Disk lock ups with Docker Swarm and Postgres due to hung process

transnomis · December 23, 2024, 9:49pm

I’ve been noticing weird issue where Postgres crashes silently while running on Docker Swarm and locks up relevant folders that are attached as volumes. The DB engine continues to run and you can connect to it, but actual DB isn’t accessible.

Even after killing the stack/service, the process continues to linger and locks up relevant files, preventing new deployment from coming up. I don’t have logs from this time it happened, but I confirmed this is what happened last time with lsof. Restarting the machine with process still running fixes the problem. The crash happens randomly and cannot be reproduced.

Logs for postgres container:

Nothing relevant in the Postgres logs. We’re using Postgres with TimescaleDB extension enabled.

Our Docker Stack file:

services:
  db:
    ports:
      - 20004:5432
    image: timescale/timescaledb-ha:pg16-ts2.15
    networks:
      - its_qa_net7
    user: 0:0
    command: postgres -c 'max_connections=300' -c 'shared_buffers=1024MB' -c 'temp_buffers=128MB' -c 'work_mem=64MB' -c 'maintenance_work_mem=512MB' -c 'max_stack_depth=3MB'
    environment:
      POSTGRES_PASSWORD: ${PG_PASS:-password}
    volumes:
      - /mnt/vol1/transnomis/ITS_QA_NET7/db:/home/postgres/pgdata/data
      - /home/automation/stack/init-user-db.sh:/docker-entrypoint-initdb.d/init-user-db.sh
    deploy:
      resources:
        limits:
          memory: 10GB
          cpus: '3.0'
        reservations:
          memory: 6GB
          cpus: '1.0'
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.hostname != qa-docker-1]
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 10
  dbupdater:
    image: transnomis/its:{{ ansible_CI_COMMIT_SHA }}
    networks:
      - its_qa_net7
    environment:
      Transnomis_ConnectionStrings__dbServer: "Server=db;Port=5432;Database=transnomis;User Id=postgres; Password=${PG_PASS:-password};Pooling=true;MaxPoolSize=1000;Timeout=180;CommandTimeout=300"
      TRANSNOMIS_SCHEMA_UPDATES_DIRECTORY: "databaseupdater/SchemaUpdates"
      TRANSNOMIS_CONTAINER_LOGNAME: dbupdater
    entrypoint: sh
    command: -c "sleep 90 && dotnet databaseupdater/DatabaseUpdater.dll"
    deploy:
      replicas: 1
      restart_policy:
        condition: none
  website:
    image: transnomis/its:{{ ansible_CI_COMMIT_SHA }}
    healthcheck:
      test: curl --fail http://localhost:8080/ || exit 1
      interval: 60s
      retries: 5
      start_period: 240s
      timeout: 10s
    networks:
      - caddy
      - its_qa_net7
    environment:
      ASPNETCORE_URLS: "http://+:8080"
      Transnomis_ConnectionStrings__dbServer: "Server=db;Port=5432;Database=transnomis;User Id=postgres; Password=${PG_PASS:-password};Pooling=true;MaxPoolSize=1000;Timeout=180;CommandTimeout=300"
      TRANSNOMIS_CONTAINER_LOGNAME: website
    volumes:
      - /mnt/vol1/transnomis/ITS_QA_NET7/Assets:/app/Assets
      - /mnt/vol1/transnomis/ITS_QA_NET7/PublicRoot:/app/wwwroot/PublicRoot
    entrypoint: sh
    command: -c "dotnet WebsiteMvcCore.dll"
    deploy:
      resources:
        limits:
          memory: 14GB
          cpus: '3.5'
        reservations:
          memory: 8GB
          cpus: '2.0'
      labels:
        kuma.ITS_QA_NET7.http.name: "ITS_QA_NET7"
        kuma.ITS_QA_NET7.http.url: "https://qa-net7.transnomis.com"
        kuma.ITS_QA_NET7.http.max_retries: "5"
        kuma.ITS_QA_NET7.http.tags: '[{"tag_id": 2 }]'
        kuma.ITS_QA_NET7.http.interval: "180"
        caddy: "qa-net7.transnomis.com"
        caddy.reverse_proxy: "{% raw %}{{ upstreams 8080 }}{% endraw %}"
        caddy.tls: "simon@transnomis.com"
        caddy.tls.dns: "godaddy {env.GODADDY_TOKEN}"
      replicas: 1
      placement:
        constraints: [node.hostname != qa-docker-1]
  gateway:
    image: transnomis/its:{{ ansible_CI_COMMIT_SHA }}
    networks:
      - its_qa_net7
    environment:
      Transnomis_ConnectionStrings__dbServer: "Server=db;Port=5432;Database=transnomis;User Id=postgres; Password=${PG_PASS:-password};Pooling=true;MaxPoolSize=1000;Timeout=180;CommandTimeout=300"
      TRANSNOMIS_CONTAINER_LOGNAME: gateway
    volumes:
      - /mnt/vol1/transnomis/ITS_QA_NET7/Assets:/app/Assets
      - /mnt/vol1/transnomis/ITS_QA_NET7/PublicRoot:/app/wwwroot/PublicRoot
    entrypoint: sh
    command: -c "dotnet gateways/GatewayScheduler.dll"
    deploy:
      resources:
        limits:
          memory: 6GB
          cpus: '2.0'
        reservations:
          memory: 2GB
          cpus: '0.5'
      replicas: 1
      placement:
        constraints: [node.hostname != qa-docker-1]

networks:
  caddy:
    external: true
  its_qa_net7:
    driver: overlay
    attachable: true

/mnt/vol1 is shared storage based on Ceph mounted with systemd. Any help would be greatly appreciated.

bluepuma77 · December 25, 2024, 9:04am

Usually databases are run with local files, not with shared folders. If you want high availability, you would setup a cluster or master/slave (new wording?) with multiple instances, each with their own local files.

transnomis · December 27, 2024, 7:49pm

That would be ideal, but this configuration really works for us, if not for this bug. Also not completely sure this is a Postgres bug or lies somewhere else.

One instance had too many connections issue so I killed it with docker stack rm.

root@qa-docker-4:/home/varun# docker service ps ITS_QA_xxxx_db --no-trunc
ID                          NAME                       IMAGE                                                                                                              NODE          DESIRED STATE   CURRENT STATE              ERROR                                                          PORTS
lnctvg6iux6e19xfhf77xqb8c   ITS_QA_xxxx_db.1       timescale/timescaledb-ha:pg16.3-ts2.15.3@sha256:1c656cade53ee0251f355157eb010bb278fc80fcc5c94a2b30fa1323c2db722c   qa-docker-1   Running         Starting 10 minutes ago                                                                   
kx9hduy2sbuq2aadbsnyvuc7j    \_ ITS_QA_xxxx_db.1   timescale/timescaledb-ha:pg16.3-ts2.15.3@sha256:1c656cade53ee0251f355157eb010bb278fc80fcc5c94a2b30fa1323c2db722c   qa-docker-1   Shutdown        Failed 10 minutes ago      "task: non-zero exit (137): dockerexec: unhealthy container"   
...

Now the Docker VM won’t come up on the other host.

I tried finding problematic process with lsof and docker/Ceph restart on qa-docker-4(last host where DB crashed) to no avail. Nothing found with docker ps.

Restarted the host and the DB is back online.

I’ve once seen this issue with website as well so this makes me think it may not be something related to Postgres exclusively.

transnomis · December 27, 2024, 7:52pm

For completeness, this is my mounting file:

[Unit]
Description=Mount CephFS
After=network.target

[Mount]
What=:/
Where=/mnt/vol1
Type=ceph
Options=name=admin,_netdev,noatime

[Install]
WantedBy=multi-user.target

My /etc/ceph.conf

root@qa-docker-1:/etc/ceph# cat ceph.conf 
[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 192.168.201.0/24
        fsid = a869e132-66e1-4aad-8a3a-1cba0523a9e3
        mon_allow_pool_delete = true
        mon_host = 192.168.201.11 192.168.201.13 192.168.201.15 192.168.201.14 192.168.201.12
        osd_max_backfills = 1
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        osd_recovery_max_active = 3
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        public_network = 192.168.201.0/24

[client]
        keyring = /etc/ceph/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
        keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.PVE-11]
        host = PVE-11
        mds_standby_for_name = pve

[mds.PVE-3]
        host = PVE-3
        mds_standby_for_name = pve

[mds.PVE-4]
        host = PVE-4
        mds_standby_for_name = pve

[mds.PVE-5]
        host = PVE-5
        mds_standby_for_name = pve

[mds.PVE-12]
        host = PVE-12
        mds_standby_for_name = pve

[mon.PVE-11]
        public_addr = 192.168.201.11

[mon.PVE-3]
        public_addr = 192.168.201.13

[mon.PVE-5]
        public_addr = 192.168.201.15

[mon.PVE-6]
        public_addr = 192.168.201.16

[mon.PVE-12]
        public_addr = 192.168.201.12

bluepuma77 · December 29, 2024, 8:44am

Regarding wording, it’s not a “Docker VM”, it’s a “container”.

If you had issues before, maybe the node is faulty?

transnomis · December 30, 2024, 4:44pm

Yes, sorry. Should be container. Not VM.

Have had issues on other nodes before. It’s very random where it happens. Doesn’t happen more often on node or the other.

meyay · December 30, 2024, 4:59pm

I am afraid, you will have to wait until someone who is actually experienced with ceph sees the post and knows the answer.

It pretty much sound like to what happens when a process tries to access a mountpoint of a nfs or cifs remote share, where the remote server is temporarily unavailable. This is regardless whether the process runs as an isolated container process or native process on the host.

Since you mount ceph into a host folder and bind it into a container, the container is not involved in managing the ceph mount at all.

I would recommend asking in a ceph forum, where it’s more likely to find ceph users that use ceph as backend storage for containerized processes (I doubt that it’s even relevant that its containerized), than docker users that use ceph.

transnomis · December 30, 2024, 5:16pm

Thank you. I’ll try that.

It’s weird because the resources is accessible once I just restart the problematic VM. Ceph isn’t rebooted, only the mount on problematic VM restarted. Ceph(FS) is running on hypervisor hosts, with mount points in VMs. So I thought it may not be a storage issue.

We’ve had same application running on Docker Swarm with Glusterfs with similar mounts for a in-premise setup for one of our customers and it seems to be working fine with zero issues, but I’m not sure if it has something to do with specific environment where it isn’t loaded as much with traffic. But it could also indicate something specific to Ceph like you said

meyay · December 30, 2024, 9:09pm

This is not really an argument against ceph being temporarily unavailable on the node, thus resulting in process that access the mountpoint to be hanging.

I hope you do find a ceph expert who knows what’s responsible for it. From the perspective of docker it’s just a bind.

transnomis · December 31, 2024, 3:38pm

This is not really an argument against ceph being temporarily unavailable on the node, thus resulting in process that access the mountpoint to be hanging.

That makes sense. Thank you. I don’t think it’s temporarily unavailable, but it seems to not allow access to files(they are there, but when I try opening some of them them with vim, the editor sometimes hangs), preventing the container from coming up on another node.

I’ll try posting on Ceph forum. We are also debating getting rid of Ceph entirely as we don’t use it on anywhere other than our QA environment. But weren’t completely sure this was a Ceph issue.

transnomis · January 13, 2025, 5:44pm

Ok. I think I may have found the solution? Not totally sure but after 10 days I haven’t seen the issue again. Hopefully it doesn’t reoccur.

I think you were right.

I was using single keyring for mounting CephFS on all client VMs. I regenerate individual keyring files for every client and that seems to have solved it. I wouldn’t have thought file locking mechanism dependent on client keyring.

Topic		Replies	Views
DockerSwarm mode with Postgres, fail only with persistent storage Swarm	2	2895	June 4, 2017
Docker PostgreSQL data syncing issue General docker , swarm , docker-compose	1	207	February 13, 2024
Is Docker with PostgreSQL safe enough to store data? Are there any constraints? General	3	1963	October 4, 2017
Docker PostgreSQL Replication and Failover setup data syncing issue General	1	798	January 19, 2024
Docker swarm crash during large file copies Swarm	1	3217	May 16, 2019

Disk lock ups with Docker Swarm and Postgres due to hung process

Related topics