I’ve been noticing weird issue where Postgres crashes silently while running on Docker Swarm and locks up relevant folders that are attached as volumes. The DB engine continues to run and you can connect to it, but actual DB isn’t accessible.
Even after killing the stack/service, the process continues to linger and locks up relevant files, preventing new deployment from coming up. I don’t have logs from this time it happened, but I confirmed this is what happened last time with lsof. Restarting the machine with process still running fixes the problem. The crash happens randomly and cannot be reproduced.
Usually databases are run with local files, not with shared folders. If you want high availability, you would setup a cluster or master/slave (new wording?) with multiple instances, each with their own local files.
That would be ideal, but this configuration really works for us, if not for this bug. Also not completely sure this is a Postgres bug or lies somewhere else.
One instance had too many connections issue so I killed it with docker stack rm.
root@qa-docker-4:/home/varun# docker service ps ITS_QA_xxxx_db --no-trunc
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
lnctvg6iux6e19xfhf77xqb8c ITS_QA_xxxx_db.1 timescale/timescaledb-ha:pg16.3-ts2.15.3@sha256:1c656cade53ee0251f355157eb010bb278fc80fcc5c94a2b30fa1323c2db722c qa-docker-1 Running Starting 10 minutes ago
kx9hduy2sbuq2aadbsnyvuc7j \_ ITS_QA_xxxx_db.1 timescale/timescaledb-ha:pg16.3-ts2.15.3@sha256:1c656cade53ee0251f355157eb010bb278fc80fcc5c94a2b30fa1323c2db722c qa-docker-1 Shutdown Failed 10 minutes ago "task: non-zero exit (137): dockerexec: unhealthy container"
...
Now the Docker VM won’t come up on the other host.
I tried finding problematic process with lsof and docker/Ceph restart on qa-docker-4(last host where DB crashed) to no avail. Nothing found with docker ps.
Restarted the host and the DB is back online.
I’ve once seen this issue with website as well so this makes me think it may not be something related to Postgres exclusively.
I am afraid, you will have to wait until someone who is actually experienced with ceph sees the post and knows the answer.
It pretty much sound like to what happens when a process tries to access a mountpoint of a nfs or cifs remote share, where the remote server is temporarily unavailable. This is regardless whether the process runs as an isolated container process or native process on the host.
Since you mount ceph into a host folder and bind it into a container, the container is not involved in managing the ceph mount at all.
I would recommend asking in a ceph forum, where it’s more likely to find ceph users that use ceph as backend storage for containerized processes (I doubt that it’s even relevant that its containerized), than docker users that use ceph.
It’s weird because the resources is accessible once I just restart the problematic VM. Ceph isn’t rebooted, only the mount on problematic VM restarted. Ceph(FS) is running on hypervisor hosts, with mount points in VMs. So I thought it may not be a storage issue.
We’ve had same application running on Docker Swarm with Glusterfs with similar mounts for a in-premise setup for one of our customers and it seems to be working fine with zero issues, but I’m not sure if it has something to do with specific environment where it isn’t loaded as much with traffic. But it could also indicate something specific to Ceph like you said
This is not really an argument against ceph being temporarily unavailable on the node, thus resulting in process that access the mountpoint to be hanging.
I hope you do find a ceph expert who knows what’s responsible for it. From the perspective of docker it’s just a bind.
This is not really an argument against ceph being temporarily unavailable on the node, thus resulting in process that access the mountpoint to be hanging.
That makes sense. Thank you. I don’t think it’s temporarily unavailable, but it seems to not allow access to files(they are there, but when I try opening some of them them with vim, the editor sometimes hangs), preventing the container from coming up on another node.
I’ll try posting on Ceph forum. We are also debating getting rid of Ceph entirely as we don’t use it on anywhere other than our QA environment. But weren’t completely sure this was a Ceph issue.
Ok. I think I may have found the solution? Not totally sure but after 10 days I haven’t seen the issue again. Hopefully it doesn’t reoccur.
I think you were right.
I was using single keyring for mounting CephFS on all client VMs. I regenerate individual keyring files for every client and that seems to have solved it. I wouldn’t have thought file locking mechanism dependent on client keyring.