Hey there,
I’ve been running into a weird issue with databases corrupting and I seem to have tracked it down to one main issue: Apparently the docker-compose.yml
files that I have include
d aren’t stopped when I run docker compose stop
on the main docker-compose.yml
.
Any idea why that is? Some containers come with preconfigured docker-compose files and I also want to de-clutter my main docker-compose.yml where possible
Any ideas how I can make this work?
Please share the output of docker compose config
so we can see the full rendered config. Make sure to anonymize public ips, domain names and secrets, but leave the rest untouched.
Without having any context information, I would assume it’s related to the grace period when containers are stopped. After a container receives the SIGTERM signal cause by docker compose stop
, the container is killed hard with a SIGKILL signal. The default grace period is 10 seconds, which is suited for most cases, but not all.
Your grace period is probably too short for the database to finish a graceful termination, hence resulting in corruption of data.
It’s definitely not the grace period. If I enter the directory with the second docker-compose.yml file and run docker compose stop
there the containers in question immediately stop just as expected.
The corruption seems to stem from the fact that something is stopped that the database server depends on, while the database server itself isn’t stopped. I’m not sure what exactly causes that but docker doesn’t even try to stop the containers in question. If I look at the docker compose stop
status, the containers don’t even show up.
The containers in question are invidious
and invidious-db
name: podconf
services:
caddy:
build:
context: /podconf
dockerfile: /podconf/build/caddy.Dockerfile
args:
VER: 2.7.6
container_name: caddy
extra_hosts:
- host.docker.internal=host-gateway
networks:
php-fpm: null
webaccess: null
ports:
- mode: ingress
target: 80
published: "80"
protocol: tcp
- mode: ingress
target: 443
published: "443"
protocol: tcp
volumes:
- type: bind
source: /podconf/caddy/
target: /etc/caddy
bind:
create_host_path: true
- type: bind
source: /podconf/www/html
target: /var/www/html
bind:
create_host_path: true
- type: bind
source: /podconf/www/files-public
target: /var/www/files
bind:
create_host_path: true
cgit:
build:
context: /podconf/build/clearlinux-dockerfiles/cgit
dockerfile: Dockerfile
container_name: cgit
networks:
webaccess: null
restart: always
volumes:
- type: bind
source: /podconf/cgit-conf/cgitrc/
target: /etc/cgitrc
bind:
create_host_path: true
- type: bind
source: /podconf/cgit-conf/cgit/
target: /usr/share/cgit
bind:
create_host_path: true
- type: bind
source: /podconf/www/git/
target: /var/www/git
bind:
create_host_path: true
invidious:
depends_on:
invidious-db:
condition: service_started
required: true
environment:
INVIDIOUS_CONFIG: |
healthcheck:
test:
- CMD-SHELL
- wget -nv --tries=1 --spider http://127.0.0.1:3000/api/v1/trending || exit 1
timeout: 5s
interval: 30s
retries: 2
image: quay.io/invidious/invidious:latest
networks:
default: null
ports:
- mode: ingress
target: 3000
published: "9100"
protocol: tcp
restart: unless-stopped
invidious-db:
healthcheck:
test:
- CMD-SHELL
- pg_isready -U $$POSTGRES_USER -d $$POSTGRES_DB
image: docker.io/library/postgres:14
networks:
default: null
restart: unless-stopped
volumes:
- type: bind
source: /podconf/invidious/postgres_data
target: /var/lib/postgresql/data
bind:
create_host_path: true
- type: bind
source: /podconf/invidious/config/sql
target: /config/sql
bind:
create_host_path: true
- type: bind
source: /podconf/invidious/docker/init-invidious-db.sh
target: /docker-entrypoint-initdb.d/init-invidious-db.sh
bind:
create_host_path: true
navidrome:
container_name: navidrome
environment:
ND_BASEURL: /audio
ND_LOGLEVEL: debug
ND_PORT: "4040"
ND_SCANSCHEDULE: 1h
ND_SESSIONTIMEOUT: 24h
image: deluan/navidrome:latest
networks:
webaccess: null
restart: always
user: 1000:1000
volumes:
- type: bind
source: /podconf/navidrome/data
target: /data
bind:
create_host_path: true
- type: bind
source: /podconf/navidrome/music
target: /music
read_only: true
bind:
create_host_path: true
netdata:
cap_add:
- SYS_PTRACE
- SYS_ADMIN
container_name: netdata
image: docker.io/netdata/netdata:latest
network_mode: host
pid: host
restart: always
security_opt:
- apparmor=unconfined
volumes:
- type: bind
source: /podconf/netdata/config
target: /etc/netdata
bind:
create_host_path: true
- type: bind
source: /podconf/netdata/lib
target: /var/lib/netdata
bind:
create_host_path: true
- type: bind
source: /podconf/netdata/cache
target: /var/cache/netdata
bind:
create_host_path: true
- type: bind
source: /etc/passwd
target: /host/etc/passwd
read_only: true
bind:
create_host_path: true
- type: bind
source: /etc/group
target: /host/etc/group
read_only: true
bind:
create_host_path: true
- type: bind
source: /proc
target: /host/proc
read_only: true
bind:
create_host_path: true
- type: bind
source: /sys
target: /host/sys
read_only: true
bind:
create_host_path: true
- type: bind
source: /etc/os-release
target: /host/etc/os-release
read_only: true
bind:
create_host_path: true
- type: bind
source: /var/run/docker.sock
target: /var/run/docker.sock
read_only: true
bind:
create_host_path: true
php-fpm:
container_name: php-fpm
image: docker.io/bitnami/php-fpm:latest
networks:
php-fpm: null
volumes:
- type: bind
source: /podconf/www/html
target: /var/www/html
bind:
create_host_path: true
pihole:
cap_add:
- NET_ADMIN
container_name: pihole
environment:
WEBPASSWORD: admin
hostname: pihole
image: pihole/pihole:latest
networks:
wg-easy:
ipv4_address: 10.8.1.3
ports:
- mode: ingress
target: 80
published: "8081"
protocol: tcp
restart: unless-stopped
volumes:
- type: bind
source: /podconf/pihole/etc-pihole/
target: /etc/pihole
bind:
create_host_path: true
- type: bind
source: /podconf/pihole/etc-dnsmasq.d/
target: /etc/dnsmasq.d
bind:
create_host_path: true
redlib:
container_name: redlib
image: quay.io/redlib/redlib:latest
networks:
default: null
ports:
- mode: ingress
target: 8080
published: "9120"
protocol: tcp
stirling-pdf:
container_name: stirling-pdf
environment:
DOCKER_ENABLE_SECURITY: "false"
INSTALL_BOOK_AND_ADVANCED_HTML_OPS: "true"
LANGS: en_GB
image: frooodle/s-pdf:latest
networks:
default: null
ports:
- mode: ingress
target: 8080
published: "9020"
protocol: tcp
volumes:
- type: bind
source: /podconf/stirling-pdf/tesseract
target: /usr/share/tessdata
bind:
create_host_path: true
- type: bind
source: /podconf/stirling-pdf/config
target: /configs
bind:
create_host_path: true
networks:
default:
name: podconf_default
driver: bridge
php-fpm:
name: podconf_php-fpm
driver: bridge
webaccess:
name: podconf_webaccess
I just wanna add that it’s entirely possible that I’ve missed something here. I’m not saying that I’m a docker expert, but when I include a file and the docs say that it helps with de-clutting the main docker-compose.yml then I expect included files to be stopped as well when I stop my main file.
It is highly unlikely that your problem is not related to the process inside the container being killed by SIGKILL, otherwise it would terminate gracefully without data corruption.
So either the process inside the container requires a different signal than SIGTERM to initiate the graceful termination, and always stops by a SIGKILL after the grade period. Or the graceful termination takes longer than the grace period.
I’m pretty sure it is.
I checked the logs (by attaching! to the containers), and immediately after I start the stopping process I get some weird corruption errors in the postgres logs, without any of the containers actually kicking me out (and as such showing that they’re being stopped, or have been killed).
Not even the main invidious container is showing any signs that it’s being stopped.
any updates on this? I don’t think that’s intended behaviour…
The first idea is to get a clear problem description.
You write about corrupt database, it seems your compose is using a default postgres image. Does the postgres log tell you it is shut down hard or needs to recover upon start?
Are the bind mounts local folders or on a remote share?
First off: all the mounts are bind mounts, nothing remote.
previously I wasn’t really willing to kill my containers again so I wanted to avoid triggering this again, but figuring it would help with the resolution I just did exactly that.
Having changed absolutely nothing, my included containers did now shut down as expected. I’m very confused now, especially because I somewhat trust by problem solving skills.
No visible, errors, but it still corrupted:
invidious-db-1 | 2024-07-16 08:48:25.375 UTC [1] LOG: could not open file "postmaster.pid": No such file or directory
invidious-db-1 | 2024-07-16 08:48:25.375 UTC [1] LOG: performing immediate shutdown because data directory lock file is invalid
invidious-db-1 | 2024-07-16 08:48:25.375 UTC [1] LOG: received immediate shutdown request
invidious-db-1 | 2024-07-16 08:48:25.375 UTC [1] LOG: could not open file "postmaster.pid": No such file or directory
invidious-db-1 | 2024-07-16 08:48:25.401 UTC [1] LOG: database system is shut down
invidious-db-1 |
invidious-db-1 | PostgreSQL Database directory appears to contain a database; Skipping initialization
invidious-db-1 |
invidious-db-1 | 2024-07-16 08:48:26.039 UTC [1] LOG: starting PostgreSQL 14.12 (Debian 14.12-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
invidious-db-1 | 2024-07-16 08:48:26.040 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
invidious-db-1 | 2024-07-16 08:48:26.040 UTC [1] LOG: listening on IPv6 address "::", port 5432
invidious-db-1 | 2024-07-16 08:48:26.046 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
invidious-db-1 | 2024-07-16 08:48:26.057 UTC [27] LOG: database system was shut down at 2024-07-16 08:48:23 UTC
invidious-db-1 | 2024-07-16 08:48:26.058 UTC [27] LOG: invalid primary checkpoint record
invidious-db-1 | 2024-07-16 08:48:26.058 UTC [27] PANIC: could not locate a valid checkpoint record
invidious-db-1 | 2024-07-16 08:48:26.483 UTC [1] LOG: startup process (PID 27) was terminated by signal 6: Aborted
invidious-db-1 | 2024-07-16 08:48:26.483 UTC [1] LOG: aborting startup due to startup process failure
invidious-db-1 | 2024-07-16 08:48:26.498 UTC [1] LOG: database system is shut down
I just noticed that data directory lock thing for the first time… - What is that?
Edit: I just checked, that pid file definitely exists. It really shouldn’t go missing randomly…
This is the log of both shutdown and startup of postgres?
yes. That’s the script running docker compose stop
and then docker compose start
It’s really strange, haven’t seen that before. Did you check your disk for corruption?
Usually Postgres is very stable. You could try to isolate and run fewer services.
Sorry, you probably won’t get around some more testing.
yes. I’ve ran a five hour memtest, I’ve checked for corruption, I’ve updated all packages and I’ve checked the drive health. Everything is fine - or so it seems.
It also can’t really be another container interfering because I’ve got them all separated.
What a peculiar issue…
It’s still happening. Next time I run the backup I’m gonna have auditctl watch the file so I can find out which process deleted it. This is so annoying.
I think I figured it out. Docker had somehow lost track of the database container and that somehow caused my postmaster.pid to get lost and everything (maybe invidious was using the container that was lost, but docker had created another invidious database container, which then shut down the files as it should, deleting the postmaster.pid file before the actual relevant container had a chance to do that. I’m no expert regarding what might be possible here though.)
I noticed that the invidious-db container was still running after I had ran docker compose stop
. I then decided to just clean out all the old garbage and whatever else there was (among them another database container that had gotten lost) using docker prune
.
After that I ran docker compose up
and since then my invidious instance no longer corrupts out of the blue when stopping all containers. (At least for now. I hope this is going to last)
This included file stuff just keeps on giving.
I just discovered that docker starts the containers within one included docker-compose.yml but not within the other. The paths are definitely correct and I’ve also tried moving the problematic one to the top of the include list, but no dice. I’ve also checked the file perms and they’re identical. The problematic docker-compose.yml works perfectly fine on its own.
I wonder if it has something to do with the directory name. one contains a hyphen, the other one doesn’t.
What on earth is going on here. I do not understand…
It seems like I found the issue and it seems like a little bit of an oversight to me.
Apparently one docker-compose.yml cannot find other (even included) files’ stopped or started containers.