Behavior of `up --wait` changed?

Hello,

My deploy script starts all compose stacks with --wait and then checks the error code upon completion.

I have a number of stacks that have an init container (it itself depends on the database), and will run any migrations if there are any, and then exit 0.

So basically: database starts, init service starts, optionally triggers migrations and then exits, all other services that depend on the init service start, stack is up.

This never was an issue for --wait before, but now (I’m guessing since our quarterly server patching/update), it seems to result in an exit 1? (which causes my pipeline to fail)

Is this behavior new, is it on purpose?
if so, how can I exempt a service from this? (I would like to keep using wait as it gives my pipeline a bit more security that everything is running correctly).

Thank you.

Can you share the output of

docker info

?
If you have any private info like private plugins, registry IPs or anything, you can remove those from the output.

Also do you mean ˙docker compose up --wait˙ returns the exit code 1? When and if it does, are your services all healthy? IS it possible that some services reached the wait timeout (--wait-timeout) ?

(docker info at the bottom)

The stack starts in about 10 seconds, I use a timeout of 120s, but even with the standard 60 it’s not coming anywhere near that.

Regular “up” (I’ve replaced the internal name by “application” for privacy purposes):

$ docker compose up -d
[+] Running 5/5
 ✔ Network application-dashboard_internal       Created                                                                              0.1s 
 ✔ Container application-dashboard-database-1   Started                                                                              2.6s 
 ✔ Container application-dashboard-migrate-1    Started                                                                              5.8s 
 ✔ Container application-dashboard-pgadmin-1    Started                                                                              5.2s 
 ✔ Container application-dashboard-dashboard-1  Started                                                                              9.0s 
$ echo $?
0

With --wait:

$ docker compose up -d --wait --wait-timeout 120
[+] Running 4/5
 ✔ Network application-dashboard_internal       Created                                                                              0.1s 
 ✔ Container application-dashboard-database-1   Healthy                                                                             10.1s 
 ⠸ Container application-dashboard-migrate-1    Waiting                                                                             10.1s 
 ✔ Container application-dashboard-pgadmin-1    Healthy                                                                             10.1s 
 ✔ Container application-dashboard-dashboard-1  Healthy                                                                             10.0s 
container application-dashboard-migrate-1 exited (0)
$ echo $?
1

But the other 3 services are running healthy after that.

docker info:

Client: Docker Engine - Community
 Version:    28.0.4
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.22.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.34.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 102
  Running: 81
  Paused: 0
  Stopped: 21
 Images: 433
 Server Version: 28.0.4
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: active
  NodeID: <snip>
  Is Manager: true
  ClusterID: <snip>
  Managers: 1
  Nodes: 1
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.123.14.29
  Manager Addresses:
   10.123.14.29:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
 runc version: v1.2.5-0-g59923ef
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
 Kernel Version: 4.18.0-553.46.1.el8_10.x86_64
 Operating System: Red Hat Enterprise Linux 8.10 (Ootpa)
 OSType: linux
 Architecture: x86_64
 CPUs: 192
 Total Memory: 1006GiB
 Name: <snip>
 ID: <snip>
 Docker Root Dir: /data/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false
 Default Address Pools:
   Base: 172.23.0.0/16, Size: 24

I have the same compose version in Docker Desktop and I could reproduce the issue, but only with a compose file that contains a service that exits after running even if the exit code is 0. I would not expect it, so I would say it is a bug, which could be reported on Github, but it was actually reported already in 2023

You can join the discussion there.

As a workaround, someone recommended using the sleep infinity, but that would make the container run practically forever. So I would instead use a finite number of minutes or seconds bigger than the timeout, so it will only stop after docker compose already returned with an exit code 0.

update:

Example

services:
  nginx:
    image: bash
    init: true
    command:
      - sh
      - -c
      - 'ls -la && sleep 130'
1 Like

The service exits indeed after the migrations are complete.

Thank you for helping to look into it and discovering the bug report!
I guess that highlights one of the many advantages of having to update your servers from a company provided mirror instead of the internet… you arrive late to the party, very late.
Didn’t even consider looking back that far.

Anyway the last reply provides a working workaround by using the explicit service_completed_successfully condition for the depending service to indicate it is allowed to terminate, so we’ll be using that from now on.

1 Like

You are right, I thought it was the init container service and ignored the condition :slight_smile: