I am running on the latest Docker for AWS version.
I am running a very simple repetitive test - docker swarm deploy, and dock swarm rm - using the hello-world image. Every loop uses an additional 5-10 MB of RAM that never gets released. Eventually all RAM is out on one of the nodes. At that point the swarm exhibits multiple issues. The node that is out-of-memory is essentially unusable (CPU is 100% with no memory, running commands on the CLI takes 1-5 minutes). The other nodes get errors if running swarm commands. The only recourse is to terminate the (out-of-memory) node. Note that the other master nodes are almost out-of-memory too.
I have terminated the test prior to out-of-memory. I have pruned all resources and even waited 12+ hours but the memory was never released.
My system info. I have tested this on a 3 master swarm and with a 1 master swarm. Both run into the same out-of-memory issue.
~ $ docker info Containers: 4 Running: 4 Paused: 0 Stopped: 0 Images: 6 Server Version: 17.09.0-ce Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: awslogs Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: active NodeID: z6fetv7d34xn5y5m5fek79ml5 Is Manager: true ClusterID: alhqzwsnds4d68ryos9fyb1rk Managers: 1 Nodes: 1 Orchestration: Task History Retention Limit: 5 Raft: Snapshot Interval: 10000 Number of Old Snapshots to Retain: 0 Heartbeat Tick: 1 Election Tick: 3 Dispatcher: Heartbeat Period: 5 seconds CA Configuration: Expiry Duration: 3 months Force Rotate: 0 Autolock Managers: false Root Rotation In Progress: false Node Address: 172.31.14.156 Manager Addresses: 172.31.14.156:2377 Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0 runc version: 3f2f8b84a77f73d38244dd690525642a72156c64 init version: 949e6fa Security Options: seccomp Profile: default Kernel Version: 4.9.49-moby Operating System: Alpine Linux v3.5 OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 7.785GiB Name: ip-172-31-14-156.us-west-2.compute.internal ID: OAZY:JMZV:4Q3I:M5VO:63WE:PNIZ:G6RU:ZC2G:WOOI:YTSD:HLWD:U5LG Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): true File Descriptors: 73 Goroutines: 1170 System Time: 2017-11-22T22:44:19.05826862Z EventsListeners: 0 Registry: https://index.docker.io/v1/ Labels: os=linux region=us-west-2 availability_zone=us-west-2a instance_type=t2.large node_type=manager Experimental: true Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
My docker compose file (test.yml):
version: '3.0' services: hellotest: image: hello-world deploy: mode: global
My test script (test.sh):
To run script: nohup ./test.sh &
#!/bin/sh ##### globals ##### # how many times to loop through test LOOPS=1000 starttime=`date` echo start: $starttime counter=1 while [ $counter -le $LOOPS ] do echo loop number $counter # start docker cd /home/docker echo starting docker: docker stack deploy --compose-file test.yml hello-test docker stack deploy --compose-file test.yml hello-test sleep 30 # stop docker echo stopping docker: docker stack rm hello-test docker stack rm hello-test sleep 30 counter=$((counter+1)) done endtime=`date` echo end: $endtime