How to troubleshoot failed tasks

deniszhdanov · October 5, 2018, 4:09pm

Hello,

I’m just starting my way in the docker world and many (basic) principles of how everything is organized are still unclear. Please help me to understand how should we approach troubleshooting failed docker tasks.

My docker service doesn’t work but that is a secondary problem. The primary issue is that it’s totally unclear how to troubleshoot that.

This is my docker-compose file, it consists of an application service and mongodb service. The application service writes logs to the /opt/myapp/log/app.log. Complete sources can be found here. I also built a corresponding docker image and uploaded it to the dockerhub - deniszhdanov/docker-swarm-troobleshoot-service:1

Let’s start the stack:

docker swarm init
Swarm initialized: current node (xpkngdn0vpr73nioalzbkem1k) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-6109nv6pn7eb9gtam8bq4m198k5sk7ztzf7hy7yfv5c47kcrmq-9fbrmmccd977kx22mivs7segn 192.168.65.3:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

docker stack deploy -c docker-compose.yml myapp
Creating network myapp_default
Creating service myapp_web
Creating service myapp_db

After that, let’s wait for a little while (~1 minute) and proceed:

docker ps -a
CONTAINER ID        IMAGE                                              COMMAND                  CREATED              STATUS                        PORTS               NAMES
31e9f3a8f5aa        deniszhdanov/docker-swarm-troobleshoot-service:1   "java -jar /opt/myap…"   31 seconds ago       Up 25 seconds                 8090/tcp            myapp_web.1.sij3z7cbbynsxos6608ru2f8a
9fc6a5868c12        mongo:latest                                       "docker-entrypoint.s…"   About a minute ago   Up About a minute             27017/tcp           myapp_db.1.gxl5xwj1tg80nr16clbskk2oc
a3ff2ba0c8c5        deniszhdanov/docker-swarm-troobleshoot-service:1   "java -jar /opt/myap…"   About a minute ago   Exited (137) 32 seconds ago                       myapp_web.1.3dv8x2dx6kig4qkf1wc2axro8

We see that there is a failed task. Let’s try to understand what went wrong:

docker commit a3ff2ba0c8c5 snapshot
sha256:bec4756cadebbada400b4d1037cac671168396bf73b7d3e875c6f98f63522afd

docker run --rm -it snapshot /bin/sh
/opt/myapp # cat /opt/myapp/log/app.log
2018-10-05 15:34:20 - Starting Start on a3ff2ba0c8c5 with PID 1 (/opt/myapp/lib/myapp.jar started by root in /opt/myapp)
2018-10-05 15:34:21 - No active profile set, falling back to default profiles: default

The task’s log doesn’t contain target information which would be enough to troubleshoot the problem. However, that data is available when we run the application image standalone:

docker run --rm -d deniszhdanov/docker-swarm-troobleshoot-service:1
825a818b425feb7ed1f593c14a411efb68457aee9c6bfcf27f745fd58cfa0001

docker ps
CONTAINER ID        IMAGE                                              COMMAND                  CREATED             STATUS              PORTS               NAMES
825a818b425f        deniszhdanov/docker-swarm-troobleshoot-service:1   "java -jar /opt/myap…"   46 seconds ago      Up 44 seconds       8090/tcp            zen_bardeen

docker exec -it 825a818b425f /bin/sh
/opt/myapp # cat /opt/myapp/log/app.log
2018-10-05 15:30:09 - Starting Start on 825a818b425f with PID 1 (/opt/myapp/lib/myapp.jar started by root in /opt/myapp)
2018-10-05 15:30:09 - No active profile set, falling back to default profiles: default
2018-10-05 15:30:09 - Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@18ef96: startup date <Fri Oct 05 15:30:09 GMT 2018>; root of context hierarchy
2018-10-05 15:30:10 - Cluster created with settings {hosts=<db:27017>, mode=MULTIPLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
2018-10-05 15:30:10 - Adding discovered server db:27017 to client view of cluster
2018-10-05 15:30:11 - Exception in monitor thread while connecting to server db:27017
com.mongodb.MongoSocketException: db: Name does not resolve
	at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:188)
	at com.mongodb.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:59)
	at com.mongodb.connection.SocketStream.open(SocketStream.java:57)
	at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:126)
	at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:114)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: db: Name does not resolve
	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
	at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
	at java.net.InetAddress.getByName(InetAddress.java:1076)
	at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:186)
	... 5 common frames omitted

Huh, it took a while to describe all of that, thanks to everyone who managed to get to this point

Questions:

Why do we have different states for standalone containers and tasks?
What is the recommended way to troubleshoot failing tasks?

Regards, Denis

deniszhdanov · October 6, 2018, 4:15am

Eventually, I found this docker troubleshooting page, checked docker logs and found this:

2018-10-06 11:58:29.662461+0800  localhost com.docker.hyperkit[583]: [91168.810550] CPU: 3 PID: 50578 Comm: java Not tainted 4.9.93-linuxkit-aufs #1
2018-10-06 11:58:29.663013+0800  localhost com.docker.hyperkit[583]: [91168.811356] Hardware name:   BHYVE, BIOS 1.00 03/14/2014
2018-10-06 11:58:29.663984+0800  localhost com.docker.hyperkit[583]: [91168.811909]  0000000000000000 ffffffffa243922a ffffbb9dc0763de8 ffff937eb67aed00
2018-10-06 11:58:29.664792+0800  localhost com.docker.hyperkit[583]: [91168.812878]  ffffffffa21f5d85 0000000000000000 0000000000000000 ffffbb9dc0763de8
2018-10-06 11:58:29.665681+0800  localhost com.docker.hyperkit[583]: [91168.813694]  ffff937e6df576a0 0000000000000202 ffffffffa27f9dae ffff937eb67aed00
2018-10-06 11:58:29.666082+0800  localhost com.docker.hyperkit[583]: [91168.814585] Call Trace:
2018-10-06 11:58:29.666638+0800  localhost com.docker.hyperkit[583]: [91168.814984]  [<ffffffffa243922a>] ? dump_stack+0x5a/0x6f
2018-10-06 11:58:29.667238+0800  localhost com.docker.hyperkit[583]: [91168.815534]  [<ffffffffa21f5d85>] ? dump_header+0x78/0x1ed
2018-10-06 11:58:29.667980+0800  localhost com.docker.hyperkit[583]: [91168.816144]  [<ffffffffa27f9dae>] ? _raw_spin_unlock_irqrestore+0x16/0x18
2018-10-06 11:58:29.668686+0800  localhost com.docker.hyperkit[583]: [91168.816878]  [<ffffffffa21a1f90>] ? oom_kill_process+0x83/0x324
2018-10-06 11:58:29.669273+0800  localhost com.docker.hyperkit[583]: [91168.817583]  [<ffffffffa21a25b7>] ? out_of_memory+0x239/0x267
2018-10-06 11:58:29.669944+0800  localhost com.docker.hyperkit[583]: [91168.818162]  [<ffffffffa21ef2cd>] ? mem_cgroup_out_of_memory+0x4b/0x79
2018-10-06 11:58:29.670652+0800  localhost com.docker.hyperkit[583]: [91168.818834]  [<ffffffffa21f34a6>] ? mem_cgroup_oom_synchronize+0x26b/0x294
2018-10-06 11:58:29.671338+0800  localhost com.docker.hyperkit[583]: [91168.819560]  [<ffffffffa21ef650>] ? mem_cgroup_is_descendant+0x48/0x48
2018-10-06 11:58:29.671982+0800  localhost com.docker.hyperkit[583]: [91168.820253]  [<ffffffffa21a2612>] ? pagefault_out_of_memory+0x2d/0x6f
2018-10-06 11:58:29.672615+0800  localhost com.docker.hyperkit[583]: [91168.820886]  [<ffffffffa20459b0>] ? __do_page_fault+0x3c6/0x45f
2018-10-06 11:58:29.673158+0800  localhost com.docker.hyperkit[583]: [91168.821516]  [<ffffffffa27fb3c8>] ? page_fault+0x28/0x30
2018-10-06 11:58:29.677517+0800  localhost com.docker.hyperkit[583]: [91168.822159] Task in /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235 killed as a result of limit of /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235
2018-10-06 11:58:29.678200+0800  localhost com.docker.hyperkit[583]: [91168.826457] memory: usage 51188kB, limit 51200kB, failcnt 8764
2018-10-06 11:58:29.678913+0800  localhost com.docker.hyperkit[583]: [91168.827160] memory+swap: usage 102400kB, limit 102400kB, failcnt 78
2018-10-06 11:58:29.679469+0800  localhost com.docker.hyperkit[583]: [91168.827815] kmem: usage 884kB, limit 9007199254740988kB, failcnt 0
2018-10-06 11:58:29.681923+0800  localhost com.docker.hyperkit[583]: [91168.828391] Memory cgroup stats for /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235: cache:20KB rss:50284KB rss_huge:0KB mapped_file:4KB dirty:8KB writeback:0KB swap:51212KB inactive_anon:25264KB active_anon:25020KB inactive_file:8KB active_file:8KB unevictable:0KB
2018-10-06 11:58:29.682630+0800  localhost com.docker.hyperkit[583]: [91168.830820] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
2018-10-06 11:58:29.683423+0800  localhost com.docker.hyperkit[583]: [91168.831586] [50168]     0 50168   499844    14545      94       5    12964             0 java
2018-10-06 11:58:29.684186+0800  localhost com.docker.hyperkit[583]: [91168.832329] Memory cgroup out of memory: Kill process 50168 (java) score 1078 or sacrifice child
2018-10-06 11:58:29.685451+0800  localhost com.docker.hyperkit[583]: [91168.833172] Killed process 50168 (java) total-vm:1999376kB, anon-rss:49108kB, file-rss:9072kB, shmem-rss:0kB
2018-10-06 11:58:29.970702+0800  localhost com.docker.hyperkit[583]: [91169.119073] oom_reaper: reaped process 50168 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2018-10-06 11:58:30.206071+0800  localhost com.docker.driver.amd64-linux[579]: osxfs: die event: de-registering container bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235

I.e. the reason was that I inadvertently copied resources/limits/memory setup from one of the docker-compose tutorials and docker swarm kept killing my app because of that.

In the end of the day, the problem was trivial and the troubleshooting also was not a real bear. It was only necessary to look into docker daemon logs. Only due to my inexperience with docker I spent an evening trying to find the root from docker container/swarm side (service logs, container logs etc). Well, practice makes perfect

Topic		Replies	Views
Detect task filure Swarm	0	1084	May 12, 2018
Swarm manager loses state in one-node swarm General docker , swarm	13	6155	March 18, 2024
Docker fails to stop, restart or kill all containers General	4	430	January 9, 2025
Docker Swarm Series: #5th Troubleshooting General docker , swarm , tutorial , tips	0	1539	July 12, 2023
Docker swarm rebuilds all containers at different times Swarm swarm	16	3625	May 8, 2025

How to troubleshoot failed tasks

Related topics