Docker Community Forums

Share and learn in the Docker community.

UCP crashes with 404 error occasionally when trying to scale a container


(Ken42) #1

Hello,

I have a UCP cluster installed on 2 machines. About 50% of the time, when I choose a running application container and tell it to “scale” on several machines, there will be no “loading” symbol and I will be immediately returned to the page for that machine without any message given. Then, when I refresh or go to a different page, I’ll get a 404 error which will persist for a few minutes before the UCP web console becomes available again. This occurs with all containers I’ve tried.

Also, on the few times it does work, it will not scale to both machines. It will only create new containers on one of the machines. I get the same behavior when I run the “docker-compose scale” command directly from the machines as well. Not sure if these issues are related in anyway.

Software versions:
Redhat 7.2
Docker Engine 1.10
Docker-compose 1.6.0
UCP 0.8.0
Swarm 1.1.0-rc2 (installed by the UCP install)

All on EC2 instances.

Please let me know what additional information I can provide to help debug.

Thank you for your time


(Alex Rhea) #2

@ken42 I can confirm my team running on a very similar stack (RedHat 7.2), has run into these issues as well. We have observed that this happens on the master and the replicas when we try to perform “batch” operations like pulling 10 containers or scaling. Sometimes the controller comes back and other times we have to restart the daemon to get it to come back up.

We have also encountered a similar problem on this stack where the master and replicas throw the following error: “No Elected Cluster Leader.” We opened a support ticket but, I would be interested if you have run into similar issues as these errors seem to be RHEL related.

Would be happy to work together and discuss issues that we have run into and how we solved them.

Thanks,
Alex


(Ken42) #3

Hi @arhea, thanks for responding.

“Sometimes the controller comes back and other times we have to restart the daemon to get it to come back up.”

Exactly the same here.

I have not gotten the “No Elected Cluster Leader” however but I will keep on eye open for it.

My controller seems to have become more stable recently. When I last re-installed the cluster, I had all the follower nodes join using the masters local network domain name rather than its public internet IP/domain. Its been stable so far and hasn’t crashed while “scaling” so far but it could just be a fluke.

I may try installing the trusted registry on one of the nodes and push all our images to that and then set UCP to use the trusted registry. Not sure if it will help but am willing to try anything at this point.


(Ken42) #4

@arhea Do you also get “Error scaling container 500 Internal Server Error: Could not get container for …” errors that I mentioned in my other threads?


(Alex Rhea) #5

@ken42 I haven’t seen that issue. My installation is behind a corporate firewall so we only use internal IP addresses. I am also wondering if I happened to get a bad installation. I may try reinstalling when I get some downtime.

We elected to install our DTR outside the swarm. On a standalone node.


(Ken42) #6

Back to getting the same old errors now. It now pretty much crashes every other time I try and scale a container.

I even tried setting up the remote syslog server connection to see what was happening but that didn’t record any logs or anything at all during the crash. Seems impossible to reproduce reliably and I don’t see any errors in any of the container logs.


(Alex Rhea) #7

@ken42 yeah we searched the logs for an error to give to the UCP team. My belief is that the internal web service is crashing and that all other services remain up. The fact that it throws a 404 rather than a 500 leads me to believe it is a web app or web framework issue. But that is totally a guess.


(Vivek Saraswat) #8

Hi ken42,

We’ve been seeing other 404 errors which are related more to the web app than specifically to the scaling function. The team is currently working hard to address the 404 error.

I am currently not able to replicate the 500 error for scaling, however. Let me investigate further on that one.


(Ken42) #9

Thanks Vivek, please let me know any additional information I can provide to help reproduce the error.

The 500 scaling error only seems to occur when it is attempting to “scale” the container onto another machine (I think this occurs because the Swarm scheduler decides that the new container should be on a machine other than the one the container is currently on).

Scaling will sometimes work when it only tries to scale to the current machine (and when it doesn’t crash and give the 404 error)


(Ken42) #10

@arhea

Curious if you have set up the UCP multi-host networking as described here?

http://ucp-beta-docs.s3-website-us-west-1.amazonaws.com/networking/

Thanks


(Alex Rhea) #11

@ken42, my understanding is that overlay networks are not yet supported by the RHEL kernel within most enterprises.

We have written our own service discovery mechanism as the services available weren’t meeting our needs. We are planning on sharing these learnings soon, outside of our clients.


(Ken42) #12

Are you referring to previous incompatibility of multi-host networking with the 3.10 kernel? Or some other incompatibility? Apparently multi-host networking is now supported with the 3.10 kernel in Docker engine 1.10 so RHEL 7.2 should work fine.


(Alex Rhea) #13

@ken42 I believe you are correct. We are still on the CS engine version 1.9


(Ken42) #14

@arhea

I see, thanks.

This is just a stab in the dark, but I wonder if the 404 error is somehow due to us not having multi-host networking setup and enabled? Maybe UCP is coded somewhere to assume multi-host networking is setup and some uncaught exception from that assumption is crashing the controller.


(Vivek Saraswat) #15

Hey folks, just a note: 404 error should be corrected for 1.0 launch. Coming soon!