How do I repair or refresh ingress network?

Ingress network is unstable.

If the load balance in the Ingress network is no longer responding, I do not know a way to recovery.
How can you debug and recover it?

Expected behavior

  • service creation success with publishing option.
  • ELB returns response.

Actual behavior

  • service creation success with publishing option.
  • ELB listener start TCP balancing.
  • ingress published port become can’t return response.

Additional Information

Template: d4x 1.12.1-bata5
3 managers (t2.small), 3 workers (t2.midium)

Steps to reproduce the behavior

  1. create and delete sevice a number of times.

We are in the process of adding diagnostic tool to docker for AWS that will collect all the required data that will help troubleshoot the issue better.

In order to help you debug this issue, can you please share the following information ?

  1. Are the ELB listeners properly programmed when the services come up and go down ?
  2. Can you confirm if the services can talk-to-each other within the swarm cluster ? Is this a ingress network only issue ?
  3. Exact commands and steps to reproduce the issue
  1. Are the ELB listeners properly programmed when the services come up and go down ?

Yes.

  1. Can you confirm if the services can talk-to-each other within the swarm cluster ? Is this a ingress network only issue ?

create/delete services are works. Probably only ingress network’s issue.
And I’ve noticed when I’ve added custom worker node which created with ubuntu to swarm cluster for check behavior. Ubuntu workers didn’t fail response, but Moby linux workers are failed to response.

  1. Exact commands and steps to reproduce the issue

I’ve created service with own custom image. it takes 60-100 secs until listen port due to startup script.

docker service create -name myservice -p 10000:8080 --constraint 'node.role == worker' myservice

Thanks,

This is not really a supported use case - you should use the AWS scaling group to add or remove workers.

Can you provide more detailed steps to reproduce? Eg.

  1. deployed Docker for AWS with X managers and Y workers of instance type Z
  2. deployed service foo
  3. checked hostname bar, and got no response
    4 …

Michael

This is not really a supported use case - you should use the AWS scaling group to add or remove workers.

I see, it’s just added for check swarm network. I’ll remove from cluster after finding what is problem.

deployed Docker for AWS with X managers and Y workers of instance type Z

  • 3 managers (t2.small)
  • 3 workers (t2.midium)

deployed service foo

I could not reproduce 100%…

  • service create -n 10001 nginx -p 10001:80
    • curl http://{ELB_ENDPOINT}:1000x
    • service delete 10001
  • service create -n 10002 nginx -p 10002:80
    • curl http://{ELB_ENDPOINT}:10002
    • service delete 10002
  • service create -n 10003 nginx -p 10003:80
    • curl http://{ELB_ENDPOINT}:10003
    • service delete 10003
  • service create -n 100xx nginx -p 100xx:80
    • curl http://{ELB_ENDPOINT}:100xx
    • service delete 100xx

Sometime fails curl from external. But docker -H {Worker} exec {CT_ID} curl localhost was success.


(edit)

After once the cluster have fallen into the state, replacing workers is possible to recover.

Hey, are there any updates to this? I noticed the same in a 6-worker cluster running edge (Server Version: 17.05.0-ce
).

Thx!

We’re seeing the same. The ELB all of a sudden drops listeners/ports, even though services are up (and we can curl them internally). Updates? Short of killing the Swarm there is no way I can see working around this. Specially for this cumbersome SSL terminations where the cert needs to be passed in as a a label. I am sure the CLI could help, alas!