Docker Community Forums

Share and learn in the Docker community.

DTR @ AWS (quickstart): dtr-replica-03 node drops out (seems to be networking failure)

dtr

(Mark Henwood) #1

Hi there. I “successfully” installed the QuickStart DDC at AWS. UCP and DTR web interfaces work fine. Created repo and performed docker login in fine from local Mac CLI. When trying to push an image, getting lots of retries and each attempt ends with error “blob upload unknown”. In investigating this, I found out that…

The Applications dashboard shows that dtr-replica-0{1,2} are running but that dtr-replica-03 is down. Neither estarting the container via the web UI nor restarting its node via shell ‘shutdown’ command improve this. The container Logs from node are below.

Starting rethinkdb for replica: 000000000003
Trying to resolve own ip address dtr-rethinkdb-000000000003.dtr-br...
Admin interface host: 172.19.0.3
time="2016-11-02T09:29:38Z" level=info msg="arggen starting" 
time="2016-11-02T09:29:38Z" level=info msg="Waiting for etcd..." 
Starting with command: /usr/local/bin/rethinkdb --bind-http 172.19.0.3 --bind all --no-update-check --directory /data/rethink --driver-tls-key /ca/rethink-client/key.pem --driver-tls-cert /ca/rethink-client/cert.pem --driver-tls-ca /ca/rethink/cert.pem --cluster-tls-key /ca/rethink-client/key.pem --cluster-tls-cert /ca/rethink-client/cert.pem --cluster-tls-ca /ca/rethink/cert.pem --http-port 8080 --server-tag dtr_rethinkdb_000000000003 --server-name dtr_rethinkdb_000000000003 --canonical-address dtr-rethinkdb-000000000003.dtr-ol --join dtr-rethinkdb-000000000001.dtr-ol --join dtr-rethinkdb-000000000002.dtr-ol --join dtr-rethinkdb-000000000003.dtr-ol
getaddrinfo() failed for hostname 'dtr-rethinkdb-000000000003.dtr-ol': Name does not resolve (gai_errno -2)

UPDATED

I also note that on the same node, dtr-phase2 is also failing. Here is the end of the log for that container; these lines seem to contain better clues:

[37mDEBUe[0m[0033] envs:                                        
e[37mDEBUe[0m[0033] env: HTTP_PROXY=                             
e[37mDEBUe[0m[0033] env: HTTPS_PROXY=                            
e[37mDEBUe[0m[0033] env: DTR_VERSION=2.0.3                       
e[37mDEBUe[0m[0033] env: DTR_REPLICA_ID=000000000003             
e[37mDEBUe[0m[0033] env: NO_PROXY=dtr-etcd-000000000003.dtr-br, .dtr-etcd-000000000003.dtr-br, dtr-etcd-000000000003, .dtr-etcd-000000000003, dtr-etcd-000000000003.dtr-ol, .dtr-etcd-000000000003.dtr-ol, dtr-rethinkdb-000000000003.dtr-br, .dtr-rethinkdb-000000000003.dtr-br, dtr-rethinkdb-000000000003, .dtr-rethinkdb-000000000003, dtr-rethinkdb-000000000003.dtr-ol, .dtr-rethinkdb-000000000003.dtr-ol, dtr-registry-000000000003.dtr-br, .dtr-registry-000000000003.dtr-br, dtr-registry-000000000003, .dtr-registry-000000000003, dtr-registry-000000000003.dtr-ol, .dtr-registry-000000000003.dtr-ol, dtr-api-000000000003.dtr-br, .dtr-api-000000000003.dtr-br, dtr-api-000000000003, .dtr-api-000000000003, dtr-api-000000000003.dtr-ol, .dtr-api-000000000003.dtr-ol, dtr-nginx-000000000003.dtr-br, .dtr-nginx-000000000003.dtr-br, dtr-nginx-000000000003, .dtr-nginx-000000000003, dtr-nginx-000000000003.dtr-ol, .dtr-nginx-000000000003.dtr-ol, ucp.onetouchappsctrl.com, .ucp.onetouchappsctrl.com 
e[37mDEBUe[0m[0033] env: DTR_REPLICA_ID=000000000003             
e[37mDEBUe[0m[0033] env: constraint:node==dtr-replica-03         
e[37mDEBUe[0m[0041] Network set to: dtr-replica-03/dtr-br        
e[37mDEBUe[0m[0042] Network set to: dtr-ol                       
e[33mWARNe[0m[0042] Couldn't get network ID for network 'dtr-ol' 
e[37mDEBUe[0m[0042] Checking for bridge network. Have node: dtr-replica-03 
e[37mDEBUe[0m[0042] Bridge name is: dtr-replica-03/bridge        
e[33mWARNe[0m[0042] Couldn't get network ID for network 'dtr-replica-03/bridge' 
e[31mERROe[0m[0042] Couldn't get ha config: Failed to start container dtr-rethinkdb-000000000003: Error response from daemon: unauthorized
unable to authenticate user with session token on auth provider: invalid_grant: invalid authentication credentials given 
e[31mFATAe[0m[0042] Failed to start container dtr-rethinkdb-000000000003: Error response from daemon: unauthorized
unable to authenticate user with session token on auth provider: invalid_grant: invalid authentication credentials given

(Patrick Devine) #2

It seems like a problem with your overlay network (dtr-ol), although the “invalid_grant” would imply that the auth credentials are invalid (although possibly again because of the flakey overlay). In order to salvage the deployment, you may want to use the DTR bootstrapper directly to do a “remove” and then a “join” of the node.


(Mark Henwood) #3

Thanks @pdevine. I have managed to sort out the overlay problem but now having trouble recreating the dtr node/joining the rest of them. I’ve cobbled my restart script out of the bits of the ‘user-data’ CloudInit script which gets run at creation. It now fails at the ‘dtr join’ phase with the following:

DEBU[0019] Connected to etcd
DEBU[0019] client = &{%!q(*client.httpClusterClient=&{0x6369c0 [{https  <nil> dtr-etcd-000000000001.dtr-ol:2379    }] 0 <nil> {{0 0} 0 0 0 0} 0xc820442960})}
ERRO[0019] Couldn't join cluster: etcdserver: peerURL exists
ERRO[0019] Error joining cluster: etcdserver: peerURL exists
FATA[0019] etcdserver: peerURL exists
DEBU[0024] Deleting container ef9d538a39c7545f5f2afab980905644fb95fdf34ebf2e2983eb9d90e36c5d67
DEBU[0025] Deleting container ef9d538a39c7545f5f2afab980905644fb95fdf34ebf2e2983eb9d90e36c5d67
DEBU[0025] Failed to remove container: Error response from daemon: Container ef9d538a39c7545f5f2afab980905644fb95fdf34ebf2e2983eb9d90e36c5d67 not found
DEBU[0025] Ignored error Error response from daemon: Container ef9d538a39c7545f5f2afab980905644fb95fdf34ebf2e2983eb9d90e36c5d67 not found...
FATA[0025] Phase 2 returned non-zero status: 1

It sounds from your post like there is a better bootstrapper to use than my hackabout script. Can you point me in the direction of docs? Thank you.


(Patrick Devine) #4

@mhenwood you can find the docs here. Substitute “remove” for “install” in the bootstrap command. You’ll need to know the replica ID of the bad installation, and it will prompt you for a healthy replica. The command should look something like:

$ docker run -it --rm docker/dtr remove --ucp-url <location of ucp> --ucp-username admin --ucp-password <password> --ucp-ca "$(cat ucp-ca.pem)"

You can get the ucp-ca.pem file by doing:

$ curl -k https://$UCP_HOST/ca > ucp-ca.pem

Or, alternatively, just use the --ucp-insecure-tls flag. The bootstrapper will prompt you for both the replica ID you want to remove, as well as a healthy replica that it can connect to.


(Mark Henwood) #5

Unfortunately @pdevine when I run the remove, this is what I get:

INFO[0000] Beginning Docker Trusted Registry replica remove
INFO[0000] Validating UCP cert
INFO[0000] UCP cert validation successful
INFO[0001] This cluster contains the replicas: 000000000002 000000000001 
Choose a replica to remove [000000000002]:

Sadly it’s on replica 3 that the problems are occurring; it’s the addition of 000000000003 which is failing with the etcd error peerUrl exists, and that replica is apparently not in the cluster.

UPDATE
Currently scouring the ‘high availability DTR architecture’ docs for clues.


(Mark Henwood) #6

FURTHER UPDATE: Removing all the old Docker volumes cleared the ‘peerURL exists’ problem. Performing a full --fresh-install on the UCP node join followed by a DTR replica join has brought up the DTR containers successfully on the 3rd replica node for the first time ever (certain error messages notwithstanding)

However, now, when I try a docker push from my local machine, it still ends in error “blob upload unknown”. So I am kind of back where I started.


(Patrick Devine) #7

@mhenwood What are you using for the storage backend? Also, is there a load balancer in front of this? Is it possible the load balancer is configured incorrectly?


(Mark Henwood) #8

Hi @pdevine. For this error I was using ‘local file’ storage. I switched to S3 backend and got instead 500 errors.

However I have since stopped my Proof of Concept investigations for the use of DDC because it has proved very time consuming trying to figure out my various errors.

Thank you for your help.


(Patrick Devine) #9

@mhenwood Bummer. We’re really trying to make it easier to setup and configure, but there are a lot of tricky concepts. If you’ve got some time I’d love to hear about some of the things which were unintuitive or didn’t work.


(Mark Henwood) #10

I kept a little list of things which weren’t going well. If you send me
your email address I will happily ail it over.