Docker Community Forums

Share and learn in the Docker community.

Docker EE on Premise - Air Gapped

We are setting up Docker EE on premise in an air gaped environment and are running into some issues.

I am wondering if anyone has documented this process. I have developed a kickstart that does all of the heavy lifting and configuration but there are still some components that appeared to be trying to reach out to external registries etc.

Docker EE 18.09.6
UCP 3.1.7
DTR 2.6.6
OS Centos 7

We installed Docker EE on two RHEL based environments without any issues following those guides:


In preparation we downloaded both offline packages and the rpms to install the docker engine.

What particular issue do you face?

Hi,

Thanks for the links. I am seeing communications errors between the nodes. The main thing I did differently is that I rolled my own UCP and DTR tar balls, so that may be the difference. I just finished updating the kickstart with the new files and I am installing now. I will see how that goes and respond back to this thread.

Thanks.
-bill

After reinstalling with the new tar balls I am still having the same issues.

With just the UCP and the DTR setup I am able to log in to both consoles and things look normal.

However, looking at traffic on the DTR I am seeing rests when trying to communicate with the etcd control api on port 12379.

14:00:48.529084 IP 10.10.171.104.34528 > 10.10.171.101.12379: Flags [R], seq 747690708, win 0, length 0
14:00:48.529100 IP 10.10.171.104.34528 > 10.10.171.101.12379: Flags [R], seq 747690708, win 0, length 0
...

On the UCP I see the following messges on the etcdserver:

2019-06-17 17:59:40.840602 I | embed: ready to serve client requests
2019-06-17 17:59:40.841877 I | embed: serving client requests on [::]:2379
2019-06-17 18:00:02.931898 I | embed: rejected connection from "172.17.0.1:55595" (error "EOF", ServerName "")
2019-06-17 18:00:06.151971 I | embed: rejected connection from "172.17.0.1:55665" (error "EOF", ServerName "")
...
2019-06-17 18:00:53.272537 W | etcdserver: read-only range request "key:\"/registry/roles\" range_end:    \"/registry/rolet\" count_only:true " with result "range_response_count:0 size:7" took too long (352.926594ms) to execute
2019-06-17 18:00:53.272814 W | etcdserver: read-only range request "key:\"/registry/networkpolicies\" range_end:    \"/registry/networkpoliciet\" count_only:true " with result "range_response_count:0 size:5" took too long     (321.570885ms) to execute
2019-06-17 18:00:55.655102 I | embed: rejected connection from "172.17.0.1:57419" (error "EOF", ServerName "")
2019-06-17 18:01:02.534974 I | embed: rejected connection from "172.17.0.1:43962" (error "EOF", ServerName "")
2019-06-17 18:01:02.972566 I | embed: rejected connection from "172.17.0.1:57601" (error "EOF", ServerName "")
^C

These are the kinds of messages I have been getting.

NTP is up so the clocks on the various nodes are all in sync. DNS is working. Just not sure what is going on with the rest.

Any ideas are appreciated. Thanks.

If you run your stuff in a vm: The ntp-client needs to run on the vm host and the guest additions needs to sync in the time to the guest vm. If you run Docker EE on a cloud IaaS instance, you might look out for a specifc time server implementation. E.g. for amazon chrony needs to be used for an accurate time sync.

As long as the nodes keep their host names and ips everything should be fine. Even renaming the hostname of a swarm cluster member is terrible idea… been there, done that… and didn’t like the outcome :wink:

Are you running any sort of portfilter on your nodes?

Yes, I wound up setting up an ntp server on a local VM and syncing all the docker guests to that that host. ntpstat was reporting that all the nodes were in sync with the local server:

[root@dtr1 tcpdump]# ntpstat
synchronised to NTP server (10.10.171.10) at stratum 12
   time correct to within 45 ms
   polling server every 64 s

No in regard to the port filter question.

The other thing that I am still doing is using the self signed certs (I have not configured the TLS stuff yet).

Do you think that may be the problem?

a self signed certificate is worth as much as letting ucp/dtr create their own self created certifcates. Those are just used for their web-ui and rest interfaces. You can replace these certificates later on from the ui or by rerunning the setup container with different parameters.Though, I highly doubt that the ucp/dtr certificate is causing your problem.

Swarm itselfs sets up an internal CA and provides certificates to all swarm nodes. it takes care of creating, cycling, invalidating keys and all the fun stuff no one seriously would want to do manualy on a command line :slight_smile:

OK thanks.

In my configuration I changed the default UCP port to 8443 so I am switching that back to see if that has any affect. :slight_smile:

I wrote a script to dump all of the logs for all running processes and there are other errors for other services so I am going to weed through those and see if I can figure anything out. I did see a lot of attempts to connect to external registries. Maybe I need to add a registry entry in the daemon.json file and have it point internally. (Although the local registry has no additional images…)

I appreciate all of your help.
-bill

I did not configure my dtr as the default registry in /etc/docker/daemon.json.
Though, I also didn’t investigate each and every log as long everything works.

I would suggest to add a stack and see if dns ressolution works for the containers, then check UCP if the details about nodes/stacks/services and containers are correct.

Appart from seeing errors in the logs, what exactly is not working?

Sounds good. Thanks.

Still working through an approach for getting actual images into the DTR so I have not stood up any additional applications. Will be testing that shortly.

Did you find that you had to have all of the ucp images available on worker nodes or just a subset. I know that it needs the ucp-agent and then I just went ahead and add the others. I guess the key is going to be to get the registry setup and then I won’t need to duplicate the images etc.

Thanks.

Not sure, one of our staff members wrote the ansible scripts to setup a cluster. Actualy it would make sense to have the offline tars loaded into each and every node to make the images present in it’s local image cache.

Regarding the registry: What works wonders for your own images (actualy Swarm/K8s without a local image repo is a mess), is high likely not the solution to distribute the ucp/dtr images to your nodes, as the image names are hard coded and expect to be present on the host or to be loaded from dockerhub.

Once you DTR is configured, make sure to import its CA certificate on each host:

Makes sense. I was wondering how I could duplicate the namespaces. Easier to just add them to each locally as they are already available in the kickstart.

Thanks again for all your responses.

It makes sense to share experience, so others have less to suffer cough

What I actualy wanted to say: welcome!

Yes, I have a script that does that for me. Thanks!

Just to close this out a bit:

  • When I replaced my UCP and DTR tarballs with the official versions I was still seeing the same issues.
  • When I changed the UCP back to using the default port of 443 the issues appeared to go away.
  • Was still seeing connection attempts to external hosts but these all seemed to be related to usage reports and were easy to turn off in the UCP under Admin -> Settings -> Usage.
  • On the browser side I am not seeing graphs on the dashboard. This is related to the outdated version of firefox that I am forced to use. Verified by spinning up a VM with a new version so this is just a local issue that I need to resolve.
  • Pushing and pulling images from the DTR works fine.
  • I ran into the “open id” authentication issue when it was trying to use the UCP for single sign on. I got around this by not specifying the --dtr-external-url argument when installing the DTR. This was only necessary for the on site version with the older browser. My lab setup worked fine with the original arguments.
  • Time is critical in this setup. I had to setup a local VM running ntpd to sync time to for all of the internal nodes.
  • I also had to setup my own internal name server and update the docker daemon file to use that. My original thinking was that it would fail over to the host file but that was not the case. As documented, docker will default to external servers (I think it mentioned googles 8.8.8.8/7 servers somewhere).