USE CASE: i have an airflow docker container and i want to use some tools/Frameworks( like hdfs, spark, hive) that exists in a remote host from within my airflow docker container. How can i achieve that ? does anybody have an idea? hint?examples?
…the same way you’d invoke those services if you weren’t running in a Docker container? The Docker engine will do outbound NAT for you. (The forum has had several questions from people who just want to run curl(1) from a containerized shell, which really should just work for them.)
@dmaze first of all thank you for your answer.
If i try to call some HDFS funktion like hdfs -c 'hadoop fs -mkdir /folder'
or spark-submit
or execute a jar file. it will say that the called command is not found. My docker container and the tools that i want to use are not on the same host.
I believe what David is trying to tell you is that this is no different than if you were trying to access HDFS remotely from your laptop. Either way you will need a client to talk to the remote host. This means that you will need to add that client software to your Docker image.
What you need to do is build your own Docker image. Create a Dockefile FROM airflow
and RUN
the commands to install any Hadoop or Spark tools that you need. This assumes that tools like hdfs
can talk to a remote host. Then spin up a container from that image and you will have the tools that you need.
I quick view of the airflow site shows that there are packages for hadoop and hdfs:
pip install airflow[devel_hadoop]
pip install airflow[hdfs]
Which installs Airflow + dependencies on the Hadoop stack and HDFS hooks and operators respectively. I’m not sure if these are the packages you need but you get the idea. You need to create your own image with the tools you need and instantiate a container from that.
~jr
@dmaze @rofrano thank you for your reply, but that is not what i meant. My use case is i hvae doeckerized ariflow installed on a server and i want to create a permanent ssh connection to my hadoop cluster. My question is how to leverage some docker functionalities to achieve this goal.
PS: the suggestions made by @rofrano don’t work because pip install airflow[…] suppose that for example hadoop is reachable for airflow to use its functionalities, either installed on the same host or connected to a remote host.
@sdikby I’m not sure that I follow you now that you have mentioned ssh
but let’s take that as an example of what you want to do:
What is stopping you from using ssh from your airflow container to your hadoop cluster?
Do you not have an ssh client in your airflow container? If not, install one.
I thought that you wanted to programmatically make client calls to a remote server in which case you would need some client software to talk to the remote. It sounds like now you want to ssh to the remote server and issue the commands on the remote. Is that correct?
If it’s just ssh that you want, then install an ssh client in your airflow container. I don’t see why this wouldn’t work for you. As @dmaze said, whatever you would do on a regular server you just do from within your container.
~jr