Use cases of Docker for a production Hadoop cluster

juanrh · September 1, 2014, 8:13am

Hi,

I’m new to Docker but I find it very interesting, in particular because it allows to replicate the same environment for development and production, and to limit dependency from a particular IaaS or PaaS provider.
I was wondering whether it is feasible or not to use Docker for implementing a production Hadoop cluster with several services like HDFS, YARN, HBase, Zookeeper, Apache Kafka, running in each of the slave nodes in order to obtain data locality. Any of you folks have any experience with a production Hadoop cluster based on Docker? Or in general, do you think that makes sense?, is Docker a suitable technology for this or there is some technical issue that makes this approach clearly wrong? It looks like the people from http://ferry.opencore.io/ has already made some progress in that direction, but from their documentation it looks like Ferry is more a development tool than something to be used in production, but maybe I have missed something.

Thanks a lot for your help,

Greetings,

Juan

juanrh · September 1, 2014, 8:09pm

I’ve found this use case http://es.slideshare.net/Hadoop_Summit/th-130p230-amatyasv2, https://www.youtube.com/watch?v=7sQNi-57dNU from last Hadoop Summit, in case someone is interested.

Greetings,

Juan

jaredbroad · September 5, 2014, 5:35pm

We use docker on QuantConnect.com for easy update of a hadoop style cloud. Its a custom implementation but uses the same principles.

The docker part is a “docker pull” image on reboot, and then autostart of the latest docker container with our data processing application. This way we can issue a blanket restart to dozens of processing nodes and they’ll all reboot with the latest image! Saves a lot of time updating / imaging / stopping / re-creating of 10-100 VMs

The hardest part was just getting docker to work properly on boot… that took hours…

Jared

mrsingh · September 12, 2014, 6:36pm

Hi Juan,

Do you get to work on this further? I am also in same boat.

Thanks,
Jeet

juanrh · September 15, 2014, 8:09am

Hi Jared,

Thanks a lot for the feedback. If you are using it in production I understand the overhead caused by the Docker container is no problem for performance.

Thanks again for your answer,

Greetings,

Juan

juanrh · September 15, 2014, 8:14am

Hi mrsingh,

I haven’t gone past the first investigation stage, but it looks like the folks at SequenceIQ have it all pretty much figured out. At the end of the talk I mentioned in my second post, you can see a reference to CloudBreack (http://sequenceiq.com/cloudbreak/) an open source project of SequenceIQ for using Docker and Ambari for deploying Hadoop clusters on different cloud providers. I suggest you take a look to that, I will for sure when I have some time!

Greetings,

Juan

jaredbroad · December 1, 2014, 10:02pm

@juanrh, @mrsingh - It is completed and live in production. There’s no measurable overhead from Docker.

I used a simple 1-2 minute delay to load the docker image after system boot in the rc.local. This way the system had time to load the docker service, before issuing the docker update/run commands.

tmichels · May 18, 2015, 9:49am

Hi Jared,

I just came across your postings, as I was looking for some information about setting up a Hadoop cluster on Docker.

I heartly agree with you, that Docker adds no (measurable) overhead to the execution.

The thing, I am wondering about, which is the filesystem in Docker that you are using? I am guessing it is not AUT.
Do you have any comparessing data for the Docker filesystems under the heavy load of Hadoop and its eco-system?
It would be great, if you could give me some hints or directions.

Thank you very much in advance.

Thorsten

jaredbroad · May 18, 2015, 3:53pm

We use XFS simply because of the number of files we have in the system, but I haven’t done any bench marks on other ones. We’re running a custom setup though, not true Hadoop.

Topic		Replies	Views
Multi-node Hadoop cluster with Docker General docker , build	0	2797	January 25, 2016
New to docker - trying to understand General	2	620	April 23, 2017
Docker for development and production General	0	594	August 3, 2018
Create a docker image General	0	548	April 16, 2018
Hardware for local docker production server General	4	1730	August 11, 2020

Use cases of Docker for a production Hadoop cluster

Related topics