Use cases of Docker for a production Hadoop cluster

Hi,

I’m new to Docker but I find it very interesting, in particular because it allows to replicate the same environment for development and production, and to limit dependency from a particular IaaS or PaaS provider.
I was wondering whether it is feasible or not to use Docker for implementing a production Hadoop cluster with several services like HDFS, YARN, HBase, Zookeeper, Apache Kafka, running in each of the slave nodes in order to obtain data locality. Any of you folks have any experience with a production Hadoop cluster based on Docker? Or in general, do you think that makes sense?, is Docker a suitable technology for this or there is some technical issue that makes this approach clearly wrong? It looks like the people from http://ferry.opencore.io/ has already made some progress in that direction, but from their documentation it looks like Ferry is more a development tool than something to be used in production, but maybe I have missed something.

Thanks a lot for your help,

Greetings,

Juan

1 Like

I’ve found this use case http://es.slideshare.net/Hadoop_Summit/th-130p230-amatyasv2, https://www.youtube.com/watch?v=7sQNi-57dNU from last Hadoop Summit, in case someone is interested.

Greetings,

Juan

We use docker on QuantConnect.com for easy update of a hadoop style cloud. Its a custom implementation but uses the same principles.

The docker part is a “docker pull” image on reboot, and then autostart of the latest docker container with our data processing application. This way we can issue a blanket restart to dozens of processing nodes and they’ll all reboot with the latest image! Saves a lot of time updating / imaging / stopping / re-creating of 10-100 VMs :slight_smile:

The hardest part was just getting docker to work properly on boot… that took hours

Jared

Hi Juan,

Do you get to work on this further? I am also in same boat.

Thanks,
Jeet

Hi Jared,

Thanks a lot for the feedback. If you are using it in production I understand the overhead caused by the Docker container is no problem for performance.

Thanks again for your answer,

Greetings,

Juan

Hi mrsingh,

I haven’t gone past the first investigation stage, but it looks like the folks at SequenceIQ have it all pretty much figured out. At the end of the talk I mentioned in my second post, you can see a reference to CloudBreack (http://sequenceiq.com/cloudbreak/) an open source project of SequenceIQ for using Docker and Ambari for deploying Hadoop clusters on different cloud providers. I suggest you take a look to that, I will for sure when I have some time!

Greetings,

Juan

@juanrh, @mrsingh - It is completed and live in production. There’s no measurable overhead from Docker.

I used a simple 1-2 minute delay to load the docker image after system boot in the rc.local. This way the system had time to load the docker service, before issuing the docker update/run commands.

Hi Jared,

I just came across your postings, as I was looking for some information about setting up a Hadoop cluster on Docker.

I heartly agree with you, that Docker adds no (measurable) overhead to the execution.

The thing, I am wondering about, which is the filesystem in Docker that you are using? I am guessing it is not AUT.
Do you have any comparessing data for the Docker filesystems under the heavy load of Hadoop and its eco-system?
It would be great, if you could give me some hints or directions.

Thank you very much in advance.

Thorsten

We use XFS simply because of the number of files we have in the system, but I haven’t done any bench marks on other ones. We’re running a custom setup though, not true Hadoop.