Docker Community Forums

Share and learn in the Docker community.

Use cases of Docker for a production Hadoop cluster


(Juan Rodriguez Hortala) #1

Hi,

I’m new to Docker but I find it very interesting, in particular because it allows to replicate the same environment for development and production, and to limit dependency from a particular IaaS or PaaS provider.
I was wondering whether it is feasible or not to use Docker for implementing a production Hadoop cluster with several services like HDFS, YARN, HBase, Zookeeper, Apache Kafka, running in each of the slave nodes in order to obtain data locality. Any of you folks have any experience with a production Hadoop cluster based on Docker? Or in general, do you think that makes sense?, is Docker a suitable technology for this or there is some technical issue that makes this approach clearly wrong? It looks like the people from http://ferry.opencore.io/ has already made some progress in that direction, but from their documentation it looks like Ferry is more a development tool than something to be used in production, but maybe I have missed something.

Thanks a lot for your help,

Greetings,

Juan


(Juan Rodriguez Hortala) #2

I’ve found this use case http://es.slideshare.net/Hadoop_Summit/th-130p230-amatyasv2, https://www.youtube.com/watch?v=7sQNi-57dNU from last Hadoop Summit, in case someone is interested.

Greetings,

Juan


(Jared Broad) #3

We use docker on QuantConnect.com for easy update of a hadoop style cloud. Its a custom implementation but uses the same principles.

The docker part is a “docker pull” image on reboot, and then autostart of the latest docker container with our data processing application. This way we can issue a blanket restart to dozens of processing nodes and they’ll all reboot with the latest image! Saves a lot of time updating / imaging / stopping / re-creating of 10-100 VMs :slight_smile:

The hardest part was just getting docker to work properly on boot… that took hours

Jared


(Mrsingh) #4

Hi Juan,

Do you get to work on this further? I am also in same boat.

Thanks,
Jeet


(Juan Rodriguez Hortala) #5

Hi Jared,

Thanks a lot for the feedback. If you are using it in production I understand the overhead caused by the Docker container is no problem for performance.

Thanks again for your answer,

Greetings,

Juan


(Juan Rodriguez Hortala) #6

Hi mrsingh,

I haven’t gone past the first investigation stage, but it looks like the folks at SequenceIQ have it all pretty much figured out. At the end of the talk I mentioned in my second post, you can see a reference to CloudBreack (http://sequenceiq.com/cloudbreak/) an open source project of SequenceIQ for using Docker and Ambari for deploying Hadoop clusters on different cloud providers. I suggest you take a look to that, I will for sure when I have some time!

Greetings,

Juan


(Jared Broad) #7

@juanrh, @mrsingh - It is completed and live in production. There’s no measurable overhead from Docker.

I used a simple 1-2 minute delay to load the docker image after system boot in the rc.local. This way the system had time to load the docker service, before issuing the docker update/run commands.


(Thorsten Michels) #8

Hi Jared,

I just came across your postings, as I was looking for some information about setting up a Hadoop cluster on Docker.

I heartly agree with you, that Docker adds no (measurable) overhead to the execution.

The thing, I am wondering about, which is the filesystem in Docker that you are using? I am guessing it is not AUT.
Do you have any comparessing data for the Docker filesystems under the heavy load of Hadoop and its eco-system?
It would be great, if you could give me some hints or directions.

Thank you very much in advance.

Thorsten


(Jared Broad) #9

We use XFS simply because of the number of files we have in the system, but I haven’t done any bench marks on other ones. We’re running a custom setup though, not true Hadoop.


(Rajkumarrrrr) #10

Give me any suggestion for the best Hadoop Projects and Hadoop Training


(Genesissarah) #11

As a hadoop developer, there are several times when I want to create multiple node hadoop cluster more easily. First I came up with using VirtualBox and Vagrant. But it was very slow to launch one cluster. Besides the more nodes we added, the slower launching time be. I cannot wait to check each change and debug it. Every developers may think so.
https://www.besanttechnologies.com/training-courses/data-warehousing-training/hadoop-training-institute-in-chennai