Docker on EMR thru mrjob?

My project has a bunch of dependencies that are a pain to set up (and maintain!) in bootstrap scripts on Amazon EMR and I’m wondering why I shouldn’t try to run stuff in Docker on EMR.

We use mrjob (python) to set up our EMR jobs. EMR 3.11 runs Amazon Linux 2015.09 which has docker in it… I’m thinking I’m just going to modify our MapReduce steps to:

  • write mapreduce inputs to a file in a mounted docker dir
  • use “docker run” to read the input file and perform operations inside of our nice, easily maintainable docker image
  • write output files to the mounted docker dir
  • read the output files and yield output to next step

The advantage is that our bootstrap script for EMR becomes super simple and fast because almost all of the setup is already done and saved in a docker image. No more dependency madness. We just need to install docker, pull the image, and then feed input into it in each step.

The disadvantage is that we’ll introduce some computational overhead and complexity to the code. And I’m not 100% sure this is going to work (I can imagine multithreaded calls to docker-machine on EMR crossing the beams and majorly screwing memory/disk resource sharing).

Does this sound crazy? Is there a more standard way people run Docker on EMR?

@silviaterra

I’m not that familiar with EMR (besides the basic principle), but assuming you’re set on mrjob in particular (vs., say, Hadoop), using Docker to distribute the processes that will be run on the given hosts (and your proposed approach) seems reasonable.

If your basic dependencies don’t change that often, you might want to consider using a tool such as Packer to “bake” an AMI and simply rsync the code over when a new host gets brought up instead. This would free you from having to entangle your pipeline with Docker.

However, the disadvantage of this method is that iterating on an AMI vs. a Docker image is slower and costlier. It seems to me that it is much more likely to be easy to experiment and play around with the same “stack” / pipeline in a local environment with Docker. Docker also would allow you to run userspace programs from Debian-based (or other) distros on top of Amazon Linux if using AL is a hard requirement for you. IMO, that’s pretty nice.

(I can imagine multithreaded calls to docker-machine on EMR crossing the beams and majorly screwing memory/disk resource sharing).

As for the performance concerns, I’m not really sure what the “beams” are here. Might be able to help more if you can elaborate. Network? Page cache? The main performance concerns of Docker are related to disk, in my opinion. But this can be mitigated by using a volume as you suggested. Using the default bridge driver for network has some overhead due to NAT but is fast enough for many applications (basically, if getting packets through the process fast is not the main purpose of the containerized process, you’re unlikely to notice).

Thanks Nathan! I’m going to give this a shot with a proof of concept and will post back if I manage to get it working.

After making the necessary blood sacrifices, I have managed to get a proof of concept for this working! Working example at https://github.com/SilviaTerra/docker-emr-poc

One thing that’s still a bit gross is that I have to “sudo docker” everything. Per the instructions on AWS’s site, you need to add $USER to the “docker” group in order for them to run docker without sudo. But you have to log out for this change to take effect. That’s a non-starter in the bootstrap script for EMR. I tried newgrp too, but that opens up a sub-shell or something and doesn’t play nicely with the bootstrap process either. Any thoughts there?

Opened a question about the group permissions on ServerFault - http://serverfault.com/questions/762788/updating-group-without-log-out-or-subshell

Now you’re getting the hang of it!

One thing that's still a bit gross is that I have to "sudo docker" everything

I’m actually one of the weirdos who doesn’t mind sudo docker all that much, especially in higher environments, but I’d be surprised to find there wasn’t a workaround for reloading the docker group configuration, maybe by spawning a new shell (su?)? @tianon @svendowideit do you know any?

Nice to hear you got it working!

Did you try restarting hadoop on the EMR cluster after adding ‘yarn’ to the ‘docker’ user group? I’m wondering if that might work.

Hi Praveen - I haven’t tried that. I googled around a bit and it seems like restarting hadoop from within a mrjob bootstrap script is pretty complicated (although maybe I’m missing something obvious?). Not sure how that would fix the group permissions issue though - I think you have to log out and log in for *nix to make the permissions take effect.