My project has a bunch of dependencies that are a pain to set up (and maintain!) in bootstrap scripts on Amazon EMR and I’m wondering why I shouldn’t try to run stuff in Docker on EMR.
- write mapreduce inputs to a file in a mounted docker dir
- use “docker run” to read the input file and perform operations inside of our nice, easily maintainable docker image
- write output files to the mounted docker dir
- read the output files and yield output to next step
The advantage is that our bootstrap script for EMR becomes super simple and fast because almost all of the setup is already done and saved in a docker image. No more dependency madness. We just need to install docker, pull the image, and then feed input into it in each step.
The disadvantage is that we’ll introduce some computational overhead and complexity to the code. And I’m not 100% sure this is going to work (I can imagine multithreaded calls to docker-machine on EMR crossing the beams and majorly screwing memory/disk resource sharing).
Does this sound crazy? Is there a more standard way people run Docker on EMR?