In our project we use Docker to containerize all parts of our application except the part that runs on a Spark cluster. The spark cluster is already installed and running as a part of a hadoop installation, hence did not dockerize Spark.
Now, when we deploy our application, we deploy the latest docker images (through kubernetes). But our deployment includes a bunch of ‘reference data’ (plain CSV files) that needs to be placed in a common location for the Spark cluster to access it.
One way we had thought of was to place these ‘reference data’ on the docker containers themselves and copy over to a shared storage location where Spark an access (during submission of Spark jobs).
The challenge here is that every time a Spark job is submitted from one of the docker containers, we need to spend time to copy these files over to the shared storage location. These files could be > 5 GB big and hence will take up precious minutes to copy over (even if it within the data center).
We were thinking of having a separate and “disposable” Docker Container which will
- have the reference data in it,
- will do the file copy to the network share as the first activity as it is brought up and then
- Shutdown once the copy is done.
Is using Docker Images as a ‘disposable’ vehicle for file copy advisable?
Are there any other more elegant ways of achieving this?