Using Docker Images as a 'disposable' vehicle for file copy to shared storage

Hi,

In our project we use Docker to containerize all parts of our application except the part that runs on a Spark cluster. The spark cluster is already installed and running as a part of a hadoop installation, hence did not dockerize Spark.

Now, when we deploy our application, we deploy the latest docker images (through kubernetes). But our deployment includes a bunch of ‘reference data’ (plain CSV files) that needs to be placed in a common location for the Spark cluster to access it.

One way we had thought of was to place these ‘reference data’ on the docker containers themselves and copy over to a shared storage location where Spark an access (during submission of Spark jobs).
The challenge here is that every time a Spark job is submitted from one of the docker containers, we need to spend time to copy these files over to the shared storage location. These files could be > 5 GB big and hence will take up precious minutes to copy over (even if it within the data center).

We were thinking of having a separate and “disposable” Docker Container which will

  • have the reference data in it,
  • will do the file copy to the network share as the first activity as it is brought up and then
  • Shutdown once the copy is done.

Is using Docker Images as a ‘disposable’ vehicle for file copy advisable?
Are there any other more elegant ways of achieving this?

thanks,
Raga

https://ender74.github.io/Sharing-Volumes-With-Docker-NFS/

might help :slight_smile: