If we use COPY in a Dockerfile where the FROM is a generic OS image, the copied file shows up in the resultant running container.
If we use the same COPY command in a Dockerfile where the FROM is a Spark image, the copied file is missing in the resultant running container.
In order to test this, we modified the SparkPi sample test to sleep for a long while before exiting. This allows use to logon to the container. The sample test JAR file is present, but the file copied to the image is not present.
This is a simple test of installing our application files onto a Spark container. We understand that there are directories that get overwritten in the Spark installation, so we use a directory under “/”.
RUN mkdir -pv /foo
RUN chmod 777 /foo
COPY spark-examples_2.12-3.1.2-SNAPSHOT.jar /foo/
COPY bar /foo/
RUN ls -l /foo/
We execute the SparkPi test routine contained in the modifed JAR file. The odd thing is that this JAR file shows up in the hadoop user HOME directory, not in the target /foo directory. The /foo directory does not exist…so the file “bar” does not exist as well.
So how does one install a set of directories onto a Spark image and have them show up in a running container?
This problem occurs in the Amazon EMR/EKS images as well as in the Spark image generated but the Spark distro.