Docker Community Forums

Share and learn in the Docker community.

PySpark image to run on Kubernetes' pods

I’ve been looking int how to build a custom Pyspark Docker Image to run on our companies Kubernetes pods (that run on scala). Have not found any examples that work. Anyone think that they can help?

Running Spark Over Kubernetes
A big difference between running Spark over Kubernetes and using an enterprise deployment of Spark is that you don’t need YARN to manage resources, as the task is delegated to Kubernetes. Kubernetes has its RBAC functionality, as well as the ability to limit resource consumption.

You can build a standalone Spark cluster with a pre-defined number of workers, or you can use the Spark Operation for k8s to deploy ephemeral clusters. The later gives you the ability to deploy a cluster on demand when the application needs to run. Kubernetes works with Operators which fully understand the requirements needed to deploy an application, in this case, a Spark application.

What does the Operator do?

Reads your Spark cluster specifications (CPU, memory, number of workers, GPU, etc.)
Determines what type of Spark code you are running (Python, Java, Scala, etc.)
Retrieves the image you specify to build the cluster
Builds the cluster
Runs your application and deletes resources (technically the driver pod remains until garbage collection or until it’s manually deleted)