I have a flight prediction project where I need to dockerize each of the services used, including Kafka, ZooKeeper, MongoDB, Spark, and a web server. The project is based on the following repository: [GitHub - Big-Data-ETSIT/practica_creativa: Práctica de las asignaturas de Big Data del DIT]
Here’s my docker-compose.yml file:
version: "3"
services:
zookeeper:
image: 'bitnami/zookeeper:3.8.1'
container_name: zookeeper
hostname: zookeeper
ports:
- '2181:2181'
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
ZOOKEEPER_SYNC_LIMIT: 2
ALLOW_ANONYMOUS_LOGIN: 'yes'
kafka:
image: 'bitnami/kafka:3.1.2'
container_name: kafka
hostname: kafka
ports:
- '9092:9092'
expose:
- "9093"
depends_on:
- zookeeper
working_dir: /opt/bitnami/kafka
environment:
- KAFKA_BROKER_ID=1
- KAFKA_CFG_LISTENERS=PLAINTEXT://kafka:9092
- KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092
- KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
- ALLOW_PLAINTEXT_LISTENER=yes
mongo:
container_name: mongo
ports:
- "27017:27017"
build:
context: ./docker/Mongo
dockerfile: Dockerfile
spark-master:
image: bde2020/spark-master:3.3.0-hadoop3.3
container_name: spark-master
ports:
- "8080:8080"
- "7077:7077"
environment:
- SPARK_HOME =/spark
- PROJECT_HOME=/main
volumes:
- ./:/home/lucia/practica_creativa
depends_on:
- kafka
- mongo
spark-worker-1:
image: bde2020/spark-worker:3.3.0-hadoop3.3
container_name: spark-worker-1
depends_on:
- spark-master
ports:
- "8081:8081"
environment:
- "SPARK_MASTER=spark://spark-master:7077"
- SPARK_HOME =/spark
- PROJECT_HOME=/main
volumes:
- ./:/home/lucia/practica_creativa
spark-worker-2:
image: bde2020/spark-worker:3.3.0-hadoop3.3
container_name: spark-worker-2
depends_on:
- spark-master
ports:
- "8082:8081"
environment:
- "SPARK_MASTER=spark://spark-master:7077"
- SPARK_HOME =/spark
- PROJECT_HOME=/main
volumes:
- ./:/home/lucia/practica_creativa
spark-history-server:
image: bde2020/spark-history-server:3.3.0-hadoop3.3
container_name: spark-history-server
depends_on:
- spark-master
ports:
- "18081:18081"
volumes:
- /tmp/spark-events-local:/tmp/spark-events
webserver:
container_name: webserver
ports:
- "5001:5001"
environment:
- SPARK_HOME=/spark
- PROJECT_HOME=/main
depends_on:
- spark-master
- spark-worker-1
- spark-worker-2
- spark-history-server
build:
context: ./docker/Flask
dockerfile: Dockerfile`
Here’s my Dockerfile for MongoDB:
FROM mongo
WORKDIR /main
RUN apt-get update && \
apt install -y nano
#Cloning repository with data and trained models
RUN apt-get install git -y && \
git clone https://github.com/Big-Data-ETSIT/practica_creativa && \
mv practica_creativa/* . && \
rm -r practica_creativa
RUN sed -i 's/mongo /mongosh /g' /main/resources/import_distances.sh
RUN chmod +x /main/resources/import_distances.sh
WORKDIR /main
CMD /bin/bash -c "mongod & ./resources/import_distances.sh & sleep 1234567"`
Here’s my Dockerfile for the web server:
#Webserver
FROM python:3.7
WORKDIR /main
RUN apt-get update && \
apt-get upgrade -y && \
apt install -y nano
#Cloning repository with data and trained models
RUN apt-get install git -y && \
git clone https://github.com/Big-Data-ETSIT/practica_creativa && \
mv practica_creativa/* . && \
rm -r practica_creativa
#Install Python dependencies
RUN pip3 install -r requirements.txt
#Change to the web server directory
WORKDIR /main/resources/web
RUN sed -i 's/localhost/kafka/g' /main/resources/web/predict_flask.py
RUN chmod +x /main/resources/web/predict_flask.py
#Run the web server
CMD python3 predict_flask.py
I create the Kafka topic using the following command, and it’s successfully created.
sudo docker-compose exec kafka bin/kafka-topics.sh --create --bootstrap-server kafka:9092 --replication-factor 1 --partitions 1 --topic flight_delay_classification_request
However, when I try to run spark-submit
to make the predictions, I receive the following error:
sudo docker-compose exec spark-master bash -c "/spark/bin/spark-submit --class "es.upm.dit.ging.predictor.MakePrediction" --packages org.mongodb.spark:mongo-spark-connector_2.12:10.1.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 /home/lucia/practica_creativa/flight_prediction/target/scala-2.12/flight_prediction_2.12-0.1.jar"
And the error message says:
23/05/24 17:43:20 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-0640fefe-71fe-4379-ba26-4ddb17c20b1e-1571676816-driver-0-2, groupId=spark-kafka-source-0640fefe-71fe-4379-ba26-4ddb17c20b1e-1571676816-driver-0] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
23/
In our MakePrediction.scala file we triend changing the localhost:9092 to kafka:9092 to try and staclish a connection to the kafka broker, but it didn´t work.
We expect to access the localhost: ‘port’ and to see the predictions made about the delays in th flights.