(I hope this question is fit for ServerFault, if not, comment and I'll delete it)
I'm trying to create a docker image where Cassandra and Spark would be installed and configured to work together.
I never used Spark (and never created a Dockerfile), only Cassandra, so this is new territory.
I created a Dockerfile with Spark, Cassandra and Kafka. Now how do I configure them inside the Dockerfile so they all work together?
Cassandra-Spark Connector by Datastax... I don't know what to do with that.
Here is my Dockerfile so far:
FROM centos:centos7
RUN yum -y update;
RUN yum -y clean all;
# Install basic tools
RUN yum install -y wget dialog curl sudo lsof vim axel telnet nano openssh-server openssh-clients bzip2 passwd tar bc git unzip deltarpm
#Install Java
RUN yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel
# Install Python 3.6 - I used Anaconda2 instead to install it
#RUN yum install centos-release-scl -y
#RUN yum install rh-python36 -y
#RUN scl enable rh-python36 bash
#Create guest user. IMPORTANT: Change here UID 1000 to your host UID if you plan to share folders.
RUN useradd guest -u 1000
RUN echo guest | passwd guest --stdin
ENV HOME /home/guest
WORKDIR $HOME
USER guest
#Install Spark (Spark 2.4.0 - Nov 02, 2018, prebuilt for Hadoop 2.7 or higher)
RUN wget http://mirror.csclub.uwaterloo.ca/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
RUN tar xvf spark-2.4.0-bin-hadoop2.7.tgz
RUN mv spark-2.4.0-bin-hadoop2.7 spark
ENV SPARK_HOME $HOME/spark
#Install Kafka
RUN wget http://mirror.csclub.uwaterloo.ca/apache/kafka/2.1.0/kafka_2.12-2.1.0.tgz
RUN tar xvzf kafka_2.12-2.1.0.tgz
RUN mv kafka_2.12-2.1.0 kafka
ENV PATH $HOME/spark/bin:$HOME/spark/sbin:$HOME/kafka/bin:$PATH
#Install Anaconda Python distribution
RUN wget https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh
RUN bash Anaconda2-4.4.0-Linux-x86_64.sh -b
ENV PATH $HOME/anaconda2/bin:$PATH
RUN conda install -c anaconda python
# RUN pip install --upgrade pip
#Install Kafka Python module
RUN pip install kafka-python
USER root
#Install Cassandra
ADD cassandra.repo /etc/yum.repos.d/datastax.repo
RUN yum install -y cassandra
#Environment variables for Spark and Java
ADD setenv.sh /home/guest/setenv.sh
RUN chown guest:guest setenv.sh
RUN echo . ./setenv.sh >> .bashrc
#Startup (start SSH, Cassandra, Zookeeper, Kafka producer)
ADD startup_script.sh /usr/bin/startup_script.sh
RUN chmod +x /usr/bin/startup_script.sh
The GitLab repo to see the rest of the files is here: https://gitlab.com/HypeWolf/docker-cassandra-spark-kafka
The final goal is to be able to use everything Cassandra and Spark can offer inside one container and allow user to pass configuration file or environment values to modify certain settings.