How to use Cassandra with Spark in a Docker image?

Question

(I hope this question is fit for ServerFault, if not, comment and I'll delete it)

I'm trying to create a docker image where Cassandra and Spark would be installed and configured to work together.

I never used Spark (and never created a Dockerfile), only Cassandra, so this is new territory.

I created a Dockerfile with Spark, Cassandra and Kafka. Now how do I configure them inside the Dockerfile so they all work together?

Cassandra-Spark Connector by Datastax... I don't know what to do with that.

Here is my Dockerfile so far:

FROM centos:centos7

RUN yum -y update;
RUN yum -y clean all;

# Install basic tools
RUN yum install -y wget dialog curl sudo lsof vim axel telnet nano openssh-server openssh-clients bzip2 passwd tar bc git unzip deltarpm

#Install Java
RUN yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel

# Install Python 3.6 - I used Anaconda2 instead to install it
#RUN yum install centos-release-scl -y
#RUN yum install rh-python36 -y
#RUN scl enable rh-python36 bash

#Create guest user. IMPORTANT: Change here UID 1000 to your host UID if you plan to share folders.
RUN useradd guest -u 1000
RUN echo guest | passwd guest --stdin

ENV HOME /home/guest
WORKDIR $HOME

USER guest

#Install Spark (Spark 2.4.0 - Nov 02, 2018, prebuilt for Hadoop 2.7 or higher)
RUN wget http://mirror.csclub.uwaterloo.ca/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
RUN tar xvf spark-2.4.0-bin-hadoop2.7.tgz
RUN mv spark-2.4.0-bin-hadoop2.7 spark

ENV SPARK_HOME $HOME/spark

#Install Kafka
RUN wget http://mirror.csclub.uwaterloo.ca/apache/kafka/2.1.0/kafka_2.12-2.1.0.tgz
RUN tar xvzf kafka_2.12-2.1.0.tgz
RUN mv kafka_2.12-2.1.0 kafka

ENV PATH $HOME/spark/bin:$HOME/spark/sbin:$HOME/kafka/bin:$PATH

#Install Anaconda Python distribution
RUN wget https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh
RUN bash Anaconda2-4.4.0-Linux-x86_64.sh -b
ENV PATH $HOME/anaconda2/bin:$PATH

RUN  conda install -c anaconda python

# RUN pip install --upgrade pip

#Install Kafka Python module
RUN pip install kafka-python

USER root

#Install Cassandra
ADD cassandra.repo /etc/yum.repos.d/datastax.repo
RUN yum install -y cassandra

#Environment variables for Spark and Java
ADD setenv.sh /home/guest/setenv.sh
RUN chown guest:guest setenv.sh
RUN echo . ./setenv.sh >> .bashrc

#Startup (start SSH, Cassandra, Zookeeper, Kafka producer)
ADD startup_script.sh /usr/bin/startup_script.sh
RUN chmod +x /usr/bin/startup_script.sh

The GitLab repo to see the rest of the files is here: https://gitlab.com/HypeWolf/docker-cassandra-spark-kafka

The final goal is to be able to use everything Cassandra and Spark can offer inside one container and allow user to pass configuration file or environment values to modify certain settings.

it's better to use separate Docker images for every of component, and link them together by docker-compose, etc. The whole idea of Docker is to provide an image that doing only one thing. Plus, you'll able to find "official" images for Cassandra, Kafka, etc. — Alex Ott, Jan 20 '19 at 13:28
You have a good point. But I was making it like that to speed things up as I heard network virtualization in docker is very slow (50% performance hit sometimes). I'll investigate, you still have a good point, it would be simpler — HypeWolf, Jan 20 '19 at 21:16
Then use host network that isn’t slow... and check every claim yourself. — Alex Ott, Jan 20 '19 at 21:43
@AlexOtt Thanks. Any idea how to use cassandra spark connector? Or you are a Docker Guru? :) — HypeWolf, Jan 20 '19 at 21:53
I don't think I have the same link as you as I didn't find any installation instruction. :/ — HypeWolf, Jan 20 '19 at 23:09
Oh nevermind, finally find the installation instruction.. Can't believe they hid it like that XD — HypeWolf, Jan 20 '19 at 23:22

How to use Cassandra with Spark in a Docker image?

0 Answers0