Docker I: Discovering Docker and Cassandra

Docker CassandraIn this part we will learn how to run Docker containers. We will explore the basic Docker commands while deploying a small Cassandra cluster on separate hosts on my cluster. To keep things simple we will use the official Cassandra image from Docker Hub to create the Cassandra containers. I will also explain a few basic Cassandra principles and keep it simple for people who have no knowledge of Cassandra.


Docker Setup

Docker Engine is the core library to build and run Docker images and containers on a Linux host. The easiest way to install it for common linux distributions is to run the remote “get docker” installation script:

$ curl -fsSL | sh

If you experience difficulties with the script, or are using an unsupported OS/distribution, you can find details for manual installation here.

I have installed Docker Engine on my 5 nodes (ubuntu[0-4]) running Ubuntu Server 14.04 LTS.

For reference, here are the commands for a full manual installation, testing, and setup of a new user called “docker” on my Ubuntu nodes:

## 1. Change apt repository to use packages managed by docker repo
$ sudo apt-get update
$ sudo apt-get install apt-transport-https ca-certificates
$ sudo apt-key adv --keyserver hkp:// --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
$ echo "deb ubuntu-trusty main" | sudo tee /etc/apt/sources.list.d/docker.list
$ sudo apt-get update
$ sudo apt-get purge lxc-docker

## 2. Install linux-image-extra for aufs storage driver, apparmor for security restrictions, and finally Docker Engine
$ sudo apt-get install linux-image-extra-`(uname -r)`
$ sudo apt-get install apparmor
$ sudo apt-get install docker-engine

## 3. Start docker and test it
$ sudo service docker start
$ sudo docker run hello-world

## 4. Add a user to the docker group to run docker commands without sudo
$ sudo adduser --ingroup docker dock
$ sudo usermod -aG docker dock

Part number 3 of the commands starts the docker daemon, and then runs a Hello-World container which simply prints out a message. The Hello-World image is automatically pulled (i.e. downloaded) from Docker Hub (from here exactly), because it was not found locally. Once the container is deployed, its main process prints a “hello world” message. Then when the main process exits, the container automatically stops.

Overview of Cassandra

Cassandra is a NoSQL database. More specifically, it is a wide column store, and an an AP system which offers tunable consistency to reach C, at the cost of performance.


Its main advantages are:

  • Decentralized: all nodes have the same role. There is no master or slave. Easier configuration.
  • Linear Scalability: offers the best read/write throughputs for very large clusters (although latency is only mediocre).
  • Fault-Tolerant: data is replicated across datacenters and failed nodes can be replaced without downtime.
  • Tunable Consistency: a level of consistency can be chosen on a per-query basis.

Cassandra is easy to set up and play with, because it has auto-discovery of nodes, and does not need a load balancer or a specific master configuration.

We can simply install 3 instances of Cassandra on 3 different nodes and they can form a cluster automatically (each node only needs to be informed of another node’s IP address at first). Then queries can be run against any instance.


Cassandra is a very efficient distributed database, but is not appropriate for all use-cases because:

  • Tables are made to only serve a single query. To query on different criteria or using different ordering fields, extra Tables or Materialized Views must be created for those queries. Cassandra Query Language (CQL) sounds like you can query anything like in SQL, but you can’t because of this.
  • No aggregation or joining.

In our case we will only use one table to store posts from users, and always query them in the same order for our webapp, so we should be fine.

Using Containers

Let’s create 3 instances of Cassandra on 3 nodes of my cluster: ubuntu1, ubuntu2 and ubuntu3. Their respective containers will be called cass1, cass2 and cass3.

Starting a first Cassandra container

Let’s first start a Cassandra container on ubuntu1 (

$ docker run -d --name=cass1 --net=host cassandra

The docker run command was used above to run the Hello-World container. But this time it pulls the Cassandra official image from Docker Hub instead of Hello-World. An image is a mold from which containers are spawned. It normally contains a Linux distribution (in this case it is Debian) with all of the necessary resources (in this case Cassandra binaries, scripts, folder structure, etc …) to provide the service that your need.

The container is started with the following options:

  • -d : run the container in detached mode, meaning in the background.
  • –name=cass1 : give it a name so that we can easily call commands on it later on.
  • –net=host : use the host’s network stack directly, and expose all container ports directly on the host IP address.

❗ The last network option is the easiest way to get this starting for us, but is bad practice, and should be used only when testing or playing around. The default behavior is normally to containerize the container’s networking, placing it into a separate network stack. We’ll see more about that a bit later.

Basic Docker Commands

To view all running containers:

$ docker ps

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
4cbeb45c1d92        cassandra           "/"   51 seconds ago      Up 50 seconds                                cass1

To view the container logs:

$ docker logs cass1

INFO  15:05:00 Scheduling approximate time-check task with a precision of 10 milliseconds
INFO  15:05:01 Created default superuser role 'cassandra'

Stop the container:

$ docker stop cass1

View all containers, even the ones which are stopped:

$ docker ps -a

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS                        PORTS                    NAMES
4cbeb45c1d92        cassandra           "/"   3 minutes ago       Exited (143) 13 seconds ago                            cass1

Note the CONTAINER_ID field. The value from this column can be used instead of the container name when running docker commands. This is the was to specify a container if you haven’t given it a name. A prefix of the ID also works, as long as it doesn’t conflict with another ID. For example you can also call docker stop 4cbe to stop the cass1 container.

Start the container again:

$ docker start cass1

Remove a container (all data in the container will be lost):

$ docker rm -f cass1

If the -f option is not used, an error will be caused if the image has not been previously stopped.

Cached Images

Now that the container has been deleted, you’ll have to execute the docker run command above. But this time it won’t need to download the Cassandra image again, because it has been cached in your local repository the first time.

To view docker images cached locally:

$ docker images

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
cassandra           latest              7514d930eadf        3 months ago        379.8 MB
debian              jessie              256adf7015ca        3 months ago        125.1 MB
hello-world         latest              690ed74de00f        7 months ago        960 B

As you can see, the Cassandra image is about 379.8 MB. It is bigger than the Debian image (125 MB) which is logical since the Cassandra image was in fact extended from the Debian image. The Hello-World image is very small (< 1 KB) because it doesn’t even have a linux distribution !

To delete an image, use $ docker rmi <name_or_id>.

Instead of doing docker run to start a container, you can also simply download the image in your local cache first using docker pull cassandra. And then use docker create with the same options to create the container, which will exist but won’t be started until you call docker start <name_or_id>.

Executing Commands inside a Container

The Cassandra container is in fact a virtualized Debian OS which runs Cassandra. We can run commands on that OS using the following command on the host:

$ docker exec [-ti] cass1 <shell_command>

For example, you can check what process is running in the container by doing:

$ docker exec cass1 ps -ef

cassand+     1     0  1 Jun05 ?        00:20:59 java -Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOn...
root       513     0  0 10:28 ?        00:00:00 ps -ef

Or check some networking information in the container:

$ docker exec cass1 ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet scope host lo
       valid_lft forever preferred_lft forever
2: p2p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:13:20:fe:54:7f brd ff:ff:ff:ff:ff:ff
    inet brd scope global p2p1
       valid_lft forever preferred_lft forever

$ docker exec cass1 cat /etc/hosts       localhost   ubuntu0   ubuntu1   ubuntu2   ubuntu3

Since we used the –net=host mode, the container has the same network interfaces as the host, and it also has the same /etc/hosts file.

To run commands interactively, use the -i (interactive) and -t (tty) options. You can for example run a bash session in the container Debian:

$ docker exec -it cass1 bash

root@ed37536b1c48:/# cat /etc/*-release
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"

You can see that you are logged in as root in the Debian Jessie command line.

Creating the Cassandra Cluster

Now that we have a first Cassandra container running on the first node, let’s create 2 other containers on the ubuntu2 ( and ubuntu3 ( hosts.

dock@ubuntu2$ docker run -d --name=cass2 --net=host -e CASSANDRA_SEEDS="" cassandra
dock@ubuntu2$ docker logs cass2
INFO 13:48:30 JOINING: waiting for ring information
INFO 13:48:31 Handshaking version with /
INFO 13:48:32 Node / is now part of the cluster
INFO 13:48:32 InetAddress / is now UP

dock@ubuntu3$ docker run -d --name=cass3 --net=host -e CASSANDRA_SEEDS="" cassandra
dock@ubuntu3$ docker logs cass3
INFO 13:50:06 JOINING: waiting for ring information
INFO 13:50:07 Handshaking version with /
INFO 13:50:07 Handshaking version with /

We ran the same command as on ubuntu1, but this time we added a -e option to define the environment variable CASSANDRA_SEEDS.

The usable environment variables are generally specified or explained on the Docker Hub page of the image. In this case, this variable takes a list of IP addresses of nodes to join in a cluster.

We can check the status of our cluster using Cassandra’s nodetool utility from inside any of the cluster’s containers:

$ docker exec cass1 nodetool status

Datacenter: datacenter1
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  111.97 KB  256          ?       17ad1ff5-40c7-49ad-bf29-df75ac3a29f9  rack1
UN  102.07 KB  256          ?       92ab6227-7a37-455b-a8b4-f8ce79467871  rack1
UN  110.96 KB  256          ?       6d8fd80e-54bb-457b-939c-dd3b0fa564a8  rack1

We can see that our 3 nodes are “UN” (Up & Normal). Our cluster is ready !

Running the CQL Shell in a Container

Our cluster is running, so let’s create a table and insert data in it. To do this we can use the CQL Shell which comes with Cassandra.

We can call it from within any of the cluster’s container, with the docker exec command in interactive mode.

$ docker exec -ti cass1 cqlsh localhost

Connected to Test Cluster at localhost:9042.
[cqlsh 5.0.1 | Cassandra 3.3 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
cqlsh> exit

A better way is to create a new Cassandra container for the sole purpose of executing the CQL Shell inside of it:

$ docker run -it --rm cassandra cqlsh

Connected to Test Cluster at
[cqlsh 5.0.1 | Cassandra 3.3 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.

This technique is more docker-ish in philosophy. It respects functional isolation and does not run extra processes in the Cassandra cluster’s containers. Instead it runs it runs a CQL Shell in a dedicated container (on any host, even one which is not part of the cluster) and communicates with one of the cluster’s node remotely through the CQL port 9042.

The command cqlsh is specified as a parameter after the image name cassandra. In this case, Docker uses the specified command as the container’s main process instead of its default one, which should normally be the cassandra daemon.

The container is created with the –rm option so that once the container is stopped, which happens automatically when exiting the shell, it will be automatically removed.

Cassandra Data Creation and Querying

We can now create a small table and insert some data into it. We must first create a Keyspace to define our cluster organisation and set how many replications we want. Then in that keyspace we can create a table and insert some data:

> CREATE KEYSPACE posts_db WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 };
> USE posts_db;
> CREATE TABLE posts (
    username varchar,
    creation timeuuid,
    content varchar,
    PRIMARY KEY ((username), creation));

> INSERT INTO posts (username, creation, content) values ('nicolas', now(), 'First Post');
> INSERT INTO posts (username, creation, content) values ('arnaud', now(), 'Salut');
> INSERT INTO posts (username, creation, content) values ('nicolas', now(), 'Second Post');
> SELECT * FROM posts;

 username | creation                             | content
   arnaud | 6324d3d0-2ca6-11e6-a6cd-7d8eb3b4e5df |       Salut
  nicolas | f967bc50-2ca5-11e6-a6cd-7d8eb3b4e5df |  First Post
  nicolas | 00925750-2ca7-11e6-a6cd-7d8eb3b4e5df | Second Post

(3 rows)

The only thing which looks different that SQL here is the PRIMARY_KEY definition.

It is composed of 2 parts, in the format ((PARTITION_KEY), CLUSTERING_KEY). In our case, the partition key is the username field and the clustering key is the creation field, but both keys could be composite if you specify multiple field names separated by commas.

  • The partition key defines how to distribute data across nodes. In our case this means that all rows with the same username will be stored together in the same partition, on the same node.
  • The clustering key defines how the rows will be sorted within a partition.

The timeuuid data type serves as a uuid but also contains a timestamp.

Let’s look at a few CQL queries to see what Cassandra can and cannot do.

Allowed Queries

> SELECT username, dateOf(creation), content FROM posts WHERE username='nicolas' ORDER BY creation;

 username | system.dateof(creation)  | content
  nicolas | 2016-06-07 11:50:04+0000 |  First Post
  nicolas | 2016-06-07 11:57:26+0000 | Second Post

(2 rows)

The first query selects all rows for a given value of username, the partition key, and wants it ordered by creation, the clustering key. This is the perfect query for the table we created, and it is the one we will use in our webapp later on.

The results are easy to obtain, because Cassandra simply needs to find the partition (on one of our 3 nodes) which contains all the rows for “nicolas”. It then reads those rows sequentially, and they are conveniently sorted by creation already, because it is the clustering key. In fact if you don’t specify the ORDER BY clause you get exactly the same result, because the rows are already sorted anyways. However you can use the ORDER BY clause with DESC to get the results in reverse order.

> SELECT content FROM posts WHERE username='nicolas' AND creation = f967bc50-2ca5-11e6-a6cd-7d8eb3b4e5df;

 First Post

This query is also allowed. We still have our condition on the partition key, so Cassandra can go into the partition where “nicolas” rows are, and from there it can easily find the rows where the creation field matched the value we asked, since they are already ordered. Inequalities can also be used this way. Removing the condition on the partition key (username) is possible but not recommended because it requires joining results from multiple nodes. If you do so, a warning will appear and tell you to add ALLOW FILTERING at the end of your query to force it.

Forbidden Queries

> SELECT username FROM posts WHERE username='nicolas' ORDER BY content;
InvalidRequest: code=2200 [Invalid query] message="Order by is currently only supported on the clustered columns of the PRIMARY KEY, got content"

This query cannot be executed, because we are trying to sort by content value. But the values are sorted by order of creation within the target partition containing the “nicolas” rows. This would require to perform random reads, which is bad for performance, or to re-sort the data in memory. Cassandra doesn’t bother to do that, it simply throws an error. Using a condition on the content field is not allowed either for the same reason.

> SELECT username FROM posts ORDER BY creation;
InvalidRequest: code=2200 [Invalid query] message="ORDER BY is only supported when the partition key is restricted by an EQ or an IN."

This query cannot be done either, because we did not specify a user. This means that we are trying to sort rows from all partitions from all nodes. This can’t be done with huge amounts of data, so Cassandra doesn’t bother either, and throws an error. As the error suggests, it is possible to get a result if you define a few desired username values using an IN clause. But this is not recommended because required calling multiple nodes and merging results, which can become quite messy.

Networking and Data Volumes

Our cluster is now working and usable. But there are still a couple of important configurations we still need to know about.

Using the Bridge Network

Previously we used the “host” network mode, which puts our contains directly on the host’s network stack. I did this first because it’s a lazy and quick way to get things started, but it is bad practice in Docker. The better way is to put the containers on the host’s virtual Ethernet bridge.

By specifying –net=bridge, or not specifying anything, since it is the default behavior, each container has its own IP:

dock@ubuntu1:~$ docker run --name=cass1 -d cassandra
dock@ubuntu1:~$ docker exec -ti cass1 bash
root@87e9528298d9:/# ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
128: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff
    inet scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:3/64 scope link
       valid_lft forever preferred_lft forever

root@87e9528298d9:/# cat /etc/hosts       localhost
::1             localhost ip6-localhost ip6-loopback      87e9528298d9

Compared to when we used –net=host, the following things have changed:

  • Calling localhost or from within the container now calls the container itself instead of the host ubuntu1 machine.
  • The container has its own IP address, on the bridge network, which is in the example above.
  • The /etc/hosts file is no longer the same as the one on the host. Now it only contains the container’s data.

Container Communication

When using the bridge network, the following observations can be made:

  1. The container can ping the host (e.g. at, and the host can ping the container (e.g. at
  2. Two containers on the bridge network of a same host can ping each other using their bridge 172.17.x.x IP addresses.
  3. A container can ping another host using its IP address. But this is not mutual. See last point.
  4. However containers are not aware of other container or host names, because their /etc/hosts only contains their own information.
  5. A container cannot be pinged directly from outside its host. For example ubuntu2 cannot ping cass1 on ubuntu1, because the container’s IP address is on a virtual network, and only ubuntu1 knows about it.
    • This means that two containers on different hosts cannot ping each other.
    • It is however possible to ask the host to forward a port to one of its container. That way, the container can be reached from outside through that port. More in the following section.

💡 It is possible with Docker to make containers from different hosts sit on the same network, by using an overlay network instead of the bridge network. We will see more about this in future posts. For now we are stuck with the bridge network.

Based on these points, we might run into a few problems when starting our cluster. When passing the CASSANDRA_SEEDS variable to a container, we must make sure to pass the IP address instead of the hostname, because of (4).

A bigger problem is that each container cass1, cass2 and cass3, sitting on separate hosts, will auto-discover each other by exchanges IP addresses. They will broadcast their own IP address, which is on the bridge IP address. But this will lead to failed communications because of (5). Luckily, the Cassandra container takes an environment variable, CASSANDRA_BROADCAST_ADDRESS, in which we can tell the container to broadcast its host’s IP address instead.

So to overcome these networking obstacles, this is the command we can use to start our second container cass2:

docker run -d \
--name cass2 \

Exposing Ports

When using the bridge network, all the container’s ports are exposed on its own bridge IP, but they are not reachable on the host’s IP, which means that they are not reachable from other hosts ubuntu2, ubuntu3, etc…

Fortunately, is it possible to expose ports through the host, using 2 options in the run command:

  • -P : this will map container ports to random port numbers on the host.
  • -p <cp>:<hp>  : maps the container port cp to the host port hp.

We need to publish our Cassandra container ports to the host because they will be called by other nodes from the cluster. So we must use the second option listed above and map each port to its same value. But how can we know in advance which ports to map ? One way is to look at the Dockerfile of the Cassandra image. Check it out on Github here. This link can be found on the Docker Hub page of the image, which also normally documents how to run the container.

The Dockerfile defined how the image was built. If you look at the end of the file, just before the last line you can see:

EXPOSE 7000 7001 7199 9042 9160

These are the ports which are used by the container. So to run a Cassandra container and expose all of its ports:

$ docker run -d \
--name=cass1 \
-p 7000-7001:7000-7001 \
-p 7199:7199 \
-p 9042:9042 \
-p 9160:9160 \

The first -p option uses a range of ports instead of a single port. This can save a bit of typing if you have a lot of consecutive ports.

Now our containers are containerized in their own network stack, but can still be accessed from their host’s IP address, thanks to port forwarding. One advantage of this, compared to using the host network stack, is that you could deploy 3 Http Server containers on the same host, each listening on port 80, without port conflict, by mapping those 80 ports to 8001, 8002 and 8003 on the host for example.

Mounting a Data Volume

Until now, all the data created in our Cassandra containers were stored in the container file system. Which means that if we remove the containers and create new ones, then we would loose all of our Cassandra data. One way to persist data so that it can be kept after container destruction is to mount a host directory on the container.

This can be done using the -v <host_dir>:<mount_point> option in the run command:

$ docker run \
--name cass1 \
-p 7000-7001:7000-7001 -p 7199:7199 -p 9042:9042 -p 9160:9160 \
-v /data/cassandra:/var/lib/cassandra \
-d cassandra

Is this case, we have mounted the /data/cassandra directory of the ubuntu1 host on the /var/lib/cassandra mount point of the Cassandra container. To find which mount points are available on a container, it is probably documented on the Docker Hub page. If not, you can take a look at the image’s Dockerfile, and find the VOLUME values:

VOLUME /var/lib/cassandra

This way, after destroying a Cassandra node container, its data will still be present on the host’s /data/cassandra directory. If we recreate a new container, we can re-mount the same directory and continue using the same data.


With everything we have seen in this post, and with the use of ssh, we can deploy a cluster from a single client node, for example ubuntu0, in a few commands:

# Deploy the 3 nodes
dock@ubuntu0$ ssh ubuntu1 docker run --name cass1 -v /data/cassandra:/var/lib/cassandra -e CASSANDRA_BROADCAST_ADDRESS="" -p 7000-7001:7000-7001 -p 7199:7199 -p 9042:9042 -p 9160:9160 -d cassandra
dock@ubuntu0$ ssh ubuntu2 docker run --name cass2 -v /data/cassandra:/var/lib/cassandra -e CASSANDRA_BROADCAST_ADDRESS="" -e CASSANDRA_SEEDS="" -p 7000-7001:7000-7001 -p 7199:7199 -p 9042:9042 -p 9160:9160 -d cassandra
dock@ubuntu0$ ssh ubuntu3 docker run --name cass3 -v /data/cassandra:/var/lib/cassandra -e CASSANDRA_BROADCAST_ADDRESS="" -e CASSANDRA_SEEDS="" -p 7000-7001:7000-7001 -p 7199:7199 -p 9042:9042 -p 9160:9160 -d cassandra

# Check status
dock@ubuntu0$ ssh ubuntu1 docker exec cass1 nodetool status

Datacenter: datacenter1
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  122.34 KB  256          ?       c18755bc-fe98-4dc7-874f-795a5d971858  rack1
UN  117.62 KB  256          ?       1a69ff72-b9e2-43a2-8932-43c7ccef96aa  rack1
UN  202.03 KB  256          ?       c6840392-5841-4050-8885-43fda602ab69  rack1

As you can imagine, you could scale this up to thousands of nodes with a bit of scripting. In later posts we will have a look at tools which can orchestrate deployment on containers in a more sophisticated way.


  1. Sam Lin said:

    very nice post, it helps me a lot, thank you.

    February 6, 2017
  2. Leland Later said:

    Thank you for the great intro, exactly what I needed. Looking forward to reading the subsequent posts.

    February 28, 2017
  3. shree said:

    Hello Sir It’s nice one .

    Error response from daemon: Container 80486f6424c4cc6940db365152d106080d7ce8a1a1adc520a458620ea1b74a1a is not running
    please give me some idea for this

    May 16, 2017
  4. Mikhail Dubkov said:

    Thank you for this post, it’s really best I have read as Docker intro. Looking forward for next posts, especially regarding Mesos and Marathon.

    August 17, 2017
  5. michael said:

    Really cool post, no matter the “cmd” and the explaination are all perfect
    It helps me a lot
    By the way, if we add docker-compose cmd in a shell script to manage the node is better

    August 24, 2017
  6. suraj said:

    how to configure 3 node cassandra cluster on same host using docker and then able to connect spark from outside.

    June 28, 2018
  7. panos said:

    Awesome & helpful post! 8/8

    March 18, 2019
  8. xiaoyi said:

    very good posting!! thanks

    June 12, 2020

Leave a Reply

Your email address will not be published. Required fields are marked *