In my previous series of posts, I’ve focused on using distributed computing frameworks, Hadoop and Spark, which had to be manually installed on Ubuntu on my cluster nodes.
In this series of posts I will write about how to use Docker to achieve automated distribution-independent deployment of any type of services on my cluster.
Motivation to use Docker
Previously, Hadoop and Spark were configured to fully use the resources of my nodes, which is an obvious choice for better performance since their sole purpose was to work on their part of the MapReduce/Spark job.
But what if I want to use my cluster for more general usage, to host a more complex set of different services (web servers, distributed databases, parallel computing, etc…) ? What if I don’t want to let those services get in the way of each other (resource sharing, crash, etc…) ? What if I want to use my cluster as a testing platform for a bigger cluster and need all services to be regularly and automatically re-installed ? A good way to prevent a lot of problems and make things a lot more convenient is to use Docker.
Docker is a relatively new, but already very popular OS-level virtualization tool. When using docker, instead of running an application directly on the base OS of each node, we deploy a container (generated from a pre-built image) which runs that application. This extra level of abstraction provided by Docker allows us to have the following benefits:
- Resource management (limit RAM/CPU/Network usage of containers).
- Promotes functional isolation.
- For a small clusters, containers can be started manually, it only takes a single SSH command to deploy one. Application install/uninstall processes are replaced by extremely fast and simple create/remove of containers.
- For bigger clusters, orchestration engines exist (Kubernetes, Mesos…) which can automate deployment of Docker containers.
- Integration with most cloud providers (AWS, GC, Azure, DigitalOcean…).
- Packaging and Collaboration
- Images are built automatically from a simple file, called Dockerfile, which contains build instructions.
- An image is made of several layers, which are version-controlled and reused to save time and memory.
- Docker comes with a sharing system to push/pull images from a repository.
Docker also uses less resource overhead than using VMs, because it doesn’t host guest kernels for each container. All containers runs on top of the host’s linux kernel. This is a great advantage when using small weak computers such as my Minnowboards, which only have a 2-core CPU and 2 GB of RAM each.
I’ve come up with a small scenario which should cover a lot of Docker functionalities. The objective is to create a scalable, highly-available, and easy-to-deploy platform to host a web application where users can write some posts. This post system will be very simple: a user can only create and read his own posts. No followers, friends, publishing, or anything else. Not even update or delete your own posts. It’s a bit lame, I know, but it will let us focus more on the technical side of things 🙂
Here is a diagram of the different modules which we will use and how they will interact with each other. Each blue box is a module instance, and will typically be a Docker container (sometimes a couple of them working together), so each module can be deployed on any node of the cluster.
More details on these components:
- Django WebApps: using Nginx web server
- Handles authentication, post creation and viewing, serving dynamic content as well as static files.
- A Load Balancer will be used to distribute load to available web servers.
- MySQL relational database: for transactional and Django system data (user accounts and sessions, etc…).
- Master-Slave replication will be used. If the master goes down, a slave can be promoted to master (we won’t go as far as to automate this process, but at least we will have a backups of the data at all times in the slaves).
- A Load Balancer will be used to forward write requests (INSERT, UPDATE, etc …) to the master only, and to distribute read (SELECTs) requests across master and slaves. Scalability can be achieved for reads only by adding slaves.
- Cassandra NoSQL database: to store the user posts.
- Master-less replication, no single-point-of-failure, scalable.
- Doesn’t need a load balancer. Load is balanced by the client and also internally by the cluster nodes.
- Spark: to analyse the Cassandra data.
- Deployed when needed, to quickly process data from all Cassandra nodes in a parallel fashion.
Summary of future Posts
The following posts will be about how to use Docker to prepare images, and then use those images to get this whole platform deployed and working on a cluster.
Here is the list of posts, with the associated topics covered in each of them. This table is will be updated regularly because I haven’t planned everything out yet.
|Part I: Discovering Docker and Cassandra||Docker installation and basic commands
Running an public image from Docker Hub
Interacting with Docker Containers
Docker Networking and Data Volumes
|Part II: Replicated MySQL with HAProxy Load Balancing||Creating, building and sharing a custom Docker image
MySQL Replication setup
Load Balancing using a HAProxy container
|Part III: The Django/uWSGI/Nginx Web App||Using Nginx as reverse proxy to uWSGI
Communicating with the Cassandra and MySQL clusters in Django
Creating a bridge network to communicate between containers
Data Volume Sharing
Supervisord to run multiple processes in a container
|Part IV: Docker Machine, Swarm, Compose|
|Part V: Kubernetes|
|Part VI: Mesos with Marathon|