Author: <span class="vcard">Nicolas</span>

We saw in the previous part that when using multiple reducers, each reducer receives (key,value) pairs assigned to them by the Partitioner. When a reducer receives those pairs they are sorted by key, so generally the output of a reducer is also sorted by key. However, the outputs of different reducers are not ordered between each other, so they cannot be concatenated or read sequentially in the correct order.

For example with 2 reducers, sorting on simple Text keys, you can have :
– Reducer 1 output : (a,5), (d,6), (w,5)
– Reducer 2 output : (b,2), (c,5), (e,7)
The keys are only sorted if you look at each output individually, but if you read one after the other, the ordering is broken.

The objective of Total Order Sorting is to have all outputs sorted across all reducers :
– Reducer 1 output : (a,5), (b,2), (c,5)
– Reducer 2 output : (d,6), (e,7), (w,5)
This way the outputs can be read/searched/concatenated sequentially as a single ordered output.

In this post we will first see how to create a manual total order sorting with a custom partitioner. Then we will learn to use Hadoop’s TotalOrderPartitioner to automatically create partitions on simple type keys. Finally we will see a more advanced technique to use our Secondary Sort’s Composite Key (from the previous post) with this partitioner to achieve “Total Secondary Sorting“.

Hadoop Basics

Shuffle phase workflow
Shuffle phase workflow

In the previous part we needed to sort our results on a single field. In this post we will learn how to sort data on multiple fields by using Secondary Sort.

We will first pose a query to solve, as we did in the last post, which will require sorting the dataset on multiple fields. Then we will study how the MapReduce shuffle phase works, before implementing our Secondary Sort to obtain the results for the given query.

Hadoop Basics

Now that we have a Sequence File containing our newly “structured” data, let’s see how can get the results to a basic query using MapReduce.

MapReduce SketchWe will illustrate how filtering, aggregation and simple sorting can be achieved in MapReduce. For beginners, these are fundamental operations that can help you understand the MapReduce framework. Advanced readers can still read it quickly to get familiar with the dataset and get ready for the next posts which will be about more advanced sorting and joining techniques.

Hadoop Basics

In this new series of posts, we will explore basic techniques on how to query structured data. Querying means filtering, projecting, aggregating, sorting and joining data. We will view different methods of querying on different Hadoop frameworks (MapReduce, Hive, Spark, etc …).

Sequence Files CompressionThis first part will briefly introduce the dataset which will be used throughout this series, and then present the Sequence File data format. We will see how to write and read Sequence Files with a few code snippets, and benchmark the different compression types to choose the best one.

We will format our dataset in a Sequence File, for later use in various frameworks in my following posts.

Hadoop Basics

Zeppelin LogoIn the previous post we saw how to quickly get IPython up and running with PySpark.

Now we will set up Zeppelin, which can run both Spark-Shell (in scala) and PySpark (in python) Spark jobs from its notebooks.

We will build, run and configure Zeppelin to run the same Spark jobs in Scala and Python, using the Zeppelin SQL interpreter and Matplotlib to visualize SparkSQL query results.

A comparison between Scala and Python speeds, and between Zeppelin and IPython will be made to conclude this post.

Spark

Here is an easy way of running PySpark on IPython notebook for data science and visualization.

There are methods on the web which consist in creating an IPython profile or kernel in which PySpark must be started with other necessary jars. These methods can seem a bit complicated, and not suitable for all versions of IPython, especially for the newest versions where the profiles are deprecated because they were merged into the Jupyter configs.

So this is a simple way to run PySpark with a basic default IPython configuration, in 10 minutes, for any version of IPython later than 1.0.0.

IPython Banner

We will use the Anaconda python distribution, because it can quickly install IPython and all the necessary scientific and data analysis tools by running a single installation script.

Spark

Benchmark ArtIn this part, we will run a simple Word Count application on the cluster using Hadoop and Spark on various platforms and cluster sizes.

We will run and benchmark the same program on 5 datasets of different sizes on :

  • A single MinnowBoard MAX, using a multi-threaded simple java application
  • A real home computer (my laptop), using the same simple java application
  • MapReduce, using a cluster of 2 to 4 slaves
  • Spark, using a cluster of 2 to 4 slaves

Using these results we will hopefully be able to answer to the original questions of this section : is a home cluster with such small computers worth it ? How many nodes does it take to be faster than a single node, or faster than a real computer ?

Mini-Cluster

Hadoop Spark logos

In this part, we will see how to install and configure Hadoop (2.7.1) and Spark (1.5.1) to have one master and four slaves.

The configurations in this part are adapted for MinnowBoard SBCs. I tried to give as much explanations on the chosen values, which are relative to the resources of this specific cluster. If you have any questions, or if you doubt my configuration, feel free to comment. 🙂

We start by creating a user which we will use for all Hadoop related tasks. Then we will see how to install and configure the master and slaves. Finally we will finish by running a simple MapReduce job to check that everything works and to start being familiar with the Hadoop ecosystem.

Mini-Cluster

Ethernet IconIn this part, I will explain how I installed and configured the Ubuntu Server OS (along with the necessary tools and libraries) and network settings in order to prepare each node for Hadoop and Spark.

I wrote this part as a memo for myself and also to help out beginners who are not comfortable with Ubuntu and networking. It can be useful as well for people who have some trouble specifically with the Minnowboard MAX.

Mini-Cluster

In this first part, I will explain how and why I selected the various hardware (computer, storage, networking, etc.) to build my home cluster.

MinnowBoard MAX
The MinnowBoard MAX Single Board Computer

First we will compare the specs from the MinnowBoard with those of the Raspberry Pi 2. Then we will see the different available storage media on Single Board Computers, with a few explanations and benchmarks that made. Then I will show what network and rack setup I chose, and finally we will sum up all these component prices to see what was the total cost of my mini-cluster !

Mini-Cluster