Nico's Blog

Published January 4, 2016 by Nicolas

Hadoop Basics IV: Total Order Sorting in MapReduce

We saw in the previous part that when using multiple reducers, each reducer receives (key,value) pairs assigned to them by the Partitioner. When a reducer receives those pairs they are sorted by key, so generally the output of a reducer is also sorted by key. However, the outputs of different reducers are not ordered between each other, so they cannot be concatenated or read sequentially in the correct order.

For example with 2 reducers, sorting on simple Text keys, you can have :
– Reducer 1 output : (a,5), (d,6), (w,5)
– Reducer 2 output : (b,2), (c,5), (e,7)
The keys are only sorted if you look at each output individually, but if you read one after the other, the ordering is broken.

The objective of Total Order Sorting is to have all outputs sorted across all reducers :
– Reducer 1 output : (a,5), (b,2), (c,5)
– Reducer 2 output : (d,6), (e,7), (w,5)
This way the outputs can be read/searched/concatenated sequentially as a single ordered output.

In this post we will first see how to create a manual total order sorting with a custom partitioner. Then we will learn to use Hadoop’s TotalOrderPartitioner to automatically create partitions on simple type keys. Finally we will see a more advanced technique to use our Secondary Sort’s Composite Key (from the previous post) with this partitioner to achieve “Total Secondary Sorting“.

Hadoop Basics

MapReduce Sorting

Published December 28, 2015 by Nicolas

Hadoop Basics III: Secondary Sort in MapReduce

In the previous part we needed to sort our results on a single field. In this post we will learn how to sort data on multiple fields by using Secondary Sort.

We will first pose a query to solve, as we did in the last post, which will require sorting the dataset on multiple fields. Then we will study how the MapReduce shuffle phase works, before implementing our Secondary Sort to obtain the results for the given query.

Hadoop Basics

MapReduce Sorting

Published December 24, 2015 by Nicolas

Hadoop Basics II: Filter, Aggregate and Sort with MapReduce

Now that we have a Sequence File containing our newly “structured” data, let’s see how can get the results to a basic query using MapReduce.

We will illustrate how filtering, aggregation and simple sorting can be achieved in MapReduce. For beginners, these are fundamental operations that can help you understand the MapReduce framework. Advanced readers can still read it quickly to get familiar with the dataset and get ready for the next posts which will be about more advanced sorting and joining techniques.

Hadoop Basics

MapReduce Sorting

Published December 18, 2015 by Nicolas

Hadoop Basics I: Working with Sequence Files

In this new series of posts, we will explore basic techniques on how to query structured data. Querying means filtering, projecting, aggregating, sorting and joining data. We will view different methods of querying on different Hadoop frameworks (MapReduce, Hive, Spark, etc …).

This first part will briefly introduce the dataset which will be used throughout this series, and then present the Sequence File data format. We will see how to write and read Sequence Files with a few code snippets, and benchmark the different compression types to choose the best one.

We will format our dataset in a Sequence File, for later use in various frameworks in my following posts.

Hadoop Basics

Benchmark Data Format Hadoop HDFS

Published November 16, 2015 by Nicolas

Setting up Zeppelin for Spark in Scala and Python

Zeppelin Logo In the previous post we saw how to quickly get IPython up and running with PySpark.

Now we will set up Zeppelin, which can run both Spark-Shell (in scala) and PySpark (in python) Spark jobs from its notebooks.

We will build, run and configure Zeppelin to run the same Spark jobs in Scala and Python, using the Zeppelin SQL interpreter and Matplotlib to visualize SparkSQL query results.

A comparison between Scala and Python speeds, and between Zeppelin and IPython will be made to conclude this post.

Spark

IPython Matplotlib Python Scala Spark SparkSQL Zeppelin

Published November 9, 2015 by Nicolas

Quick setup for PySpark with IPython notebook

Here is an easy way of running PySpark on IPython notebook for data science and visualization.

There are methods on the web which consist in creating an IPython profile or kernel in which PySpark must be started with other necessary jars. These methods can seem a bit complicated, and not suitable for all versions of IPython, especially for the newest versions where the profiles are deprecated because they were merged into the Jupyter configs.

So this is a simple way to run PySpark with a basic default IPython configuration, in 10 minutes, for any version of IPython later than 1.0.0.

We will use the Anaconda python distribution, because it can quickly install IPython and all the necessary scientific and data analysis tools by running a single installation script.

Spark

Anaconda IPython Matplotlib Python Spark SparkSQL

Published October 31, 2015 by Nicolas

Mini-Cluster Part IV : Word Count Benchmark

Benchmark Art In this part, we will run a simple Word Count application on the cluster using Hadoop and Spark on various platforms and cluster sizes.

We will run and benchmark the same program on 5 datasets of different sizes on :

A single MinnowBoard MAX, using a multi-threaded simple java application
A real home computer (my laptop), using the same simple java application
MapReduce, using a cluster of 2 to 4 slaves
Spark, using a cluster of 2 to 4 slaves

Using these results we will hopefully be able to answer to the original questions of this section : is a home cluster with such small computers worth it ? How many nodes does it take to be faster than a single node, or faster than a real computer ?

Mini-Cluster

Benchmark MapReduce MinnowBoard Spark YARN

Published October 24, 2015 by Nicolas

Mini-Cluster Part III : Hadoop & Spark Installation

Hadoop Spark logos

In this part, we will see how to install and configure Hadoop (2.7.1) and Spark (1.5.1) to have one master and four slaves.

The configurations in this part are adapted for MinnowBoard SBCs. I tried to give as much explanations on the chosen values, which are relative to the resources of this specific cluster. If you have any questions, or if you doubt my configuration, feel free to comment. 🙂

We start by creating a user which we will use for all Hadoop related tasks. Then we will see how to install and configure the master and slaves. Finally we will finish by running a simple MapReduce job to check that everything works and to start being familiar with the Hadoop ecosystem.

Mini-Cluster

Hadoop HDFS MapReduce Spark YARN

Published October 17, 2015 by Nicolas

Mini-Cluster Part II : Node & Network Setup

In this part, I will explain how I installed and configured the Ubuntu Server OS (along with the necessary tools and libraries) and network settings in order to prepare each node for Hadoop and Spark.

I wrote this part as a memo for myself and also to help out beginners who are not comfortable with Ubuntu and networking. It can be useful as well for people who have some trouble specifically with the Minnowboard MAX.

Mini-Cluster

EFI-Boot MinnowBoard Networking Ubuntu

Published October 11, 2015 by Nicolas

Mini-Cluster Part I : Technical & Financial Choices

In this first part, I will explain how and why I selected the various hardware (computer, storage, networking, etc.) to build my home cluster.

The MinnowBoard MAX Single Board Computer

First we will compare the specs from the MinnowBoard with those of the Raspberry Pi 2. Then we will see the different available storage media on Single Board Computers, with a few explanations and benchmarks that made. Then I will show what network and rack setup I chose, and finally we will sum up all these component prices to see what was the total cost of my mini-cluster !

Mini-Cluster

HDD MinnowBoard Storage

Nico's Blog Posts