Month: <span>December 2015</span>

Shuffle phase workflow
Shuffle phase workflow

In the previous part we needed to sort our results on a single field. In this post we will learn how to sort data on multiple fields by using Secondary Sort.

We will first pose a query to solve, as we did in the last post, which will require sorting the dataset on multiple fields. Then we will study how the MapReduce shuffle phase works, before implementing our Secondary Sort to obtain the results for the given query.

Hadoop Basics

Now that we have a Sequence File containing our newly “structured” data, let’s see how can get the results to a basic query using MapReduce.

MapReduce SketchWe will illustrate how filtering, aggregation and simple sorting can be achieved in MapReduce. For beginners, these are fundamental operations that can help you understand the MapReduce framework. Advanced readers can still read it quickly to get familiar with the dataset and get ready for the next posts which will be about more advanced sorting and joining techniques.

Hadoop Basics

In this new series of posts, we will explore basic techniques on how to query structured data. Querying means filtering, projecting, aggregating, sorting and joining data. We will view different methods of querying on different Hadoop frameworks (MapReduce, Hive, Spark, etc …).

Sequence Files CompressionThis first part will briefly introduce the dataset which will be used throughout this series, and then present the Sequence File data format. We will see how to write and read Sequence Files with a few code snippets, and benchmark the different compression types to choose the best one.

We will format our dataset in a Sequence File, for later use in various frameworks in my following posts.

Hadoop Basics