February 2016 - Nico's Blog

Published February 11, 2016 by Nicolas

Hadoop Basics VIII: Running SQL Queries with Hive

In this part, we will use Hive to execute all the queries that we have been processing since the beginning of this series of tutorials.

In nearly all parts, we have coded MapReduce jobs to solve specific types of queries (filtering, aggregation, sorting, joining, etc…). It was a good exercise to understand Hadoop/MapReduce internals and some distributed processing theory, but it required to write a lot of code. Hive can translate SQL queries into MapReduce jobs to get results of a query without needing to write any code.

We will start by installing Hive and setting up tables for our datasets, before executing our queries from previous parts and seeing if Hive can have better execution times than our hand-coded MapReduce jobs.

Hadoop Basics

Benchmark Hive Joins Sorting

Published February 4, 2016 by Nicolas

Hadoop Basics VII: Bloom Filters

In this part we will see what Bloom Filters are and how to use them in Hadoop.

We will first focus on creating and testing a Bloom Filter for the Projects dataset. Then we will see how to use that filter in a Repartition Join and in a Replicated Join to see how it can help optimize either performance or memory usage.

Hadoop Basics

Joins MapReduce

Published February 3, 2016 by Nicolas

Hadoop Basics VI: Replicated Join in MapReduce

The Repartition Join we saw in the previous part is a Reduce-Side Join, because the actual joining is done in the reducer. The Replicated Join we are going to discover in this post is a Map-Side Join. The joining is done in mappers, and no reducer is even needed for this operation. So in a sense it should be faster join, but only if certain requirements are met.

We will see how it works and try using it on our Donations/Projects datasets and see if we can accomplish the same join we did in the previous post.

Hadoop Basics

Joins MapReduce

Month: <span>February 2016</span>

Hadoop Basics VIII: Running SQL Queries with Hive

Hadoop Basics VII: Bloom Filters

Hadoop Basics VI: Replicated Join in MapReduce