Benchmark Archives - Nico's Blog

Published February 11, 2016 by Nicolas

Hadoop Basics VIII: Running SQL Queries with Hive

In this part, we will use Hive to execute all the queries that we have been processing since the beginning of this series of tutorials.

In nearly all parts, we have coded MapReduce jobs to solve specific types of queries (filtering, aggregation, sorting, joining, etc…). It was a good exercise to understand Hadoop/MapReduce internals and some distributed processing theory, but it required to write a lot of code. Hive can translate SQL queries into MapReduce jobs to get results of a query without needing to write any code.

We will start by installing Hive and setting up tables for our datasets, before executing our queries from previous parts and seeing if Hive can have better execution times than our hand-coded MapReduce jobs.

Hadoop Basics

Benchmark Hive Joins Sorting

Published December 18, 2015 by Nicolas

Hadoop Basics I: Working with Sequence Files

In this new series of posts, we will explore basic techniques on how to query structured data. Querying means filtering, projecting, aggregating, sorting and joining data. We will view different methods of querying on different Hadoop frameworks (MapReduce, Hive, Spark, etc …).

This first part will briefly introduce the dataset which will be used throughout this series, and then present the Sequence File data format. We will see how to write and read Sequence Files with a few code snippets, and benchmark the different compression types to choose the best one.

We will format our dataset in a Sequence File, for later use in various frameworks in my following posts.

Hadoop Basics

Benchmark Data Format Hadoop HDFS

Published October 31, 2015 by Nicolas

Mini-Cluster Part IV : Word Count Benchmark

Benchmark Art In this part, we will run a simple Word Count application on the cluster using Hadoop and Spark on various platforms and cluster sizes.

We will run and benchmark the same program on 5 datasets of different sizes on :

A single MinnowBoard MAX, using a multi-threaded simple java application
A real home computer (my laptop), using the same simple java application
MapReduce, using a cluster of 2 to 4 slaves
Spark, using a cluster of 2 to 4 slaves

Using these results we will hopefully be able to answer to the original questions of this section : is a home cluster with such small computers worth it ? How many nodes does it take to be faster than a single node, or faster than a real computer ?

Mini-Cluster

Benchmark MapReduce MinnowBoard Spark YARN

Tag: <span>Benchmark</span>

Hadoop Basics VIII: Running SQL Queries with Hive

Hadoop Basics I: Working with Sequence Files

Mini-Cluster Part IV : Word Count Benchmark