Hands on with Hadoop


This post begins a series of exercises on Hadoop and its ecosystem. Here is a rough outline of the series:
  1. Hello Hadoop World, including loading data into and copying results out of HDFS.
  2. Hadoop Programming environments with Python, Pig and Hive. 
  3. Importing data into HDFS. 
  4. Importing data into HBase. 
  5. Other topics TBD.  
We assume that the reader is familiar with the concepts of Map/Reduce. If not, feel lucky you have Google. Here are two of my favorite introductions: (1) The Story of Sam and (2) Map Reduce: a really simple introduction.

Hello Hadoop World

We build this first environment on Amazon EC2. We begin with ami-e450a28d, an EBS-based 64-bit ubuntu machine that's pretty bare-bones. To run Hadoop, we need at least an m1.large instance — using a micro instance is just too painful.

For the actual exercise, we will use the excellent tutorial by Michael Noll. Michael first wrote it in 2007 and has kept it up to date. The tutorial helps you install map/reduce and use it for computing the count of every word in a large text. Michael's post is pretty much right on. Things to keep in mind:
  • Make sure python 2.6 is installed on the machine and is the default python version. If you are unsure, invoke python from the shell, the python shell tells you the version.
  • ami-e450a28d has no Java pre-installed on it. Michael's instructions are slightly obsolete and you have to update the /etc/apt/sources.list file as documented here.
  • lzop needs to be installed. sudo apt-get install lzop.

Once installed, we download some books from from gutenberg.org and run Map/Reduce to get a count of every word in all of those books (combined).  

A parenthetical note: too many people have misused the Amazon environment to hit public websites like gutenberg.org. They now disallow access from Amazon EC2. You have to find another route to get the files there. One effective way is to download them to your laptop and then FTP them over to your server.

Once hosted on the server, we transfer the files to HDFS, run Hadoop, and transfer the results back to the file system. We could do something more sophisticated later, and will.

How does it work? 

The source code of the WordCount program is pretty simple. Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum. 

Comments

blog comments powered by Disqus