Hands on with Hadoop


This post begins a series of exercises on Hadoop and its ecosystem. Here is a rough outline of the series:
  1. Hello Hadoop World, including loading data into and copying results out of HDFS.
  2. Hadoop Programming environments with Python, Pig and Hive. 
  3. Importing data into HDFS. 
  4. Importing data into HBase. 
  5. Other topics TBD.  
We assume that the reader is familiar with the concepts of Map/Reduce. If not, feel lucky you have Google. Here are two of my favorite introductions: (1) The Story of Sam and (2) Map Reduce: a really simple introduction.

Hello Hadoop World

We build this first environment on Amazon EC2. We begin with ami-e450a28d, an EBS-based 64-bit ubuntu machine that's pretty bare-bones. To run Hadoop, we need at least an m1.large instance — using a micro instance is just too painful.

For the actual exercise, we will use the excellent tutorial by Michael Noll. Michael first wrote it in 2007 and has kept it up to date. The tutorial helps you install map/reduce and use it for computing the count of every word in a large text. Michael's post is pretty much right on. Things to keep in mind:
  • Make sure python 2.6 ...