Hands on with Hadoop
Posted by J Singh | Filed under Amazon EC2, Hadoop
This post begins a series of exercises on Hadoop and its ecosystem. Here is a rough outline of the series:
- Hello Hadoop World, including loading data into and copying results out of HDFS.
- Hadoop Programming environments with Python, Pig and Hive.
- Importing data into HDFS.
- Importing data into HBase.
- Other topics TBD.
We assume that the reader is familiar with the concepts of Map/Reduce. If not, feel lucky you have Google. Here are two of my favorite introductions: (1) The Story of Sam and (2) Map Reduce: a really simple introduction.
Hello Hadoop World
We build this first environment on Amazon EC2. We begin with ami-e450a28d, an EBS-based 64-bit ubuntu machine that's pretty bare-bones. To run Hadoop, we need at least an m1.large instance — using a micro instance is just too painful.
- Make sure python 2.6 is installed on the machine and is the default python version. If you are unsure, invoke python from the shell, the python shell tells you the version.
- ami-e450a28d has no Java pre-installed on it. Michael's instructions are slightly obsolete and you have to update the /etc/apt/sources.list file as documented here.
- lzop needs to be installed. sudo apt-get install lzop.
A parenthetical note: too many people have misused the Amazon environment to hit public websites like gutenberg.org. They now disallow access from Amazon EC2. You have to find another route to get the files there. One effective way is to download them to your laptop and then FTP them over to your server.
Once hosted on the server, we transfer the files to HDFS, run Hadoop, and transfer the results back to the file system. We could do something more sophisticated later, and will.
How does it work?
The source code of the WordCount program is pretty simple. Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.