Big Data Laboratory

So you want to get going on having your organization come up to speed on Big Data. What next?

Your first step could be to set up a sandbox for the participants. We set up ours on Amazon EC2 because Amazon hosts a number of public data sets, as do InfoChimps and others. These data sets are convenient for initial experimentation.

It turns out that getting access to the data where you can touch it and feel it is the easy part. Christopher Miles takes an inspiring stab at the problem in his blog post All Your HBase Are Belong to Clojure

Our approach is a bit different, using Google App Engine's Map Reduce to let us focus on the task at hand. This approach doesn't work for all situations but can shave valuable time off the learning curve in some cases.

I had occasion to develop this idea further at a recent talk to the Google App Engine group. Slides from that talk:
Big Data Laboratory from J Singh

The Hadoop Ecosystem

Here's a more detailed outline of my talk on March 12. To make the talk more relevant to you, if you have a use case you'd like me to discuss, we'd love to hear about it, and possibly incorporate it into the talk. Join us for ... (see the end of this post).

If you came here looking for the presentation, here it is.


  1. What Hadoop is, and what it's not
  2. Origins and History
  3. Hello Hadoop, how to get started.

The Hadoop Bestiary

  1. Core: Hadoop Map Reduce and Hadoop Distributed File System (HDFS)
  2. Data Access: HBase, Pig and Hive
  3. Algorithms: Mahout
  4. Data Import: Flume, Sqoop and Nutch

The Hadoop Providers

  1. Apache
  2. Cloudera
  3. What to do if your data is in a database

The Hadoop Alternatives

  1. Amazon EMR
  2. Google App Engine
For those that weren' t able to attend, here is the presentation:

Hands on with Hadoop

This post begins a series of exercises on Hadoop and its ecosystem. Here is a rough outline of the series:
  1. Hello Hadoop World, including loading data into and copying results out of HDFS.
  2. Hadoop Programming environments with Python, Pig and Hive. 
  3. Importing data into HDFS. 
  4. Importing data into HBase. 
  5. Other topics TBD.  
We assume that the reader is familiar with the concepts of Map/Reduce. If not, feel lucky you have Google. Here are two of my favorite introductions: (1) The Story of Sam and (2) Map Reduce: a really simple introduction.

Hello Hadoop World

We build this first environment on Amazon EC2. We begin with ami-e450a28d, an EBS-based 64-bit ubuntu machine that's pretty bare-bones. To run Hadoop, we need at least an m1.large instance — using a micro instance is just too painful.

For the actual exercise, we will use the excellent tutorial by Michael Noll. Michael first wrote it in 2007 and has kept it up to date. The tutorial helps you install map/reduce and use it for computing the count of every word in a large text. Michael's post is pretty much right on. Things to keep in mind:
  • Make sure python 2.6 ...

Stop the Data Warehouse Creep

Have you experienced this?
  • Your data warehouse takes about 8 hours to load the cube. The cube is updated weekly; 8 hours per week is a small price to pay and everyone is happy.
  • It is missing a few data elements that another group of analysts really needs. You add those and now the cube takes 9 hours to load. No big deal.
  • Time passes, the business is doing well, the size of the data quadruples. It now takes 36 hours per week to load the cube.
  • Time passes, some of the data elements are not needed any more but it is too hard to take them out of the process — it continues to take 36 hours.
  • You add yet another group of analysts as user, a few more data elements, it now takes 44 hours per week!
  • You get the picture… the situation gets more and more precarious over time.

We hope the dam holds! 

Here's an idea from Read Write Web: 

One example … was a credit card company working to implement fraud detection functionality. A traditional SQL data warehouse is more than likely already in place, and it may work well enough but without enough granularity for ...

Big data: Does size matter?

Big data is about so many things: 
  • Size, of course, but you don't have to be Google-scale to need big data technologies. Heck, a few hundred gigabytes will suffice. 
  • Ad-hoc. Big Data platforms enable ad-hoc analytics on non-relational (ie unmodelled data). This allows you to uncover insights to questions that you never think to ask. 
  • Streaming. You cannot deliver true analytics of Big Data relying only on batch insights. You must deliver streaming and real-time analytics. 
  • Inconsistent. Air or water quality is measured in impurities-per-million. Perhaps we should have similar consistency metrics for data? 
But the biggest difference is in the tools we use to analyze and present big data. Big data analysis involves a heavy dose of numerical analysis, statistical methods, algorithms for teasing signals from noise, and techniques that would be more familiar to a scientist than a database analyst. 

Hands on Hadoop with Amazon EC2

A few months ago, I gave a talk at the Chelmsford Technology Skill Share Group. It was focused on the whys and wherefores of NoSQL and Map/Reduce. If you are interested in a copy of the presentation, please contact me.

Next week (9/14/11, 4:00 pm, Chelmsford Public Library, McCarthy Meeting Room. Directions.), I'll be giving a hands-on introduction to running Hadoop on Amazon EC2.

It will be as hands-on as the previous talk was conceptual. For the actual material, we will use the excellent tutorial by Michael Noll. Michael first wrote it in 2007 and has kept it up to date. The tutorial helps you install map/reduce and use it for computing the count of every word in a large text. We will use Ulysses by James Joyce as our sample text. Can we do this in 90 minutes? Yes, we can. But the goal is to get the most out of the journey.

To get the most out of the talk, you should be prepared to sign up for an Amazon account. They require a credit card but the credit card won't actually get charged because our usage will be a few ...