Big Data Laboratory

So you want to get going on having your organization come up to speed on Big Data. What next?

Your first step could be to set up a sandbox for the participants. We set up ours on Amazon EC2 because Amazon hosts a number of public data sets, as do InfoChimps and others. These data sets are convenient for initial experimentation.

It turns out that getting access to the data where you can touch it and feel it is the easy part. Christopher Miles takes an inspiring stab at the problem in his blog post All Your HBase Are Belong to Clojure

Our approach is a bit different, using Google App Engine's Map Reduce to let us focus on the task at hand. This approach doesn't work for all situations but can shave valuable time off the learning curve in some cases.

I had occasion to develop this idea further at a recent talk to the Google App Engine group. Slides from that talk:
Big Data Laboratory from J Singh

The Hadoop Ecosystem

Here's a more detailed outline of my talk on March 12. To make the talk more relevant to you, if you have a use case you'd like me to discuss, we'd love to hear about it, and possibly incorporate it into the talk. Join us for ... (see the end of this post).

If you came here looking for the presentation, here it is.


  1. What Hadoop is, and what it's not
  2. Origins and History
  3. Hello Hadoop, how to get started.

The Hadoop Bestiary

  1. Core: Hadoop Map Reduce and Hadoop Distributed File System (HDFS)
  2. Data Access: HBase, Pig and Hive
  3. Algorithms: Mahout
  4. Data Import: Flume, Sqoop and Nutch

The Hadoop Providers

  1. Apache
  2. Cloudera
  3. What to do if your data is in a database

The Hadoop Alternatives

  1. Amazon EMR
  2. Google App Engine
For those that weren' t able to attend, here is the presentation: