Big Data Laboratory
Posted by J Singh | Filed under HBase, Google App Engine, Hadoop
So you want to get going on having your organization come up to speed on Big Data. What next?
Your first step could be to set up a sandbox for the participants. We set up ours on Amazon EC2 because Amazon hosts a number of public data sets, as do InfoChimps and others. These data sets are convenient for initial experimentation.
It turns out that getting access to the data where you can touch it and feel it is the easy part. Christopher Miles takes an inspiring stab at the problem in his blog post All Your HBase Are Belong to Clojure.
Our approach is a bit different, using Google App Engine's Map Reduce to let us focus on the task at hand. This approach doesn't work for all situations but can shave valuable time off the learning curve in some cases.
I had occasion to develop this idea further at a recent talk to the Google App Engine group. Slides from that talk:
The Hadoop Ecosystem
Posted by J Singh | Filed under Map Reduce, Hadoop, NoSQL
Here's a more detailed outline of my talk on March 12.
If you came here looking for the presentation, here it is.
Introduction
- What Hadoop is, and what it's not
- Origins and History
- Hello Hadoop, how to get started.
The Hadoop Bestiary
- Core: Hadoop Map Reduce and Hadoop Distributed File System (HDFS)
- Data Access: HBase, Pig and Hive
- Algorithms: Mahout
- Data Import: Flume, Sqoop and Nutch
The Hadoop Providers
- Apache
- Cloudera
- What to do if your data is in a database
The Hadoop Alternatives
- Amazon EMR
- Google App Engine
For those that weren' t able to attend, here is the presentation: