Big Data Laboratory
Posted by J Singh | Filed under HBase, Google App Engine, Hadoop
So you want to get going on having your organization come up to speed on Big Data. What next?
Your first step could be to set up a sandbox for the participants. We set up ours on Amazon EC2 because Amazon hosts a number of public data sets, as do InfoChimps and others. These data sets are convenient for initial experimentation.
It turns out that getting access to the data where you can touch it and feel it is the easy part. Christopher Miles takes an inspiring stab at the problem in his blog post All Your HBase Are Belong to Clojure.
Our approach is a bit different, using Google App Engine's Map Reduce to let us focus on the task at hand. This approach doesn't work for all situations but can shave valuable time off the learning curve in some cases.
I had occasion to develop this idea further at a recent talk to the Google App Engine group. Slides from that talk: