Big data: Does size matter?

Big data is about so many things: 
  • Size, of course, but you don't have to be Google-scale to need big data technologies. Heck, a few hundred gigabytes will suffice. 
  • Ad-hoc. Big Data platforms enable ad-hoc analytics on non-relational (ie unmodelled data). This allows you to uncover insights to questions that you never think to ask. 
  • Streaming. You cannot deliver true analytics of Big Data relying only on batch insights. You must deliver streaming and real-time analytics. 
  • Inconsistent. Air or water quality is measured in impurities-per-million. Perhaps we should have similar consistency metrics for data? 
But the biggest difference is in the tools we use to analyze and present big data. Big data analysis involves a heavy dose of numerical analysis, statistical methods, algorithms for teasing signals from noise, and techniques that would be more familiar to a scientist than a database analyst. 

Mongo Boston 2011

Mongo Boston 2011 was held at the New England Research & Development (NERD) on October 3.

Here is our presentation from the conference. 

Convergence of Analysis and Data

Moon in the Water

Moon-in-the-water happens when the moon shines in the water.
When there is no moon, there is no moon-in-the-water.
When there is no water, there is no moon-in-the-water.
Zen Metaphor

The Convergence

Analysis and Data are not two things, they are one thing. Big Data is about both together!

Not so long ago, we used to have programmers who programmed, and DBAs who managed the data. We forgot that both enterprises were about providing analysis to the users

BigData Solutions

Big Data Solutions include traditional (relational) databases. They also include the more recent open source NoSQL databases. They include the traditional programming methods, but they also include functional programming methods exemplified by Hadoop or Map/Reduce.

The focus is on getting analysis results quickly, and being able to do so even when the problem statement changes on a daily basis.

Traditional (pre-Google) programming fits the above definition. But we think it would be misleading to call it Big Data if all you are doing is traditional programming.

What do you think?

Hands on Hadoop with Amazon EC2

A few months ago, I gave a talk at the Chelmsford Technology Skill Share Group. It was focused on the whys and wherefores of NoSQL and Map/Reduce. If you are interested in a copy of the presentation, please contact me.

Next week (9/14/11, 4:00 pm, Chelmsford Public Library, McCarthy Meeting Room. Directions.), I'll be giving a hands-on introduction to running Hadoop on Amazon EC2.

It will be as hands-on as the previous talk was conceptual. For the actual material, we will use the excellent tutorial by Michael Noll. Michael first wrote it in 2007 and has kept it up to date. The tutorial helps you install map/reduce and use it for computing the count of every word in a large text. We will use Ulysses by James Joyce as our sample text. Can we do this in 90 minutes? Yes, we can. But the goal is to get the most out of the journey.

To get the most out of the talk, you should be prepared to sign up for an Amazon account. They require a credit card but the credit card won't actually get charged because our usage will be a few ...