The Hadoop Ecosystem


Here's a more detailed outline of my talk on March 12. To make the talk more relevant to you, if you have a use case you'd like me to discuss, we'd love to hear about it, and possibly incorporate it into the talk. Join us for ... (see the end of this post).

If you came here looking for the presentation, here it is.

Introduction

  1. What Hadoop is, and what it's not
  2. Origins and History
  3. Hello Hadoop, how to get started.

The Hadoop Bestiary

  1. Core: Hadoop Map Reduce and Hadoop Distributed File System (HDFS)
  2. Data Access: HBase, Pig and Hive
  3. Algorithms: Mahout
  4. Data Import: Flume, Sqoop and Nutch

The Hadoop Providers

  1. Apache
  2. Cloudera
  3. What to do if your data is in a database

The Hadoop Alternatives

  1. Amazon EMR
  2. Google App Engine
For those that weren' t able to attend, here is the presentation:

Big Data Executive Briefing


DataThinks' Executive Briefing on Big Data, NoSQL and Data Analytics is scheduled for March 1, 2012. Signup Information below.

These are are some of the topics we intend to cover in that briefing.
  • What is Big Data and what problems does it solve? What opportunities does it present that weren't available before?
  • These terms are often seen in articles about Big Data: NoSQL, Map Reduce, Analytics, Schema-less Databases. How do they relate to each other and how do they differ?
  • What problems you are better off solving with traditional database solutions?
  • The various types of NoSQL databases and what are they each appropriate for? 
    • Key-value stores (e.g., Riak, Redis, Voldemort, Tokyo Cabinet)
    • Document stores (e.g., MongoDB, CouchDB)
    • Wide column stores (e.g., Cassandra, Hypertable)
    • Graph Databases (e.g., Neo4J)
  • Analysis and Visualization
    • What is Map / Reduce and the opportunities it presents for your business.
    • The cost / performance trade-offs  of running Map Reduce in the cloud, on Amazon EC2 or on Google, or in-house.
Any additional topics you would like to see covered? Please leave a comment.

The event is closed but corporate briefings based on this material are available. Please contact us to arrange.

Big data: Does size matter?


Big data is about so many things: 
  • Size, of course, but you don't have to be Google-scale to need big data technologies. Heck, a few hundred gigabytes will suffice. 
  • Ad-hoc. Big Data platforms enable ad-hoc analytics on non-relational (ie unmodelled data). This allows you to uncover insights to questions that you never think to ask. 
  • Streaming. You cannot deliver true analytics of Big Data relying only on batch insights. You must deliver streaming and real-time analytics. 
  • Inconsistent. Air or water quality is measured in impurities-per-million. Perhaps we should have similar consistency metrics for data? 
But the biggest difference is in the tools we use to analyze and present big data. Big data analysis involves a heavy dose of numerical analysis, statistical methods, algorithms for teasing signals from noise, and techniques that would be more familiar to a scientist than a database analyst.