The Hadoop Ecosystem
Posted by J Singh | Filed under Map Reduce, Hadoop, NoSQL
Here's a more detailed outline of my talk on March 12.
Introduction
- What Hadoop is, and what it's not
- Origins and History
- Hello Hadoop, how to get started.
The Hadoop Bestiary
- Core: Hadoop Map Reduce and Hadoop Distributed File System (HDFS)
- Data Access: HBase, Pig and Hive
- Algorithms: Mahout
- Data Import: Flume, Sqoop and Nutch
The Hadoop Providers
- Apache
- Cloudera
- What to do if your data is in a database
The Hadoop Alternatives
- Amazon EMR
- Google App Engine
Big Data Executive Briefing
Posted by J Singh | Filed under Executive Briefing, Big Data, NoSQL, Map/Reduce
- What is Big Data and what problems does it solve? What opportunities does it present that weren't available before?
- These terms are often seen in articles about Big Data: NoSQL, Map Reduce, Analytics, Schema-less Databases. How do they relate to each other and how do they differ?
- What problems you are better off solving with traditional database solutions?
- The various types of NoSQL databases and what are they each appropriate for?
- Key-value stores (e.g., Riak, Redis, Voldemort, Tokyo Cabinet)
- Document stores (e.g., MongoDB, CouchDB)
- Wide column stores (e.g., Cassandra, Hypertable)
- Graph Databases (e.g., Neo4J)
- Analysis and Visualization
- What is Map / Reduce and the opportunities it presents for your business.
- The cost / performance trade-offs of running Map Reduce in the cloud, on Amazon EC2 or on Google, or in-house.
Big data: Does size matter?
Posted by J Singh | Filed under big data, nosql, hadoop, map reduce, statistical analysis, numerical methods
Big data is about so many things:
- Size, of course, but you don't have to be Google-scale to need big data technologies. Heck, a few hundred gigabytes will suffice.
- Ad-hoc. Big Data platforms enable ad-hoc analytics on non-relational (ie unmodelled data). This allows you to uncover insights to questions that you never think to ask.
- Streaming. You cannot deliver true analytics of Big Data relying only on batch insights. You must deliver streaming and real-time analytics.
- Inconsistent. Air or water quality is measured in impurities-per-million. Perhaps we should have similar consistency metrics for data?