Clustering Big Data

Join Us for a discussion on Clustering Big Data

When: Thursday June 4, noon – 1:00 pm EST. 
Where: (Virtual Meeting) Contact Us for coordinates.

Description: Approximate Nearest Neighbor methods for clustering and indexing have been actively researched ever since the K-Means algorithm was published in 1975 (and coded in FORTRAN). A recent book lists about 300 variants and related topics.

The 50th Anniversary issue of Communications of the ACM in 2008 cited two pieces of "Breakthrough Research". One was MapReduce, the other was clustering based on Locality Sensitive Hashing (LSH). Locality Sensitive Hashing is for sets of large data and alleviates many of the issues seen with k-means. Want to see if a body of code has remarkable similarity to a public github repo? Want to see "similar" fragments of DNA that are common between several species? LSH will get you there faster than most other techniques.

The talk will demonstrate OpenLSH, an open source implementation of LSH we have been working on.

Speaker Bios: Dr. J Singh is a Principal at, a Cloud and Big Data consulting company. He is a frequent speaker on NoSQL, Hadoop--Map/Reduce and analytics of social media. He is the originator ...

Stop the Data Warehouse Creep

Have you experienced this?
  • Your data warehouse takes about 8 hours to load the cube. The cube is updated weekly; 8 hours per week is a small price to pay and everyone is happy.
  • It is missing a few data elements that another group of analysts really needs. You add those and now the cube takes 9 hours to load. No big deal.
  • Time passes, the business is doing well, the size of the data quadruples. It now takes 36 hours per week to load the cube.
  • Time passes, some of the data elements are not needed any more but it is too hard to take them out of the process — it continues to take 36 hours.
  • You add yet another group of analysts as user, a few more data elements, it now takes 44 hours per week!
  • You get the picture… the situation gets more and more precarious over time.

We hope the dam holds! 

Here's an idea from Read Write Web: 

One example … was a credit card company working to implement fraud detection functionality. A traditional SQL data warehouse is more than likely already in place, and it may work well enough but without enough granularity for ...

Sources of Noise in Big Data

Curt Monash published a trio of articles about dirty data last summer where he cited use cases for storing dirty data even when the signal-to-noise ratio is pretty low. He had this observation about the value of the data in even in cases where the signal-to-noise ratio is low:

Intelligence work is one case where the occasional black swan might justify gilded cages for the whole aviary; the same might go for other forms of especially paranoid security.

Big Data can be dirty and the collection process must pay attention to data quality. This is different from the type of data we are used to in your typical database, with all the focus on data integrity. The point is that even data with very low integrity has valuable nuggets of information. 

Whatever happened to data integrity? The RDBMS people mastered the art of data integrity long ago and along come these NoSQL folks who just ignore all that learning. This data is junk!

Human input is often a source of noise in the data. Big data is used for analyzing Twitter feeds, blog posts, comment forums, chat sessions and the like and text data can be ambiguous. It's context ...

Big Data Executive Briefing

DataThinks' Executive Briefing on Big Data, NoSQL and Data Analytics is scheduled for March 1, 2012. Signup Information below.

These are are some of the topics we intend to cover in that briefing.
  • What is Big Data and what problems does it solve? What opportunities does it present that weren't available before?
  • These terms are often seen in articles about Big Data: NoSQL, Map Reduce, Analytics, Schema-less Databases. How do they relate to each other and how do they differ?
  • What problems you are better off solving with traditional database solutions?
  • The various types of NoSQL databases and what are they each appropriate for? 
    • Key-value stores (e.g., Riak, Redis, Voldemort, Tokyo Cabinet)
    • Document stores (e.g., MongoDB, CouchDB)
    • Wide column stores (e.g., Cassandra, Hypertable)
    • Graph Databases (e.g., Neo4J)
  • Analysis and Visualization
    • What is Map / Reduce and the opportunities it presents for your business.
    • The cost / performance trade-offs  of running Map Reduce in the cloud, on Amazon EC2 or on Google, or in-house.
Any additional topics you would like to see covered? Please leave a comment.

The event is closed but corporate briefings based on this material are available. Please contact us to arrange.

Big data: Does size matter?

Big data is about so many things: 
  • Size, of course, but you don't have to be Google-scale to need big data technologies. Heck, a few hundred gigabytes will suffice. 
  • Ad-hoc. Big Data platforms enable ad-hoc analytics on non-relational (ie unmodelled data). This allows you to uncover insights to questions that you never think to ask. 
  • Streaming. You cannot deliver true analytics of Big Data relying only on batch insights. You must deliver streaming and real-time analytics. 
  • Inconsistent. Air or water quality is measured in impurities-per-million. Perhaps we should have similar consistency metrics for data? 
But the biggest difference is in the tools we use to analyze and present big data. Big data analysis involves a heavy dose of numerical analysis, statistical methods, algorithms for teasing signals from noise, and techniques that would be more familiar to a scientist than a database analyst. 

Convergence of Analysis and Data

Moon in the Water

Moon-in-the-water happens when the moon shines in the water.
When there is no moon, there is no moon-in-the-water.
When there is no water, there is no moon-in-the-water.
Zen Metaphor

The Convergence

Analysis and Data are not two things, they are one thing. Big Data is about both together!

Not so long ago, we used to have programmers who programmed, and DBAs who managed the data. We forgot that both enterprises were about providing analysis to the users

BigData Solutions

Big Data Solutions include traditional (relational) databases. They also include the more recent open source NoSQL databases. They include the traditional programming methods, but they also include functional programming methods exemplified by Hadoop or Map/Reduce.

The focus is on getting analysis results quickly, and being able to do so even when the problem statement changes on a daily basis.

Traditional (pre-Google) programming fits the above definition. But we think it would be misleading to call it Big Data if all you are doing is traditional programming.

What do you think?