Sources of Noise in Big Data

Curt Monash published a trio of articles about dirty data last summer where he cited use cases for storing dirty data even when the signal-to-noise ratio is pretty low. He had this observation about the value of the data in even in cases where the signal-to-noise ratio is low:

Intelligence work is one case where the occasional black swan might justify gilded cages for the whole aviary; the same might go for other forms of especially paranoid security.

Big Data can be dirty and the collection process must pay attention to data quality. This is different from the type of data we are used to in your typical database, with all the focus on data integrity. The point is that even data with very low integrity has valuable nuggets of information. 

Whatever happened to data integrity? The RDBMS people mastered the art of data integrity long ago and along come these NoSQL folks who just ignore all that learning. This data is junk!

Human input is often a source of noise in the data. Big data is used for analyzing Twitter feeds, blog posts, comment forums, chat sessions and the like and text data can be ambiguous. It's context sensitive and sometimes coded. "Two sticks, a dash and a cake with a stick down." was Mohammed Atta's code for 9/11. An automated algorithm could be forgiven for missing the meaning of that message. Algorithms used for sentiment analysis sometimes get individual messages wrong. But quite often, they can still produce statistically significant results.

In other words, Big Data is about problems we never used to solve with RDBMS technology and we shouldn't expect those solutions to apply here.

Data collection apparatus can often be a source of noise as well. When thousands of browsers feed data into an Analytics collector (Google Analytics or other), bugs and errors in the browsers or in the network inbetween cause some of the data to be corrupted along the way. Generally the collector just throws away those records.

Bottom line: noise is a feature of Big Data, not a bug! The power of your Big Data solution depends on how well it deals with the noise. 


blog comments powered by Disqus