Stop the Data Warehouse Creep

Have you experienced this?
  • Your data warehouse takes about 8 hours to load the cube. The cube is updated weekly; 8 hours per week is a small price to pay and everyone is happy.
  • It is missing a few data elements that another group of analysts really needs. You add those and now the cube takes 9 hours to load. No big deal.
  • Time passes, the business is doing well, the size of the data quadruples. It now takes 36 hours per week to load the cube.
  • Time passes, some of the data elements are not needed any more but it is too hard to take them out of the process — it continues to take 36 hours.
  • You add yet another group of analysts as user, a few more data elements, it now takes 44 hours per week!
  • You get the picture… the situation gets more and more precarious over time.

We hope the dam holds! 

Here's an idea from Read Write Web: 

One example … was a credit card company working to implement fraud detection functionality. A traditional SQL data warehouse is more than likely already in place, and it may work well enough but without enough granularity for ...

Sources of Noise in Big Data

Curt Monash published a trio of articles about dirty data last summer where he cited use cases for storing dirty data even when the signal-to-noise ratio is pretty low. He had this observation about the value of the data in even in cases where the signal-to-noise ratio is low:

Intelligence work is one case where the occasional black swan might justify gilded cages for the whole aviary; the same might go for other forms of especially paranoid security.

Big Data can be dirty and the collection process must pay attention to data quality. This is different from the type of data we are used to in your typical database, with all the focus on data integrity. The point is that even data with very low integrity has valuable nuggets of information. 

Whatever happened to data integrity? The RDBMS people mastered the art of data integrity long ago and along come these NoSQL folks who just ignore all that learning. This data is junk!

Human input is often a source of noise in the data. Big data is used for analyzing Twitter feeds, blog posts, comment forums, chat sessions and the like and text data can be ambiguous. It's context ...

Executive Briefing Promotion

We will be giving away one free ticket to the executive briefing. To be entered into the drawing for it, please let us know why you would like to attend and what you expect to get out of the briefing.

The winner will be drawn at random from the attendees and will be announced at the briefing. To be considered for the free ticket, please submit your response by February 26, 2012.

Click here to let us know.