Stop the Data Warehouse Creep
Posted by J Singh | Filed under ETL process, Hadoop, Big Data, Data Warehouse
Have you experienced this?
- Your data warehouse takes about 8 hours to load the cube. The cube is updated weekly; 8 hours per week is a small price to pay and everyone is happy.
- It is missing a few data elements that another group of analysts really needs.
You add those and now the cube takes 9 hours to load. No big deal.
- Time passes, the business is doing well, the size of the data quadruples. It now takes 36 hours per week to load the cube.
- Time passes, some of the data elements are not needed any more but it is too hard to take them out of the process — it continues to take 36 hours.
- You add yet another group of analysts as user, a few more data elements, it now takes 44 hours per week!
- You get the picture… the situation gets more and more precarious over time.
We hope the dam holds!
Here's an idea from Read Write Web:
One example … was a credit card company working to implement fraud detection functionality. A traditional SQL data warehouse is more than likely already in place, and it may work well enough but without enough granularity for an analysis system to accurately capture or isolate the sequence of events that may lead up to a fraud incident. So one smart strategy he suggested was for that same warehouse to begin storing a supplemental stream of raw transactional data, perhaps several years' worth, through Hadoop. That way, when a potential fraud incident is isolated using SQL, rapid analytics over billions of transactions may become available through Hadoop. From those analytics, a model for predicting future fraud events can be constructed that benefits both SQL and Hadoop engines.
Putting it another way, Hadoop is a great way to stop the creep, maybe even roll it back a little — at least get it under control.
- Keep the cube for the core needs of your analysts.
- Bring everything else into a suitable file system, HDFS, HBase, some NoSQL database, whatever makes sense.
- Serve the new needs and ad-hoc queries through Hadoop coupled with a visualization engine.
The resulting Hadoop framework will be more agile, more responsive to business needs than making more changes to the ETL process. And it will help you and the analysts refine the requirements.
They will love you for it.
Previous Post
Next Post