Over the last few years new technology has helped to boost the performance of databases far beyond most people’s expectations. The promises from database vendors get bolder almost every day. And every time an even faster database is announced, it is quickly followed up by predictions that the ‘Enterprise Data Warehouse is dead’.
Who needs a Data Warehouse if data can be accessed from the source directly? Well, guess what, an Enterprise Data Warehouse (EDW) is a lot more than a ‘copy and paste’ of data to boost the performance for reporting. Thanks to developing technology, the EDW will radically change, but the EDW will remain valid as a concept. So don’t switch of your EDW just yet and don’t sack your Data Warehouse architects. Instead, get ready for change or you will be struggling with your business requirements, which become more demanding in almost as fast a pace as database technology itself.
When I started my career, ‘normalisation’ was king. It meant that data models were designed in a way that avoided duplication, so there was always just a ‘single image of the truth’. Data redundancy was seen as a necessary evil to meet performance requirements. During those years more and more processes got automated and the data models got increasingly complex and volumes increased to maximum system capacity. ‘Normalisation’ could no longer keep up with requirements and IT teams were in trouble.
Enter Ralph Kimball and Bill Inmon. They separated data storage for business transactions from data storage for analysis purposes and introduced new rules for modelling data for analysis purposes (‘Online Analytical Processing’ or ‘OLAP’ versus Online Transactional Processing or ‘OLTP’). Suddenly data redundancy was fine, as long as it was created in a controlled way, i.e. automated ‘load processes’ for read-only (reporting) applications only. The Enterprise Data Warehouse was born. (It was Ralph Kimball who is credited with coining this term).
Eventually, the benefits of smart data models for analytics could not keep up with ever growing volumes of data and technology once again came to the rescue. First in the form of automatically generated subsets of data (these have different names, depending on the technology you are using, but are often referred to as ‘micro cubes’ or ‘aggregates’). Again it meant data had to be copied, so more storage was required, but user queries could usually be provided by looking at a specific subset of the data, so not all records had to be evaluated for every query. Another revolutionary performance boost came from ‘In-Memory’ technology, which meant a whole subset of data could be stored in Memory rather than on disk, which often made queries 10 ~ 100 times faster.
About ten years ago, this In-Memory technology was first applied in Enterprise Data Warehouse solutions but today technology has enabled us to run complete business applications ‘In-Memory’. And this is the reason why people ask themselves the question: “If my business application is ‘In-Memory’ and I can run a report within (milli)seconds, why should I bother with copying the data to an Enterprise Data Warehouse system?”.
In other words, copying the data is now seen as a costly thing of the past, which perhaps should be avoided. And there is an important point here: The true costs of data duplication is not just the storage capacity, but also the costs of managing the copy processes and the risks associated with errors creeping in with copying the data.
So should we go back to normalisation and should we get rid of the Enterprise Data Warehouse?
Yes and no.
The answer to the first question does not automatically answer the second question. This is because the Enterprise Data Warehouse does not only boost the reporting performance; it is also the single repository for integrated data from a large number of business applications and other sources of information. Usually, information from a single system is not very interesting, or at least not for long. The business would like to evaluate this information in the context of other information: Sales vs planning; Year on Year sales, Consolidated sales from all business units, Sales against competitors’ performance, and so on. In order to be able to compare data from different sources the data needs to be transformed in a way that you can compare ‘apples with apples’. This transforming, unifying and often cleansing of data happens in the Data Warehouse. Until now, this is mostly done when the data is loaded into the Enterprise Data Warehouse system. This is going to change. There will be less of a need to physically store the data in an EDW, so the transformation, unification and cleansing needs to take place ‘on the flight’ when reporting directly from the source systems. To support this process, your Data Warehouse needs to have a virtual transformation layer. Back in 2011 Gartner described this process and coined the term ‘Logical Data Warehouse (LDW) but I prefer to call it a ‘Virtual Data Warehouse’ (VDW). After all, the term ‘Logical’ might suggest something about its predecessor and I would not want to call the current generation of Enterprise Data Warehouses ‘Illogical’.
(Gartner: Does the 21st-Century "Big Data" Warehouse Mean the End of the Enterprise Data Warehouse? (2011))
Get ready for change
What your Enterprise Data Warehouse team should be doing is make sure the EDW architecture is future proof. Your EDW will not change into an LDW overnight but the EDW architecture need to be able to support the transition. Already, many EDW’s are partly ‘Virtual’ but only a very small portion of the queries is serviced from the ‘Logical’ part of the Data Warehouse, in most cases probably less than one percent. Ten years from now, I expect that most queries will be serviced from the virtual part of a Data Warehouse. If your EDW managers are not yet thinking about how to make the transition, your users will eventually pay the price in not being able to get the information they need as and when they want it.