SAP has recently launched the SAP Data Hub, a tool to evolve Enterprise Data Warehouses (EDW) to Big Data Warehouses (BDW), using SAP BW/4HANA. Jan van Ansem takes a closer look at this transformation and the role of the SAP Data Hub.
The trend in data warehousing is shifting from EDW to BDW. SAP is one of the players driving this trend, and rumour goes they will soon announce a new reference architecture for BDW. It will be interesting to see what tools are going to play a key role in implementing a BDW, based on SAP’s best practise.
For implementing an EDW, you only need one application: SAP Business Warehouse (BW/4HANA). In a BDW architecture, BW/4HANA will still be there but alongside other components. For example, a BDW architecture will not be complete without a ‘Hadoop-like’ distributed file system function.
Although SAP recognise that not all BDW functions can be provided by BW/4HANA, the lack of cohesion between the various products in a BDW implementation could be problematic. The risk is that the move from EDW to BDW will result in an unmanageable diversification of tools and data load processes. This post looks at how to prepare for a move to a BDW by:
- Looking at BDW architecture.
- Give examples of different tools which can be used for BDW implementation and how do they work together.
- Provide tips for successful BDW implementation.
SAP’s Big Data Warehouse architecture
SAP has not yet published an official BDW architecture, but to get an idea of what SAP’s reference architecture might look like, see Thomas Zurek’s blog BW/4HANA and the SAP Data Hub
. He introduces a three-layered model, with an 'Ingestion Layer', a 'Process & Refinement Layer' and a 'Relational Data Warehouse Layer'. SAP and Bluefin had been working on BDW architecture for one of Bluefin’s customers, and SAP had made a compelling case for adding an 'Integration and Harmonisation' layer. In the overview below, I used the building blocks from Thomas’ model, but added this extra layer.
Fig 1. Big Data Warehouse layered architecture with BW/4HANA
Boxes represent systems. Systems are easy to manage. Make sure they don’t get too full and they stay cool and the boxes will be happy.
Arrows represent some form of interfacing or transformation. I call them 'magic arrows' because in many architectural diagrams this flow of data seems to be happening automatically, completely effortlessly. In reality, arrows are an IT function's biggest headache. Interfaces take time, need to be managed on a day-to-day basis and are the cause of data discrepancies between different applications.
As you can see, a BDW architecture has many magic arrows, and interfaces they represent can wildly vary. Below is my classification of magic arrows, starting with good magic and finishing with evil magic.
Magic arrow ranking
Overnight batch has been the norm since the introduction of the EDW in the 1980s. Up until now, the alternatives (virtual interface, or near real-time replication) have not been powerful enough to deal with the large volumes and high complexity of EDW transformation to replace batch altogether. Slowly but surely, overnight batch is being replaced with other interface types and there is an anticipation that the ‘zero batch’ EDW will be a reality soon.
The most controversial point in the BDW diagram above is the split between ‘BW/4HANA compliant sources’ loading into BW/4HANA and the ‘Legacy systems’ loading into the SAP HANA Harmonisation and Integration layers. The thing to remember is that although ‘HANA’ and ‘BW/4HANA’ are drawn as separate boxes, the data is completely transparent both ways. The data in BW/4HANA can seamlessly be used in HANA and vice versa. So this is one arrow we don’t have to worry about.
What tools do I need to implement the ‘magic arrows’ in my Business Data Warehouse (BDW)?
The answer to that question varies, depending on the nature of the sources, the complexity of the required transformations and the need for real-time. The following examples give a better understanding as to why it is no longer sufficient to just use BW/4HANA in a BDW scenario.
Example 1: real-time replication from non-SAP sources
Assume there are some business-critical requirements for real-time reporting in your organisation. For example, for monitoring stock levels in outlets. SAP has various replication tools: SAP Replication Server (SRS), SAP Landscape Transformation Replication Server (LT) and HANA Smart Data Integration (SDI). Ultimately, they all provide the same function – replicating data from a source to a target in real-time with some basic filter and field conversion rules. Which tool you choose depends on your landscape. If your sources and targets are mainly ABAP based systems, LT is the obvious choice. If you are on HANA, then SDI is probably a good fit. Not all of these tools replicate directly into BW/4HANA. Instead, data is replicated to a staging layer (in HANA) and then either loaded into BW/4HANA or exposed using virtual models.
Example 2: Complex batch transformations for high data volumes
After an acquisition, organisations often find themselves with multiple Enterprise Resource Planning (ERP) systems. That might be a temporary situation, but it is not unusual that the different ERP systems are kept in place to support various parts of the business. Transformation and integration at data warehouse level can be very complex. In such a scenario, SDI could play a pivotal role by integrating data from different systems into one data warehouse. SDI can connect to a wide variety of sources (all major database vendors as well as other data sources) and it is easy to design complex transformations. For example, SDI can extract from various sources and many tables, and load into different targets within one load process, something which is not possible in BW. When running SDI in batch mode, it is very capable of processing very large volumes of data quickly.
Example 3: virtual models and overnight batch in BW/4HANA
BW/4HANA can consume virtual models on the HANA platform directly and use further BW/4HANA virtual models to transform data or integrate data from various sources. For some use case scenarios, data needs to be optimised for reporting. This can be done by using a data mart. You can define the logic for a transformation once in a virtual model and then use this virtual model as a source for persisting data. If required, you can combine the persisted data with the virtual data so you achieve high performance; the bulk of your data is persisted in a reporting optimised form and your virtual model pulls through any changes in real-time.
Enter: SAP Data Hub
The examples above describe physical load processes in Replication Server, SDI and BW/4HANA. That is three applications and three different interfaces to manage load processes. What SAP Data Hub promises is one ‘cockpit’ to manage all load processes in the BDW, reducing the risk of failures and making maintenance easier. SAP Data Hub also comes with features for lineage and impact analysis which, if implemented correctly, will reduce the effort of root cause analysis when there are data issues.
These examples are by no means a complete overview of all the available tools. It’s also worth adding that the best tools for one BDW implementation are not necessarily the best tools for the next BDW implementation. It is clear though that BW/4HANA is no longer the only answer for modern big data warehousing requirements.
Tips for successful BDW implementation
Know the depth and breadth of your requirements, but start simple
Early on you should identify which tools you need to support your business requirements for big data warehousing. Start simple – your team needs time to get used to the different tools; it is better to have an in-depth knowledge of one new application instead of superficial knowledge of several. Develop the team's skills one application at a time.
Use the different tools to their strengths
Many tools have overlapping capabilities but most tools have their specific strengths. The choice of tools for implementation of a specific scenario should be based on what tool is best for the job and not personal preferences. Change Management procedures and testing requirements should be tool-agnostic and standardised across the organisation.
Mix the skillsets of your team members and your support organisation
A sure way to disaster is siloed thinking, based on different applications. Team members should be able to see the complete data journey, from source to the ultimate target. A support organisation will not be able to perform efficient root cause analysis if different people need to be involved for each different application used for data transformation.
The journey from EDW to BDW is an exciting journey with some pitfalls. It is a journey which cannot be avoided though, as sooner or later your business users will demand integration of ‘Big Data’ (in whatever shape or form) with traditional enterprise data. SAP has recognized that BW/4HANA alone is no longer the one-stop shop for data warehousing and some central orchestration is required to make sure all different components of a BDW work together in an efficient way. To what degree SAP Data Hub can provide this function remains to be seen, but on paper it seems to be a promising step towards the ‘single point of management’ for a BDW.