For over 18 months I’ve become increasingly drawn into the world of Big Data, none more so than over the past year where I’ve been hands on in Big Data projects - I have come to the conclusion that ‘Big Data’ really is an overly simplistic term.
I recently presented at a UK & Ireland SAP User Group Analytics Symposium on how to deploy Big Data infrastructure. One of the key messages I wanted to convey was that Big Data by itself is meaningless - it has to be paired with either a Data Science system/process or it has to be associated with an Analytics process.
Why are we querying data?
In my opinion, at the present time, the use of querying data is for two main purposes: structured reporting or data science. Structured reporting feeds data into a business process either to inform a process or confirm a decision e.g. stock levels or end of quarter sales figures. Data science tends to ask questions which are infrequent and often complex- requiring specific skills and/or domain knowledge. As a result the platforms on which these systems reside will be deployed and managed differently, often with very different software.
The table below summarises some of the key attributes of both data science platforms and structured reporting platforms.
The commonality of data
What both of these types of use cases have in common is data; the data sets a company has access to in order to produce insights. These data sets reside in what is fashionably called a ‘Data Lake’. A data lake is a logical construct of locations where data sits and can be called upon directly (not replicated) to be combined in a query and give a results set. My colleagues and I also talk about a ‘Pond of Significance’ when discussing data lakes with customers.
Having spent the past year working closely with clients deploying these Big Data (structured reporting and data science) platforms there are some lessons I would like to share with you. These focus on compute infrastructure and storage which underpins a landscape used for data science:
- Use infrastructure platforms which allow you to define resources in code - this increases the consistency of your platform and allows easy reusability to shorten deployment time.
- Do not skimp on the hardware resources for your platform - compute is relatively cheap, whereas the loss of confidence in your solution in front of your stakeholders can be very expensive.
- Automate as many activities as possible to allow you and your team to undertake more ‘important’ activities.
- Scalability is critical because it can be hard to predict how much resource is required for a particular activity. For TechEd Barcelona, Bluefin were able to scale a Hadoop cluster from 20 nodes to 100 nodes easily by utilising software defined infrastructure.
- Storage is cheap but not all storage is equal - choose the correct performing storage for your application.
- Size your storage correctly - applications like Hadoop were built for consumer grade components and therefore used replication as a means of dealing with component failure. This means that a 300GB file on HDFS takes up 900GB by default.
These lessons may seem self-evident but the methods used to provide these capabilities are not often used with SAP and so are rarely employed by those with an SAP background. This is as a result of the typically silo’d nature of SAP applications which are not usually found in the more bleeding edge areas of IT.
Finally, there are questions about the data which resides within your data lake, fundamental questions which are technical, security related and legal in nature. For example:
- What is the life cycle of your data processes in your company - do they even exist?
- How long do you legally need to keep data and for how long of that period is it useful?
- What datasets are you not allowed to combine in an analysis without some legal oversight? This is usually an easier question to answer than what you can combine.
- What processes are you going to use to curate the data in your data sets?
- How do you determine who is accessing your data? Recent products from the Apache Software Foundation like Ranger have made this much easier.
From conversations I had at the recent UK & Ireland SAP User Group Analytics Symposium, it is clear that this is an agenda item for most SAP users and many of them are struggling with it. The data that they are bringing into their data lake is both structured and unstructured, collected from across their organisation and not just from SAP. To address this, SAP has adopted a very collaborative approach, allowing clients to integrate different products to create their required architecture.
It’s key that clients find a partner, be that internally or externally in the market, who understand the challenges in not only using these architectures, but also the challenges in building them and running them.