When good datacenters go bad - building fault tolerant solutions

13 October 2016

Mathew Matt

Mathew Matt

Principal HANA Architect

Every technical resource worth their salt has had “that moment”. That moment when your pupils dilate and your muscles contract, which is typically followed by a string of profanities that would make the burliest of sailors cower in the corner in fear. This is one of a series of insights detailing considerations to build truly fault tolerant SAP solutions. Consider this your introduction to the very important and very expensive world of fault tolerance. 

“The production system is down, and this is bad.”

Despite all of your planning and good intentions, issues do arise beyond your control. Is your mission critical system, that one system that your revenue relies, truly disaster tolerant? Have you truly thought about nearly all contingencies in the event of a failure? Are you fully aware of the mechanisms available to you within said system? Are you the guy who will have all fingers pointed at him when things go awry?

dreamstime_m_2682167.jpg“But I had a backup…”

… said the technical lead as he was escorted out of the building, after turning in his badge and equipment. Easily avoided technical outages, especially those as we have seen lately, tend to end badly for those who have architected the solutions. There’s a saying “the devil is in the detail”, and this certainly holds true for fault and disaster tolerance. Besides, backups are so 1980s. Let’s have a look at the basics.

High Availability (HA) vs. Disaster Recovery (DR)

It strikes me as funny (and by funny, I mean insanely frustrating) when I walk into a customer site and some of the core decision makers don’t fully understand HA vs. DR infrastructure. This has happened more times than I care to mention, so let’s start here.

High Availability is simply the ability to account for the simplest of failures within the hardware or software infrastructure, most often within the same data center. Each of these can be handled at individual component level. Think down to the lowest levels when planning this, including your own datacenter. True infrastructure items like power, HVAC, fire protection and network routing. Then we move deeper to the servers themselves, power supplies, networking, storage. Virtualization and hypervisors bring their own level of complexity, as well (hypervisor tolerance, storage and virtual network multi-pathing, etc.). Only after we chase our known “single-points-of-failure” can we move to the database and application tier with any level of comfort. We will touch on some key SAP elements further on in this post (I have to get you to keep reading somehow).

Disaster recovery is a different beast entirely. I look at it this way, if your data center were to melt off the face of the earth, what would happen? Do you have a geographically disparate location in order for your primary revenue generating system to keep running?

Recovery Time Objective (RTO), Recovery Point Objective (RPO) and why you should care.

To put this in the easiest way to understand, RTO is how much time it takes to bring your primary, revenue generating systems back (note that I keep stressing revenue generating). How much downtime can you afford in an HA scenario (failing systems over to new hardware, let’s say)? This should typically be measured in seconds. How much time in a DR scenario? The one thing to note here is that costs will exponentially increase the shorter you require RTO to be. This is due to the duplication of hardware, as well as the associated supporting costs (bandwidth, infrastructure).

Recovery Point Objective is how much data (generally measured in time units) you can “afford” to lose. In a disaster scenario, truly synchronous communication tends to be limited due to infrastructure costs, therefore as much as you’d like it to be, the answer is rarely “none”.

Faulty logic

To be frank, the purpose of this introduction is to scare you… thoroughly. To 90% of the SAP (and other) customers out there in the world you are not safe. If you’re not thinking about where your weakest links are, in other words your single-point-of-failure, then ‘you are the weakest link’.

If you think that HA/DR is too expensive, consider this. How much revenue and productivity will you lose if you’re down for 10 minutes? 30 minutes? 3 hours? 3 days? Does this keep you up at night? It should, but fret not. We can help.


About the author

Mathew Matt

Principal HANA Architect

“May I have a desk in the data center?” is actually a question I enjoy asking customers, though I have yet to find one willing to take me up on the offer. 

While this may seem like a silly request, it sums up how I feel about infrastructure, and in particular SAP running on said infrastructure. It’s about one simple premise… passion. I thoroughly enjoy designing high performing, fault tolerant SAP solutions that emphasize efficiency and yes, even a fair amount of grace and artistic flair.   

“Artistic flair?” you say? The bottom line is that each and every organization has different needs and requirements when designing a solution, and I thrive on the challenge that creates that solution. HA tolerant, secure, high performing? Yes, please.   

When it comes to the actual work performed on your enterprise system, who do you want to run with it? Someone passionate who constantly repeats the words “We can do better”, or… well… someone else.