Disaster recovery - when things go Mad Max

22 December 2016

Mathew Matt

Mathew Matt

Principal HANA Architect

Let’s be brutally honest and face it. At some point in time, things will go bad. It’s happened to many corporations, large and small. The question is simple: are you prepared for a worst case scenario? If your primary data center was pummelled, destroyed, obliterated, incinerated, mashed, flattened, or otherwise rendered inoperable, could you recover? In short: can you still do business in the post-apocalyptic wasteland? Can you still sell widgets?

Admittedly this is all pretty dramatic but I had to try and hook you somehow, didn’t I? I realize though, it’s not totally inconceivable that you could have a true disaster, natural or otherwise, that completely renders your primary datacenter useless. With a little foresight and planning, you can sleep a little better at night.

We’ve been through the concepts of High Availability (HA) and Disaster Recovery (DR) in the first post of this series, and then we went over (at a high level) how to make your SAP Applications highly available. Now it’s time to plan for when things really go bad. That is, the datacenter and all of your servers within are, effectively, toast. 

… But this is hard 

Building a truly effective Disaster Recovery solution for any application can be an arduous process, at best. Building one with SAP can be brutal. Each of the components that we considered in the HA guide also have to be replicated here. Add to that implications of name servers (DNS) and network throughput - for the replication pieces. Alongside these, you have segmentation; have you ever had two production instances with the same name, up at the same time? It’s foul! Finally, there are all sorts of other traps that can cause a level of havoc unprecedented within your organization. Be extremely thoughtful of everything that must take place when you first test your DR strategy. 

DR on the DB… 10-4 

Disaster recovery on an SAP HANA Scale-Up database is pretty straightforward. Essentially, you would setup a server that’s geographically disparate (i.e. far away) from your primary datacenter. You would then use asynchronous HANA replication to create the database at that location. With the proper network connection and throughput, you can have a fully synchronized database in a matter of a few hours. On the HANA server, the asynchronous copy is not functional, as it only has the row store loaded into memory. What this means, however, is that I can put this secondary (or even tertiary) copy of the database onto, say, a second instance on my Quality Assurance system, and it will have a minimal footprint, comparatively.

Things to keep in mind: We cannot perform a proper takeover of the primary database without a little manual intervention. In a disaster, we generally try to avoid automatic failovers because we want to exert control over things like DNS, network, and the order in which pieces of our environments become functional.

Clustering Inside Linux: At this point, neither SUSE Linux (SLES) or Red Hat (RHEL) support a clustered replication chain (i.e. HA Node 1 -> HA Node 2 -> DR Node). While it can be setup, and can work, we lose a little of the automation that the cluster resource agents provide. The best advice that I can give anyone here is test, test, and when you think you’re done, test some more. Practice failovers from nearly any scenario and how to recreate the chain when those failovers occur. You’ll thank me for it later.

Application tier tumbles

The application tier introduces its own set of complexities. Remember DVEBMGS from the HA piece in the series? Each of those components must be duplicated somehow in your environment to work properly in the case of a disaster. File system replication can (and even should) be used to accommodate this but be very careful on the solution you pick. Most virtual machine (and even cloud) providers have solutions for this very scenario.

Fuzzy bits 

The “fuzzy bits” of your SAP solution are the flotsam and jetsam that generally sit around your core instance, and tend to get very little attention. That is, they don’t get attention until they go down. These can include anything from your front-end providers (Portals, UI, etc.) to mission-critical tax calculation applications that help you avoid the rubber glove treatment from the IRS. Be incredibly mindful of these applications and ensure that they’re also accounted for in your disaster recovery plan. Needless to say, we could do an entire series on fluffy bits and only be touching the surface.

‘I have a (nerdy) dream’ 

At some point in the future, I think we end up somewhere in-between HA and DR. A dream of mine is to have a low latency, high throughput connection that is fast enough to allow a geo-clustered SAP application (think message server with failover to secondary data center, with application servers able to have the performance as if it were in the same rack). On top of that, with HANA readable DB replicas that are replicating synchronously (in HANA 2.0, they introduced the concept of a “readable” replica). That is my dream, where we essentially have a geographically disparate HA cluster. Frankly, I’m pretty sure we may not be that far off from it, but for now, it’s just a dream. 

And… we’re done 

While some of this may seem trivial, and even somewhat basic, the nuts and bolts of disaster recovery are anything but. With a lot of thoughtful design and some very well placed testing, though, you too can have a near-bulletproof SAP environment.


About the author

Mathew Matt

Principal HANA Architect

“May I have a desk in the data center?” is actually a question I enjoy asking customers, though I have yet to find one willing to take me up on the offer. 

While this may seem like a silly request, it sums up how I feel about infrastructure, and in particular SAP running on said infrastructure. It’s about one simple premise… passion. I thoroughly enjoy designing high performing, fault tolerant SAP solutions that emphasize efficiency and yes, even a fair amount of grace and artistic flair.   

“Artistic flair?” you say? The bottom line is that each and every organization has different needs and requirements when designing a solution, and I thrive on the challenge that creates that solution. HA tolerant, secure, high performing? Yes, please.   

When it comes to the actual work performed on your enterprise system, who do you want to run with it? Someone passionate who constantly repeats the words “We can do better”, or… well… someone else.