Bluefin › Mathew MattHigh Availability specifics – your inevitable failure
Mathew Matt

High Availability specifics – your inevitable failure

Mathew Matt , Principal HANA Architect, Bluefin Solutions

Published in 3 Categories - Tuesday 18 October 2016

You’re not as protected as you think. You might be thinking “Wow, that’s a bold claim, Basis gorilla”, and yet I can say with near certainty that it’s true. We tend to spend so much time, effort and money maintaining the High Availability (HA) / Disaster Recovery (DR) strategy that we tend to miss the oh-so-small pieces within the environment. Here, we’ll go over basic redundancy requirements, and touch on HA for both the database and application tiers. 

Hardware redundancy redundancy 

In an effort to reduce the time it takes for you to read this, we’re only going to touch on hardware redundancy ever so lightly. Just a quick snippet on a few of the hardware pieces that are easily done.

Storage

Newer Storage Area Network (SAN) technologies have a ton of redundancy built in. Older straight Logical Unit Number (LUN) partitioning has given way to the newer “virtual LUN” based partitioning and has also given us better performance in the meantime by essentially spreading all of our workload over all disks (of the same tier). However, make sure that you consider a couple of things. Are the filers at the top of the SAN chain redundant? Do I have multiple paths to each of the filers? That is, does my server have multiple SAN interface cards, each with a path to the individual head-ends of the SAN?

Network

By all accounts, this is similar to SAN in the fact that creating multiple paths, with multiple Network Interface Controllers (NICs), to multiple switches is the key. Make sure that you consider both internal and external NICs (internal potentially between servers in the environment, and external to your commodity network). Do they each have multiple paths to redundant switching fabrics? Also, make sure you don’t assume that your virtual infrastructure is safe, simply because it’s virtual. If I have a hypervisor server sitting there with a singular 10Gb connection and that connection goes bad, no amount of virtual switching in the world will save me.

Operating System

We build tons of redundancy in nearly every aspect of our architecture but what if the failure is actually in the Operating System (OS) (unresponsive, blue-screen-of-death, memory leaks, kernel dumps, etc.)? Is your OS smart enough to move the critical pieces to another server automatically?

Others

Other items to consider when planning your HA strategy include power (battery, generator, switching), rack cooling, management interfaces, and even fire suppression. Yes, this is all for us to consider. Even if you don’t actually manage your hardware, do you know just how redundant you are with these components?

The Database; Juggernaut of your SAP environment

Database Selection and Shameless Plug. We’re going to use HANA as our database platform of choice. The selection criteria for this is simple; Bluefin Solutions specializes in SAP HANA. It has performed numerous platform migrations and has a very distinct passion for the SAP HANA platform. It’s this passion, which you only occasionally see in ‘technical’ folks, that brought me to Bluefin Solutions as a HANA architect. Hats off to John Appleby, Lloyd Palfrey, and the rest of the team for igniting that spark in me, personally. End shameless plug. 

You’ll notice that HA within the HANA database is not overly complicated but it’s not without its nuances. Focusing on a scale-up installation of HANA, we use a combination of database replication and Linux clustering (the clustering will work on both SLES and Red Hat, as they both use Pacemaker with HANA resource agents).

The bottom line is that we setup the HANA resource agents within the Operating System, as well as HANA replication. The agents’ sole purpose is to simply monitor replication and when a host fails, then perform a “takeover” on a secondary host. This creates a somewhat “hot/warm” setup for our highly available environment in that if you use a synchronous type of replication (sync or syncmem) the memory of the secondary node stays “preloaded” in order to come up in a matter of seconds, as opposed to minutes. In these cases, our recovery time is measured in seconds, and our data loss is zero due to the synchronous replication.

Mathew-Matt-High-Availability-specifics-your-inevitable-failure-content-picture.png

Obligatory crude diagram showing movement of both cluster resources and HANA application. 

DVEBMGS and Sometimes Y 

If you’re a pinkie-up executive type, go to your nearest Basis person and ask them, “What is DVEBMGS?” If he looks at you with the empty stare of a deer in headlights, revel in the fact that you have a larger technical acumen than your Basis person. After that, it’s probably appropriate to cry a little because you have a larger technical acumen than your Basis person.  

DVEBMGS lists the types of work-processes present in any ABAP based SAP system. These are the core pieces of SAP that must be considered when planning a highly available landscape. When architecting said landscape, there should always be an answer to, “What if THIS process dies?”. Below is a very quick definition of each and how they can be made highly available. 

  • Dialog: The process that users “attach” to in order to display their screen. These are replicated in multiple application servers and as long as you use a message server to connect (see Message Server below).
     
  • Update: V=Update (who knew). These guys have the tough job of actually performing the changes to the database. After a Dialog or Batch process performs its work and needs to update, create, or delete, it will push to the Enqueue (see below), and then to an Update process. Update processes can be replicated in each application server, similar to Dialog and Batch processes. 
     
  • Enqueue: These processes control updates and locks within SAP’s own mechanisms. The Enqueue is a small table that is kept on the Operating System in order to keep track of said updates and locks. These can be replicated using an ERS (Enqueue Replication Service, naturally) instance.
     
  • Batch: Workhorses of nearly every SAP environment, any time you schedule a batch / background job, these are the dudes. Unfettered by things like work-process time limits, and generally a little more open in terms of memory usage, these processes perform a lot of heavy lifting. Like Dialog processes, they get replicated in each application instance.
     
  • Message Server: The Message Server within the SAP environment simply accepts requests and distributes load throughout your environment. He can distribute GUI and/or Web traffic (though I have seen issues with attempting to use the MS as an HTTP load balancer) and essentially forwards traffic to an appropriate application server. This guy tends to be most important as he ends up a single point of failure often. That is, unless you utilize some form of clustering to protect him. You can do this via Microsoft’s MSCS clustering and native clustering found in most Linux distributions.   
     
  • Gateway: The Gateway is an oft forgotten component within the SAP environment. He can be used for varying types of RFC connections and, by default, exists on each application server individually. While that is handy, you generally run into an issue where the RFC’s are pointed at a specific application server, thus rendering that RFC inoperable if the app server dies.  The way around this is quite simple, you can add a gateway process to the clustered Message Server (see above).   
     
  • Spool: In short, spool processes are used to create output. That is, a Dialog or Batch processes creates a spool request, and the Spool process converts that to something legible. Like Dialog and Batch processes, it’s generally wise to have a few of these (or at least one) in each application server. 
     
  • Sometimes Y: While all of the above covers the core components residing in the SAP Application tier, we still have other pieces that we always need to be aware of. Web Dispatchers, Tax Calculators, external apps that I like to call the “fuzzy bits” need to be taken into consideration. I smell a new post on nothing but taking care of the fuzzy bits, but let me finish this one first! 

Finally, it’s over

First, apologies for the longer, more technical blog post, but the moral of the story is simply to “tend to your garden”. Know everything that’s in your garden and the care and feeding for each (i.e. how to make them highly available), and you have a heart-warming baseline that can only be found in the movies. 

When good datacenters go bad - building fault tolerant solutions Insight

When good datacenters go bad - building fault tolerant solutions

Mathew Matt Mathew Matt Principal HANA Architect When good datacenters go bad - building fault tolerant solutions
Optimising SAP BW hardware environments for SAP HANA Insight

Optimising SAP BW hardware environments for SAP HANA

Lloyd Palfrey Lloyd Palfrey Database & Technology Lead Optimising SAP BW hardware environments for SAP HANA

We use cookies on bluefinsolutions.com to ensure that we can give you the very best experience. To find out more about how we use cookies, please visit the cookie policy page.

Close