Description

Provides an overview of the concept of Server Anti-Affinity and identifies how to use it to avoid downtime due to host hardware failure.

Content / Solution:

Background

Cloud Servers are virtual servers that run on physical servers called "hosts". In VMware data centers, the system uses VMware's Distributed Resource Scheduler (DRS) to dynamically balance the performance load across a "cluster" of these different physical hosts in order to maximize performance and reliability by "moving" the CPU and RAM associated with the Cloud Servers between different hosts in the cluster as needed. This approach uses vSphere's vMotion feature to ensure the Cloud Server is unaffected by any of these moves.

However, if the physical host itself experiences a catastrophic failure, all Cloud Servers that are currently running on the physical host will fail along with it. In this case, VMware's High Availability (HA) service will restart the affected Cloud Servers on other hosts in the cluster. However, this process takes from a few minutes to up to an hour to occur, as DRS needs to move resources and restart ever Cloud Server on a different physical host. The net result is that some amount of downtime will occur.

We take extensive steps to ensure redundancy throughout the infrastructure, using a minimum of N+1 redundancy at each level. However, though the event remains rare, the failure of a single physical host has been the most common event that can occur and affect Cloud Server availability. The Server Anti-Affinity function provides a tool to avoid this problem.

Anti-Affinity

Server Anti-Affinity increases resiliency by implementing a rule ensuring two Cloud Servers avoid residing on the same physical host wherever possible. When you establish an Anti-Affinity relationship between two servers, VMware's DRS system enforces a rule to ensure that as it manages virtual servers across the physical cluster, it will make sure these two Cloud Servers CPU and RAM are never running on the same physical host at the same time. Therefore, if a physical host fails, only one of the two Cloud Servers will be affected.

As such, Anti-Affinity only applies when the Cloud Servers are deployed on the same Cluster. If you decide to move a Cloud Server which is a member of an Anti-Affinity rule to a different User-Manageable Cluster, the Anti-Affinity rule will be dropped as it is no longer required. See Introduction to User-Manageable Clusters and How to Move a Cloud Server between User-Manageable Clusters for more information.

How To Use It

To take advantage of Server Anti-Affinity, you'll want to deploy two Cloud Servers in a redundant fashion so that if one fails, the other can take on whatever role your servers are performing. This usually involves creating a pool of load-balanced servers using VIP functionality  or using a clustered OS or application implementation. 

Once you've created such a redundant solution, you can implement Anti-Affinity on the two Cloud Servers.  See the below article for details on how to establish a relationship:

Note: Anti-Affinity is available in MCP 2.0 Data Center locations only on servers deployed on "Advanced" Network Domains. 

If your redundant resource approach uses a pool of more than two servers, you can establish Anti-Affinity between different pairs of Cloud Servers within the pool. However, each Cloud Server can only be a member of one Anti-Affinity relationship. Also note that Anti-Affinity requires the servers to be located on the same Cloud Network (in MCP 1.0) or Cloud Network Domain (in MCP 2.0).

Limitations

Anti-Affinity increases the resiliency of your solution but can't ensure application availability. Application availability is also obviously dependent on your application and design... anti-affinity may keep one of the two Cloud Servers from being affected by a physical hardware failure, but if the unaffected Cloud Server can't properly take over, it may still result in a problem. Be sure your overall design is built with such scenarios in mind.

In addition, there are some low-likelihood infrastructure events that can theoretically occur and cause undesired effects. The design of the Cloud infrastructure makes these scenarios unlikely, but it is worth noting the limitations. One scenario is that two or more physical hosts could fail at the same time and the two Cloud Servers could be "unlucky" enough to be located on both of those hosts at the time. Anti-Affinity rules also only relate to CPU and RAM resiliency. It's also theoretically possible that the two Cloud Servers could share the same underlying storage LUN and a problem with that LUN could affect both servers. This event is unlikely since the storage design itself is highly redundant, with redundant paths between the physical hosts and the storage infrastructure and with the LUN's themselves created with RAID that prevents a single disk failure from causing an outage. In either case, increasing the size of your pool will reduce the possibility of such a situation occurring.