1. Create an HA Cluster in vSphere 7
In this video, I’ll demonstrate how to configure High Availability in Vsphere seven. And one of the important things that I want to note right up front is that we’re creating an Ha cluster. This is not Vcenter. High availability. That’s a completely different feature. Vsphere high Availability provides high availability for all of your virtual machines. It’s not specifically aimed at Vcenter itself. So we’re going to create an hauster, and the first step to do that is to actually create my cluster. So I’m just going to right click my folder here that my hosts are stored in and choose New Cluster, and I’m just going to call it Rick Ha Cluster. And at the moment, I’m not even going to enable Ha or vSAN or DRS. I’m just going to hit OK here. There’s my cluster, and I’m going to drag these two ESXi hosts into that cluster.
So that’s what a cluster really is. It’s a container object for multiple ESXi hosts. And so now I’m going to click on Vsphere availability under configure. And we can note right now that Vsphere ha is currently turned off. So let’s fix that. I’m going to click on Edit and I’m going to enable Vsphere ha by clicking this little slider here. And the first setting that pops up is, do I want to enable host monitoring? And so do I want to actually be able to monitor my ESXi hosts for things like failures or host isolation? So I definitely want that turned on. And now I’m going to start going through all of these little settings that I can configure. So first off, what do I want to do if one of the ESXi hosts in my cluster fails? Do I want to just not do anything and not respond to host failures? If so, I can choose Disabled or I can choose to restart virtual machines. And that’s how these fair, high availability works.
If an ESXi host fails, Ha is going to result in downtime virtual machines that are running on that given host. When it fails, they’re all going to go down. So they’re all going to fail, but they will then subsequently restart on other ESXi hosts in the cluster. So that’s what I want to do. In the event of a host failure, I want my virtual machines to be restarted on other ESXi hosts. And then by default, what is the virtual machine restart priority? So I can go with Lowest, Low, Medium, High, and Highest basically saying in the event of a failure, which virtual machines should restart first? Well, by default, all virtual machines will be configured with a restart priority of Medium, and so they’re all going to be basically equal.
Now, I may eventually take certain virtual machines and change that default restart priority on those specific VMs, but for now, any new virtual machine that gets created is going to have a default restart priority of Medium. So let’s assume that some of our VMs are set to medium, some of our VMs are set to high. Some of our VMs are set to low. What’s going to happen is the VMs with the high priority are going to boot first, and then the next group will start when the VM dependency restart condition is met. So what is the restart condition? Is it when resources are allocated for all of the VMs in the higher priority? Is it when the higher priority VMs are actually powered on? Is it when the higher priority VMs actually are sending VMware tools?
Guest heartbeats? What’s the dependency restart condition to move on to the next restart priority? So we’re going to start with highest and then it’s going to move on once those guest heartbeats are detected, or maybe just once those VMs are powered on, maybe then it will move on to the next group. So I can configure that dependency restart condition there. And if I want to, I can even configure an additional delay before moving on to a lower priority VM restart. So those are the different host failure responses. How about host isolation responses? So if I have an ESXi host that appears to be functional, if the ESXi host still has access to storage and the ESXi host is still powered on but is isolated from the network, what should happen? Should the virtual machines on that host be left alone because the host is still running? Should the virtual machines on that host power off and be restarted on other ESXi hosts? We may want to go with that because if the host is isolated, those VMs may not be able to communicate with the network.
Or do I want to more friendly manner shut down those VMs on the isolated host and restart them on other ESXi hosts so I can choose the response for a host isolation situation here? What if we have a permanent device loss? So what if there is some kind of storage connectivity failure where a physical storage device has actually failed and sent a sense code saying this thing is down? Well, in that case, what do we want to have happen to the virtual machines on the host? Should the virtual machines continue to run on that host even though it’s lost access to the storage? Should the virtual machines on that host remain where they are but will also issue an event? Or should the VMs on that host be powered off and rebooted on other ESXi hosts? These are decisions that you really have to make specific to your particular environment.
The answer as to what exactly is going to make the most sense here depends on what you think is happening in your environment based on these conditions. Very similar to a permanent device loss is an all paths down condition. Again, this is a storage connectivity failure where our ESXi host is unable to communicate with storage systems. And in that case, we have similar options. Do we want to not do anything? Do we simply want to generate events? Will we power off VMs and reboot them if they are can be rebooted on another host? Or will we power off VMs aggressively even if Ha cannot detect the resources on other hosts? So those are our options with an all paths down condition.
And then finally, virtual machine monitoring. Do we want to enable the monitoring of individual virtual machines? So, for example, if a VM stops sending heartbeats using VMware tools, what should happen to that virtual machine? Should nothing happen? Should the virtual machine be reset on the same ESXi host if those heartbeats stop coming? Or do we want to not only monitor the virtual machine, but also turn on application heartbeats so we can monitor applications and potentially restart virtual machines based on that? So you can choose whether or not you want to enable individual virtual machine monitoring within your Ha cluster or not. And those are some of the basic configuration options for an Ha cluster.
2. Configure Admission Control
In this video I’ll demonstrate how to configure Admission Control on an Ha cluster in Vsphere seven. So here you can see I am logged into the Vsphere client and I’m going to go to Hosts and Clusters, and I have a cluster that I’ve created here. And at the moment High Availability is not configured on this cluster. So I’m just going to very quickly edit the cluster and I’m going to go with all of the default configuration options here. And we went through all of these options in the last video. So now let’s focus on admission control. And the purpose of Admission Control is to make sure that there are enough resources available when a host within my cluster fails. So let’s say I have four hosts in my cluster. If one of those hosts fail, I have to have enough remaining capacity on the surviving three hosts to power on all of the VMs that were running on that failed host.
So the objective here is to reserve some of those resources to ensure that, yes, if the host fails, we have enough resources to tolerate that host failure. So let’s start by setting our options here. And the first option is how many host failures do I want this cluster to tolerate? So basically what we’re doing here is we’re configuring Admission Control with the cluster resource percentage method. And what it’s doing is it’s reserving a percentage of all of the CPU and all of the memory resources to provide recovery for a host failure. And so what we’re basically doing, we’re setting aside a certain amount of CPU and memory. We’re going to reserve those resources for failure. And so the number of hosts that I choose here will determine how much resources need to be reserved.
Now I only have three hosts in my cluster, so one is a good number here. But if I had like 20 hosts in my cluster, maybe I would make this number two or three or something like that and the cluster would reserve the corresponding percentage of resources necessary to basically accomplish that goal. Or I could even override what Ha calculates here and I could specify an exact amount of memory and CPU that I want to reserve, so I could specify percentages there. So the critical thing that you need to understand here is the way that this is going to calculate resources is going to be based on virtual machine reservations. So if you have a cluster full of virtual machines with no CPU or memory reservations, then this isn’t going to do a whole lot for us here.
It’s not going to reserve resources for us. If there’s no reservations specified, that’s where the performance degradation that VMs tolerate comes in. And so basically we want to be able to ensure that the virtual machines are actually going to perform well after a failure. So we’re not just concerned about reservations, we’re not just concerned about CPU and memory reservations. We’re also concerned about the overall performance of VMs that do not have reservations. So I can specify a certain threshold here, like let’s say 35% of performance degradation that VMs will tolerate if there is a failure. So this is an important setting to understand because let’s say that we have all four of our hosts and they’re all running at like 80% CPU and 80% memory usage, but we don’t have reservations.
That’s going to be really problematic if one of those hosts fail. So this kind of keeps the overall resources more in line to allow acceptable performance after a host failure. So that’s the cluster resource percentage and by default that’s going to be the method that’s used for high availability. But we could also choose from one of these other methods as well. We could choose dedicated failover hosts or we could use slot admission control policy. Slot has been around for a while. This is going to determine slot size. So basically what it’s going to do is it’s going to look at the virtual machine with the largest CPU reservation and the virtual machine with the largest memory reservation and it is going to say that’s my typical VM and I’ll figure out how many slots I have based on that and that’s how many VMs are allowed to run. Or we can specify a fixed slot size.
So we’re assuming here that hey, our CPUs for each VM is going to be a certain amount of megahertz and a certain amount of memory and then we can determine exactly how many virtual machines can run within that cluster. So this can get a little confusing and frankly a little tricky to manage. I typically don’t recommend this approach. I recommend either the cluster resource percentage or if you want, you can establish certain hosts that are dedicated for failover. So these are hosts that are just sitting there waiting for a failure to occur and they don’t want any virtual machines. Under normal circumstances, in my opinion, the ideal policy is the cluster resource percentage.
I don’t like having failover hosts just sitting there not doing anything because they’re not doing anything. And how much can I really count on them working when I need them? Typically if I’m working with a consulting client, what I’ll do is I’ll go and I’ll set up a mission control with the cost of resource percentage and I’ll configure the performance degradation that I feel is appropriate or that the customer wants configured and that keeps things relatively simple, relatively straightforward and it functions well. If you go with slot policy, from my experience, you tend to run into situations where you don’t have enough slots available to boot up more virtual machines. And in many of those cases it’s because the slot size is kind of skewed and you still have plenty of resources but you’re just out of slots.
So kind of look at slots as hey, they’re kind of like parking spots in a parking lot. Now, if I make all of my parking spaces too small, that’s going to be problematic. But if I also make all of my parking spaces too big, like, let’s say I make them way too big, now, I’m not going to be able to park many cars in that parking lot. I might have plenty of room in that parking lot, but every car is taking up twice as much room as it needs. And if the slot size is not perfect, that’s the problem you’re going to run into. And you’re not going to be able to boot up as many virtual machines as you should be able to. And so after years of people running into those kinds of issues and me going in and resolving it there’s, this is the ideal admission control policy to land on for most use cases. And that’s why it’s the default method shown here.
3. Configure Heartbeat Datastores
In this video, I’ll demonstrate how to configure heartbeat data stores for a High Availability cluster. So here you can see I’ve created an Ha cluster, and I haven’t configured heartbeat data stores at all yet. So here’s my configuration options here. Basically what these heartbeat data stores are used for is if a host becomes isolated, if a host loses access to the Ha heartbeat network, it’s going to keep a little file on these heartbeat data stores and it’s going to lock that file.
That way, other ESXi hosts can reach out to the data stores and see if that file still exists and is still locked. And if the file is locked, the other hosts know, hey, that host that is not sending heartbeats, it’s actually up, it’s still working, it’s still maintaining a lock on that data store heartbeat. So it’s an important option to get a more accurate picture of whether hosts are truly down or whether hosts are just isolated from the Ha heartbeat network. And so if I want to, I can automatically allow my hauster to select data storesaccessible from the hosts in the cluster. I can specify a list of data stores that I want to use for my data store heartbeats, and I can also say, hey, let’s use the data stores from the specified list and complement automatically if needed.
So if I choose this final option, I can then select my preferred data stores that Ha should use for data store heart beating. Ha chooses from amongst the data stores on that list. And if one of the data stores becomes unavailable, ha will automatically choose a different data store. If none of these data stores are available, ha will choose any available cluster data store at that time. So it adds a little bit of a foolproof nature to data store heartbeats because of the simple fact that let’s say that we remove these data stores, but we don’t think to come in and reconfigure High Availability to point to new ones. Or let’s say that this storage array is down, that’ll give us the ability to still continue to have functional heartbeat data stores auto automatically.