1. Design Business Continuity (10-15%)
The fourth section of the exam says “Design Business Continuity” and is worth 10 to 15% of the exam score. Underneath the topic of business continuity are two subtopics. One is backup and recovery, and the other is high availability. Now, we should deal with these together because they are obviously not going down. Having an application that has high availability is probably the best defence against a system that’s going to be unstable, crash, or need restoration. And so you’re going to talk about high availability and what you need to do to ensure that your applications stay up, whether you’re using infrastructure as a service or platform as a service. But you might not be able to have multiple instances of your application running, in which case you need to have a solid backup and recovery strategy so that something does happen. You’re able to restore from backup, you know exactly what your downtime is going to be, you know exactly how much data loss you’re going to have, and you know when that application is going to be available for users to start using again. That brings us to the Backup and Recovery section. So that’s what we’re going to talk about in this section of the course. I hope you’re enjoying it so far. Thanks a lot. Be sure to leave your questions and comments in the Q&A section of the course if you’ve got ideas for improvement. I’m always open to improving this course for yourself and for other students things. Let’s keep going.
2. Introduction to Azure Site Recovery (ASR)
So let’s talk about the concepts of business continuity and disaster recovery. And what is our strategy for protecting our entire website and web applications from a disaster? Now, we have to admit that no cloud provider is perfect. Whether you’re Amazon, Microsoft, Google, or anyone else, there are times when you’re having a particularly bad day. When a natural disaster strikes—a hurricane, a flood, an earthquake, or something else—those data centers are knocked offline.
Or when it’s a man-made issue, like what can happen when there’s a bad software deployment? So the Azure team goes and updates the Azure firmware within a region, and they test it and miss something, and that region goes down. Assume you have a perfectly beautiful software solution running in the East United States. Region, and customers are happy; you’re happy; everything is going great. And then Azure has a problem. Suddenly east came to us. The region has been knocked offline, and nobody can access it. You go into your portal, you look into Azure Service Health, and they say the engineers are investigating. They don’t know what happened with the East. Two east of us Region.
So, what do you do as an engineer or architect on this team? Do you wait? So do you say, “Well, we’ve got to hope that they fix it quickly”? And so that’s option number one. It’s just to sit there for 30 minutes, an hour, 2 hours, or 3 hours until the solution comes back online. “You know what? We can go to the West Coast of the United States,” you scramble. region, create those resources, redeploy the VMs, redeploy the code, and redeploy the databases from a backup? Do you manually rebuild it? Or maybe you’re smart enough to have planned ahead of time and have that standby region that we saw in an earlier diagram. And so you’ve been paying money all along for this insurance policy, essentially, and now you need to fail over to the already running code? Or do you simply have everything in standby mode so that you’re not using the resources but have your finger on the button that will automatically get you up and running with no manual rebuilding required?
In the event of a disaster, these are pretty much our only options. So enter ASR (Azure Site Recovery). Now, Azure Site Recovery is software that is sort of designed to help us with our business continuity issues. Okay? Now, this is something you have to do in advance if your site is currently down. If you’re from the East Coast and watching this video, If you happen to be down, you have got to take one of those other options. You can’t use Azure Site Recovery because the sites aren’t available. But if you plan this in advance, you could create yourself a backup vault. Recovery Services vault. Keep copies of your VMs and all the things that you need in this Recovery Services Vault. Make sure your recovery services vault is in a region other than your primary region because if the East US goes down and your recovery services vault is in the East US, you’re still screwed.
And this is an example of something that Azure Site Recovery can help you with. So we can have, in this case, the East US region. We have our storage accounts, we have our virtual machines, we have virtual networks, availability sets, and subnets, and everything is great. And if the East US collapses, we’d like to relocate to the Central US. But notice in the graph that there are no VMs running in Central. There’s no storage account provisioned. It’s an empty set of boxes. So we can have the veins created; those are free. You can have the subnet created; those are free. We can have an empty source account with no data in it that’s free. And so ASR will help us set up this ghost copy that doesn’t really cost us anything in the Central US. Region. Once you’ve configured ASR, you’ll need to install extensions into your virtual machine, including the Site Recovery extension, and your data will be cached in another storage account.
So you need another storage account that’s going to serve as the temporary holding area for data that’s going to get copied over to Central. Now, in order to have that data available to us, we needed to keep a copy of the data in two places. So we have the East US, which is our primary active region, and then we have the Central US, which will have our data but doesn’t currently have any running virtual machines. And this is what ASR will help us set up. So we have this in a case of emergency; that’s what we’re doing. We’re paying for two storage accounts, but we’re not paying for double sets of virtual machines at this time. And then, bam, right, the storage account, the source environment, East US, goes down, and we can trigger, using ASR, a failover, and we say, “Okay, we want to switch everything to Central.” ASR will take the backups of the virtual machines from the Recovery Services vault, restore them, and get them up and running. Within the central United States We already have the data sitting there from the synchronization that’s been going on, and within 30 minutes or whatever, we are back and running.
So it takes five or ten minutes for the virtual machines to get spun up. Those machines must begin whatever startup endeavors you are undertaking. So ASR is not perfect, but this would be sort of like Plan B or C if the East US region was to go down. You’ve already been using ASR, and then you manually flip the switch and get things running in Central, and you only had 30 minutes or an hour of downtime, whereas the East East US could be down for 7 hours, and you’ve basically saved yourself a lot of trouble in the process. No manual effort This has all been taken care of for you by Azure Site Recovery.
3. Testing Failover and Initiating Failover
So, continuing to talk about business continuity and specifically the Azure site recovery strategy that we talked about in the last video, let’s talk about site failover and fail back. Now, to remind you, we have this ASR site recovery plan set up where we have a source environment, in this case, in East Texas. It has a couple of virtual machines. It has a couple of storage accounts. And the site recovery system within Azure is actually keeping a backup of the data and basically synchronizing that data with another environment. Now, there are no VMs that are set up, but the site is ready to go as soon as you initiate a fail over. You would expect that the two VMs would be created there, and you could start directing traffic to that location.
So it’s an emergency failover site. But how do you test that? When we’re talking about real high-stakes consequences, it’s critical to have a backup plan or a business continuity plan. If your application goes down and there is a significant financial impact, and you can’t get it back up and running within minutes, you should test this. You better actually run through it, have your team go through it, and prepare. Just like a fire alarm test or an evacuation test, they do one for hurricanes. You’re going to want to test your site recovery plan. This is not something you just set up and never try. You can do that. You go into the Recovery Services vault, and you’ll see that there’s the source and target setup in a recovery plan set up. So, basically, you can click this “test failover” button, and it will ask you for some configurations, such as which recovery point you want to test this failover from, and it will attempt to create that application. It will get the data up, it will get the virtual machines up in the right configuration, and you can go and see whether it actually worked or it didn’t work. It also causes no data loss or downtime in your source environment. So this is something you can do without affecting production. So it’s one type of test.
When you perform that type of test, you will gain confidence that the site recovery plan worked. But it’s not full confidence yet, right? It’s not an actual disaster, and you’re not actually shutting down the source environment to get up and running in the new environment. You’re leaving the original alone. Now, when you do a test failover, it does the transition into a virtual network, and then after the fact, you can play with it, test it, make sure it works, and then you can clean up after the test. Another type of test is to actually initiate a failover. This is more of a reality. This is you going in and saying, “We have a disaster, the source environment is down, we don’t trust it anymore, or something happened, and I want the target environment to be alive.” And so this is a real failover. You go into the more options section of your recovery services vault and you click the failover button. You can choose the direction and the return point, and then you can shut down the virtual machines in the source environment afterwards. Now, that’s assuming that they’re accessible. As a result, you can select failover.
And then, doing this, the failover really does happen. This is not a test. This is an actual failover. Again, you could shut down the virtual machines, and if that worked, you’d have no data loss. By shutting down the VMs, you ensure that no new data is added to the data. Once you start the failover process, the other option—that’s more of an emergency—I need this to happen. Then there is also this concept of the planned failover. Now, this would be more like you actually moving to that new environment. So you want to get away from East Texas and into Central Texas. and you are replicating that environment. Everything is set up, and then you want to make the move so you can use this plan failover as an actual move and move my application and everything else I need into a new environment. So on that menu, right above the failover option is the “planned failover” button. And so that’s actually, again, a no-data-loss option, but that’s going to shut down your source environment. So when you hit that button, it’s going to say, “Okay, we’re shutting down the VMs, we’re making sure the data gets replicated to the last bit, and that application is going to be down during that time.” Performing a plan failover is clearly a disruptive event. shuts down the VMs. I’m.
4. ASR Supported Workloads
Now, one thing I haven’t mentioned yet about Azure Site Recovery is that it operates in a number of environments. This is not just for replicating an existing Azure network with applications into another Azure region. It supports what we just talked about, which is Azure Virtual Machines on Windows or Linux, from one region to another. It also supports a lot of on-premises backup and recovery strategies. So, if you’ve got a VMware virtual machine set up within your own environment, you’ve got a VMware box. It’s got a number of VMs running on top of that hypervisor. You can use ASR to have this backup strategy go from your own premises to Azure. So imagine using the cloud as the backup site but having your own site as the primary.
And then disaster strikes. You initiate the fail over.Then, suddenly, you have an Azure region that has your virtual machines that were originally running on VMware running within Azure. Those of you who are thinking, “Well, can’t we do this as a copy or a migration technique?” Of course. So if you have a VMware application and a number of VMs running on top of that and you want to move that permanently into Azure, you can use ASR to set it up, test it, test the failover, make sure the applications are all working, and then initiate the fail over.
Then, all of a sudden, your workloads are running in Azure. It works well not only for VMware but also for physical HyperV and physical servers. So, if you just have a Windows server that’s not virtualized in any way, you can migrate that into Azure or have that as an emergency backup. Finally, and perhaps most unexpectedly, you can migrate and replicate data between two on-premises sites as if they were in the cloud. So, if you have a network with some physical servers running on your premises and want an emergency backup in case of a failover into another network, that can work on that workload as well. Here’s a diagram from the VMware side, where it shows on the left a source set of VMware vSphere VMs and a bunch of physical servers as well. Azure site. Recovery lives in the middle. And then there is secondary on-premises storage that is also VMware Vsphere. And this Azure VM is running on premises using a combination of physical servers and VMware. And you can see the sort of processes that are required to make that work.
5. ASR Geographies and Paired Regions
So let’s talk about the effect of geography on your ASR site recovery or business continuity strategy. Now, different people are going to use it for different reasons, right? Some people, like we saw in an earlier example, are purely for disaster recovery. So you have an existing application that is located in the eastern United States.
And in case of emergency, you’ll want that to be running in Central, and there’s a reason why you’re staying so close. Part of that is that you’re not moving your app to Europe; you just want it to still be running in the United States. It’s just that particular region is having downtime. Microsoft has this concept of “paired regions,” where it typically takes two or more regions within a particular geography and pairs them together. So we can see from the official documentation here that for any particular geography, whether it’s the United States, Canada, North America, or Australia, there are two regions that are paired. As a backup, one is not the primary one. It’s more that, between East Asia and Southeast Asia, this is the quickest connection that you’re going to get out of all the regions. So if you have a workload running in East Asia and the backup is running in Southeast Asia, then the data synchronisation is going to be the quickest.
The moment data is written to East Asia, it’ll get to Southeast Asia quicker than any other region, which reduces data loss. Now, it’s possible that your goal for backup isn’t to have the fastest disaster recovery, but rather to use it as a migration strategy or a copy of an application strategy. Do keep in mind that there is pricing based on your geography. So for the data connection between one location and others, you are charged for the data that transfers between regions. So if you are picking two regions within America, that’s the cheapest and the simplest charge. Of course, you can copy your app to one region in the East US and another in Canada. Do keep in mind that you cannot mix a North American Azure public cloud with the US government cloud. So the US government’s cloud runs according to its own geography. Generally, you’re not going to do Azure site recovery between those two locations. And that’s why it kind of makes sense, I guess. Now if you wanted to move your site from the US to Europe, there’s going to be a slightly higher charge for the bandwidth.
So there’s a higher per-gigabyte charge for data travelling between the US and Europe. South Africa isn’t counted in Europe for some reason; I guess there’s only one African location, and counting that as its own geography isn’t quite ready yet. And so that’s Europe. The other geographic clusters So we saw that they listed Asia. And so it is with Indian and Asian Australia. China. China. You need a special agreement with the Chinese company that runs Azure for them. And in Brazil, there’s actually only one data centre, so it can’t even be paired with another Brazilian one. It has to be paired with the US. So these things, again, are implications that Price missed in going to further geographies. But it may be that you want to copy your app, have a second backup, or use permanent failover. You’ve realised you don’t want to be hosted in one location, and they are.