1. AWS Lambda Architectures
So now let’s talk about how we can use lambda to do some automations with our databases. So number one is that lambda can be used to react to events happening within your database using CloudWatch events or also called Amazon EventBridge. So RDS for example, will have events whenever we have a DB snapshot, new parameter groups or security group changes. And so all these events can go into CloudWatch Events or EventBridge. And CloudWatch Event or EventBridge can have a target for the rules to trigger a lambda function. So that is a very common pattern.
For example, if you wanted to react to a database snapshot being completed and then trigger something with a lambda function, an alternative way to do the exact same architecture, or at least the same outcome, is to use SNS topics and RDS event descriptions. So this is the event notifications directly coming out of RDS. And so these RDS event descriptions can go directly into an SNS topic and it will be the exact same events that we had from before. Because the source is the same, it’s just a target that’s different. And from this SNS topic we can send an email, we can trigger another lambda function or send the data into an Sqsq.
So both these things allow us to react to events happening in our RDS database. Next, we have a reaction to API calls using Cloud Trails. So for example, say we want to react to whenever a database instance is created, or whenever database instance is terminated or these kind of things. Then the users are going to do API calls to perform these created B instance delete DB instance APIs on your RDS service. And so therefore Cloud Trail is a service from AWS that can intercept all these API calls and log them. And from Crowd Trail then we can trigger a lambda function to get access to these API calls and for example, have a lambda function that would be triggered whenever this API call was logged into Cloud Trail.
We also have Cronja’s with lambda. So CloudWatch Events and EventBridge can trigger lambda not just on events, but also on the Cron schedule. So we can have CloudWatch events, say every hour, trigger a lambda function. And that lambda function could be used to execute a script against an RDS database, therefore creating what’s called a Cron drop because it is on the schedule. Now, you need to know the limitations of this. So the max execution time of your lambda function is 15 minutes. So you need to make sure that whatever scripts you’re doing on your database, for example, the cleaning script does not time out, so does not exceed 15 minutes.
And in terms of languages that are accessible to execute that script, you have Python, JavaScript, Java, et cetera. Not everything but docker. Docker can now be used as a language or a runtime on lambda to execute the script. Another way you could use lambda is to do disaster recovery. So we know that automated RDS backups happen once every day during the maintenance window we defined and the frequency is just once every day. We cannot schedule them to be more often, but sometimes you may want to have automated backups every 12 hours. And so for this you need to use manual backups because you can do a manual backup whenever you want.
So how do we automate this? Well, CloudWatch events can trigger a lambda function every 12 hours. We’ve seen this from the CronJob slides from before and this lambda function can issue an API call to start a manual database snapshot. Now, starting a snapshot is asynchronous and the RDS database will create the snapshot in its own time. And how do we react to a snapshot being created? Well, if you followed, there will be an event coming out of this from the RDS that the snapshot is completed and so this can be again intercepted by CloudWatch events to trigger another lambda function.
When that happens and that lambda function then knows that the snapshot is completed and it can target the snapshot to copy the snapshots across regions therefore creating a disaster recovery strategy because we have automated a backup being created in RDS every 12 hours and copied automatically into another region using two lambda functions. So those are just some of the combinations you can do with lambda and Cloudwater events and RDS. Hopefully these make sense. They unlock a lot of use cases and the exam may test you on a few of those. So I hope you like this and I will see you in the next lecture.
2. Server Migration Service
Now let’s talk about a service that may appear, probably is going to be a distraction. It is AWS Server Migration Service or SMS which is used to migrate entire virtual machines to AWS from on premises and it is a replacement and improvement over the EC to VM Import Export service that used to exist. So Server Migration Service as the name indicates is meant to migrate servers not databases. Instead for databases we have DMs Database Migration Service which is having a very explicit name. So when you use SMS, the operating system, the data and everything is kept intact.
And then once you load the VM onto EC Two you can update the OS data and make an AMI from it. So therefore SMS is just used to migrate one server from on premise to the cloud and this is called re hosting. Also SMS only works with several specific kinds of on premises systems so VMware vsphere. Windows HyperV and Azure VM. And every time SMS will run it will create an EBS snapshot or an Mi that is going to be ready for deployment on EC Two with replication being incremental. So you can choose a one time migration or a replication every X number of hours interval option.
So DMs you remember, is for database migration and DMs is going to be continuous, whereas SMS is going to be one time migrations or replication every interval option and therefore it’s not continuous.So what I want to illustrate here is that SMS is not used for database migration, DMs is going to be used, SMS is only going to be used for server migration and it is not continuous. So hopefully that helps you make the right decision. If, if that service comes at the exam, that’s it for me. I will see you in the next lecture.
3. EBS-optimized instances
So let’s look at a subtlety with EBS Optimized instances. So if we consider an EC two instance that is going to be EBS Optimized, what this means is that there is going to be dedicated bandwidth between the EC two instance and EBS volume. So for example, say we have an EBS volume of Tap IO one, which means it’s provision I up so we get dedicated I ups on that volume. Now, we need to note that these two instances that are EBS Optimized do have a dedicated bandwidth, but that bandwidth can change based on the instance size. So for example, if I consider an I three large, the maximum bandwidth is 425 megabits per second.
But if I go to I 316 Xlarge, I’ll get 14,000 megabits per second. So a lot more bandwidth. What this means is that, and this is very important choose an EBS Optimized instance that provides more dedicated Amazon EBS throughputs than what your application needs. Otherwise the connection between Amazon EBS and Amazon EC two can become a performance bottleneck. So even though you have an EBS volume with a lot of IOPS, if you don’t have enough bandwidth between your EC two instance and your EBS volume, then the IOPS will not all be used and you will have a bottleneck.
So that applies to EC two instances. But therefore, because RDS databases are made on top of the same type of instance EC two instances are, then the same conclusion applies to RDS databases that have mounted, for example, an IO One EBS volume. In order to leverage the full Pi ups for the full provision I ups of your volume, you need to increase the RDS instance size and to get the maximum performance, how do we know when to increase the instant size? There are two metrics to look at. There is the EBS IO balance percentage and the EBS Bite balance percentage metric.
And if these metrics are consistently low, that means that we have candidates to size up. And if these instances have a balance percentage that never drops below 100, their candidates for downsize. I invite you to read a lot a bit more at this URL if you want more information. But from an exam perspective, I think what you have to remember is that if you have a bottleneck on your IOPS because they’re not all being used, it’s maybe because you have to increase the instance type in order to fix that issue because you’ll get more network bandwidth out of it. So that’s it. I hope you liked it and I will see you in the next lecture.
4. Transferring large amount of data into AWS
So here is a lecture on transferring large amount of data into AWS. This is just a short one to illustrate a point. So for example, say you want to transfer 200 terabytes of data in the cloud and you have a 100 megabits per second internet connection. If you go over the internet or use a site to site VPN, it’s going to be immediate to set up and it will take about 185 days if you do the computation. If you have a faster internet, for example, you set up a direct connect connection between Onsite and AWS with one gigabits per second.
Then it’s going to take a long time to set up because it’s a one time setup for Direct connect about over a month. But once it is set up, it’s going to be ten times as fast because we have a ten times as fast connection and so it’s going to take only 18. 5 days. And then finally, if you want to transfer this amount of data over snowball, so over the physical route, it will need two or three snowballs in parallel to load all the data and then the transfer will take about one week end to end because the snowball has to be shipped obviously to AWS.
The cool thing about snowball, as you’ve learned in this course, is that it can be combined with DMs database migration service. So we can do a one time load of the data using snowball and then we can do continuous replication after that using DMs. So in terms of what I said, continuous ongoing replication, you would pair up Cytosvpn or direct connect alongside DMs CDC replication. So that’s it just something that should make sense for you already, but it’s good to repeat it once. So that’s it. I will see you in the next lecture.
5. Disaster Recovery
So disaster recovery as a solutions architect is super important and the exam expects you to know about disaster recovery. And there’s a white paper on it, you should read it. But I try to summarize everything clearly with graphs and diagrams in this lecture. So you don’t have to read it if you don’t want to. But overall you can expect some question on disaster recovery and as a solutions architect you need to know about disaster recovery anyway. Don’t worry, I try to to make this as simple as possible for you. So what is a disaster? Well, it’s any event that has a negative impact on the company’s business continuity or finances. And so disaster recovery is about preparing and recovering from these disasters.
So what kind of disaster recovery can we do on AWS or in general? Well, we can do on premise to on premise. That means we have a first data center maybe in California, another data center maybe in Seattle. And so this is traditional disaster recovery and it’s actually very expensive. Or we can start using the cloud and do on premise as a main data center and then if we have any disaster use the cloud. So this is called hybrid recovery. Or if you’re just all in the cloud then you can do A tos cloud region A to cloud region B and that would be a full cloud type of disaster recovery.
Now before we do disaster recovery we need to define two key terms and you need to understand them from an exam perspective. The first one is called RPO recovery Point Objective and the second one is called RTO recovery Time Objective. So remember these two terms and I’m going to explain them right now. So what is RPO and RTO? The first one is the RPO recovery Point objective. And so this is basically how often basically you run backups, how back in time can you to recover? And when a disaster strikes, basically the time between the RPO and the disaster is going to be a data loss.
For example, if you backup data every hour and a disaster strikes, then you can go back in time for an hour and so you’ll have lost 1 hour of data. So the RPO, sometimes it can be an hour, sometimes it may be 1 minute, it really depends on your requirements. But RPO is how much of a day loss are you willing to accept in case of a disaster happens? RTO on the other end is when you recover from your disaster.
Okay? And so between the disaster and the RTO is the amount of downtime your application has. So sometimes it’s okay to have 24 hours of downtime. I don’t think it is, sometimes it’s not okay and maybe you need just 1 minute of downtime.
Okay? So basically optimizing for the RPO and the RTO does drive some solution architecture decisions and obviously the smaller you want these things to be, usually the higher the cost. So let’s talk about disaster recovery strategies. The first one is backup and restore. Second one is pilot. Lights. Third one is warm standby. And fourth one is hot. Site or multisite approach. So if we basically rank them all will have different RTO. Backup and Restore will have the smaller RTO pilot light, then warm, standby and multisite. All these things cost more money, but they get a faster RTO.
That means you have less downtime overall. So let’s look at all of these one by one in details to really understand from an architectural standpoint what they mean. Backup. And Restore has a high RPO. That means that you have a corporate data center, for example, and here is your AWS cloud and you have an S, three bucket. And so if you want to back up your data over time, maybe we can use AWS’s storage gateway and have some lifecycle policy, put data into Glacier for cost optimization purposes. Or maybe once a week you’re sending a ton of data into Glacier using AWS snowball.
So here if you use snowball, your RPO is going to be about one week because if your data center burns or whatever and you lose all your data, then you’ve lost one week of data because you send that snowball device once a week. If you’re using the AWS cloud instead, maybe EBS volumes, redshift and RDS. If you schedule regular snapshots and you back them up, then your RPO is going to be maybe 24 hours or 1 hour based on how frequently you do create these snapshots. And then when you have a disaster strike you and you need to basically restore all your data, then you can use Amis to recreate EC two instances and spin up your applications.
Or you can restore straight from a snapshot and recreate your Amazon RDS database or your EBS volume or your redshift, whatever you want. And so that can take a lot of time as well to restore this data. And so you get a high RTO as well. But the reason we do this is actually it’s quite cheap to do backup and restore. We don’t manage infrastructure in the middle, we just recreate infrastructure when we need it, when we have a disaster. And so the only cost we have is the cost of storing these backups. So it gives you an idea. Backup and Restore, very easy, pretty expensive, not too expensive, and you get high RPO, high RTO.
The second one is going to be Pilot Lights. So here with Pilot Lights, a small version of the app is always running in the cloud and so usually that’s going to be your critical core and this is what’s called Pilot Lights. So it’s very similar to Backup and Restore but this time it’s faster because your critical systems, they’re already up and running. And so when you do recover, you just need to add on all the other systems that are not as critical. So let’s have an example. This is your data center. It has a server and a database. And this is the aviation cloud. Maybe you’re going to do continuous data replication from your critical database into RDS, which is going to be running at any time.
So you get an RDS database ready to go running. But your EC, two instances, they’re not critical just yet, what’s really important is your data. And so they’re not running. But in case you have a disaster happening, route 53 will allow you to fail over from your server on your data center recreate. That easy to instance in the cloud and make it up and running. But your RDS database is already ready. So here what do we get? Well, we get a lower RPO, we get a lower RTO and we still manage costs, we still have to have an RDS running, but just the RDS database is running, the rest is not. And your easy to instance only are brought up are created when you need to do a disaster recovery.
So pilot light is a very popular choice. Remember, it’s only for critical core systems. Warm standby is when you have a full system up and running, but at a minimum size. So it’s ready to go, but upon disaster we can scale it to production load. So let’s have a look. We have our corporate data center, maybe it’s a bit more complicated this time. We have a reverse proxy, an app server and a master database. And currently our Route 53 is pointing the DNS to our corporate data center. And in the cloud we’ll still have our data replication to an RDS slave database that is running. And maybe we’ll have an EC two auto scaling group but running at minimum capacity, that’s currently talking to our corporate data center database.
And maybe we have an ELB as well, ready to go. And so if a disaster strikes you, because we have a warm standby, we can use Route 53 to fail over to the ELB and we can use the failover to also change where our application is getting our data from. Maybe it’s getting our data from the RDS live now. And so we’ve effectively basically stood by and then maybe using auto scaling our application will scale pretty quickly. So this is a more costly thing to do now because we already have an ELB and EC to auto scaling running at any time. But again, you can decrease your RPO and your RTO doing that. And finally we get the multisite hot side approach.
It’s very low RTO, we’re talking minutes or seconds, but it’s also very expensive. But you get to full production scales running on AWS and on premise. So that means that we have your on premise data center full production scale. You have your AWS data center full production scale with some data replication happening. And so here what happens, is that because you have a hot site that’s already running your Route 53 can root requests to both your corporate data center and the Avias cloud. And it’s called an Active active type of setup. And so the idea here is that the failover can happen. Your EC Two can fail over to your RDS slave database if need be, but you get full production scale running on AWS and on premise.
And so this cost a lot of money, but at the same time, you’re ready to fail over already and you’re running into a multi DC type of infrastructure, which is quite cool. Finally, if you wanted to go all cloud, it will be the same kind of architecture. It will be a multiregion. So maybe we could use Aurora here because we’re really in the cloud. So we have a master database in a region, and then we have Aura global database that’s being replicated to another region as a slave. And so these both regions are working for me. And then when I want to fail over, I will be ready to go full production scale again in another region if I need to.
So this gives you an idea of all the strategies you can have for disaster recovery. It’s really up to you to select the disaster recovery strategy you need, but the exam will ask you basically based on some scenarios, what do you recommend? Do you recommend backup and restore pilot light? Do you recommend multi site or do you recommend hot sites? All that kind of stuff, warm backups and all that stuff. Okay, so finally, disaster recovery tips and it’s more like real life stuff. So for backups you can use EBS snapshots, RDS, automated snapshots, and backups, et cetera. And you can push all these snapshots regularly to S Three. S three IA Glacier. You can implement a lifecycle policy.
You can use cross region replication if you wanted to make sure these backups will be in different regions. And if you wanted to have your data from on premise to the cloud, snowball or storage gateway would be great technologies for high availability. Using a Route 53 to migrate DNS from a region to another region is really, really helpful and easy to implement. But you can also use technology to have multiaz implemented, such as RDS multiaz elastic multiaz EFS S three. All these things are highly available by default if you enable them. Obviously, if you’re talking about the high availability of your network, maybe you’ve implemented Direct Connect to connect from your corporate data center to AWS.
But what if the connection goes down for whatever reason? Maybe you can use Site to site VPN as a recovery option for your network in terms of replication. You can use RDS, replication, Cross Region, Aurora and Global databases. Maybe you can use a database replication software to do your on premise database to RDS. Or maybe you can use storage gateway as well in terms of automation. So how do we recover from disasters. I think you would know already. Cloud formation elastic beanstalk can help recreate whole new environments in the cloud very quickly. Or maybe if we use Cloud Watch we can recover or reboot our EC.
Two instances when the Cloud Watch alarm fail. Alice Lambda can also be great to customize automation, so they’re great to do rest API, but they can also be used to automate your entire Alex infrastructure. And so overall, if you managed to automate your whole disaster recovery, then you are really well set for success. And then finally chaos testing. So how do we know how to recover from the disaster? Then you create disasters. And so an example that’s I think widely quoted now in the AWS as well is that Netflix, they run everything on AWS and they have created something called a Simon Army and they randomly terminate easy to instances.
For example, they do so much more but basically they just take an application server and terminate it randomly in production, okay? Not in dev or test in production. So they want to make sure that their infrastructure is capable to survive failures. And so that’s why they’re running a bunch of chaos monkeys that just terminate stuff randomly just to make sure that their infrastructure is rock solid and can survive any types of failures. So that’s it for this section on disaster recovery. I hope you enjoyed it and I will see you in the next lecture.