67. Advanced Cloud Troubleshooting Module Introduction
As you know, in the previous module, we took a look at the very basic, fundamental concepts of troubleshooting, and now, of course, it’s time for us to take a look at more advanced troubleshooting topics, specifically those involving cloud. Let’s get started.
68. Advanced Cloud Troubleshooting
Now, you know me, I hope. I would never want you to be intimidated. We have the word advanced in the title of this video, but this is not going to be something that’s intimidating, something that’s too intense for you. So, while we are gonna talk about some more advanced cloud troubleshooting topics, I hope you trust me now. This is not gonna be difficult. Let’s once again use AWS as an example.
When we have some type of solution, so we’ll call this solution A, and we have this solution running inside of AWS, and we are interested in performing advanced troubleshooting of it. We’re gonna be taking advantage of, I can almost guarantee you, a couple of services. And yes, there’s going to be services just like this inside of, and I’m just remembering now the names. It’s so funny, right? When you jump between the different cloud providers as much as I do, you end up really just having to do kind of a vocabulary brush up on what things are called.
So, there are two services that would have appropriate services in other types of public cloud. So, these are not exclusive types of approaches by AWS, but sure enough, what CloudWatch is doing is it is providing you with metrics and their values of your cloud solution. And there’s plenty that are always free. And then you can turn on more detailed metrics if you want to, and you might end up spending a little bit of money on that privilege. So, CloudWatch is something that we would incorporate most definitely in our advanced troubleshooting efforts, again, something like in AWS. And what would CloudTrail be doing therefore? Well, it would be monitoring all the API calls that are flowing in and out of the solution. So, this is really something that we would use for auditing purposes. Yes, and you can already see where I love tools like this that are tracking everything that happens, that are giving us glimpses into the various metrics, the various health parameters of our cloud solutions.
And notice it’s not just so that we can be more reactive when something goes wrong, I like tools like this because they helped us to be more proactive. That’s right. We wanna be trying to find issues before they become full blown trouble tickets. And if you are doing due diligence and you are paying attention to that rich CloudWatch information that you get and that rich CloudTrail information that you get, you can get into a situation where you start proactively finding problems before they become full blown trouble tickets. So, in more advanced cloud troubleshooting scenarios, we will be turning to the tools that are built in to whatever cloud tech that we’re using. In the case of AWS, it would be CloudWatch and CloudTrail that are free services, at least to start, that are gonna be allowing us to get health information and detailed audit records of who did what and when inside of our cloud solution. Thank you so much for watching.
69. TS Automation and Orchestration
Troubleshooting automation and orchestration is nowhere near as tricky as it used to be. Let me demonstrate.
I must tell you, I am somewhat jealous when I see tools like this because back in the old days when I first started with automation and orchestration in IT, it was in a more traditional environment and we had to do all of this kind of stuff by hand. Notice I am in the cloud formation service inside of AWS and I’m gonna say that we would like to use a quick sample template in here in cloud formation to do something like, oh, I don’t know. How about we set up Windows Active Directory. That’s right. Have this automate the creation of a Windows Active Directory Domain Controller. So, I’ll call this demo_dc. The Parameters. A fully qualified domain name. I’ll do lab.ajsnetworking.com. We will have a domain net bios name. That will be lab_ajsnet_com. Instance type? I will say, Oh my gosh, I don’t need, I’ll do a c1.medium type of instance. And a key name , we’ll go ahead and utilize a key that I know I have access to. How about my Ansible_KP? I know I have access to that. We’ll do a restore mode password and that is good. And then a source citer for the RDP. I don’t wanna look up my IP address right now for demonstration purposes. I will just say we are gonna open it up to the entire interweb. So, there you go. I’m going to click Next. Oh, I can only use letters there in the stack name, no problem. So, I clean up that little issue. Say Next and look at this.
This is what I wanted to show you from a troubleshooting perspective. Look at this. What do we want to happen? Should there be a failure? Do we want to roll back all the stack resources or do we want to preserve those elements that did successfully get provision from the stack? Keep in mind here what’s happening. We’re clicking a button and based on this template, it’s gonna spin up a whole bunch of resources, including the ingredients for connectivity between those resources. In our case, we are spinning up an active directory domain controller with all of the necessary configuration inside it, given the parameters that we have provided here.
So, notice you can go in and you can fine tune the rollback configuration settings and whatnot, but we will just go with the default settings here and we will scroll down and we will say, ‘Go ahead and create the stack.’ Now watch what happens. Just amazing to me that our troubleshooting now is going to be made. So incredibly simple. Notice, look at this. The net bios domain name must match a certain pattern. So, I’ve got characters that are not acceptable in that either. So, now I’m gonna go in and edit the stack details. You see what’s happening here? How we made a mistake in the configuration of this domain controller, and what happens? Well, that mistake gets found, gets really spelled out for us. And now all I’m going to do is… Hey, now wait a minute. What is going on? I think that change that I made did not take. Let’s see. DomainNetBIOSname labajsnetcom. That should work just fine. So, let’s go and create that stack, please. There we go. So not sure why. I guess what happened there was I was clicking a little bit too quickly.
So, now, think again from a troubleshooting standpoint. We have all of these tasks that are going on, all of these automated tasks to quickly build for us thanks to the wonderful cloud formation in AWS an active directory domain controller in the cloud. And I can sit here and hit Refresh and we can see what is being created, what was successful. And look at this, something crashed, and sure enough what happened was I was not authorized to run this certain AMI image, and I knew this would happen because in order to get this script to run you have to edit it with an updated AMI. But I wanted to show you this. Remember, one of the things that we said is from a troubleshooting perspective if something broke with all of these automated tasks that are being orchestrated by this wonderful cloud formation tool if something breaks, roll everything back. I love it. So what I’m gonna do is Refresh and let’s see what happened.
Look at that. The roll back is complete. So what happened? Everything was going along fine. Oh, can’t find the image to build the domain controller with. Error. We better clean everything up.
So, what is beautiful and what really just did happen? There’s nothing up my sleeve. You really did just see it. I had my orchestration fail with the automations it was trying to carry out. Soб it cleaned up all those automations, rolled everything back. There’s nothing I have to worry about cleaning up here. Nothing that’s gonna be hanging around costing me money. These types of tools now in the cloud are making troubleshooting or automation and orchestration an absolute breeze.