1. 38. prepare an operational performance baseline – Metrics Explorer
Hello. And in this section, we’re going to be monitoring activity and performance, and we’re going to start by preparing an operational performance baseline. So one way we can do this is by using metrics. Metrics are numerical values that are collected at regular intervals and have a timestamp for the name value and other levels. Now they’re stored in a time series database, which is suitable for alerting and fast detection of issues. It’s lightweight and allows for near-real-time alerting. By lightweight, I mean it doesn’t use up a huge number of resources. So let’s take a look at metrics in Metrics Explorer. So I can go down to the monitoring section and take a look at metrics. So this is in a particular SQL database. So here I can select the scope. So in this particular case, I’m selecting a database, but you can select other things as well. You can select the metric namespace. So in Azure, the SQL database is the only one, and you can select the metric.
So I’m going to have a look at the data space used, but notice there are quite a number of other ones. and then I can select the aggregation. As a result, I can choose Min Max Average for other metrics. It could be some or a couple of videos. We’re going to just have a look and see what other metrics are available. So we can see how the metric data space is used here. Now, at the moment, it is over the last 24 hours. We can change that. We can change it so that it says “last 30 minutes” all the way up to “last 30 days.” There is also a custom period. We can also show the time in local, Greenwich Meantime, or UTC coordinated Universal time. So let’s say 30 days. So here I’ve got the last 30 days, and there’s only a maximum of 30 days that you can see. However, we can use these arrows to go backwards and forwards, and we can go up to 93 days in the past if we so choose. So I’m going to change this back so that it is in the last 12 hours. Notice what’s happened: the start date and end date have now become custom. This time, granularity So, how frequently does it perform the aggregation? In this case, the aggregation is max.
So does it do it every five minutes? Every 30 minutes, every 12 hours, and every month So I’m just going to leave that on automatic. But if you’ve got a particular reason for changing it, say you want to see it every day but you want to see the maximum per hour or per six hours, you can do that. Now what else can you do here? Well, you can change the colour of the line. Now you don’t do that by looking at the line itself, but by looking at the legend, and there is a line next to the legend. If I click on that, then I can change the colour of the line, and I can edit a title. So I’m just going to put the word “database” here. I can split or filter a metric if it has a dimension. Now, that’s not applicable to the Azure SQL database, but we can add a second metric to the same chart. For example, I will add not datespace used but datespace allocated. As you can see, this is the amount allotted. And again, if I wanted to, I could change the color.
I could change the chart type. So at the moment, it’s a standard line chart. I could do an area chart, a bar chart, a scatter chart, or a grid if I so wished. There are a few more settings over here in the dot. The dot I can move up or down If I’ve got more than one chart, I can add a new chart here. So maybe I’ll add a new chart, which is the data space use percentage, and then I could say, “no, I want to move this bottom one up.” So I can do that. I can delete it, and I can clone it if I want to make changes, especially on a temporary basis. And I can have a look at chart settings as well, which gives me a good number of the things that we’ve previously looked at. But in a panel, we can also change the Y axis. I can also share this by downloading it to Excel or copying the link. So hopefully you can see from that how you can use this to create an operational performance spaceline. In the next video, we’ll take a look at some other ways we can do this.
2. 38. prepare an operational performance baseline – Other ways
Hello. And in this section, we’re going to be monitoring activity and performance, and we’re going to start by preparing an operational performance baseline. So one way we can do this is by using metrics. Metrics are numerical values that are collected at regular intervals and have a timestamp for the name value and other levels. Now they’re stored in a time series database, which is suitable for alerting and fast detection of issues. It’s lightweight and allows for near-real-time alerting. By lightweight, I mean it doesn’t use up a huge number of resources.
So let’s take a look at metrics in Metrics Explorer. So I can go down to the monitoring section and take a look at metrics. So this is in a particular SQL database. So here I can select the scope. So in this particular case, I’m selecting a database, but you can select other things as well. You can select the metric namespace. So in Azure, the SQL database is the only one, and you can select the metric. So I’m going to have a look at the data space used, but notice there are quite a number of other ones. and then I can select the aggregation. So I can select Min Max Average for other metrics. It could be some or a couple of videos. We’re going to just have a look and see what other metrics are available. So we can see how the metric data space is used here. Now, at the moment, it is over the last 24 hours. We can change that. We can change it so that it says “last 30 minutes” all the way up to “last 30 days.” And also a custom period. We can also show the time in local, Greenwich Meantime, or UTC coordinated Universal time.
So let’s say 30 days. So here I’ve got the last 30 days, and there’s only a maximum of 30 days that you can see. However, we can use these arrows to go backwards and forwards, and we can go up to 93 days in the past if we so choose. So I’m going to change this back so that it is in the last 12 hours. Notice what’s happened: the start date and end date have now become custom. This time, granularity So, how often does it do the aggregation? In this case, the aggregation is max. So does it do it every five minutes? Every 30 minutes, every 12 hours, and every month So I’m just going to leave that on automatic. However, if you have a specific reason for changing it, such as wanting to see it every day but only seeing the maximum per hour or every six hours, you can do so. Now what else can you do here? Well, you can change the colour of the line. Now you don’t do that by looking at the line itself, but by looking at the legend, and there is a line next to the legend. If I click on that, then I can change the colour of the line, and I can edit a title. So I’m just going to put the word “database” here. I can split or filter a metric if it has a dimension. Now, that’s not applicable to the Azure SQL database, but we can add a second metric to the same chart. For example, I will add not datespace used but dataspace allocated.
As you can see, this is the amount allotted. And again, if I wanted to, I could change the color. I could change the chart type. So at the moment, it’s a standard line chart. I could do an area chart, a bar chart, a scatter chart, or a grid if I so wished. There are a few more settings over here in the dot. The dot I can move up or down If I’ve got more than one chart, I can add a new chart here. So maybe I’ll add a new chart, which is the data space use percentage, and then I could say, “no, I want to move this bottom one up.” So I can do that. I can delete it, and I can clone it if I want to make changes, especially on a temporary basis. And I can have a look at chart settings as well, which gives me a good number of the things that we’ve previously looked at. But in a panel, we can also change the Y axis. I can also share this by downloading it to Excel or copying the link. So, hopefully, you can see how this can be used to create an operational performance spaceline. In the next video, we’ll take a look at some other ways we can do this.
3. 77. create event notifications for Azure resources
So now you’ve got this operational baseline. How can you be notified when things change? Well, we can create an alert. Assume we have this data space, and I want to generate an alert from it. Well, I can go in metrics to the New Alert rule, and that will get me onto the alert page. Now I should point out that if I go back to my database and I go to alerts, so this is also in monitoring, I can click on “New alert rule.” Then I would have to create one with my current one. I don’t really have to create a condition.
It says whenever the maximum data space used is greater than something, and I just have to say, “Okay, it’s greater than something.” If I haven’t started creating the metric, then I would have to start from scratch and say, “Okay, I need a particular signal type.” So metrics or activity logs When you want to be notified when restore points are created, a visible activity log is useful. For instance, here I can go onto the platform and say, “I want to have a look at the data space allocated,” and that will get me back to where I was here. So we got this metric data space used in this case. Okay, but when do I want to be notified? Well, I’ve got this threshold, and it could be static or dynamic. So if it’s static, then it’s based on a single figure. So you give the figure that you want. So for instance, it could be greater than the maximum; it could be greater than, well, how many bytes, kilobytes, megabytes, gigabytes, or terabytes? Well, it’s currently at 28 and 31 megabytes. So if it goes above 40 megabytes, and for some reason I can’t actually click on megabytes, I’ll just have to type in the value in.
So you can see that is my threshold up there, and that is where I am currently. And as you can see in English, whenever the maximum data space used is greater than, I must also choose an aggregation granularity. So, for instance, how often is this maximum going to be checked? Or if it was an average, how often is it going to be averaged? Is it going to be averaged every minute, every five minutes, every 15 minutes? So how frequently are measures grouped together? So let’s say we group them together every 30 minutes, and how often are they going to be checked? So let’s check every hour, and you can see that doesn’t work. If they’re created every 30 minutes, then we probably need to check every 30 minutes or so. So that is an example of a static threshold where I put the actual value in.
Now let’s change that to “dynamic.” And dynamic, it lets the computer decide, and it’s calculated using historical databases and can use things like seasonality. Maybe you need more space during a particular sales period so you can select the operator. It’s either going to be higher than the normal range or lower than the normal range. So we’re not talking about a single figure; we’re talking about a range. We can change the aggregation, for example, average, maximum, and then the threshold sensitivity. So this specifies how frequently you will be email editing based on the size of the range. So this range would be reduced if the threshold sensitivity was high. So you will get more alerts. If it’s low, then the range gets expanded, so you’ll get fewer alerts. Medium is the default. So if you’re getting too many alerts, then change it from high to medium or medium to low. Now, there are advanced settings. You can say how many violations are needed to trigger this alert. So maybe you’ve got a metric for your DTU and it reaches 100%. Is that something that you want to be notified of every single time?
Or would it be that you want to be notified if it happens four times over 2 hours? So in other words, it happens every 30 minutes for 2 hours. So if using the maximum, it just needs to be at one point in those 30 minutes. If using the average, then it wouldn’t need to be at 100% throughout. But if you’re talking about dynamics, then it would be outside of the average range or minor max at some point in those 30 minutes. If it’s a minimum, maximum, or average, it’s going to be outside of the range in that 30-minute period, and it’s going to happen every 30 minutes for 2 hours. because there are four violations. I could just say, “Well, you’ve got four time periods; let me know if it happens in two of those time periods.” And I can also say to ignore data before a certain point. So there is my condition. You can see it’s going to cost around 20 cents per month. So it’s not free, but it’s not too expensive either. It automates something away from you, and then you get notified if something happens. So you don’t have to check every single day. Now, turning to the action settings, you can have an existing action group or you can click on a new action group. So a new action group, here we go. Create an action group. I would put in the action group name, so my action group and the display name are limited to just twelve characters. And here it is, the main part of this: the notifications. So you can select email, SMS (text message push), and voice.
And here you can see the configurations. So I need to provide a unique name, so send me a text message, and you can say courier code; it’s a country code. So I’ll choose 34 and put in my text number. Alternatively, I could enter an email address with metrics at something. Now, in addition to this, You’ve also received an email regarding the Azure Resource Manager role. I just deleted that; I don’t want that. And we’ve also got actions. So this is when you can integrate this with other parts of Azure. So you can select an automation runbook, an Azure function, an ITMlog app, and a web hook, whether secure or not. So I will create this, and if you don’t know what any of those mean, don’t worry. It’s only if you’re using those bits of Azure that you might go, “Oh, I want to put this into my log app,” or once you have it as a web hook. And then finally, you provide some alert rule details. So you put in a name, a description, a subscription, a resource group, and the severity. So whether it is critical all the way down to verbose, So from zero to four lower numbers, it’s more critical whether to enable the alert rule or polycreation and whether to automatically resolve the errors. So what does that mean? Well, here is an example of a demo alert.
So this is the actual metric, and based on the history, this is the range, this lighter blue, that it could go in. So if it goes outside, as you can see here, it becomes an active alert, and then it is resolved. So the alert period is shown in a different color. When it is unresolved, the line changes from a blue dot to a red dot, and the background turns light red as well. So that’s what “resolved” and “unresolved” mean. Now there are other places where you can create an alert rule from.So, for instance, if I was looking at logs and looking at a particular query, then I could also add in a new alert rule. The term “electronic commerce” refers to the sale of electronic goods. So, when you do that, you must specify the scope, which is your goal; you want to monitor the condition. So tell me specifically what you are monitoring and why. And you’re aware that you can look for a dynamic as well as a statistical threshold, what to do, and some alert rules. So this is how you can create event notifications for Azure resources.
4. 39. determine sources for performance metrics
In this video, we’re going to have a look at what sources there are for metrics, specifically performance metrics, but also going a bit wider as well. So first of all, you’ve got the Azure tenant. So there are some services that are tenant-wide, such as Azure Active Directory. Then you received the subscriptions. So we’ve got, for instance, an Azure activity log. So that includes self-service health records and records of configuration changes. And we’ve also got Azure Service Health, which has, as the name suggests, information about the health of your Azure services health.
Then we have your resources. So most Azure resources submit platformmetrics to the metrics database. Resource logs are then created internally regarding the internal operation of an Azure resource. So in the SQL Database, for instance, the way that we have all of these metrics is because they are being generated and sent. We’ve also got all of these logs. If you have a guest operating system in Azure or another cloud on Prem, there will be some sources there as well. As a result, the AzureDiagnostic Extension for Azure Virtual Machine will be available. When enabled, that submits logs and metrics. We’ve also got log analytic agents, so they can be installed on your Windows or Linux virtual machines, either running in Azure in the cloud or on Prem.
And there’s also something called “VM insights.” It’s a preview at the moment. It provides additional Azure monitor functionality on Windows and Linux. VMs So other sources in the application code You could enable application insights to collect metrics and logs relating to the performance and operations of the app. You could have monitoring solutions and insights. They provide additional insights into a particular service or app. If you have containers, then ContainerInsights provides data about the Azure Kubernetes service. As I’ve already said, VM Insights allow for customised monitoring of VMs and of the VMs themselves. You can also have a look at the Windows Performance Monitor, which is also called a Perf Mol. And there are specific counters available for SQL Server. In the previous video, we took a look at the metrics.
Now available metrics for the managed instance include the average CPU percentage over a specified time period, I Obytes read or written, I O request counts, storage space reserved and used, and the virtual call count. And that can be anything from four to eight TV calls. Now, metrics available for the Nazir SQL Database are blocked by firewall deadlocks. CPU percentage data DTU limit, used, or percentage of failed and successful connections in memory OLTP storage percentage. So that’s what’s being saved in your online transaction processing if you’ve enabled it. We’ve also got session percentage and worker percentage. So the number of requests We’ve also got SQL Server process core and memory percentage, then temp DB data and log file size and kilobytes, and also the percentage of log used, so those are your metrics for Azure SQL Database. For managed instances, you’ve also got a few different ones like storage space, which is reserved and used, and the average CPU percentage at a particular time. We’ve also got that for the Azure SQL database. And then there are other methods of getting performance metrics from virtual machines, including log analytic agents and VM insights.
5. 40. interpret performance metrics
In this video, we’re going to say, “Okay, we’ve got all of these performance metrics, but what can we do with them?” So what is the reason for getting all of these metrics? Well, it could be that you have too much happening on your computer for your particular specification. So if you’re basic, you might need to upgrade to standard, and so forth. However, it is possible that you have too much and should consider a downgrade. So we’re going to have a look at some of these particular metrics, and I’m going to do that by looking at Sys. Dmdb. resource statistics. So here is an example of that.
So if any of the following seems to be on the high side, let’s say close to 100%, then you might want to consider upgrading to the next service tier. So first of all, let’s take a look at space and the components used. So we have things like CPU memory and I/O for VCOs. This can now be seen in things like DTU performance. Now if this is too high, you might want to upgrade. If this is low, then you might save money by downgrading, and you won’t be able to use this DTU percentage if it’s a database that uses Vcore’s models. Now additionally we’ve also got things like the CPU percentage. So AVC CPU percent. So when this is high, you might find that query latency increases and queries might time out. So maybe you need to increase the compute size, or maybe you need to optimize queries to reduce the CPU requirements. Now you’ve also got the data I/O percentage, which is the average I/O percentage, and you’ve also got the log I/O percentage as well. So that is Avlog’s correct percent. Now we’ve also got the “in memory” OLTP storage percent, which is the “XTP storage percent.” So if in memory OLTP, this will turn zero. So we’re talking about non-used memory-optimized tables, indexes, and table variables. If you are using them and this hits 100%, then insert, update, alter, and create operations will fail. Unfortunately, select and delete will be fine.
Now, the percentage of data space used is If this is getting too expensive, you might want to consider upgrading to the next service tier. You might want to shrink the database or scale it out using sharding. Now if it’s an elastic pool, then you might want to consider moving it out of the pool and giving it its own resources. So that is dataspace use percent; if we look at average memory usage percent, this is used for caching. So if you get out of memory errors, then you might want to increase the service tier compute size or optimise the queries. Next, we’ll look at the connections requested and used, and we’ll see the session percentage and the worker percentage. So this is the maximum worker percent and the maximum session percent. As a result, the maximum session percentage So this is the maximum number of concurrent sessions divided by the service tier limit, and the maximum worker percent is the maximum number of concurrent requests divided by the service tier limit. If you get close to 100, then you might want to increase the service tier compute size or optimise queries. So what queries might need to be optimized? Well, if we go into the intelligence performance section here and query performance insight, what we can do is take a look at the top five queries by CPU data I/O and log I/O. And if I scroll down to the bottom, you’ll see these are the top queries. So here, if I click on this query, you can see what it actually is. As a result, this is a source of system query. So going back to it, we can also have a look at long-running queries; which ones run the longest? We can do a custom calculation so we can have a metric for CPU data, an aisle log, an aisle duration, or an execution count.
So we can select any of those within a particular time period. We can see the number of queries and the aggregation, and we can also have some recommendations as well. So you may have seen some icons earlier that weren’t actually on my version of the page, but in your version you may see icons showing performance recommendations. So if there are individual queries that are using a huge number of resources, then this is a good place to find out where they are. There are other places, and we’ll take a look at the query store in later videos. So this is how you can interpret performance metrics. So we have DTU percentage, CPU percentage, data, IO percentage, log I/O percentage, percentage in memory, OLTP storage percentage, average memory used and set in percentage, and workers’ percentage. If any of those get close to 100%, then you might want to consider upgrading. If they are all consistently low, then you might be using too much; you might get a high specification and be able to save some money by downgrading. So this is how we can interpret percentage metrics.
6. 41. assess database performance by using Intelligent Insights
In this video, we’re going to assess database performance by using Intelligence Insights for Azure SQL Database and managed instances. So what this does is compare the current database workload for the last hour within the last seven days. For example, it’s the most repeated and largest query. It uses data from the query store, which we’ll be talking about in later videos and which is enabled by default in Azure SQL Database. So you need to enable it for the AzureSQL managed instance. It monitors using artificial intelligence operational thresholds and detects issues with high wait times, critical exceptions, and query prioritisations impacted metricsare due to increase if there is a high quality duration, excessive waiting, timed out or erroredout requests.
Just to note, it’s not available in some regions; it’s not available in West Europe, North Europe, or the West US. one and to the east. One. That doesn’t mean it’s not available in the US. There are other regions in that, and similarly for Europe, there are other regions, and they’re also not available for virtual machines. Now, I’m going to scroll down on the lefthand side to “Intelligence Performance,” where we can take a look at some of the intelligent insights by looking at “Query Performance Insight,” which we had a look at in the previous video. We can also, in diagnostic settings, add additional diagnostic settings that we might want to look for. So we might want to look for SQLInsights’ automatic tuning, which we’ll have a look at in later videos and timeouts. And we can send them to a log analytics workspace, we can archive them in a storage account, and we can stream them to an event hub.
So those are the three destinations that we had previously, and we can send it to a partner solution. So this is my dialogue setting, and I’m going to send them to my Log Analytics account that I created earlier. So I’ll save that. So, Intelligent Insights looks for things that could affect the database’s performance, such as resource limits. So if you reach your resource limits, like CPU reaching resource limits for managed instances or DTU worker threads or login sessions reaching resource limits for Azure SQL Database, then you might get a performance recommendation based on that. There is also an increase in workload that puts more pressure on the memory. So workers, those are requests waiting for memory allocations, data locking, and whether you need to increase the maximum degree of parallelism. So if there are more parallel workers than there should have been, then that’s called the Max dot. If you’ve got missing indexes, new queries affecting performance, increased weight statistics, or multiple threads using the same temp DB resource, it doesn’t matter whether you have a new plan, change an existing plan, or downgrade your pricing tier. And for the DTU model, if you’re in an elastic pool, then there may be a shortage of available EDTUS. So if you have any of those, you may have a performance recommendation in Performance Recommendations. So the Intelligent Insights, it uses artificial intelligence to take a look at what’s currently happening compared to the last seven days and see about operational thresholds, detect issues with higher wait times, critical exceptions, and query parameterizations.
7. 42. configure and monitor activity and performance
In this video, we’re just going to wrap up this section by looking at how to configure and monitor activity and performance at the infrastructure, server, service, and database levels. And we’ve already had a really good look at how we can do it in SQL databases. We can have a metric, we can create an alert on the metric, we can look at logs, and we can query the logs using a language called Kouso. We can have a look at performance overview and performance recommendations, and we can also have a look at query insight. We can also monitor using the dynamic management views we discussed earlier. But I just want to complete this section by mentioning a few other things that you can do. So first of all, we’ve got extended events. Now, this is a lightweight monitoring system, so it uses very little in terms of performance and allows you to create, modify, display, and analyse your session data. So you can use it for trend analysis and ad hoc monitoring. There’s a graphical interface, and you can use it within your own custom application, SQL Server Profiler.
This tracks process events such as the start of a batch or transaction. So it allows you to monitor server and database activity. And you can also save the data to an SQL Server table or file for later analysis. And you can also replay the events step by step to see exactly what happened. So you can do that with just one computer. If you are doing this from multiple computers, then something called distributed replay should be used instead. So this allows you to use multiple computers to replay trace data, so it simulates a mission-critical workload monitoring system Monitor. This primarily tracks resource usage, such as the number of buffer management page requests in use. So it enables you to monitor server performance and activity using objects and counters to track these events. Now, System Monitor, which is also called Performance Monitor, collects counts and rates rather than data about the events. The primary distinction between SQL ServerProfiler and System Monitor is that the profiler examines database engine events. System monitors monitor resource usage. The activity monitor in SSMS is useful for ad hoc views of current activity and graphically displays information about processors running on an instance of SQL Server. You can also look at blocked processes, locks, and user activity. We’ve also got some TSQL for monitoring. So we’ve had a fair look at TSQL, but after this, we’ll just have a look at a few more things. We’ve got error logs, so the Windows application event log provides a picture of events on your Windows operating system. So this is for VMs, and we’ve also got your performance dashboard in SSMS that can help you identify whether there’s any current performance bottleneck in SQL Server. So if you encounter any of these terms, then you know roughly what they’re for. So replaying. Captured Events runs SQL Server Profiler on a single computer and distributes replay from multiple machines. For instance, for trend analysis, you can use Extended Events, SQL Server Profiler, and System Monitor. So here are four commands in TSQL that might be of interest. SP underscores that this provides snapshot information about current SQL Server users and processors, including whether it is blocked.
SP Lock, so this has a look at information about locks, including the object ID and the index ID, the type of lock, and so forth. SP Space Used: This displays an estimate of the current amount of disc space used by a table or the entire database. And then this last one doesn’t actually work in Azure SQL Database, but it can display the statistics, including CPU usage, I/O usage, and the amount of idle time since it was last executed. So if I go into my on-premises database and run that here, you can see the output. So in this video, we’ve had a look at some terminology that you might encounter when you configure and monitor activity and performance at the infrastructure, server service, and database levels. And these include things like extended events, SQL Server profiler, distributed replay system, performance monitor, activity monitor, TSQL error logs, and the performance dashboard in SSMS. Now, you might want to know that none of these are asked for by name in the DP 300, so that’s why we’re not spending a number of videos looking at how to configure each one of these. But it’s just useful to know, if you come across that particular terminology, what it can do.