Home / Nerdio Academy / Nerdio / Nerdio Fundamentals: Performance Monitoring

Nerdio Fundamentals: Performance Monitoring

0 commentsJuly 06, 2019Videos

Joseph Landes
… In this session, we are going to dive deep into performance monitoring. We will look at it both from the standpoint of how performance monitoring works in Microsoft Azure and also look at how it is implemented in Nerdio for Azure with desktop visual synthetic monitoring. Enjoy the session.

Joseph Landes
All right, today we’re going to talk about performance monitoring and we’re going to look at performance monitoring both from an Azure perspective, as well as through the NAP that does visual synthetic monitoring of desktop workloads. So, let’s start with Azure performance monitoring. Recently, they’ve actually enhanced the monitoring to include some really interesting metrics that are going to be very useful in troubleshooting performance, even looking at things in the past.

Joseph Landes
Let me walk you through what’s available. If we go into the VM view, the virtual machine view and you open up any virtual machine, you can click on ‘metrics’ under ‘monitoring’ and you will see a screen like this. This screen basically allows you to select multiple different metrics and graph them and select the timeframe, et cetera. Let’s go through a couple of examples.

Joseph Landes
We’re currently looking at RDSH01 in a demo account. There’s not a lot of usage on it, but we’ll use it as an example. The next thing is, we select a metric namespace, which in this case, is a virtual machine. Then, we look at metrics. Now, important thing to note here, is there going to be some metrics that are going have nothing in parentheses next to them and some metrics will have deprecated and previews. The deprecated metrics are obviously ones that are going away. The preview metrics are the ones that are coming in.

Joseph Landes
There’s going to be some interesting preview metrics that we’re going to look at. But, first, let’s start with the simplest possible metric. That is a CPU percentage. CPU, right? We’re all familiar with percentage CPU, we can put that on the sheet. We can select average, which is going to be the aggregation algorithm. What this is telling you, is that for every sample data point that’s on here, it basically is looking at the CPU utilization and averaging it out, I guess over a certain period.

Joseph Landes
The timeframe is listed here and you can select different timeframe and different granularity. Let’s see, if we change it to one minute what that’s going to look like, so we have more frequent data points. But, thing I want to point out, is this aggregation, average is what we’re all used to look at. However, average on a retroactive basis, if you’re looking back in time and you’re trying to pinpoint certain constraints, really what you care about is that peak average.

Joseph Landes
If you have a peak that’s maxing out the CPU for a very short amount of time, if it gets averaged out with a few data points around it that are not maxing out the CPU, it’s going to skew the result and everything is going to look normal, even though there may be very high peaks that are persistent and they’re causing performance issues. What’s really interesting is that you have this max metric here that you can put in. By setting the metric to max, you’re actually looking at the peaks that the CPU is hitting. Right now, we’re looking at the 24-hour interval. Let’s say we’re going to look at the last 30 days. With the last 30 days, it gives you a limit, because we have the data point set to one minute, that’s too frequent.

Joseph Landes
Let’s see if we can do five minutes. We can. All right, good. Basically, what max is showing me is where all the peaks are. Obviously, this machine looks fine. There are occasional peaks. If we investigate, we’ll probably find that this is either user logins in the morning or maybe updates running in afternoon and things like that. This is good to know, that you have a metric for CPU usage and you can also look at peaks by selecting max. You can also define your timeframe and the time granularity. All of these are pretty powerful.

Joseph Landes
You can also have different types of charts. You could have an area chart, for example, if you’re plotting maybe two different metrics or from two different hosts. You can have already RDSH01 and RDSH02 or you can have RDSH01 and DC01 overlaid on the same graph. If you see a spike in one, you can readily see if there was a corresponding spike in the other. Kind of find correlations visually like this and various other sort of visual representations that you can modify.

Joseph Landes
What I want to point out is, couple of other metrics that we can add. Let’s go ahead and add another metric for the same host, except this time, we looked at CPU, which is a common source of performance issues. The other common source of performance issues is disk I/O. The thing that summarizes disk I/O up pretty well is known as queue depth, Okay? Queue depth is now a metric that’s available in here. There is a disk, the OS disk queue depth and then there’s also data disk queue depth, because RDSH01 doesn’t have data disks, it only has the OS disk, which is the C: drive.

Joseph Landes
Let’s select this metric and again, we are going to look at max and let’s get rid of the CPU, so we can see this as a more representation that makes sense. You’re looking at things and you can see what the maximum queue depth on that particular VM is and this VM, obviously, looks very healthy over the last 24 hours. Queue depth is pretty low and it’s pretty shallow. There was a peak once to 1.2, but even that is really not a big deal.

Joseph Landes
If you had the machine that was struggling with IO on the disk side, what you would see is that the queue depth would build up. Queue depth is basically a measure of how many operations are pending to be processed by the disk subsystem. As the disk subsystem becomes a bottleneck, that queue starts building up. The operating system is basically buffering all of those requests and storing them in a queue. Then, as the disk becomes available, they get processed. The times when the disk is really busy, it is going to start building up and you’ll see consistent numbers. At least in the single digits maybe, two, three, four, five.

Joseph Landes
Certainly, if you get into the teens or hundreds, then you have pretty terrible disk performance, because the queue is being built up that high. This is a really valuable metric to look at, if you suspect storage as the issue. Let’s look at it on a 30-day basis … It’s lots of data. I guess it’s got to pull it all and process it, but here we go. All right, we have this Disk Q, which, you can see, again, we’re looking at the max of it, not the average.

Joseph Landes
Let’s see what the difference is going to be if we do average. Well, actually, we’ll switch in a second. But, what you can see is that there were some times when the disk queue depth jumped up. Like this one, for example, is pretty high. If this is just a single data point, it may be nothing, but if you look at the exact timeframe of this data point and there was corresponding user performance complaints, you can correlate that pretty well. Again, this looks overall fairly healthy. There have been some spikes. However, if you were looking at the average, let’s see what the average would look like on the same metric.

Joseph Landes
I expect the metrics are, right now, if our maximum is about 10 and we have a handful that are reaching five or four, I would expect that the average will smooth a lot of these out. They’re going to be lower, yeah. You see this big one is much less and then overall, things look a lot better, even though there are peaks that, at the time when they’re happening, performance is affected. Again, important metrics to look at.

Joseph Landes
One other thing to point out is, there is now alerting that you can do in Azure. If you click on ‘new alert rule’, you can define the resource. You can define the condition and you can define the action group and the action that that gets executed. What this is useful for is, you can set certain alarms for performance metrics exceeding thresholds that you predefined. Maybe the queue depth exceeding five or CPU utilization exceeding 95 or whatever. Whatever the metrics that are relevant, that you can have a proactive alert, go to a group of users who can then go in and try to remediate the issue, or at least be aware that it’s happening, how frequently it’s happening and then, adjust infrastructure.

Joseph Landes
The important thing to keep in mind is, there is an associated cost with configuring proactive alerts. Looks like it’s probably about 10 cents per month, per alert. If you have a hundred alerts, you’re going to pay about a hundred dollars. I’m sorry, $10.00. There is a cost associated with alerts. There is no cost associated with the type of monitoring that I was showing you on the previous screen. This is Azure monitoring, using various metrics and Azure alerting, based on those same metrics that have an associated costs.

Joseph Landes
All right. The next thing I want to talk about is desktop visual synthetic monitoring that NAP does. To introduce the concept, let me define what we mean by digital synthetic monitoring. Digital synthetic monitoring is a process that basically simulates a user action or user activity and then measures the time to perform various predefined tests or actions that that user may be taking. Instead of looking at the system back end and measuring things like CPU, RAM usage, disk queue, et cetera and trying to extrapolate how does that impact the user experience? Visual synthetic monitoring goes at it from the user perspective and actually looks at the screen and measures the amount of time a particular action takes.

Joseph Landes
This is built into the NAP and it’s implemented in the following way, let’s start from the top. It’s under the ‘optimize’ menu, under performance monitoring. It’s turned off by default. When you turn it on, what happens is there is a new VM that gets provisioned and this VM is called a probe VM. It’s a small virtual machine that sits in the environment and as you’ll see in a minute, it basically executes these tests on a regular basis and then reports back to NAP what the results are.

Joseph Landes
Here’s the VM that gets created, it’s called perfprobe01 for performance probe 01. It is a very small, single core, two gig VM. It doesn’t really have access to anything on the network. It sits on the LAN and all it does, it runs scheduled tasks that execute various actions, as if it was a user logging in, okay? How does that work? This is the VM itself. It’s set up with a very predefined template, very predefined resolution and that literally looks at the screen to see what happens.

Joseph Landes
When the VM boots up, or every hour, since it’s boot up, it runs a script and the script basically looks in a folder called ‘targets’ and the folder called ‘targets’ has a listing of RDP files for the various desktop sources. If you look here, you’ll see we have an RDSH01, which is a possible desktop. We also have a collection and then we don’t have any users that have VDI desktops. Every VDI desktop and for every RDS session host, there will be an RDP file in this folder.

Joseph Landes
Just for the sake of time, let’s narrow it down to just one file. These get automatically created then removed as desktops get assigned and unassigned from users. What happens is, it runs a script that looks like this. I’m just actually let this run, so you can see what happens. It basically runs this operation, which logs in to each and every desktop VM, whether it’s a session host or a individual DDI machine. Then, it goes through a set of tasks.

Joseph Landes
Now, I’m not touching my keyboard and mouse, I’m not really doing anything. This is the preconfigured set of tests. It starts out with an Outlook test. It launches Outlook, if Outlookj prompts for credentials or for activation, it knows how to react to it and basically activates Outlook. Then, it goes through and it sends a new email to itself. You can see on the bottom right, it’s currently in online mode, it’s not cached, it’s online mode.

Joseph Landes
In a second, it’s going to click the ‘new mail’ button and it is going to send itself an email … Okay. I mean, right now, it’s basically going to, from the time it clicks, the ‘new email’ button, it’s going to, okay. First, it tries to empty the inbox, make sure it starts with an empty inbox, then it’s going to send an email. There are certain predefined delays, to allow things to run their course. Okay. There we go.

Joseph Landes
It’s going to empty out the folder. Okay. There we go. Now, it’s gonna send a new email. It will send the mail to itself. It will give it a subject. Okay. It’s going to send the mail. Now, it’s going to wait until that email shows up in the inbox. There it is. Now, it basically measured how long it took from the time it clicked ‘new email’ to the time the email showed up. It deleted that email and now, it measured that particular action for sending an email. So the next thing it’s going to look at is going to be launching a Word document with a predefined content.

Joseph Landes
It’s going to look on this screen and identify certain elements. Once it sees them, it says, “Okay, that document must be on the screen.” Now, it’s going to do the same thing for, probably Excel. Yeah, it’s going to launch a predefined spreadsheet, look for the screen elements that it’s expecting. Identify them and move on to the next test. It had tested PDF. It tested PowerPoint, it tested browsing the internet with IE and Chrome and then, at the end, it showed the results on the screen for a few seconds before that screen disappeared.

Joseph Landes
It’s going to go through this test, every hour. An hour from now, 12:25 P.M. it’s going to do the same thing. The results get recorded in the database in a SQL database or MySQL database, that sits on the probe VM itself. Then, that information gets updated on an hourly basis by the NAP. If we go back into ‘optimize’ and we go to performance monitoring, okay, what you’ll see is that there is a little ‘refresh’ button. It tells you the last time it pulled in the data, so I’m going to click this button to refresh.

Joseph Landes
It’s going to pull in the latest data and then here’s where it summarizes the results. The results are for this RDSH01, which has the following users on it currently. These are the various tests that we were looking at. There was a test for logging into the desktop, from the time you type in the password to the time the icons come up on the screen. There is a visiting website through IE, through Chrome.

Joseph Landes
Sending an email, opening a Word document, opening a spreadsheet, opening a PowerPoint and opening a PDF file. Then, there are multiple data points. This is the last data point. It took 11 seconds to log in, three seconds to browse view with IE. Five seconds with Chrome. Five seconds to send email, two seconds, two seconds, two seconds and one second to open a PDF, so you get the idea. Then, there is the results for the day. This is the average of all the hourlies for for the last 24 hours.

Joseph Landes
Then, there’s the results for the past week, which is the average of all the dailies over the past week. If you click on ‘details’, it will give you a more detailed breakdown, with graphs, what the results were, so you can see, “We ran a test in 11:00 A.M. timeframe. You can see it’s only every hour, there is one data point. We ran this at 11, you can see how long it took.” You can see there is a green threshold line and a yellow threshold line. I’ll explain those in a minute. All of these tests that we saw the system perform are listed here.

Joseph Landes
You can see that performance has been pretty stable for all of these. Again, the system is not under load, so that would be expected. Then, you can go through and actually look at it on a longer interval basis. You can see, for the past month, whatever the reason was, we had some really high log in times that were happening between January 22nd and February 6th, that we can go through the system. Figure out what changes were made here, what changes were made there, to resolve that issue. But, somehow, this effected logins in a pretty significant way over that period of time. But, it looks like everything else wasn’t really effected.

Joseph Landes
This is the monthly view. There is a weekly view, there is a daily view, et cetera. This is the way that the data is collected. The way the data is presented is in this little table with green, yellow or red boxes. These boxes depend on these preconfigured thresholds. Each of these boxes is a test of a particular action. You can see each test has a corresponding threshold and when you configure a threshold, you basically set, under what threshold is it considered green? Under what threshold, between that and the next threshold it’s considered yellow? Then, above a certain threshold it’s considered red.

Joseph Landes
Let’s take logging in as an example. Anything under 20 seconds is green, between 20 and 40 is yellow and over 40 is red. That obviously can be customized. If we want to be super aggressive on our thresholds for login, let’s say anything over five seconds is yellow and maybe more than 15 seconds is red. Then anything between five and 15 is going to be, sorry, under five is going to be green, between five and 15 is yellow and 15 and above is red. You can see that all of these logins that are 11 seconds fall between five and 15, which is why they’ve now been marked as yellow.

Joseph Landes
Just a quick way, at a glance, to be able to tell how the system is performing. That’s one way to get to the data, is from this optimize performance monitoring view. The other way to get to it is from the user screen. For instance, let’s say we have user called Andy IT Admin, or let’s say Adam Citron on RDSH01. We want to take a look at that user’s performance. We can click on ‘view performance’, right from here. This will give us that familiar table, showing us the data for all the different tests. Just narrowed down or filtered out to this one particular user. Then, if we wanted to see more, we can click ‘more’ and that will take us into that more detailed view that we saw previously.

Joseph Landes
The way performance monitoring works is, because it’s visual synthetic, it actually is executing those tasks as if the user was executing them. Sometimes it works, sometimes it doesn’t. For instance, sometimes you’ll open Outlook and it will not open, it’ll crash. Then, the system is pretty resilient, it will retry again on the next hour, if monitoring is enabled. Let me just reset this back real quick. If monitoring is enabled, but the results are not coming in as the system would expect, right? The system updates every hour and it’s expecting certain results.

Joseph Landes
If results are not coming in, even though this box is set to ‘on’, it will automatically restart the performance probe, right? Could be it ran an update or could be the VM crashed or whatever the case is. But, if it’s seeing that the performance probe is on but it’s not receiving data, it is going to proactively reboot this perfprobe01 VM, I believe once every three days or once every day or something like that. There’s certain algorithm, by which it tries to self-recover from, inability to receive data for one reason or another. Because it’s visual synthetic, as the UI of different Office products and internet browser changes, for example, the ‘new mail’ button has recently changed in the version of Office, in Outlook. We actually had to adjust the Outlook test, so it knows how to visually recognize the ‘new mail’ button and it’s able to actually run the test. There’s occasional maintenance to the various tests that run, as the UI elements of the various programs change over time.

Videos in the series