If there is one trend I’ve noticed in my years as a developer and solution architect, it’s that having solid instrumentation about an application is one of the leading indicators of a successful project. Even more so for solutions which are built for and hosted in Microsoft Azure. Projects that failed or were full of “issues” almost certainly did not have good instrumentation or, at best, added it as an afterthought – as in after problems started to emerge. In production. Definitely not the time when you want to start thinking about how to support your application.

Before we dive into the details of how diagnostics work in Microsoft Azure, let’s step back and take a quick look at some common practices for on-premises solutions. It’s fairly common to use tools like the Windows Event Viewer, IIS logs, Performance Monitor, and custom application logs (i.e. log4net, NLog, etc.) to gain insights into our applications. We can also use Remote Desktop (RDP) to access a server and manually inspect the data, or perhaps that data is collected by some agent and sent to a central reporting system.

Having a well thought-out plan for diagnostic data is important for on-premises applications, but it is arguably more important for distributed, highly scalable cloud applications. After all, we can’t walk over to that server, kick it, and hope the right bits get jostled into place.

And before we continue, you may encounter Microsoft Azure Diagnostics referred to as the acronym ‘WAD’ in this article and further afield. Recall the previous brand name of Microsoft Azure was “Windows Azure” . . . thus ‘WAD’ is short for ‘Windows Azure Diagnostics’. Naturally, it’s still the same thing.

Not so different after all

The sources of diagnostic data for Microsoft Azure Cloud Services (web and worker roles) is in many ways the same as on-premises solutions. We can continue to use Windows event logs, IIS logs, performance counters, and custom logs. We can even use RDP to connect to a machine instance and view the data. But (and there’s always a “but”), that only works if the machine instance hasn’t been reimaged, and if we’re only dealing with a few machines. In practice, this approach doesn’t scale well. If the machine is reimaged, as is not uncommon, all that potentially important data is lost. Cloud applications need a way to persist data outside of the machine instance.

At its core, Microsoft Azure diagnostics builds on the tools we’re already familiar with for on-premises solutions. The change comes in how those tools are configured and where the data is stored. There are some important differences from how we might configure diagnostics for on-premises applications:

  • It is better, and often easier, to configure diagnostics when the application is first deployed because we must add some configuration to tell Azure to collect specific key data. If we don’t do that, nothing is collected and our diagnostics process is utterly broken from the start. To then go back and make the correct configuration change, while possible, is often not the experience we’re after (especially when trying to get a service working again).
  • The data stored is essentially just the raw data – stored in semi-structured tables or files. Additional tools, such as Cerebrata Azure Management Studio, are needed to search and visualize the data.
  • There is a chance the diagnostic data could be lost when, the machine instance is reimaged. (Helpful tip – always plan for the worst and you’ll be happy you did when the time comes)
  • Since Azure diagnostic data is stored in an Azure storage account, there is a cost for storing that data. Diagnostic data – the more data that is stored, the more you we pay. Thus, we want to store only the data we really need, and only for as long as it is needed – no sense paying for data we’re not using.

The Microsoft Azure Diagnostic Agent

In a Cloud Service application, we’re most likely not dealing with a single machine, but rather multiple virtual machine instances, each identical to all the others. Therefore, we need a way to collect data from all the instances within the role, which is where the Microsoft Azure Diagnostic Agent enters the picture. It’s the agent’s job to periodically gather up all the diagnostic data (based on a configuration you choose, but more on that later), buffer the data on the local instance, and then send the data off to an Azure storage account for safe keeping. Once the data is in Azure storage, it can easily be downloaded or queried to help you understand more about the application.

Figure 1 - Diagnostic Agent and data flow

Figure 1 – Diagnostic Agent and data flow

The diagnostic agent is added as a role module to each web or worker role in your Cloud Service. As can be seen in Figure 2 below, if you were to use Remote Desktop to log into your running role instance, you would find the DiagnosticAgent.exe process in there, doing its thing.

Figure 2 - The DiagnosticAgent.exe process

Figure 2 – The DiagnosticAgent.exe process

Enabling the diagnostic agent is usually done via Visual Studio by editing the properties of the desired role. When you check the “Enabled Diagnostics” option and set the storage account credentials, you’re telling Visual Studio to import and configure the “Diagnostics” module, which is responsible for enabling the collection of diagnostic data.

Figure 3 - Enabling diagnostics via Visual Studio

Figure 3 – Enabling diagnostics via Visual Studio

If you were to look at the Cloud Service’s ServiceDefinition.csdef, you would see the “Diagnostics” module being imported into each role in the service.

Figure 4 - ServiceDefinition.csdef - the “Diagnostics” module being imported into each role in the service

Figure 4 – ServiceDefinition.csdef – the “Diagnostics” module being imported into each role in the service

WAD uses a storage account as a central location to persist all collected diagnostic data. Based on your configuration, diagnostic data is periodically transferred from the local machine instance to specific tables or blobs in Azure storage. If you were to look at the ServiceConfiguration.cscfg file you would find the Azure storage account connection string, which was set previously via Visual Studio.

Figure 5 - ServiceConfiguration.cscfg - the Azure storage account connection string, which was set previously via Visual Studio

Figure 5 – ServiceConfiguration.cscfg – the Azure storage account connection string, which was set previously via Visual Studio

Great, but where’s my data?

Glad you asked! Trace log messages, event log messages, and performance counters are saved into table storage. Everything else goes into blob storage. It’s easy to identify Microsoft Azure diagnostic tables and blobs, because each table or blob container is prefixed with “WAD” (recall the previous brand name was ‘Windows Azure’ . . . thus ‘WAD’ is short for ‘Windows Azure Diagnostics’).

Table 1 - Diagnostic items and where to find them

Table 1 – Diagnostic items and where to find them

You’re probably asking yourself why all this is necessary. Good question! As I’ve implied earlier, the reason is that we need a persistent, durable, highly available storage location outside of the machine instance, and Microsoft Azure storage provides that location. It provides a common place for all machine instances to write (via the aptly named DiagnosticAgent.exe).

If we were to open one of those tables, we’d see just the raw data – It’s fairly well structured, but just looking at the sheer volume of data can be daunting. Figure 6 below shows what the performance counter data in the WADPerformanceCountersTable looks like. And that’s just 1000 records out of some yet unknown number in the table. Try finding a CPU spike in that!

Figure 6 - WADPerformanceCountersTable

Figure 6 – WADPerformanceCountersTable

On the plus side, using tools like Cerebrata Azure Management Studio can help make quick work of visualizing this important data. Ah, that’s much easier on the eyes!

Figure 7 - Performance Counter chart in Azure Management Studio

Figure 7 – Performance Counter chart in Azure Management Studio

The data is buffered and then sent to storage because this is a very efficient, not to mention cost-effective, way to send potentially large volumes of data. When it comes to distributed cloud architectures, such as you’d find with Microsoft Azure, it’s generally more efficient to send fewer, larger chunks of data to storage than to continually send a high volume of relatively small data sets. Additionally, since transactions against Azure storage are billed at the staggering rate of $0.005 per 100,000 transactions (as of the writing of this article – see the current rate on the official Azure pricing page), sending fewer transactions saves us money. Yes, Microsoft really was looking out for our wallets when designing the diagnostic agent. Not so evil after all!

Let’s wrap it up already

As you can see, Microsoft Azure Diagnostics provides us with much of the same functionality we are already using for our on-premises applications. The majority of the diagnostic code you’re used to writing should work equally as well when running in Microsoft Azure. The changes we’ll need to make are primarily related to configuration and how we go about visualizing the data. The Microsoft Azure Diagnostic agent handles the buffering of the data on each machine, and then periodically transferring to a safe storage location in Microsoft Azure storage. We just need to tell the diagnostic agent what to collect, how often to collect it, and how often to transfer it to storage.

The information and steps described in this article are related to Azure Diagnostics v1.0. Azure Diagnostics v1.0 is included in the Azure SDK. Future articles will introduce working with Azure Diagnostics v1.2 – which is not included in the Azure SDK. While similar, there are some important differences. In the next article we’ll take a look at how to configure diagnostics, both via code and a configuration file, as well as the pros and cons of each approach. Further along in the series we will be taking a look at more advanced configuration techniques, along with some practical advice on how to best configure, and more importantly use the diagnostic data you’ve collected.