The usage of Azure Virtual Desktop (AVD) is growing fast and AVD has become a mission critical component of many IT environments. Making AVD resilient is an important design consideration when relying on the service for access to corporate data and applications.
Since AVD deployments consist of several inter-dependent components, we will consider each one individually in the configuration of Business Continuity and Disaster Recovery (BCDR) for AVD.
Azure Virtual Desktop Components
The table below lists the various AVD components with their associated DR considerations.
Disaster Recovery (DR) Scenarios
When planning for AVD disaster recovery, it is important to identify the possible outage scenarios and decide on the ones to protect against. Some DR strategies will cover multiple scenarios as we’ll see below.
Scenario #1: Corruption of data, metadata, or resources, but no underlying data center or region outage
In this situation, restoring from backup or rebuilding session host VMs is the best approach. Let’s review how this applies to each AVD environment component:
- AVD service - because this service is hosted, managed, and backed up by Microsoft there is nothing for you to do. The AVD service will fail over automatically and Microsoft is responsible for getting everything back up and running within the provided SLA.
- Identity / Directory – If using native Azure AD joined VMs, no action is necessary. Microsoft is responsible for keeping this service operational within the provided SLA. If using Active Directory, functional AD domain controllers must always be accessible. Azure AD DS operates two domain controllers, in separate availability zones if supported, by default.
- Recommendation: Use Azure AD native, Azure AD DS, or if using Active Directory create multiple AD domain controllers. Back up the AD system state and restore, if needed.
- Desktop images - Changes are often made to desktop images during the normal course of AVD maintenance. Maintaining backups of desktop images is important to be able to quickly recover from any corruption.
- Recommendation: Use Shared Image Gallery with image versioning. Leverage Nerdio Manager’s built-in desktop image backup functionality to version the images prior to making any changes.
- Session host VMs - Hosts can become unavailable or corrupted in the normal course of operation.
- Recommendation: Enable Nerdio Manager’s Auto-Heal functionality to automatically repair broken session hosts.
- FSLogix profiles - Corruption of profile containers can be resolved by restoring the corrupted VHD(X) files from backup.
- Recommendation: Depending on your FSLogix storage technology choice – configure Azure Backup for Azure Files shares, Azure NetApp Files snapshots, or use any backup or versioning method for file server VMs (e.g. Volume Shadow Copies). Restore corrupted profile containers, as needed.
Scenario #2: Single datacenter or Availability Zone failure within an Azure region
Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more datacenters equipped with independent power, cooling, and networking. To ensure resiliency, there’s a minimum of three separate zones in all enabled regions. The physical separation of Availability Zones within a region protects applications and data from datacenter failures. Zone-redundant services replicate your applications and data across Availability Zones to protect from single-points-of-failure. With Availability Zones, Azure offers an industry-best 99.99% VM uptime SLA. Learn more here.
In the case of datacenter or Availability Zone failure, most components of the AVD environment will automatically fail-over to another Availability Zone with no user intervention required.
NOTE: Not all Azure regions support Availability Zones for all products. Review the Regions that support Availability Zones before deploying your AVD environment to select the region that addresses your availability requirements. Pay special attention to Premium Files Storage if using Azure Files for FSLogix profiles.
To protect against Availability Zone failure, the initial AVD architecture and design must take zone redundancy into account. Let’s review this on a component-by-component basis.
- AVD service - because this service is hosted, managed, and backed up by Microsoft, there is nothing for you to do. The AVD service will fail over automatically and Microsoft is responsible for getting everything back up and running within provided SLA.
- Identity / Directory – If using native Azure AD joined VMs, no action is necessary. Microsoft is responsible for keeping this service operational within provided SLA. If using Active Directory, functional AD domain controllers must be always accessible. Azure AD DS operates two domain controllers, in separate availability zones if supported, by default.
- Recommendation: Use Azure AD native, Azure AD DS, or if using Active Directory create multiple AD domain controllers in different Availability Zones.
- Desktop images – Desktop images stored using ZRS (Zone Redundant Storage) will be available during Availability Zone failure.
- Recommendation: Store images with ZRS storage.
- Session host VMs – Session host VMs running in the datacenter where an outage occurs will go offline.
- Recommendation: When deploying session hosts, distribute them across Azure region’s Availability Zones using Nerdio Manager’s automation.
- FSLogix profiles – FSLogix profiles stored on Azure Files Premium ZRS storage won’t be impact by an Availability Zone failure.
- Recommendation: Use ZRS storage with Azure Files Premium to store FSLogix profiles
Scenario #3: Entire Azure region outage
An Azure region is a set of datacenters deployed within a latency-defined perimeter and connected through a dedicated, regional low-latency network. Azure gives you the flexibility to deploy applications where you need to, including across multiple regions to deliver cross-region resiliency. Failure of complete Azure regions is highly unlikely and rare. For more information, see Overview of the resiliency pillar.
Failure of an entire Azure region is the most severe scenario. The best way to protect against this situation is by automatically distributing AVD session host VMs across two Azure regions and replicating FSLogix profile data, thereby creating an Active/Active DR configuration. If one of the regions becomes unavailable, VMs in the second region can continue servicing users. Learn more about host pool DR in our video below and read further for considerations regarding the different components involved in Scenario 3.
- AVD service - because this service is hosted, managed, and backed up by Microsoft, there is nothing for you to do. The AVD service will fail over automatically and Microsoft is responsible for getting everything back up and running within the provided SLA.
- Identity / Directory – If using native Azure AD joined VMs, no action is necessary. Microsoft is responsible for keeping this service operational within provided SLA. If using Active Directory, functional AD domain controllers must be always accessible. Azure AD DS operates two domain controllers, in separate availability zones if supported, by default.
- Recommendation: Use Azure AD native, Azure AD DS replica sets, or if using Active Directory create multiple AD domain controllers in 2 Azure regions.
- Desktop images – Desktop images stored in Shared Image Gallery and replicated to multiple regions will be available during a single region outage.
- Recommendation: Geo-replicate desktop images with Nerdio Manager and Shared Image Gallery.
- Session host VMs – Session host VMs running in the Azure region where an outage occurs will go offline. If there are available session host VMs in a secondary region, users will be able to reconnect and continue working.
- Recommendation: Leverage Nerdio Manager’s Active/Active host pool DR to automatically distribute session hosts across two selected Azure regions.
- FSLogix profiles – Users won’t be able to work without access to the FSLogix user profiles. Profiles must be continuously replicated in multiple regions.
- Recommendation: Use Nerdio Manager’s FSLogix Cloud Cache functionality to replicate user profiles across two Azure regions.
Configuring an AVD environment to be resilient to an Azure region failure (scenario #3) will also cover Azure Availability Zone failure (scenario #2). The outlined approach works best for pooled AVD deployments. Personal desktops can also be protected, but the approach is different. Protecting personal desktops involves using Azure Site Recovery in an active/passive configuration.
Summary Table
For more information on Nerdio Manager for Enterprise, click here.