AWS AZs: Not all are Equal

March 10, 2021

(Short version: there’s more context below, but for readers who just want to know which AZ doesn’t support Nitro, it’s use1-az3.)


What this is about

  1. Not all AZs support all instance types
  2. Availability zone names aren’t consistent across AWS accounts
  3. Availability zone IDs are consistent across AWS accounts!
  4. How to find out which zone IDs support a given instance type
  5. How to automatically exclude those unsupported AZs using Terraform

Not all AZs are equal

When choosing what AWS region to use, practitioners will often discover the AWS Regional Services List describing what services a region supports. What’s less well-known is that even within a given region, not all availability zones are equal. This is most visible for services that offer a choice of instance types such as EC2, RDS, Elasticache, etc. That’s because some AZs are older than others and don’t support newer Nitro or Graviton2 instances.

This seems like a weird limitation at first; shouldn’t it be as easy as racking new hardware and switching it on? I mean, maybe, but probably not. When NLBs were new, there was a limitation (now resolved) that prevented internal NLBs from passing traffic between Nitro and non-Nitro instances. This hinted that Nitro hypervisors didn’t just have a bigger network pipe; the networking improvements were likely coupled to the operation of the DC network fabric itself. Older AWS datacentres would need to do more than just rack servers to support Nitro, and that’s likely what got us here today.

Where I’ve seen this cause the most grief is in autoscaling groups that span multiple AZs, and yet instances will stubbornly not launch in one of them. You might think you’re redundant across 3 AZs only to find that instances are launching only in 2; or, you might be running Nitro instances in us-east-1a and us-east-2b but an older generation RDS instance in us-east-1c, pretty much guaranteeing intra-AZ network bandwidth (itself a complicated topic). Or maybe you’re real unlucky and have inherited a single AZ VPC to save on intra-AZ bandwidth, but for some reason it won’t let you launch Nitro or Graviton2 because, yep, it’s an old AZ.

Now that you know, how do you pick the right AZs?

Availability zones have two names

Something else non-obvious is that availability zones are not consistently named across AWS accounts. That is, the AZ labelled us-east-1a in your AWS account is probably different than the us-east-1a in my AWS account. Amazon “shuffles” the AZ names for each new account to make sure that their customers are spread across the whole region. If us-east-1a were the same across all accounts, it would be heavily over-represented in customer workloads.

That makes it a bit harder to know which AZs to avoid if you want to balance Nitro workloads across your whole VPC, so fortunately AZs have another identifier: Zone IDs! The zone ID use1-az1 points to the same physical datacentre across all AWS accounts, even if it points to one named us-east-1a in one AWS account and us-east-1e in another. Knowing that, you can enumerate which zone IDs support Nitro instances by asking the API:

  1. For every AWS region
  2. For every zone ID in every AWS region
  3. If the each zone supports a given instance type

Here’s a script that does that using the m5.large instance type to find out which zones support Nitro:

#!/bin/bash

# Instance type for which to test availability
INSTANCE_TYPE=m5.large

# Formatting
BOLD=$(tput bold)
RESET=$(tput sgr0)

# Get all regions
REGIONS=$(aws ec2 describe-regions --region=us-east-1 | jq '.[][].RegionName' | tr -d '"' | sort)

for REGION in $REGIONS ; do

  echo "${BOLD}AZs in ${REGION}:${RESET}"

  # Get all AZs in a region
  ALL_AZS=$(aws ec2 describe-availability-zones --region "${REGION}" | jq '.[][].ZoneId' | tr -d '"' | sort)

  # Get all AZs that support the instance type
  SUPPORTED_AZS=$(aws ec2 describe-instance-type-offerings \
    --location-type availability-zone-id  \
    --filters Name=instance-type,Values="${INSTANCE_TYPE}" \
    --region "${REGION}" \
    | jq '.[][].Location' | tr -d '"' | sort)

  # Annotate which AZs do not support the instance type
  for AZ in $ALL_AZS ; do
    if echo $SUPPORTED_AZS | grep -q $AZ ; then
      echo "${AZ}"
    else
      echo "${AZ} (${INSTANCE_TYPE} not supported)"
    fi
  done

done

You can set INSTANCE_TYPE=m6g.large and run it again to find out where Graviton2 instances are supported.

Shortly after publishing, Mia pointed out that the reverse is also true: older generation instances are often not supported in newer regions, and the older regions that do support them may not provide that support across all AZs.

You can set INSTANCE_TYPE=m3.large in the above script to see this, and while m3s might seem like a peculiar thing to launch in 2021, who among us hasn’t found m3s in old prebaked CloudFormation templates, bespoke up.sh scripts, or that bizarre cross-section of maximum system requirements to run legacy apps?

How to exclude unsupported AZs

Here’s a pretty common approach to creating a multi-AZ VPC using Terraform:

provider "aws" {
  region = "us-east-1"
}

resource "aws_vpc" "vpc01" {
  cidr_block = "10.0.0.0/22"
}

data "aws_availability_zones" "azs" {
  state = "available"
}

resource "aws_subnet" "public" {
  count                   = 3

  availability_zone       = data.aws_availability_zones.azs.names[count.index]
  cidr_block              = cidrsubnet("10.0.0.0/22", 3, count.index)
  map_public_ip_on_launch = true
  vpc_id                  = aws.vpc01.id
}

This creates a VPC with three public subnets, each in a different availability zone. As for which three, state = "available" means the first three that are, well, available. Which three is an implementation detail that probably shouldn’t matter (except it does, because you’re reading this) which makes it a good abstraction.

Which three didn’t used matter too much, but now if your autoscaling group wants to launch Nitro or Graviton2 instances into those subnets, you might see this error:

“Your requested instance type is not supported in your requested Availability Zone.”

Right. One of those three “available” AZs maps to use1-az3, which doesn’t support newer generation instances. So, fine, let’s exclude it in the aws_availability_zones data source:

--- main.tf
+++ main.tf
@@ -8,6 +8,7 @@

 data "aws_availability_zones" "azs" {
   state             = "available"
+  excluded_zone_ids = ["use1-az3"]
 }

 resource "aws_subnet" "public" {

Now when the VPC and subnets are created, it will pick the first three available subnets that are not use1-az3.

A word of caution: do not make this change to existing VPCs/subnets, or at least not an environment with any workloads in it. The best case scenario is that Terraform will not apply since it might require destroying in-use subnets, and in a worst case scenario, you’ve architected your Terraform to support destroying in-use subnets by also destroying everything in them, like the production database.

By the way, you can exclude AZs even if they are not part of the region you’re targeting. For example, you can exclude use1-az3 even if you’re deploying to ca-central-1. This might seem silly and unnecessary, but consider: you can add that exclusion globally to all of your boilerplate without having to worry about what region someone is going to launch into. If you’re keen on using a certain instance type, then blanket exclude all the AZs that don’t support it regardless of region, and get on with your day!

TL;DR

  1. Not all AZs support all instance types
  2. AZ names are not consistent across AWS accounts
  3. AZ zone IDs are consistent across AWS accounts!
  4. You can find out which zone IDs support a given instance type
  5. You can exclude these AZs using something like Terraform