Aurora Global Database RPO: What the Marketing Number Doesn't Include

What Aurora Global Database Actually Promises

The AWS documentation for Aurora Global Database states: "Aurora Global Database replicates changes to a secondary AWS Region with typical latency of less than 1 second." It also notes an RTO of under one minute for managed planned failovers and under one minute for unplanned failovers in the primary region.

What neither the marketing page nor the primary documentation surfaces clearly is the relationship between replication lag, the write acknowledgment model, and your actual data loss window at the moment of a regional failure. Those three things together determine your real RPO — and they interact in ways that can make it meaningfully worse than one second under specific conditions.

The Storage-Level Replication Architecture

Aurora Global Database replication operates at the storage layer, not at the database engine layer. This is different from logical replication, DMS, or Debezium-style CDC. The Aurora storage layer writes redo log records to six storage nodes distributed across three AZs in the primary region. Those same redo records are asynchronously forwarded to storage nodes in the secondary region.

The key word is asynchronous. Aurora's primary write path does not wait for the secondary region to acknowledge before confirming a transaction commit to the application. The six-node quorum for commit confirmation is entirely within the primary region. The secondary region receives redo asynchronously, typically within one second, but without a synchronous guarantee.

This is the correct design for cross-region replication — synchronous cross-region commits would add 50-100ms of round-trip latency to every write, which would be unacceptable for OLTP workloads. But it means that at any given moment, the secondary region may be up to N seconds behind the primary, where N is the current replication lag.

Measuring Real Replication Lag

The CloudWatch metric AuroraGlobalDBReplicationLag reports replication lag in milliseconds. Under normal conditions, this sits at 300-800ms for most inter-region pairs. But "normal conditions" excludes several situations that cause it to spike:

Write bursts: Large batch writes, bulk inserts, or schema changes that generate high redo volume can cause lag to spike to 5-30 seconds while the secondary catches up.
Secondary region capacity constraints: If the secondary region's storage nodes are under pressure (often from running read workloads against the secondary cluster), replication lag increases.
Cross-region network congestion: Rare but real. AWS inter-region bandwidth is shared infrastructure. Large data transfers or regional events can introduce latency spikes.
Schema changes: DDL operations — particularly ALTER TABLE on large tables — generate substantial redo volume. Lag during a schema change can exceed the DDL execution time itself.

Query the lag history to understand your environment's actual behavior:

# CloudWatch CLI — get AuroraGlobalDBReplicationLag statistics
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name AuroraGlobalDBReplicationLag \
  --dimensions Name=DBClusterIdentifier,Value=your-global-cluster-id \
  --start-time 2023-10-01T00:00:00Z \
  --end-time 2023-10-17T00:00:00Z \
  --period 3600 \
  --statistics Maximum Average \
  --region us-east-1

Look at the Maximum, not just the Average. The Average will be comfortably under one second. The Maximum tells you what your worst-case lag looked like over the period. If the maximum is 45 seconds, you had a 45-second RPO exposure window — not a sub-second one.

The Write Acknowledgment Gap

Beyond replication lag, there is a second RPO component that is rarely discussed: the write acknowledgment gap between when Aurora confirms a commit to your application and when that committed write reaches the secondary.

When your application executes a COMMIT and Aurora returns success, the data is durable in the primary region. It is not yet in the secondary. At exactly that microsecond, if the primary region fails, the committed transaction has not been replicated. This is the fundamental guarantee of asynchronous replication — commits are fast precisely because they do not wait for remote durability.

The practical consequence: your RPO in a sudden primary region failure scenario is not "the last replication lag reading." It is "however much write activity occurred in the primary between the last redo record the secondary received and the moment of failure." Under high write load, this can be several seconds even when average replication lag is below one second.

Failover Procedure Timing

The third RPO factor is the failover procedure itself. When the primary region fails, you have choices about how to initiate failover:

Managed Planned Failover (Switchover)

For maintenance, testing, or voluntary failover between regions, Aurora provides a managed planned failover that handles the transition cleanly. This is the scenario AWS targets with its "less than one minute RTO" claim. The process:

The primary cluster stops accepting new writes
Aurora waits for all in-flight replication to drain to zero lag
The secondary is promoted to primary
The DNS endpoint is updated

In this scenario, RPO is genuinely zero — no data is lost because replication is fully caught up before promotion. RTO is the time for steps 3 and 4, which is typically 30-90 seconds.

Unmanaged Failover (Regional Failure)

When the primary region fails ungracefully — infrastructure failure, not a planned switchover — the sequence is different:

You detect the primary is unreachable (CloudWatch alerts, application errors, manual observation)
You decide to initiate failover
You run the promote command or use the console
Aurora promotes the secondary and updates DNS

Steps 1-2 are human time, not automated. Unless you have automated failover detection and promotion wired up, you have an additional 5-15 minutes of decision time on top of the RPO from replication lag. This is your actual RTO in the unplanned scenario — and it is well beyond one minute for most teams without a tested runbook.

# Promote a secondary cluster to primary (Aurora Global Database)
aws rds failover-global-cluster \
  --global-cluster-identifier your-global-cluster-id \
  --target-db-cluster-identifier arn:aws:rds:us-west-2:123456789:cluster:your-secondary \
  --region us-east-1

# Monitor promotion status
aws rds describe-global-clusters \
  --global-cluster-identifier your-global-cluster-id \
  --query 'GlobalClusters[0].GlobalClusterMembers'

The Application-Side RPO Factor

There is a fourth component that is entirely outside Aurora's control: in-flight application transactions at the moment of failure. If your application has an open transaction that has written to the primary but not yet committed when the failure occurs, that data is lost regardless of replication state. The database never acknowledged the commit, so it is consistent behavior — but from the application's perspective, work was lost.

For write-heavy OLTP applications, this means the effective RPO includes the average transaction duration times the number of concurrent uncommitted transactions at the failure moment. For most transactional workloads with sub-second transactions, this is negligible. For batch processes running 10-minute transactions, it can be significant.

What You Can Actually Measure and Commit To

Given all of the above, here is a framework for honest RPO characterization for an Aurora Global Database deployment:

Steady-State RPO

The RPO under normal operating conditions, measured as the 99th percentile replication lag over a 30-day period. Query CloudWatch for the p99 of AuroraGlobalDBReplicationLag:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name AuroraGlobalDBReplicationLag \
  --dimensions Name=DBClusterIdentifier,Value=your-global-cluster-id \
  --start-time $(date -u -d '30 days ago' '+%Y-%m-%dT%H:%M:%SZ') \
  --end-time $(date -u '+%Y-%m-%dT%H:%M:%SZ') \
  --period 86400 \
  --statistics p99 \
  --region us-east-1

If your p99 lag is 3 seconds, your steady-state RPO is approximately 3 seconds. This is the number to commit to in your SLA.

Burst RPO

The RPO during high-write-activity windows (batch jobs, end-of-month processing, large imports). Measure replication lag during those specific windows and document the maximum observed. If your monthly batch generates 45 seconds of lag, your RPO during that window is 45 seconds — not sub-second.

Detection-to-Failover Time

Measure and document how long it actually takes your team to detect a regional failure and initiate promotion. Run a failover drill against your secondary region with a test cluster. Time the entire process from simulated failure detection to application successfully connecting to the promoted secondary. For most teams without automation, this is 10-25 minutes.

Architectural Options to Tighten RPO

Lag Alerting and Auto-Failover

Set a CloudWatch alarm on AuroraGlobalDBReplicationLag that triggers an SNS notification when lag exceeds your RPO threshold. Pair this with a Lambda function that can initiate automated failover promotion when both the primary is unreachable AND lag was acceptably low before failure. This narrows detection-to-failover time from minutes to seconds.

Application-Level Write Fencing

For applications with extremely tight RPO requirements, implement write fencing at the application layer: before confirming a write to downstream systems, verify that the write has been confirmed by a secondary-region reader. This adds latency (you wait for the secondary read to confirm the data is there) but gives you synchronous-equivalent durability. This is appropriate for financial ledger operations, not for general OLTP.

Global Database vs. Multi-Region Architecture

Aurora Global Database is optimized for read scaling and DR with acceptable asynchronous RPO. If your RPO requirement is genuinely zero (no data loss tolerated), you need a synchronous replication architecture. Options include:

Application-level dual-write to two independent Aurora clusters in different regions, with a transaction coordinator that confirms both writes before acknowledging
DynamoDB Global Tables, which provides synchronous-ish multi-region writes (though with eventual consistency semantics on reads across regions)
CockroachDB or YugabyteDB for distributed SQL with synchronous geo-replication

Each of these trades write latency for durability guarantees that Aurora Global Database does not provide.

The Honest Conversation

Aurora Global Database is a well-engineered product that delivers genuine cross-region replication with excellent operational characteristics. The sub-second typical replication lag is real. The managed failover RTO under one minute is real.

What it does not deliver is a zero-RPO guarantee or a fully automated unplanned failover within one minute. Teams that have bought Aurora Global Database specifically to hit a regulatory or SLA requirement for sub-second RPO should verify what their actual steady-state and burst lag looks like — and whether their detection-to-failover procedure can actually meet the window they have committed to.

Measure the real numbers. Document the burst cases. Test the failover procedure against a non-production clone. The platform is solid; the gap is almost always in the operating procedures, not the technology.

Need a second opinion on your database DR posture?

We'll review your Aurora Global Database configuration, measure actual replication lag patterns, and assess your failover procedures as part of a free database assessment.

Get a Free Assessment → More Articles