GTID Replication Gaps in Aurora MySQL: What the Console Doesn't Tell You

The Silent Replication Gap

Aurora MySQL clusters use GTID (Global Transaction Identifier) replication to keep read replicas synchronized with the writer. Each transaction on the writer gets a unique GTID — a combination of the server UUID and a sequence number. Replicas track which GTIDs they've applied and request the next ones from the writer's binlog.

The problem: replication gaps. A gap occurs when the replica's executed GTID set has a hole — a range of sequence numbers that are missing. This happens most commonly during maintenance operations: restoring from a snapshot, importing data directly to a replica, running DDL that bypasses binlog logging, or a replica that was stopped and restarted incorrectly. The replica may continue running and accepting read queries, but it stops applying new transactions from the writer. The RDS console reports it as Available and may show low or zero replication lag — because lag is measured differently than gap detection.

Applications routing reads to that replica get silently stale data. If your application caches aggressively or doesn't have strong consistency requirements, this can go undetected for hours or days.

How GTIDs Work in Aurora MySQL

When GTID mode is enabled (gtid_mode=ON), every committed transaction is assigned a GTID in the format source_uuid:transaction_id. The writer maintains a gtid_executed set — all GTIDs committed so far. Each replica maintains its own gtid_executed set. The difference between the writer's set and the replica's set is the replication lag (in GTID terms).

A healthy replica's gtid_executed looks like a contiguous range:

-- On a healthy replica
SHOW VARIABLES LIKE 'gtid_executed';
-- Result: 3E11FA47-71CA-11E1-9E33-C80AA9429562:1-12345678

-- On a replica with a gap (sequences 100000-100050 are missing)
-- Result: 3E11FA47-71CA-11E1-9E33-C80AA9429562:1-99999:100051-12345678

That colon-separated sub-range in the second example is the gap. Transactions 100000 through 100050 were never applied to this replica. Depending on what those transactions contained, the replica may have an inconsistent view of the data — or replication may have stopped entirely.

The Three Most Common Causes of Aurora GTID Gaps

1. Snapshot restore with data import. You restore an Aurora snapshot to create a new read replica, then import additional data directly via SQL (bypassing replication) to populate it faster. The imported transactions don't exist in the writer's binlog — the replica has applied transactions the writer never knows about. When the replica tries to replicate, GTID conflicts cause replication to stop or produce errors.

2. DDL executed on the replica directly. Someone connects directly to a read replica and runs a DDL statement — CREATE INDEX, ALTER TABLE, TRUNCATE. Read replicas in Aurora MySQL do not honor read_only restrictions for all statement types in all versions. The DDL creates a new GTID on the replica that doesn't exist on the writer. Now the replica has a GTID the writer doesn't know about, and the two are diverged.

3. Incorrect replica restart after binlog purge. The writer's binlog retention is set too low (common RDS default is 0, meaning Aurora manages it automatically). If a replica falls behind by more than the binlog retention window, the writer has already purged the binlogs the replica needs. When the replica restarts and tries to resume from where it left off, the required binlog files are gone. Aurora may automatically re-initialize from a snapshot, but if the snapshot predates the purge point, the replica ends up with a gap.

Detecting the Problem

Don't rely on the RDS console for replication health. Add these checks to your monitoring stack:

-- Connect to the read replica and check replication status
SHOW SLAVE STATUS\G

-- Key fields to examine:
-- Slave_IO_Running: Yes
-- Slave_SQL_Running: Yes
-- Seconds_Behind_Master: (this is the lag in seconds)
-- Retrieved_Gtid_Set: GTIDs received from writer
-- Executed_Gtid_Set: GTIDs actually applied
-- Last_IO_Error: any IO thread errors
-- Last_SQL_Error: any SQL thread errors (this is where gap errors appear)

-- Compare executed GTIDs on replica vs writer
-- On the writer:
SELECT @@gtid_executed AS writer_executed;

-- On the replica:
SELECT @@gtid_executed AS replica_executed;

-- Compute the gap: transactions on writer not yet on replica
-- On the replica:
SELECT GTID_SUBTRACT(
  '3E11FA47-71CA-11E1-9E33-C80AA9429562:1-12345678',  -- writer gtid_executed
  @@gtid_executed  -- this replica's gtid_executed
) AS replication_gap;

If GTID_SUBTRACT returns anything other than an empty string, there's a gap. The returned value is the set of GTIDs the writer has committed that this replica has not applied.

Also check for unexpected GTIDs on the replica — transactions the writer doesn't have:

-- GTIDs on the replica that the writer doesn't have (the dangerous direction)
-- Run on the replica, passing in the writer's gtid_executed value
SELECT GTID_SUBTRACT(
  @@gtid_executed,
  '3E11FA47-71CA-11E1-9E33-C80AA9429562:1-12345678'  -- writer gtid_executed
) AS extra_on_replica;

If this returns anything non-empty, the replica has applied transactions the writer never had. This is a data consistency problem, and the only safe resolution is to rebuild the replica from a fresh writer snapshot.

Resolving a Replication Gap

The resolution depends on the gap type and severity.

Gap where the replica is missing writer transactions (most common):

-- Option 1: Skip the gap by telling the replica to treat
-- the missing GTIDs as already executed.
-- Use ONLY if you've verified the skipped transactions
-- are safe to ignore (e.g., they were operational/maintenance transactions
-- not affecting application data).

-- On the replica:
STOP SLAVE;

-- Inject the missing GTID range as empty transactions
-- Replace with your actual gap range
SET GTID_NEXT = '3E11FA47-71CA-11E1-9E33-C80AA9429562:100000';
BEGIN; COMMIT;
SET GTID_NEXT = '3E11FA47-71CA-11E1-9E33-C80AA9429562:100001';
BEGIN; COMMIT;
-- ... repeat for each GTID in the gap ...

SET GTID_NEXT = 'AUTOMATIC';
START SLAVE;

-- Verify replication resumed
SHOW SLAVE STATUS\G

For large gaps or gaps involving application data transactions: Skipping is unsafe. The only correct resolution is to drop the replica and rebuild it from a current snapshot of the writer. In Aurora, this means creating a new read replica from the cluster — Aurora will automatically provision it from the shared storage volume, which is already synchronized with the writer.

# Aurora cluster rebuild via AWS CLI
# Delete the diverged replica
aws rds delete-db-instance \
  --db-instance-identifier my-aurora-replica-1 \
  --skip-final-snapshot

# Create a new replica from the cluster
# It inherits the writer's GTID state automatically
aws rds create-db-instance \
  --db-instance-identifier my-aurora-replica-1 \
  --db-cluster-identifier my-aurora-cluster \
  --db-instance-class db.r6g.2xlarge \
  --engine aurora-mysql

Preventing Gaps in the First Place

Three configuration changes eliminate the most common gap scenarios:

Set binlog retention appropriately. Aurora's default manages binlog retention automatically (often aggressively short). For clusters with read replicas that might lag during heavy batch operations, set an explicit minimum retention:

-- Set minimum binlog retention to 48 hours
-- Run on the Aurora writer instance
CALL mysql.rds_set_configuration('binlog retention hours', 48);

Never write directly to read replicas. Enforce this at the network level — security group rules should prevent application connections from writing to replica endpoints. Use the cluster reader endpoint exclusively for application read traffic. Reserve direct replica connections for DBA diagnostics only.

Monitor replication health independently of the console. Set up a CloudWatch custom metric or an external monitoring check that queries SHOW SLAVE STATUS on each replica every minute and alerts on Slave_SQL_Running != Yes, Seconds_Behind_Master > threshold, or any non-empty Last_SQL_Error. The RDS console health checks are not a substitute for application-level replication monitoring.

Aurora-specific behavior: Aurora MySQL uses its own storage layer for cluster replication between writer and reader instances — GTID replication applies to binlog-based replicas (cross-region replicas, external MySQL instances). Within a single Aurora cluster, reader instances share storage with the writer and don't use traditional binlog replication. GTID gaps are most relevant for cross-region Aurora replica clusters and Aurora-to-external-MySQL replication topologies.

Cross-Region Replica Topology Considerations

Cross-region Aurora replicas use binlog-based replication and are where GTID gaps most commonly manifest in production. The binlog must travel from the source region to the target region over the AWS network. Latency and occasional network interruptions mean the cross-region replica can fall behind during peak write periods.

For cross-region replicas used as DR targets, monitor the gap in terms of both seconds of lag and transaction count. A replica that's 30 seconds behind a writer processing 10,000 TPS has a much larger data exposure window than one that's 30 seconds behind a 100 TPS writer. Size your cross-region replica at least as large as the writer instance — replication SQL thread throughput is roughly proportional to instance compute, and an undersized replica will fall progressively further behind under sustained write load.

Aurora replication issues affecting your read traffic?

We audit Aurora cluster configurations and replication topologies. Free assessment, no obligation.

Get a Free Assessment → More Articles