Postgres Streaming Replication Lag: Three Root Causes and How to Tell Them Apart

The Three Columns Nobody Distinguishes

When someone reports replication lag, the first query they run is usually this:

SELECT
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;

PostgreSQL exposes three lag measurements: write_lag, flush_lag, and replay_lag. Each represents a different phase of the replication pipeline and points to a different class of problem. Most teams look at replay_lag alone, see a number they don't like, and start tuning the wrong thing.

This article walks through each root cause, how to distinguish them from the output of pg_stat_replication and OS-level metrics, and what actually fixes each one.

Understanding the Replication Pipeline

PostgreSQL streaming replication moves WAL from primary to standby through a sequence of steps:

Primary generates WAL and sends it to the WAL sender process
Standby WAL receiver writes the WAL to its local WAL buffer (write_lsn)
WAL receiver flushes the write to disk (flush_lsn)
Startup process replays the WAL into the standby data files (replay_lsn)

The lag columns in pg_stat_replication measure the time between the primary confirming a WAL record and each of these phases completing on the standby. They are wall-clock durations, not byte offsets — which is an important distinction when you're diagnosing the cause.

Root Cause 1: Network Throughput Bottleneck

The network bottleneck is the easiest root cause to identify and the one people least often actually have. The signature is:

write_lag is elevated (100ms or more)
flush_lag and replay_lag are close to write_lag in absolute terms
The difference between sent_lsn and write_lsn is large (megabytes)

When the network is the bottleneck, WAL is accumulating in the primary's send queue faster than the standby can receive it. You can confirm this by checking the WAL sender send queue size directly:

SELECT
  pg_wal_lsn_diff(sent_lsn, write_lsn) AS send_queue_bytes,
  pg_wal_lsn_diff(sent_lsn, flush_lsn) AS flush_queue_bytes,
  pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_queue_bytes
FROM pg_stat_replication;

A large send_queue_bytes with moderate flush_queue_bytes and replay_queue_bytes points to the network. Confirm with network throughput metrics on both the primary and standby — look for the standby's NIC approaching saturation during WAL-heavy write windows.

On AWS, this appears most often when an RDS or Aurora primary is in one Availability Zone and the standby is in another, with a sustained write workload generating 200–500 MB/s of WAL. Cross-AZ bandwidth is shared infrastructure. The fix is not to tune PostgreSQL — it's to reduce WAL volume (by addressing write amplification) or accept the lag as a structural constraint of your write pattern.

Root Cause 2: Standby I/O Saturation

I/O saturation on the standby is more common than a network bottleneck and harder to diagnose because the lag metrics look similar from the PostgreSQL side. The signature is:

write_lag is low (single-digit milliseconds or less)
flush_lag is elevated
replay_lag is also elevated, close to or larger than flush_lag
OS iowait on the standby is high during write peaks

The standby is receiving WAL from the network without delay, but the local disk can't flush it fast enough. This happens when the standby's storage is under-provisioned relative to the primary's write throughput — a very common situation when teams size standby replicas smaller than the primary to save cost.

Check the standby's disk write throughput and IOPS against the provisioned limits:

# On the standby host — monitor WAL write throughput
iostat -xz 1 30 | grep -E "Device|$(ls /dev/nvme* | head -1 | xargs basename)"

# For gp3/io2 on AWS, compare actual write throughput to provisioned limits
# CloudWatch metric: VolumeWriteBytes for the standby EBS volume

On EC2-hosted standbys, this often appears when the primary is on an io2 volume with 64,000 provisioned IOPS and the standby is on gp3 with 3,000 IOPS. The WAL flush on the standby creates a write queue, which shows up as elevated flush_lag.

The fix is either to match standby storage provisioning to the primary, or — if the standby is a read replica where you don't require synchronous durability — to evaluate whether synchronous_commit = remote_write is acceptable for your RPO requirements. Remote write acknowledges after the standby has written WAL to its OS buffer but before flushing to disk, which eliminates I/O wait from the synchronous commit path.

The synchronous_commit Tradeoff

If you are running synchronous replication (synchronous_standby_names is set), the flush lag on the standby directly adds to your primary's commit latency. The primary's COMMIT does not return until the standby reports at the commit level you specified:

synchronous_commit = on           -- wait for standby flush to disk
synchronous_commit = remote_write -- wait for standby OS write (not fsync)
synchronous_commit = remote_apply -- wait for standby replay (highest latency)
synchronous_commit = local        -- don't wait for standby at all

Reducing synchronous_commit from on to remote_write on a standby with I/O-limited storage can reduce primary commit latency by 15–40ms at peak write load. The RPO implication is that up to one OS write buffer of WAL (typically 4–8 MB) could be lost if the standby host loses power before flushing.

Root Cause 3: Replay CPU or Lock Contention

Replay lag without significant flush lag is the most insidious root cause because it doesn't appear as a hardware bottleneck at all. The signature is:

write_lag is low
flush_lag is low
replay_lag is significantly higher than flush_lag
CPU on the standby is not saturated
I/O on the standby is not saturated

The standby is receiving and flushing WAL without issue, but the startup process applying WAL to data files is falling behind. This happens for three specific reasons.

Reason A: Long-Running Queries on the Standby

The most common cause of replay lag on read replicas is long-running queries conflicting with replay. PostgreSQL's streaming replication standby cannot replay WAL records that would modify pages currently being read by a query on the standby. This is the recovery_conflict_on_lock mechanism.

When a query on the standby is reading a relation that the primary is vacuuming, truncating, or heavily updating, the startup process must either wait for the query to finish or cancel it. The default behavior is to wait up to max_standby_streaming_delay (default: 30 seconds), then cancel the conflicting query.

You can observe this with:

-- On the standby
SELECT
  pid,
  wait_event_type,
  wait_event,
  query_start,
  now() - query_start AS duration,
  query
FROM pg_stat_activity
WHERE wait_event = 'RecoveryConflictSnapshot'
   OR wait_event = 'RecoveryConflictTablespace'
   OR wait_event = 'RecoveryConflictLock';

If you see sessions waiting on RecoveryConflictSnapshot, replay is paused waiting for those queries. The system view pg_stat_database_conflicts tracks cumulative conflict counts:

SELECT
  datname,
  confl_tablespace,
  confl_lock,
  confl_snapshot,
  confl_bufferpin,
  confl_deadlock
FROM pg_stat_database_conflicts;

A high confl_snapshot count means the primary's vacuum activity is regularly conflicting with read queries on the standby. The fix is not to tune replication — it's to tune autovacuum on the primary to run less aggressively during query-heavy windows on the standby, or to set hot_standby_feedback = on to let the standby inform the primary about its oldest active transaction.

Reason B: hot_standby_feedback and Bloat

hot_standby_feedback = on prevents recovery conflicts by telling the primary the oldest transaction ID active on the standby, which prevents the primary from vacuuming rows that the standby's queries still need. This solves the conflict problem but introduces a new one: if the standby has a long-running query, the primary's vacuum cannot clean up dead tuples for as long as that query runs, causing table bloat on the primary.

This is a genuine tradeoff. For OLAP workloads running long analytical queries on replicas, hot_standby_feedback = on is usually the right call. For OLTP read replicas serving short queries, it's often safer to leave it off and set max_standby_streaming_delay to a reasonable value (5–10 seconds) and accept the occasional query cancellation.

Reason C: Single-Threaded Replay on High-Volume WAL

PostgreSQL's WAL replay process is single-threaded. On very high-write primaries generating 300+ MB/s of WAL, a single startup process replaying WAL can become CPU-bound — not because the CPU is saturated overall, but because one core is pegged at 100% while the rest of the system is idle.

Diagnose this by checking CPU usage per-process on the standby:

top -p $(pgrep -f "startup") -b -n 5 -d 1

If the startup process is consistently at 95–100% of a single core while system CPU load is low, you have a replay throughput ceiling. PostgreSQL 16 introduced parallelism in WAL apply for logical replication but not yet for streaming replication physical apply — so on versions prior to a future release addressing this, the mitigation is to reduce WAL volume at the source.

Reducing WAL volume for replay performance means auditing wal_level. If you're running wal_level = logical for CDC consumers but the standby is a physical hot standby, the logical WAL is being generated and shipped but is useless for physical replay. Switch to wal_level = replica if logical consumers can be served from a separate logical replication source.

Diagnostic Flowchart

Start with this query on the primary every 5 seconds for a minute:

SELECT
  now() AS ts,
  client_addr,
  pg_wal_lsn_diff(sent_lsn, write_lsn)  AS send_q_bytes,
  pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_q_bytes,
  pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_q_bytes,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;

Decision tree:

Large send_q_bytes, elevated write_lag: Network bottleneck. Check NIC throughput on standby.
Low write_lag, elevated flush_lag: Standby disk I/O saturation. Check iostat and EBS throughput limits.
Low flush_lag, elevated replay_lag: Check pg_stat_activity for recovery conflicts. Check startup process CPU. Check pg_stat_database_conflicts.confl_snapshot.

Monitoring Setup for Ongoing Visibility

For production systems, export these metrics to your monitoring stack continuously. A Prometheus query using the postgres_exporter looks like:

# prometheus/postgres_exporter custom queries
pg_replication_lag:
  query: |
    SELECT
      client_addr,
      EXTRACT(EPOCH FROM write_lag)::float  AS write_lag_seconds,
      EXTRACT(EPOCH FROM flush_lag)::float  AS flush_lag_seconds,
      EXTRACT(EPOCH FROM replay_lag)::float AS replay_lag_seconds,
      pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_queue_bytes
    FROM pg_stat_replication
  metrics:
    - client_addr:
        usage: LABEL
    - write_lag_seconds:
        usage: GAUGE
    - flush_lag_seconds:
        usage: GAUGE
    - replay_lag_seconds:
        usage: GAUGE
    - replay_queue_bytes:
        usage: GAUGE

Alert separately on each lag type with different thresholds. A 5-second replay_lag on a read replica with long-running analytics queries is expected and may be acceptable. A 5-second flush_lag on a synchronous standby is a direct hit to your primary's commit latency and needs immediate attention.

What This Means for Aurora

Aurora PostgreSQL has its own replication architecture that bypasses WAL streaming entirely — Aurora replicas receive storage-layer log records, not PostgreSQL WAL. The pg_stat_replication view is not populated on Aurora. Instead, use CloudWatch's AuroraReplicaLag metric, and look at the Aurora-specific ReplicaLagMaximum and ReplicaLagMinimum across the cluster.

Aurora replica lag is primarily a storage-layer phenomenon and is almost always under 20ms under normal load. When Aurora replica lag spikes to seconds, the cause is almost always a writer-side commit storm generating storage I/O that the replica storage nodes can't process fast enough. This is not tunable via replication parameters — it requires reducing write concurrency or scaling storage throughput via instance class upgrades.

For self-managed PostgreSQL on EC2 or RDS PostgreSQL (not Aurora), the three-root-cause model applies directly.

Diagnosing persistent replication lag on a production Postgres cluster?

We instrument streaming replication metrics, identify the actual root cause, and fix the right layer — not the wrong one. Free assessment, no obligation.

Get a Free Assessment → More Articles