The Incident Pattern
The alert comes in at 2 AM: disk utilization on the primary is at 94% and climbing. The DBA checks tablespace sizes — none of them grew unexpectedly. The database itself is the same size it was yesterday. But df -h shows the volume is almost full. The culprit is almost always pg_wal/.
Running a quick check reveals the problem immediately:
SELECT slot_name,
plugin,
active,
restart_lsn,
confirmed_flush_lsn,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal
FROM pg_replication_slots
ORDER BY restart_lsn ASC NULLS FIRST;
The output shows a slot named debezium_prod with active = false and retained_wal of 187 GB. The Debezium CDC connector that consumed this slot was retired three weeks ago. Nobody dropped the slot. Postgres has been faithfully retaining every WAL segment since then, because the slot's restart_lsn has not moved.
This is not a rare edge case. It is one of the most consistent disk-related incidents we encounter on Postgres environments that have ever run CDC, logical replication, or replication tooling like pglogical, Debezium, or AWS DMS.
How Logical Replication Slots Work
To understand why this happens, you need to understand what a logical replication slot actually does at the storage level.
Postgres WAL (Write-Ahead Log) is a sequential record of every change made to the database. Normally, WAL is recycled aggressively: once a segment is no longer needed for crash recovery and has been archived (if archiving is enabled), it is deleted or overwritten. The checkpoint process advances redo_lsn forward and marks older WAL as eligible for recycling.
A logical replication slot changes this calculus. When a slot is created, Postgres records the slot's restart_lsn — the earliest LSN the slot's consumer might still need. Postgres guarantees that no WAL segment containing data at or after restart_lsn will be recycled, regardless of checkpoints. The slot essentially places a hold on WAL deletion.
While the slot is active and its consumer is running, the consumer continuously advances confirmed_flush_lsn, signaling to Postgres that it has processed changes up to that point. Postgres can then recycle WAL behind that watermark. The slot acts as a moving bookmark, and WAL accumulation stays bounded.
When the consumer stops — whether it crashes, is decommissioned, or simply disconnects — the confirmed_flush_lsn freezes. Postgres continues writing new WAL, but it also continues retaining all WAL from the frozen LSN forward. Every megabyte of WAL your database writes from that moment on is also a megabyte that cannot be recycled.
Quantifying the Accumulation Rate
The rate at which a stale slot accumulates WAL equals your database's WAL generation rate. On a busy OLTP database generating 5 GB/hour of WAL, a slot left inactive for a week accumulates 840 GB of retained WAL. On a write-heavy database generating 20 GB/hour, a slot left inactive over a weekend can retain 960 GB.
You can measure your current WAL generation rate:
-- WAL generation rate over the last hour
SELECT
round(
sum(wal_bytes) / extract(epoch from (now() - min(stats_reset)))
/ 1024 / 1024,
2
) AS wal_mb_per_second
FROM pg_stat_wal;
-- Or check WAL position delta over a short interval
SELECT pg_current_wal_lsn() AS current_lsn;
-- Wait 60 seconds
SELECT pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), '0/1A3F2C10')
) AS wal_in_60s;
On Aurora PostgreSQL, you can observe retained WAL via CloudWatch's TransactionLogsDiskUsage metric, but it does not break down retention by slot. You still need to query pg_replication_slots to identify the specific slot responsible.
Detection: What to Monitor
The Core Query
This query should run as a scheduled job and alert when any slot retains more than a threshold you define (commonly 10 GB or 5% of available disk):
SELECT
slot_name,
slot_type,
plugin,
active,
active_pid,
database,
restart_lsn,
confirmed_flush_lsn,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal_size,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_wal_bytes,
-- Estimate how long slot has been inactive
CASE
WHEN active THEN NULL
ELSE now() - pg_postmaster_start_time()
END AS rough_inactive_since
FROM pg_replication_slots
WHERE slot_type = 'logical'
ORDER BY retained_wal_bytes DESC NULLS LAST;
Alerting Thresholds
Two metrics are worth alerting on independently:
- Retained WAL absolute size: Alert at 10 GB on small instances, 50 GB on large ones. This is a direct disk consumption risk.
- Slot inactive duration: Alert when a slot has
active = falsefor more than 30 minutes. An inactive slot for 30 minutes is unusual in most environments; an inactive slot for 6 hours is a problem.
The second metric requires either a monitoring table that records when each slot was last seen active, or an external alerting system that tracks the gap between when a slot's confirmed_flush_lsn last advanced and the current time.
Aurora-Specific Behavior
On Aurora PostgreSQL, the parameter rds.logical_replication enables logical replication slots. AWS has added some guardrails: if a slot's retained WAL exceeds a threshold relative to allocated storage, Aurora will log warnings in the PostgreSQL error log. But it will not drop the slot automatically. You still hit disk exhaustion if you do not intervene.
Aurora also has a hard limit: logical replication slots cannot be used across a failover to a reader instance. If your primary fails over, all logical slots must be recreated on the new primary, and your consumers must reconnect and resume from the publication's beginning or from an external checkpoint. This is a significant operational consideration for CDC-based pipelines.
Prevention: The Slot Lifecycle Problem
The root cause of stale slot accumulation is not a Postgres design flaw — it is a lifecycle management gap. Slots are created when CDC tools or replication pipelines are set up, but teams rarely have a formal process for dropping them when those pipelines are decommissioned.
Ownership and Runbooks
Every logical replication slot should have an owner — the team or service responsible for consuming it. Document this in a runbook or in the slot name itself:
-- Naming convention that embeds ownership
CREATE_REPLICATION_SLOT debezium_orders_team_prod LOGICAL pgoutput;
-- Or use a descriptive naming scheme
-- {consumer}_{database}_{environment}_{created_date}
-- debezium_inventory_prod_20230901
When you retire a CDC pipeline or replication consumer, dropping its slot should be part of the decommission checklist, not an afterthought.
max_slot_wal_keep_size
PostgreSQL 13 introduced max_slot_wal_keep_size, a parameter that limits how much WAL a slot can retain before Postgres invalidates the slot rather than continuing to accumulate disk. This is a safety valve, not a solution:
-- In postgresql.conf or via ALTER SYSTEM
max_slot_wal_keep_size = '10GB'
-- Check current setting
SHOW max_slot_wal_keep_size;
When a slot hits this limit, Postgres marks it as invalid:
-- Check for invalidated slots
SELECT slot_name, invalidation_reason
FROM pg_replication_slots
WHERE invalidation_reason IS NOT NULL;
An invalidated slot cannot be resumed — its consumer must be reconfigured to start a new slot from scratch. For Debezium, this means rerunning the initial snapshot. For DMS CDC tasks, it means resetting the task to reload. max_slot_wal_keep_size prevents disk exhaustion but does not save you from data re-ingestion work. Set it as a backstop, but treat it as a last resort, not a primary protection strategy.
Automated Slot Monitoring with pg_cron
If you run pg_cron on your Postgres instance, you can automate slot health checks and alert generation:
-- Create a table to track slot state over time
CREATE TABLE dba_slot_history (
check_time timestamptz DEFAULT now(),
slot_name text,
active boolean,
retained_bytes bigint
);
-- Schedule hourly checks
SELECT cron.schedule(
'slot-health-check',
'0 * * * *',
$$
INSERT INTO dba_slot_history (slot_name, active, retained_bytes)
SELECT
slot_name,
active,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
FROM pg_replication_slots
WHERE slot_type = 'logical';
$$
);
-- Query to find slots inactive for more than 2 hours
SELECT
slot_name,
min(check_time) AS first_seen_inactive,
now() - min(check_time) AS inactive_duration,
max(retained_bytes) AS max_retained_bytes
FROM dba_slot_history
WHERE active = false
GROUP BY slot_name
HAVING min(check_time) < now() - interval '2 hours';
Remediation: Safely Dropping a Stale Slot
Before dropping a slot, confirm the consumer is truly gone and not temporarily disconnected. Check for active connections from the consumer's application:
-- Check for any connections from the consumer application
SELECT pid, usename, application_name, client_addr, state, wait_event
FROM pg_stat_activity
WHERE application_name ILIKE '%debezium%'
OR application_name ILIKE '%dms%'
OR application_name ILIKE '%replication%';
-- Verify the slot has not advanced recently
-- (compare restart_lsn to what you recorded earlier)
SELECT slot_name, restart_lsn, confirmed_flush_lsn, active
FROM pg_replication_slots
WHERE slot_name = 'debezium_prod';
If the slot is inactive and the consumer is confirmed gone, drop it:
-- Drop the slot
SELECT pg_drop_replication_slot('debezium_prod');
-- Verify it is gone
SELECT slot_name FROM pg_replication_slots WHERE slot_name = 'debezium_prod';
-- Should return 0 rows
After dropping the slot, Postgres will immediately begin recycling the accumulated WAL segments during the next checkpoint cycle. Disk space recovery is not instantaneous — the segments are eligible for deletion, but actual reclamation depends on checkpoint timing and archiver behavior.
If archiving is enabled (archive_mode = on), the archiver must successfully archive each WAL segment before Postgres marks it for deletion. If your archive destination is slow or unreachable, WAL segments will pile up even after the slot is dropped. Check archiver status:
SELECT last_archived_wal,
last_archived_time,
last_failed_wal,
last_failed_time,
archived_count,
failed_count
FROM pg_stat_archiver;
The RDS / Aurora Complication
On RDS PostgreSQL and Aurora PostgreSQL, you cannot directly access the filesystem to check pg_wal/ directory size. You depend on CloudWatch metrics and the pg_replication_slots view. A few RDS-specific considerations:
- RDS storage autoscaling: If enabled, autoscaling will grow the volume when WAL accumulation threatens to fill it. This masks the problem — you pay for more storage while the real issue (a stale slot) goes undetected. Watch for unexpected storage growth in CloudWatch even if disk alerts do not fire.
- Read replica slots: On RDS, logical replication slots can only be created on the primary. Read replicas do not participate in logical replication slots. If you are using physical read replicas alongside logical slots, the replicas do not hold WAL independently — only the primary slot does.
- Parameter group changes: Enabling logical replication on RDS requires setting
rds.logical_replication = 1in a parameter group and rebooting the instance. This is a one-time setup. Once enabled, slot creation and management work the same as on self-managed Postgres.
Slots and Logical Replication Publications
Logical replication in Postgres separates the publication (what changes to capture) from the subscription (who receives them). When you create a subscription, Postgres automatically creates a replication slot on the publisher:
-- On the publisher
CREATE PUBLICATION orders_pub FOR TABLE orders, order_items;
-- On the subscriber (this creates a slot on the publisher automatically)
CREATE SUBSCRIPTION orders_sub
CONNECTION 'host=primary.internal dbname=prod user=replicator'
PUBLICATION orders_pub;
-- Verify the slot was created on the publisher
SELECT slot_name, active FROM pg_replication_slots;
-- orders_sub | f (initially inactive until subscriber connects)
When you drop a subscription, the slot on the publisher is automatically dropped — as long as the drop is performed with the subscriber still able to connect to the publisher. If the publisher is unreachable when the subscriber is dropped, or if you drop the subscription with CREATE SUBSCRIPTION ... WITH (slot_name = NONE) (which skips automatic slot management), the slot on the publisher remains and will accumulate WAL.
-- Drop subscription and its slot cleanly
DROP SUBSCRIPTION orders_sub; -- drops slot on publisher automatically
-- If publisher is unreachable, drop with slot cleanup disabled
-- (then you MUST manually drop the slot on the publisher later)
DROP SUBSCRIPTION orders_sub WITH (slot_name = NONE);
Summary: The Operational Checklist
Logical replication slots are safe and reliable when managed. The failures happen when management is absent. The checklist that prevents the 2 AM alert:
- Monitor
pg_replication_slotsfor inactive slots with retained WAL above your threshold. Automate this alert. - Set
max_slot_wal_keep_sizeas a backstop. 10–50 GB depending on your available disk and WAL generation rate. - Include slot cleanup in decommission runbooks for any CDC pipeline or replication consumer.
- Name slots with enough context to identify their owner without needing to trace back through Terraform or application configs.
- On Aurora, do not rely on storage autoscaling to absorb slot-driven WAL accumulation — it will, but at your expense, and it will not alert you to the underlying problem.
- Verify archiver health after dropping a stale slot if archiving is enabled. Recycling depends on the archiver clearing segments first.
The fix is always the same: drop the stale slot. Getting there without a 2 AM disk-full incident is purely a monitoring and process discipline question.
Need a second opinion on your Postgres environment?
We'll review your replication slot configuration, WAL retention, and monitoring gaps as part of a free database assessment. Findings delivered in 5–7 business days.