Home // Cloud Infrastructure // Odoo Development // AI & Automation // Odoo + AI Agents Case Studies About Blog Free Assessment
// DATABASE OPERATIONS · 12 MIN READ

Postgres Logical Replication Slots and the Disk Bloat They Leave Behind

Inactive logical replication slots pin WAL on disk indefinitely. Postgres will not recycle a single WAL segment until the slot's consumer advances its LSN — and if that consumer is gone, your disk fills up quietly until the database crashes.

// PUBLISHED 2023-09-12 · LANIAKEA TEAM

The Incident Pattern

The alert comes in at 2 AM: disk utilization on the primary is at 94% and climbing. The DBA checks tablespace sizes — none of them grew unexpectedly. The database itself is the same size it was yesterday. But df -h shows the volume is almost full. The culprit is almost always pg_wal/.

Running a quick check reveals the problem immediately:

SELECT slot_name,
       plugin,
       active,
       restart_lsn,
       confirmed_flush_lsn,
       pg_size_pretty(
         pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
       ) AS retained_wal
FROM pg_replication_slots
ORDER BY restart_lsn ASC NULLS FIRST;

The output shows a slot named debezium_prod with active = false and retained_wal of 187 GB. The Debezium CDC connector that consumed this slot was retired three weeks ago. Nobody dropped the slot. Postgres has been faithfully retaining every WAL segment since then, because the slot's restart_lsn has not moved.

This is not a rare edge case. It is one of the most consistent disk-related incidents we encounter on Postgres environments that have ever run CDC, logical replication, or replication tooling like pglogical, Debezium, or AWS DMS.

How Logical Replication Slots Work

To understand why this happens, you need to understand what a logical replication slot actually does at the storage level.

Postgres WAL (Write-Ahead Log) is a sequential record of every change made to the database. Normally, WAL is recycled aggressively: once a segment is no longer needed for crash recovery and has been archived (if archiving is enabled), it is deleted or overwritten. The checkpoint process advances redo_lsn forward and marks older WAL as eligible for recycling.

A logical replication slot changes this calculus. When a slot is created, Postgres records the slot's restart_lsn — the earliest LSN the slot's consumer might still need. Postgres guarantees that no WAL segment containing data at or after restart_lsn will be recycled, regardless of checkpoints. The slot essentially places a hold on WAL deletion.

While the slot is active and its consumer is running, the consumer continuously advances confirmed_flush_lsn, signaling to Postgres that it has processed changes up to that point. Postgres can then recycle WAL behind that watermark. The slot acts as a moving bookmark, and WAL accumulation stays bounded.

When the consumer stops — whether it crashes, is decommissioned, or simply disconnects — the confirmed_flush_lsn freezes. Postgres continues writing new WAL, but it also continues retaining all WAL from the frozen LSN forward. Every megabyte of WAL your database writes from that moment on is also a megabyte that cannot be recycled.

Quantifying the Accumulation Rate

The rate at which a stale slot accumulates WAL equals your database's WAL generation rate. On a busy OLTP database generating 5 GB/hour of WAL, a slot left inactive for a week accumulates 840 GB of retained WAL. On a write-heavy database generating 20 GB/hour, a slot left inactive over a weekend can retain 960 GB.

You can measure your current WAL generation rate:

-- WAL generation rate over the last hour
SELECT
  round(
    sum(wal_bytes) / extract(epoch from (now() - min(stats_reset)))
    / 1024 / 1024,
    2
  ) AS wal_mb_per_second
FROM pg_stat_wal;

-- Or check WAL position delta over a short interval
SELECT pg_current_wal_lsn() AS current_lsn;
-- Wait 60 seconds
SELECT pg_size_pretty(
  pg_wal_lsn_diff(pg_current_wal_lsn(), '0/1A3F2C10')
) AS wal_in_60s;

On Aurora PostgreSQL, you can observe retained WAL via CloudWatch's TransactionLogsDiskUsage metric, but it does not break down retention by slot. You still need to query pg_replication_slots to identify the specific slot responsible.

Detection: What to Monitor

The Core Query

This query should run as a scheduled job and alert when any slot retains more than a threshold you define (commonly 10 GB or 5% of available disk):

SELECT
  slot_name,
  slot_type,
  plugin,
  active,
  active_pid,
  database,
  restart_lsn,
  confirmed_flush_lsn,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS retained_wal_size,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_wal_bytes,
  -- Estimate how long slot has been inactive
  CASE
    WHEN active THEN NULL
    ELSE now() - pg_postmaster_start_time()
  END AS rough_inactive_since
FROM pg_replication_slots
WHERE slot_type = 'logical'
ORDER BY retained_wal_bytes DESC NULLS LAST;

Alerting Thresholds

Two metrics are worth alerting on independently:

The second metric requires either a monitoring table that records when each slot was last seen active, or an external alerting system that tracks the gap between when a slot's confirmed_flush_lsn last advanced and the current time.

Aurora-Specific Behavior

On Aurora PostgreSQL, the parameter rds.logical_replication enables logical replication slots. AWS has added some guardrails: if a slot's retained WAL exceeds a threshold relative to allocated storage, Aurora will log warnings in the PostgreSQL error log. But it will not drop the slot automatically. You still hit disk exhaustion if you do not intervene.

Aurora also has a hard limit: logical replication slots cannot be used across a failover to a reader instance. If your primary fails over, all logical slots must be recreated on the new primary, and your consumers must reconnect and resume from the publication's beginning or from an external checkpoint. This is a significant operational consideration for CDC-based pipelines.

Prevention: The Slot Lifecycle Problem

The root cause of stale slot accumulation is not a Postgres design flaw — it is a lifecycle management gap. Slots are created when CDC tools or replication pipelines are set up, but teams rarely have a formal process for dropping them when those pipelines are decommissioned.

Ownership and Runbooks

Every logical replication slot should have an owner — the team or service responsible for consuming it. Document this in a runbook or in the slot name itself:

-- Naming convention that embeds ownership
CREATE_REPLICATION_SLOT debezium_orders_team_prod LOGICAL pgoutput;

-- Or use a descriptive naming scheme
-- {consumer}_{database}_{environment}_{created_date}
-- debezium_inventory_prod_20230901

When you retire a CDC pipeline or replication consumer, dropping its slot should be part of the decommission checklist, not an afterthought.

max_slot_wal_keep_size

PostgreSQL 13 introduced max_slot_wal_keep_size, a parameter that limits how much WAL a slot can retain before Postgres invalidates the slot rather than continuing to accumulate disk. This is a safety valve, not a solution:

-- In postgresql.conf or via ALTER SYSTEM
max_slot_wal_keep_size = '10GB'

-- Check current setting
SHOW max_slot_wal_keep_size;

When a slot hits this limit, Postgres marks it as invalid:

-- Check for invalidated slots
SELECT slot_name, invalidation_reason
FROM pg_replication_slots
WHERE invalidation_reason IS NOT NULL;

An invalidated slot cannot be resumed — its consumer must be reconfigured to start a new slot from scratch. For Debezium, this means rerunning the initial snapshot. For DMS CDC tasks, it means resetting the task to reload. max_slot_wal_keep_size prevents disk exhaustion but does not save you from data re-ingestion work. Set it as a backstop, but treat it as a last resort, not a primary protection strategy.

Automated Slot Monitoring with pg_cron

If you run pg_cron on your Postgres instance, you can automate slot health checks and alert generation:

-- Create a table to track slot state over time
CREATE TABLE dba_slot_history (
  check_time     timestamptz DEFAULT now(),
  slot_name      text,
  active         boolean,
  retained_bytes bigint
);

-- Schedule hourly checks
SELECT cron.schedule(
  'slot-health-check',
  '0 * * * *',
  $$
  INSERT INTO dba_slot_history (slot_name, active, retained_bytes)
  SELECT
    slot_name,
    active,
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  FROM pg_replication_slots
  WHERE slot_type = 'logical';
  $$
);

-- Query to find slots inactive for more than 2 hours
SELECT
  slot_name,
  min(check_time) AS first_seen_inactive,
  now() - min(check_time) AS inactive_duration,
  max(retained_bytes) AS max_retained_bytes
FROM dba_slot_history
WHERE active = false
GROUP BY slot_name
HAVING min(check_time) < now() - interval '2 hours';

Remediation: Safely Dropping a Stale Slot

Before dropping a slot, confirm the consumer is truly gone and not temporarily disconnected. Check for active connections from the consumer's application:

-- Check for any connections from the consumer application
SELECT pid, usename, application_name, client_addr, state, wait_event
FROM pg_stat_activity
WHERE application_name ILIKE '%debezium%'
   OR application_name ILIKE '%dms%'
   OR application_name ILIKE '%replication%';

-- Verify the slot has not advanced recently
-- (compare restart_lsn to what you recorded earlier)
SELECT slot_name, restart_lsn, confirmed_flush_lsn, active
FROM pg_replication_slots
WHERE slot_name = 'debezium_prod';

If the slot is inactive and the consumer is confirmed gone, drop it:

-- Drop the slot
SELECT pg_drop_replication_slot('debezium_prod');

-- Verify it is gone
SELECT slot_name FROM pg_replication_slots WHERE slot_name = 'debezium_prod';
-- Should return 0 rows

After dropping the slot, Postgres will immediately begin recycling the accumulated WAL segments during the next checkpoint cycle. Disk space recovery is not instantaneous — the segments are eligible for deletion, but actual reclamation depends on checkpoint timing and archiver behavior.

If archiving is enabled (archive_mode = on), the archiver must successfully archive each WAL segment before Postgres marks it for deletion. If your archive destination is slow or unreachable, WAL segments will pile up even after the slot is dropped. Check archiver status:

SELECT last_archived_wal,
       last_archived_time,
       last_failed_wal,
       last_failed_time,
       archived_count,
       failed_count
FROM pg_stat_archiver;

The RDS / Aurora Complication

On RDS PostgreSQL and Aurora PostgreSQL, you cannot directly access the filesystem to check pg_wal/ directory size. You depend on CloudWatch metrics and the pg_replication_slots view. A few RDS-specific considerations:

Slots and Logical Replication Publications

Logical replication in Postgres separates the publication (what changes to capture) from the subscription (who receives them). When you create a subscription, Postgres automatically creates a replication slot on the publisher:

-- On the publisher
CREATE PUBLICATION orders_pub FOR TABLE orders, order_items;

-- On the subscriber (this creates a slot on the publisher automatically)
CREATE SUBSCRIPTION orders_sub
  CONNECTION 'host=primary.internal dbname=prod user=replicator'
  PUBLICATION orders_pub;

-- Verify the slot was created on the publisher
SELECT slot_name, active FROM pg_replication_slots;
-- orders_sub  | f  (initially inactive until subscriber connects)

When you drop a subscription, the slot on the publisher is automatically dropped — as long as the drop is performed with the subscriber still able to connect to the publisher. If the publisher is unreachable when the subscriber is dropped, or if you drop the subscription with CREATE SUBSCRIPTION ... WITH (slot_name = NONE) (which skips automatic slot management), the slot on the publisher remains and will accumulate WAL.

-- Drop subscription and its slot cleanly
DROP SUBSCRIPTION orders_sub;  -- drops slot on publisher automatically

-- If publisher is unreachable, drop with slot cleanup disabled
-- (then you MUST manually drop the slot on the publisher later)
DROP SUBSCRIPTION orders_sub WITH (slot_name = NONE);

Summary: The Operational Checklist

Logical replication slots are safe and reliable when managed. The failures happen when management is absent. The checklist that prevents the 2 AM alert:

The fix is always the same: drop the stale slot. Getting there without a 2 AM disk-full incident is purely a monitoring and process discipline question.

Need a second opinion on your Postgres environment?

We'll review your replication slot configuration, WAL retention, and monitoring gaps as part of a free database assessment. Findings delivered in 5–7 business days.