DB2 LUW HADR: Auto-Start Across Multi-Server Clusters

The Failure Mode This Architecture Solves

Four-server DB2 LUW HADR cluster. Primary in DC-A. Principal standby in DC-B. Two auxiliary standbys. Quarterly planned-failover tests pass. Monitoring green for 18 months.

A power event hits DC-A on a Saturday night. All four servers reboot simultaneously. The hard-coded startup scripts bring up DB2 on each server in a fixed order — primary first, then standbys — regardless of which instance actually holds the newest LSN.

The former primary comes up, but has a partial write loss at the storage layer from the abrupt shutdown. Its LSN is now 40,000 behind the principal standby, which had successfully replicated up to the instant before the outage. The scripts promote the primary anyway because that's what they always do. The cluster resumes writes. Forty minutes of transactions on the standby are silently lost.

This article walks through the architecture that prevents this class of failure — coordinated startup based on LSN, not on hard-coded order, with verification at every step.

Core Principles

Do not assume the instance that was primary before the outage is still the right primary after
Compare LSN across all surviving instances before promoting anyone
Use a shared-storage coordination file (or a small Zookeeper/etcd cluster) as the source of truth during startup
Verify HADR state after takeover, not just the takeover command's return code
Fail safe: if coordination cannot be established, leave the cluster in STANDBY mode and page an operator

Startup Architecture

Step 1 — Identify Self, Read Coordination File

Each server, on boot, runs a startup script that identifies its hostname and reads a coordination file on shared storage (NFS, GPFS, or a small EFS volume in AWS). The coordination file is JSON with per-instance state: last known role, last known LSN, last graceful shutdown timestamp.

Step 2 — Start DB2 in STANDBY Mode First

Every instance starts in STANDBY mode. No one starts as primary yet. This is the most important design decision: you never assume primary role until the cluster has coordinated.

Step 3 — Gather Current LSN From All Surviving Instances

After a configurable stabilization window (we use 60 seconds), the startup script queries HADR_LOG_POS from every reachable instance. Unreachable instances are excluded. A minimum quorum (typically 2 of 4 for a 4-node cluster, 2 of 3 for a 3-node cluster) must be achieved.

Step 4 — Elect the Instance with the Highest LSN

The instance with the highest HADR_LOG_POS across the quorum is elected primary. This is written to the coordination file with a timestamp and a term counter (borrowed from Raft; prevents split-brain if coordination files diverge).

Step 5 — Promote and Verify

The elected instance runs TAKEOVER HADR ON DATABASE ... BY FORCE. The script then polls HADR_STATE until it reaches PEER with at least one standby, or a 120-second SLA expires.

If PEER state is not reached, the script does NOT accept the primary role. It reverts to STANDBY, writes a failure indicator to the coordination file, and pages on-call. Refusing to take over badly is better than taking over into a split-brain.

Step 6 — Other Instances Configure as Standbys

Once the primary is elected and in PEER state, remaining instances read the coordination file and configure their HADR_TARGET_LIST to match. They start as standbys pointing to the new primary.

HADR_STATE vs HADR_ROLE

The single most important monitoring distinction in DB2 HADR:

HADR_ROLE tells you what the instance thinks it is (PRIMARY or STANDBY)
HADR_STATE tells you whether it's actually operational (PEER, REMOTE_CATCHUP, DISCONNECTED_PEER, etc.)

A takeover command that returns success sets HADR_ROLE = PRIMARY. That does NOT mean HADR_STATE is PEER. An instance can be PRIMARY but in REMOTE_CATCHUP_PENDING — which means no standby is actually applying logs and you have no HA protection.

Monitor HADR_STATE. Alert on HADR_STATE. Automate based on HADR_STATE.

From a real incident: takeover command returned success. HADR_ROLE = PRIMARY. Monitoring dashboard: green. Application: completely down for 47 minutes. Because HADR_STATE was REMOTE_CATCHUP_PENDING and no standby was reachable. The cluster was up as a single node with no HA. Monitor HADR_STATE, or live through this.

Coordination File Format

A minimal coordination file (updated only during cluster state transitions, never during steady-state):

{
  "term": 47,
  "elected_primary": "db-node-b.corp.local",
  "elected_at": "2026-04-11T21:14:03Z",
  "elected_lsn": "0x00003A7B:0000FE42:0000",
  "quorum_members": [
    {"host": "db-node-a.corp.local", "lsn": "0x00003A7A:0000AC11:0000", "reachable": true},
    {"host": "db-node-b.corp.local", "lsn": "0x00003A7B:0000FE42:0000", "reachable": true},
    {"host": "db-node-c.corp.local", "lsn": "0x00003A7A:00009E88:0000", "reachable": true},
    {"host": "db-node-d.corp.local", "lsn": null, "reachable": false}
  ],
  "peer_achieved": true,
  "peer_achieved_at": "2026-04-11T21:15:21Z"
}

The term counter prevents stale coordination files from being used after a network partition. Any instance that finds its local view of the cluster term less than the coordination file's term must re-read and re-elect.

Failure Modes and Responses

Split Brain Attempt

If two instances simultaneously try to elect themselves primary (rare, but possible in a severe network partition), the term counter catches it. Only one term number wins. The other instance detects the conflict and reverts to standby.

All Instances Down Then All Come Up Simultaneously

The LSN election picks the right primary regardless of which instance was primary before. No hard-coded order. No silent transaction loss.

Coordination File Corruption

If the coordination file is unreadable, the cluster refuses to auto-promote. All instances stay in standby. On-call gets paged. Manual intervention required. This is the correct behavior — better an outage than a wrong promotion.

Partial Recovery — 2 of 4 Instances

With a 2-of-4 quorum minimum, the cluster can elect a primary with only two survivors. With a single survivor, it cannot — and should not — auto-promote. That path requires operator intervention.

Monitoring That Actually Helps

Three metrics we alert on for every HADR cluster:

HADR_STATE — must be PEER on primary, REMOTE_CATCHUP or PEER on each standby
HADR_LOG_GAP — standby lag in bytes; alert if >10 MB sustained for >5 minutes
Coordination file age — should not change during steady state; unexpected changes indicate election churn

We do not alert on individual takeover commands returning. The state verification post-takeover is the real signal.

Testing the Architecture

The only proof that HADR auto-start works is destroying it. Quarterly chaos tests we run on every HADR cluster we're responsible for:

Hard-kill the primary during transaction load; verify correct standby promotion
Simultaneously reboot all four nodes; verify correct LSN-based election
Introduce a network partition between DC-A and DC-B; verify no split brain
Corrupt the coordination file; verify failure-to-promote behavior
Lose the coordination file (shared storage unavailable); verify failure-to-promote behavior

Every test produces a runbook update. Every runbook update gets reviewed at the next quarterly test.

The Bottom Line

DB2 LUW HADR is a mature, well-documented technology. The auto-start tooling IBM ships is adequate for single-pair HADR with a single primary and a single standby. For multi-node clusters with auxiliary standbys and coordination across data centers, you have to build the cluster coordination layer yourself — and most teams build it naively with hard-coded start order.

The architecture above — LSN-based election, coordination file with term counter, state verification at every step, failure-safe defaults — is the pattern we deploy in every multi-node HADR cluster. It's not more complex than the naive version; it's differently-complex, and it fails in ways the business can tolerate.

Audit your HADR architecture?

We review HADR setups for failure modes and automate coordination correctly. 30-min scoping call, written findings in 5–7 business days.

Get a Free Assessment → More Articles