Home // Cloud Infrastructure // Odoo Development // AI & Automation // Odoo + AI Agents Case Studies About Blog Free Assessment
// DATABASE OPERATIONS · 12 MIN READ

Oracle ASM Rebalance Operations During Production Hours: Risk Model

Running an ASM disk group rebalance while production workloads are active is possible — but the I/O impact is non-linear and depends on power limit, extent size, and current disk utilization in ways that most teams don't model before they start.

// PUBLISHED 2022-11-15 · LANIAKEA TEAM

Why This Comes Up

There are three scenarios that force ASM rebalance operations: adding disks to expand capacity, replacing failed or degraded disks, and migrating from one disk group redundancy level to another. In an ideal world, all of these happen during scheduled maintenance windows at 2am on Sunday. In the real world, a disk fails Thursday morning, your SAN vendor delivers the replacement array Friday afternoon, and your operations team wants it incorporated before the weekend batch run starts Saturday night.

The question is never really "can we rebalance during production hours?" — ASM will let you. The question is: at what power level, for how long, and what is the measurable impact on transaction throughput and response time while it runs? Most production incidents we've investigated involving rebalance operations didn't fail because someone ran the wrong command. They failed because nobody quantified the I/O overhead before starting, and the database hit a throughput wall 45 minutes in during peak morning trading.

This is a risk model, not a prohibition. Some ASM rebalance operations are safe to run during business hours. Others will saturate your I/O subsystem. Knowing which is which requires understanding what rebalance actually does to your storage.

What ASM Rebalance Actually Does at the I/O Level

ASM stores data as extents — fixed-size allocation units distributed across disks in the disk group. When you add or remove a disk, ASM must redistribute extents so that data is evenly spread across all disks in the group. This redistribution is the rebalance operation.

The rebalance process runs in two phases:

The critical detail: extent migration is not a sequential read-then-write. ASM issues parallel read and write I/Os simultaneously, interleaved with your production database I/O on the same physical disks. At power limit 1 (the lowest), rebalance generates roughly 1-2 MB/s of additional I/O per disk. At power limit 11 (the maximum), it can generate 100+ MB/s per disk — effectively saturating most storage configurations.

The Power Limit: What It Controls and What It Doesn't

The ASM rebalance power limit is a throttle, not a cap. Setting POWER 2 doesn't guarantee that rebalance will consume no more than X MB/s of I/O. It controls the number of parallel rebalance slave processes that ASM spawns. More processes mean more concurrent extent moves, which means more I/O.

-- Check current rebalance activity and power level
SELECT inst_id, group_number, pass, state, power, actual, sofar, est_work, est_rate, est_minutes
FROM gv$asm_operation
WHERE type = 'REBAL';

-- Modify power limit of running rebalance
ALTER DISKGROUP data REBALANCE POWER 3;

-- Start rebalance with explicit power limit
ALTER DISKGROUP data REBALANCE POWER 4;

-- Check disk group free space and rebalance needs
SELECT name, state, type, total_mb, free_mb,
       ROUND(100 - (free_mb / total_mb * 100), 1) AS pct_used
FROM v$asm_diskgroup;

Power limits map approximately to the following rebalance throughput on common storage configurations:

These are illustrative ranges. Your actual throughput depends on disk type, RAID configuration, ASM extent size, and the redundancy level of the disk group (normal vs. high redundancy doubles the write I/O because ASM maintains two copies).

Measuring Your Storage Headroom Before You Start

Before starting any production-hours rebalance, establish your current I/O baseline and headroom. You need to know how much spare I/O capacity exists at your current workload level.

-- AWR-based I/O summary for the last hour
SELECT
  snap_id,
  ROUND(SUM(phyrds) / elapsed_seconds, 0) AS read_iops,
  ROUND(SUM(phywrts) / elapsed_seconds, 0) AS write_iops,
  ROUND(SUM(phyblkrd * blksize) / elapsed_seconds / 1048576, 1) AS read_mbps,
  ROUND(SUM(phyblkwrt * blksize) / elapsed_seconds / 1048576, 1) AS write_mbps
FROM (
  SELECT f.snap_id, f.phyrds, f.phywrts, f.phyblkrd, f.phyblkwrt,
         (s2.end_interval_time - s1.end_interval_time) * 86400 AS elapsed_seconds,
         8192 AS blksize
  FROM dba_hist_filestatxs f
  JOIN dba_hist_snapshot s1 ON f.snap_id = s1.snap_id AND s1.dbid = f.dbid
  JOIN dba_hist_snapshot s2 ON s2.snap_id = s1.snap_id + 1 AND s2.dbid = f.dbid
  WHERE s1.snap_id >= (SELECT MAX(snap_id) - 12 FROM dba_hist_snapshot)
)
GROUP BY snap_id, elapsed_seconds
ORDER BY snap_id DESC
FETCH FIRST 12 ROWS ONLY;

Compare your peak I/O against your storage throughput ceiling. If you're running at 70% of your SAN's rated throughput during peak hours, adding POWER 3 rebalance (which adds another 15-20%) puts you at 85-90% utilization — dangerously close to the point where latency starts climbing non-linearly.

The Redundancy Multiplier

Normal redundancy disk groups write each extent to two disks (mirroring). High redundancy writes to three. This means a rebalance operation in a normal redundancy group generates 2x the write I/O of the raw data volume being moved. A high redundancy group generates 3x.

This catches many teams off guard. They calculate: "We need to move 500 GB of data, at 50 MB/s that's under 3 hours." The actual I/O load is 1 TB of reads plus 1 TB of writes for normal redundancy, not 500 GB total. At 50 MB/s of rebalance write throughput, it's over 5.5 hours — and that 50 MB/s write load is competing with your production write path.

-- Check disk group redundancy and estimate rebalance I/O
SELECT
  name,
  type AS redundancy,
  total_mb,
  (total_mb - free_mb) AS used_mb,
  CASE type
    WHEN 'NORMAL' THEN (total_mb - free_mb) * 2
    WHEN 'HIGH'   THEN (total_mb - free_mb) * 3
    ELSE               (total_mb - free_mb)
  END AS estimated_rebalance_read_write_mb
FROM v$asm_diskgroup;

The Rebalance Interruption Problem

You can pause and resume ASM rebalance. However, "pause" is not atomic. If you issue ALTER DISKGROUP data REBALANCE POWER 0 to stop rebalance, ASM will complete any in-flight extent moves before stopping. In-flight extent moves can number in the hundreds at higher power levels, meaning the stop is not immediate — it may take 30-60 seconds to fully quiesce.

More importantly: if your Oracle instance crashes during a rebalance operation, ASM does not lose data. Rebalance is designed to be crash-safe. ASM will resume the operation from the last checkpoint when the instance restarts. This means a rebalance that was 40% complete when a node failed will still be 40% complete when it comes back online — not 0%.

-- Pause rebalance (drains in-flight moves first)
ALTER DISKGROUP data REBALANCE POWER 0;

-- Monitor drain progress
SELECT state, sofar, est_work, est_minutes
FROM v$asm_operation
WHERE type = 'REBAL';
-- Wait until state = 'INACTIVE' before proceeding

-- Check rebalance checkpoint status after resume
SELECT name, rebal FROM v$asm_disk
WHERE group_number = (SELECT group_number FROM v$asm_diskgroup WHERE name = 'DATA')
ORDER BY name;

ASM Extent Size and Its Effect on Rebalance Duration

ASM disk groups have a configurable AU (allocation unit) size, set at creation time and not changeable without recreating the disk group. Common values are 1 MB (default), 4 MB, 8 MB, and 64 MB. The AU size determines extent size, which directly affects rebalance behavior:

-- Check current AU size for disk groups
SELECT name, block_size, au_size / 1048576 AS au_size_mb, state
FROM v$asm_diskgroup;

The Safe Production-Hours Rebalance Procedure

When you must rebalance during production hours, this is the procedure we use:

Step 1: Establish I/O baseline. Pull the AWR I/O summary for the last 2 hours. Calculate current throughput as a percentage of storage rated IOPS and MB/s. If you're above 60% utilization at peak, rebalance should wait for a maintenance window or be done at POWER 1 only.

Step 2: Start at POWER 1 and observe. Begin rebalance at the minimum power level. Wait 10 minutes and check two things: the rate reported in v$asm_operation (EST_RATE column, in MB/s), and your current database I/O wait events.

-- Start rebalance at minimum power
ALTER DISKGROUP data REBALANCE POWER 1;

-- Monitor every 2 minutes
SELECT pass, state, power, actual, sofar, est_work,
       ROUND(est_rate, 1) AS rate_mbps, est_minutes
FROM v$asm_operation
WHERE type = 'REBAL';

-- Check I/O wait events on the database simultaneously
SELECT event, total_waits, time_waited_micro / 1000000 AS time_waited_sec,
       ROUND(average_wait, 2) AS avg_wait_ms
FROM v$system_event
WHERE event LIKE '%db file%' OR event LIKE '%direct path%'
ORDER BY time_waited_micro DESC
FETCH FIRST 10 ROWS ONLY;

Step 3: Increase incrementally if headroom permits. If after 10 minutes at POWER 1 there's no measurable change in your I/O wait events, you can move to POWER 2 and repeat the check. Never jump more than one power level at a time during production hours.

Step 4: Set a production-hours ceiling. For most OLTP environments, POWER 3 is the practical ceiling during business hours. Above that, you risk creating visible latency for end users on I/O-bound workloads.

Step 5: Plan the transition to off-hours. Even if rebalance is safe at POWER 3 during business hours, it completes faster at POWER 8 or higher during off-hours. Start it during the day if needed, then increase power once the evening batch window opens.

-- Scheduled power increase for evening window (cron-driven or scheduler job)
-- Run at 10pm to accelerate rebalance overnight
ALTER DISKGROUP data REBALANCE POWER 8;

-- Throttle back before morning peak
-- Run at 7am
ALTER DISKGROUP data REBALANCE POWER 2;

Disk Failure Scenario: The Forced Rebalance

When a disk fails in a normal redundancy disk group, ASM immediately begins a rebalance to restore redundancy — this is automatic and you cannot prevent it. The default power level for this automatic rebalance is controlled by the ASM parameter ASM_POWER_LIMIT, which defaults to 1.

The important difference between a manual rebalance (adding capacity) and a failure-triggered rebalance (restoring redundancy) is urgency. A manual rebalance can be paused and resumed. A failure-triggered rebalance running at POWER 1 may take 8-12 hours to complete on a large disk group. During that window, you're running without redundancy protection — a second disk failure in the same failure group means data loss.

For production environments, consider increasing ASM_POWER_LIMIT from 1 to 3 as a baseline. This accepts slightly more I/O impact from failure-triggered rebalances in exchange for significantly faster redundancy restoration.

-- Check current ASM_POWER_LIMIT
SHOW PARAMETER asm_power_limit;

-- Increase default power limit for faster failure recovery
-- Set in ASM spfile, takes effect immediately for new rebalance operations
ALTER SYSTEM SET asm_power_limit = 3 SCOPE = BOTH;

-- Monitor disk group health
SELECT name, state, total_mb, free_mb, offline_disks, unbalanced
FROM v$asm_diskgroup;

What to Watch While Rebalance Runs

Three metrics tell you whether rebalance is having production impact:

-- Real-time I/O contention monitor during rebalance
SELECT
  a.name AS disk_name,
  a.reads,
  a.writes,
  ROUND(a.read_time / NULLIF(a.reads, 0), 2) AS avg_read_ms,
  ROUND(a.write_time / NULLIF(a.writes, 0), 2) AS avg_write_ms,
  a.bytes_read / 1048576 AS total_read_mb,
  a.bytes_written / 1048576 AS total_write_mb
FROM v$asm_disk_iostat a
JOIN v$asm_disk d ON a.group_number = d.group_number AND a.disk_number = d.disk_number
ORDER BY a.bytes_written DESC;

The Real Risk: What Can Go Wrong

Running ASM rebalance during production hours at an appropriate power level is generally safe. The failure modes we've seen are:

The procedure above keeps you out of all of these failure modes. The core discipline is: establish your baseline, start conservatively, observe before escalating power, and have a plan for the off-hours window when you can run at full power.

Need a second opinion on your stack?

We'll review your environment and share findings in 5–7 business days. No sales pitch, no obligation.