The Problem Nobody Looks at First
When a production Oracle database starts showing intermittent throughput spikes — response times that spike every few minutes then recover, batch jobs that take 20 minutes one run and 45 the next — the first instinct is to look at the SQL. Check the top wait events. Review the AWR report. Run an ASH query against the problem window.
What most teams miss: if your AWR shows log file switch (checkpoint incomplete) or log file switch completion in the top wait events, the problem isn't your SQL. It's that your redo log groups are too small for the write volume your database is producing, and Oracle is spending significant time waiting for log switches to complete before it can continue writing redo.
This is one of the oldest Oracle performance problems in the book, and it's still surprisingly common. The default redo log size when you install Oracle or create a database via DBCA is often 50MB or 200MB — a number that made sense for 1995 workloads, not for a modern OLTP system processing thousands of transactions per second against a terabyte-class database.
How Redo Logging Works and Why Size Matters
Oracle writes all changes to the redo log stream before they're committed. The Log Writer (LGWR) process writes redo buffer entries to the current online redo log group. When that group fills, Oracle performs a log switch: LGWR begins writing to the next group, and the Archiver (ARCn) process starts copying the filled group to the archive log destination.
The problem occurs when log switches happen too frequently. Every log switch triggers a checkpoint — the Database Writer (DBWR) must flush dirty buffers to disk to advance the checkpoint position. If your log groups are only 50MB and your system is generating 200MB of redo per minute, you're switching every 15 seconds. Each switch potentially blocks foreground processes waiting for LGWR to confirm the switch is complete and the next group is ready.
The serialization point is critical: at a log switch, Oracle must confirm the next group is not still being archived before it can begin writing to it. If archiving is slow — either due to I/O contention or a slow archive destination — the next group may not be free when the current one fills. Foreground processes wait. Throughput drops.
Diagnosing the Problem
The fastest diagnostic is the log switch frequency from V$LOG_HISTORY:
-- Log switch frequency over the last 24 hours
SELECT
TO_CHAR(FIRST_TIME, 'YYYY-MM-DD HH24') AS hour,
COUNT(*) AS switches,
ROUND(COUNT(*) / 60, 1) AS switches_per_minute
FROM V$LOG_HISTORY
WHERE FIRST_TIME > SYSDATE - 1
GROUP BY TO_CHAR(FIRST_TIME, 'YYYY-MM-DD HH24')
ORDER BY 1;
The general rule of thumb: you want no more than 4-6 log switches per hour during normal operations, and no more than 1 switch per minute during peak load. If you're seeing 20+ switches per hour at peak, your log groups are undersized for your workload.
Correlate this with the AWR top wait events. Query V$SYSTEM_EVENT for the log switch waits:
-- Current wait events related to log switching
SELECT
EVENT,
TOTAL_WAITS,
TOTAL_TIMEOUTS,
TIME_WAITED_MICRO / 1000000 AS time_waited_secs,
AVERAGE_WAIT_MICRO / 1000 AS avg_wait_ms
FROM V$SYSTEM_EVENT
WHERE EVENT IN (
'log file switch completion',
'log file switch (checkpoint incomplete)',
'log file switch (archiving needed)',
'log file sync'
)
ORDER BY TIME_WAITED_MICRO DESC;
log file switch (checkpoint incomplete) is the most damaging. It means Oracle is trying to switch to the next log group but DBWR hasn't finished advancing the checkpoint far enough — the group still contains redo that covers dirty blocks in the buffer cache. Foreground processes wait until DBWR catches up. This is a direct throughput tax on your transaction rate.
Also check current redo log sizes:
-- Current redo log configuration
SELECT
GROUP#,
MEMBERS,
BYTES / 1024 / 1024 AS size_mb,
STATUS,
ARCHIVED
FROM V$LOG
ORDER BY GROUP#;
How to Size Redo Log Groups Correctly
The target is for log switches to occur no more than once every 15-20 minutes during peak load. Work backwards from your redo generation rate to determine the right size.
Measure your peak redo generation rate from AWR or directly from V$SYSSTAT:
-- Redo generated in the last hour (from V$SYSSTAT)
SELECT
NAME,
VALUE / 1024 / 1024 AS value_mb
FROM V$SYSSTAT
WHERE NAME IN (
'redo size',
'redo log space requests',
'redo log space wait time'
);
-- Better: redo generation from AWR for a specific snapshot interval
SELECT
TO_CHAR(s.BEGIN_INTERVAL_TIME, 'HH24:MI') AS interval_start,
ROUND((e.VALUE - b.VALUE) / 1024 / 1024, 1) AS redo_mb_per_interval
FROM DBA_HIST_SYSSTAT e
JOIN DBA_HIST_SYSSTAT b
ON b.SNAP_ID = e.SNAP_ID - 1
AND b.DBID = e.DBID
AND b.INSTANCE_NUMBER = e.INSTANCE_NUMBER
AND b.STAT_NAME = e.STAT_NAME
JOIN DBA_HIST_SNAPSHOT s
ON s.SNAP_ID = e.SNAP_ID
AND s.DBID = e.DBID
AND s.INSTANCE_NUMBER = e.INSTANCE_NUMBER
WHERE e.STAT_NAME = 'redo size'
AND s.BEGIN_INTERVAL_TIME > SYSDATE - 1
ORDER BY s.BEGIN_INTERVAL_TIME;
If your peak redo generation is, say, 1.2GB per hour, that's 20MB per minute. A 15-minute log switch interval requires 300MB log groups. Round up for headroom — 500MB or 1GB is reasonable for this workload.
General sizing guidelines based on redo generation rate:
- Under 100MB/hour peak: 50-200MB groups (the defaults may be fine)
- 100MB-500MB/hour peak: 500MB-1GB groups
- 500MB-2GB/hour peak: 1GB-2GB groups
- Over 2GB/hour peak: 2GB-4GB groups; also consider whether you need more group members
For RAC databases, multiply by the number of instances — each instance has its own log groups, but they all archive to the same destination, so archiver throughput must accommodate the aggregate.
Adding and Dropping Redo Log Groups Online
You cannot resize existing redo log groups directly. The procedure is to add new groups at the correct size, then drop the old undersized ones. This can be done online without downtime.
-- Step 1: Add new log groups at correct size
-- Adjust size and group numbers for your environment
ALTER DATABASE ADD LOGFILE GROUP 4 ('/u03/oradata/PRODDB/redo04a.log') SIZE 1024M;
ALTER DATABASE ADD LOGFILE GROUP 5 ('/u03/oradata/PRODDB/redo05a.log') SIZE 1024M;
ALTER DATABASE ADD LOGFILE GROUP 6 ('/u03/oradata/PRODDB/redo06a.log') SIZE 1024M;
-- Step 2: Check current group status — cannot drop CURRENT or ACTIVE groups
SELECT GROUP#, STATUS, ARCHIVED FROM V$LOG ORDER BY GROUP#;
-- Step 3: Force a log switch to cycle away from groups you want to drop
ALTER SYSTEM SWITCH LOGFILE;
-- Wait for ARCHIVED = YES on the groups you want to drop
-- Then drop them
ALTER DATABASE DROP LOGFILE GROUP 1;
ALTER DATABASE DROP LOGFILE GROUP 2;
ALTER DATABASE DROP LOGFILE GROUP 3;
-- Step 4: Physically remove the old redo files from the OS
-- Oracle does NOT delete the physical files on drop
-- rm /u03/oradata/PRODDB/redo01a.log
-- rm /u03/oradata/PRODDB/redo02a.log
-- rm /u03/oradata/PRODDB/redo03a.log
-- Step 5: Verify final configuration
SELECT GROUP#, MEMBERS, BYTES/1024/1024 AS size_mb, STATUS FROM V$LOG;
Important: You cannot drop a log group with status CURRENT (Oracle is actively writing to it) or ACTIVE (Oracle needs it for crash recovery — the checkpoint hasn't advanced past it). Run ALTER SYSTEM CHECKPOINT followed by ALTER SYSTEM SWITCH LOGFILE and wait for the status to become INACTIVE before dropping. In a busy system, cycle through all groups with multiple log switches if needed.
Multiplexing: Protecting Redo Logs from Media Failure
Redo log multiplexing — maintaining multiple copies (members) of each log group on separate physical disks — is standard practice for any production database. If a redo log member is lost and is not multiplexed, Oracle will halt the database when it tries to archive that group.
-- Add a second member to each group for multiplexing
-- Members should be on different disks than the primary member
ALTER DATABASE ADD LOGFILE MEMBER
'/u04/oradata/PRODDB/redo04b.log' TO GROUP 4;
ALTER DATABASE ADD LOGFILE MEMBER
'/u04/oradata/PRODDB/redo05b.log' TO GROUP 5;
ALTER DATABASE ADD LOGFILE MEMBER
'/u04/oradata/PRODDB/redo06b.log' TO GROUP 6;
-- Verify members
SELECT l.GROUP#, lm.MEMBER, lm.STATUS
FROM V$LOG l
JOIN V$LOGFILE lm ON lm.GROUP# = l.GROUP#
ORDER BY l.GROUP#, lm.MEMBER;
On AWS EC2, place primary members on one EBS volume and mirror members on a separate EBS volume. Do not multiplex to the same EBS volume — you gain no protection against volume failure. On ASM deployments, ASM handles mirroring automatically if you use NORMAL or HIGH redundancy disk groups.
RAC-Specific Considerations
In an Oracle RAC environment, each instance has its own set of online redo log groups (thread). The log sizing principle is the same — each thread should be sized for that instance's peak redo generation rate. But there's an additional consideration: all threads archive to the same archive log destination, and ARCn processes are shared across the cluster.
If one RAC node is generating significantly more redo than others (common in unbalanced workloads), the archiver can become a bottleneck. Symptoms include log file switch (archiving needed) wait events on the heavy node, even if the log groups are adequately sized. The fix is either to increase the number of ARCn processes (ALTER SYSTEM SET log_archive_max_processes=8) or to address the workload imbalance.
-- Check redo generation per RAC instance
SELECT
INSTANCE_NUMBER,
SUM(REDO_SIZE_TOTAL) / 1024 / 1024 / 1024 AS total_redo_gb,
MAX(REDO_SIZE) / 1024 / 1024 AS peak_redo_mb_per_interval
FROM DBA_HIST_SYSSTAT
WHERE STAT_NAME = 'redo size'
AND BEGIN_INTERVAL_TIME > SYSDATE - 7
GROUP BY INSTANCE_NUMBER
ORDER BY INSTANCE_NUMBER;
The Archive Log Destination: The Other Half of the Problem
Properly sized redo log groups only help if the archive destination can keep up. If ARCn is slow — because the archive destination is on a slow NFS mount, a network file system with high latency, or a shared disk that's contended — log groups will back up even with correct sizing. The group can't be reused until it's archived.
Check archiver performance from V$ARCHIVED_LOG:
-- Archive log write times over the last 24 hours
SELECT
THREAD#,
SEQUENCE#,
BLOCKS * BLOCK_SIZE / 1024 / 1024 AS size_mb,
(COMPLETION_TIME - FIRST_TIME) * 86400 AS archive_seconds,
ROUND(
(BLOCKS * BLOCK_SIZE / 1024 / 1024) /
NULLIF((COMPLETION_TIME - FIRST_TIME) * 86400, 0),
1
) AS archive_mb_per_sec
FROM V$ARCHIVED_LOG
WHERE FIRST_TIME > SYSDATE - 1
AND STANDBY_DEST = 'NO'
ORDER BY FIRST_TIME DESC
FETCH FIRST 50 ROWS ONLY;
If archive throughput is consistently below your redo generation rate, you have a bottleneck in the archive destination pipeline regardless of log group sizing. The fix is typically a faster local staging disk for archives, separate from the database volumes, with periodic sync to long-term storage (S3 or equivalent).
What the Right Sizing Looks Like After the Fix
After resizing redo log groups on a financial services OLTP database that was generating 1.8GB of redo per hour at peak — going from 50MB groups to 1GB groups — the change in behavior was immediate:
- Log switches dropped from 120+ per hour at peak to 3-4 per hour
log file switch (checkpoint incomplete)wait time dropped by 94%- Average transaction response time improved by 18% during peak batch processing windows
- DBWR write I/O became smoother — fewer emergency checkpoint bursts, more consistent write patterns
The fix took 20 minutes to implement and required no application changes, no downtime, and no code review cycle. It's one of those tuning interventions where the effort-to-impact ratio is so high that it should be in every Oracle DBA's standard health check checklist.
Putting It in Your Standard Health Check
The queries above should run as part of any Oracle environment health assessment. Specifically:
- Check V$LOG for current group sizes and count
- Pull log switch frequency from V$LOG_HISTORY for the last 7 days
- Check V$SYSTEM_EVENT for log switch and log file sync wait totals
- Compare peak redo generation rate from DBA_HIST_SYSSTAT against current group sizing
If peak switches exceed 6 per hour or log switch waits appear in the top 10 AWR wait events, resize the groups. The calculation is straightforward. The implementation is online. The impact is immediate.
Need a second opinion on your Oracle environment?
We review production Oracle databases and deliver findings within 5-7 business days. No pitch, no obligation.