Understanding What the Wait Events Actually Mean
Before running a single diagnostic query, understand the distinction between the four primary RAC interconnect wait events:
- gc cr request: A session on this node requested a consistent read (CR) copy of a block from another node's cache. The requesting session is waiting for the remote node to ship the block image. This is normal RAC behavior — it becomes a problem when the average wait time is high (above 1-2ms) or when the volume is disproportionate to the workload.
- gc current request: A session requested the current version of a block (needed for a write) from another node. Requires the holding node to flush any dirty state and transfer ownership. Higher overhead than a CR request.
- gc buffer busy acquire: A session is waiting to acquire a block that another session on the same node is already waiting to receive from a remote node. This is a queue behind another waiter — the root cause is still the cross-node transfer, but this wait measures the local contention for the same block.
- gc buffer busy release: A session is waiting for a local session to finish with a block before it can be shipped to a remote requester. Local block contention delaying interconnect transfer.
High gc cr request waits with low average wait time (under 1ms) usually indicate an application design issue — too many cross-node block requests, suggesting data that is accessed from all nodes should be pinned or routed to a single node. High average wait time (above 2-3ms on InfiniBand, above 5ms on 10GbE) indicates a network or OS configuration problem.
Step 1: Establish a Baseline Wait Time Threshold
The first diagnostic query separates "many requests" from "slow requests":
-- RAC interconnect wait event summary with average times
SELECT inst_id,
event,
total_waits,
time_waited_micro / 1000000 AS time_waited_sec,
ROUND(average_wait * 10, 2) AS avg_wait_ms,
-- average_wait is in centiseconds; convert to ms
ROUND(time_waited_micro / NULLIF(total_waits, 0) / 1000, 3) AS avg_wait_ms_precise
FROM gv$system_event
WHERE event IN (
'gc cr request',
'gc current request',
'gc buffer busy acquire',
'gc buffer busy release',
'gc cr block 2-way',
'gc current block 2-way',
'gc cr block 3-way',
'gc current block 3-way'
)
ORDER BY inst_id, time_waited_micro DESC;
Reference thresholds for a healthy RAC cluster:
- InfiniBand interconnect: gc cr request avg < 0.5ms, gc current request avg < 1ms
- 10GbE interconnect: gc cr request avg < 1ms, gc current request avg < 2ms
- 1GbE interconnect (legacy): gc cr request avg < 3ms, gc current request avg < 5ms
If average wait times are within threshold but total wait time is high, the problem is application-level block contention — too many cross-node block transfers for hot data. If average wait times exceed the threshold, the problem is network-level.
Step 2: Identify the Hot Blocks Causing Cross-Node Traffic
-- Find objects with the highest GCS (Global Cache Service) block transfers
SELECT o.object_name,
o.object_type,
o.owner,
s.gc_cr_blocks_received,
s.gc_current_blocks_received,
s.gc_cr_blocks_received + s.gc_current_blocks_received AS total_gc_blocks
FROM gv$segment_statistics s
JOIN dba_objects o ON s.obj# = o.object_id
WHERE s.statistic_name IN ('gc cr blocks received', 'gc current blocks received')
AND s.value > 0
ORDER BY total_gc_blocks DESC
FETCH FIRST 20 ROWS ONLY;
-- Find the specific SQL generating the most GCS traffic
SELECT sq.inst_id,
sq.sql_id,
sq.executions,
sq.buffer_gets,
sq.rows_processed,
sq.elapsed_time / 1000000 AS elapsed_sec,
SUBSTR(sq.sql_text, 1, 100) AS sql_text
FROM gv$sql sq
WHERE sq.buffer_gets > 100000
ORDER BY sq.buffer_gets DESC
FETCH FIRST 20 ROWS ONLY;
High GCS traffic on a specific table or index, combined with that object being accessed from multiple nodes, indicates an application routing problem — not a network problem. The fix is connection affinity: route all sessions that access this object to the same node using Oracle's Services framework or application-level connection pool partitioning.
Step 3: Measure Actual Network Latency on the Interconnect
If average GCS wait times exceed the thresholds above, measure raw network latency on the interconnect interfaces directly. This requires OS-level access to the RAC nodes.
# On each RAC node, identify the interconnect interface
# Oracle Clusterware registers the interconnect interface
oifcfg getif
# Output example:
# bond0 10.0.1.0 global public
# ib0 192.168.1.0 global cluster_interconnect
# Measure round-trip latency between nodes on the interconnect
# Run from node1 to node2's interconnect IP
ping -I ib0 -c 1000 -i 0.01 192.168.1.2 | tail -5
# For InfiniBand, use ibping for more accurate measurements
ibping -c 1000 -G
# Target: p99 latency under 200 microseconds for InfiniBand
# p99 latency under 500 microseconds for 10GbE
# Check for packet loss on the interconnect interface
netstat -s | grep -E "retransmit|error|drop"
cat /proc/net/dev | grep ib0
# Check interface errors specifically
ethtool -S ib0 | grep -E "error|drop|miss"
# For InfiniBand, check for port errors
perfquery # from libibverbs-utils package
# Look for SymbolErrors, LinkErrorRecovery, PortRcvErrors
Step 4: Check OS Interrupt Handling and CPU Affinity
A common and overlooked cause of RAC interconnect latency is interrupt handling on the interconnect NIC being assigned to a CPU that is already saturated, or interrupt coalescing settings that batch interrupts and add latency.
# Check which CPU is handling interconnect NIC interrupts
cat /proc/interrupts | grep ib0
# The CPU column shows which CPU cores are handling IRQs
# Check interrupt rate on the interconnect NIC
watch -n 1 "cat /proc/interrupts | grep ib0"
# If a single CPU core is handling all interconnect IRQs under heavy load,
# it becomes the bottleneck. Enable IRQ balancing:
service irqbalance start
# Or manually set IRQ affinity to spread across multiple CPUs
# Find the IRQ number for ib0
IRQ_NUM=$(cat /proc/interrupts | grep ib0 | awk '{print $1}' | tr -d ':')
# Set affinity to CPUs 2-5 (bitmask 0x3c = 0b00111100)
echo 3c > /proc/irq/${IRQ_NUM}/smp_affinity
# Check interrupt coalescing settings
# High coalescing reduces CPU overhead but adds latency per packet
ethtool -c ib0
# For latency-sensitive RAC interconnects, reduce rx-usecs
ethtool -C ib0 rx-usecs 50 # reduce from default (often 200+)
# Test the impact on ping latency before making permanent
# Check for CPU frequency scaling (can cause variable latency)
cpupower frequency-info | grep "current CPU frequency"
# Oracle RAC nodes should use performance governor, not powersave
cpupower frequency-set -g performance
Step 5: Examine GCS Statistics in AWR/Statspack
If you have access to AWR (requires Diagnostic Pack license) or Statspack, the Global Cache and Enqueue Statistics section provides the most complete picture of interconnect health over a time period:
-- Global Cache Efficiency Percentages from AWR
-- (requires Diagnostic Pack license)
SELECT metric_name, value
FROM dba_hist_sysmetric_summary
WHERE metric_name IN (
'Global Cache Average CR Get Time',
'Global Cache Average Current Get Time',
'Global Cache Blocks Corrupted',
'Global Cache Blocks Lost'
)
AND snap_id = (SELECT MAX(snap_id) FROM dba_hist_snapshot)
ORDER BY metric_name;
-- Without AWR: use gv$sysstat for cumulative statistics
SELECT inst_id,
name,
value
FROM gv$sysstat
WHERE name IN (
'gc cr blocks received',
'gc current blocks received',
'gc cr blocks served',
'gc current blocks served',
'gc blocks lost',
'gc blocks corrupt',
'gcs messages sent',
'ges messages sent'
)
ORDER BY inst_id, name;
gc blocks lost and gc blocks corrupt are critical — any non-zero value indicates a network-level problem (packet loss or corruption on the interconnect) that requires hardware or switch investigation. These do not occur under normal conditions.
Step 6: Check the Clusterware Alert Log
# Clusterware logs interconnect health events
# Location varies by Oracle version and OS
# Oracle 12c+:
$GRID_HOME/log//alertlog/alert_.log
# Look for these patterns indicating interconnect issues:
grep -E "NCRQ|RCFG|CSS|misscount|reconfig|evict" \
$GRID_HOME/log/$(hostname)/alertlog/alert_$(hostname).log \
| tail -50
# Node evictions triggered by missed heartbeats are
# the most serious symptom of interconnect problems
grep "evict" $GRID_HOME/log/$(hostname)/alertlog/alert_$(hostname).log
The Diagnosis Decision Tree
After running these steps, the cause falls into one of three categories:
Application-layer problem (high GCS volume, normal latency): Hot blocks accessed from multiple nodes. Fix: Oracle Services with connection affinity, application-level node routing, or in-memory cache for frequently-read reference data.
OS configuration problem (elevated latency, no packet loss): IRQ handling, interrupt coalescing, CPU frequency scaling, or OS scheduler configuration. Fix: IRQ affinity, ethtool coalescing adjustments, CPU governor setting.
Network hardware problem (elevated latency AND/OR packet loss, gc blocks lost > 0): Switch port errors, cable degradation, NIC errors, or InfiniBand fabric issues. Fix: Physical inspection, switch port replacement, NIC replacement. This is the case that genuinely requires the infrastructure team — and now you can tell them exactly which interface and which metric to investigate, rather than opening a support ticket and waiting.
Oracle RAC performance problems slowing you down?
We diagnose RAC interconnect issues, gc wait event root causes, and application connection architecture as part of a free Oracle database assessment.