Diagnosing Oracle RAC Interconnect Latency Without Oracle Support

Understanding What the Wait Events Actually Mean

Before running a single diagnostic query, understand the distinction between the four primary RAC interconnect wait events:

gc cr request: A session on this node requested a consistent read (CR) copy of a block from another node's cache. The requesting session is waiting for the remote node to ship the block image. This is normal RAC behavior — it becomes a problem when the average wait time is high (above 1-2ms) or when the volume is disproportionate to the workload.
gc current request: A session requested the current version of a block (needed for a write) from another node. Requires the holding node to flush any dirty state and transfer ownership. Higher overhead than a CR request.
gc buffer busy acquire: A session is waiting to acquire a block that another session on the same node is already waiting to receive from a remote node. This is a queue behind another waiter — the root cause is still the cross-node transfer, but this wait measures the local contention for the same block.
gc buffer busy release: A session is waiting for a local session to finish with a block before it can be shipped to a remote requester. Local block contention delaying interconnect transfer.

High gc cr request waits with low average wait time (under 1ms) usually indicate an application design issue — too many cross-node block requests, suggesting data that is accessed from all nodes should be pinned or routed to a single node. High average wait time (above 2-3ms on InfiniBand, above 5ms on 10GbE) indicates a network or OS configuration problem.

Step 1: Establish a Baseline Wait Time Threshold

The first diagnostic query separates "many requests" from "slow requests":

-- RAC interconnect wait event summary with average times
SELECT inst_id,
       event,
       total_waits,
       time_waited_micro / 1000000 AS time_waited_sec,
       ROUND(average_wait * 10, 2)  AS avg_wait_ms,
       -- average_wait is in centiseconds; convert to ms
       ROUND(time_waited_micro / NULLIF(total_waits, 0) / 1000, 3) AS avg_wait_ms_precise
FROM gv$system_event
WHERE event IN (
  'gc cr request',
  'gc current request',
  'gc buffer busy acquire',
  'gc buffer busy release',
  'gc cr block 2-way',
  'gc current block 2-way',
  'gc cr block 3-way',
  'gc current block 3-way'
)
ORDER BY inst_id, time_waited_micro DESC;

Reference thresholds for a healthy RAC cluster:

InfiniBand interconnect: gc cr request avg < 0.5ms, gc current request avg < 1ms
10GbE interconnect: gc cr request avg < 1ms, gc current request avg < 2ms
1GbE interconnect (legacy): gc cr request avg < 3ms, gc current request avg < 5ms

If average wait times are within threshold but total wait time is high, the problem is application-level block contention — too many cross-node block transfers for hot data. If average wait times exceed the threshold, the problem is network-level.

Step 2: Identify the Hot Blocks Causing Cross-Node Traffic

-- Find objects with the highest GCS (Global Cache Service) block transfers
SELECT o.object_name,
       o.object_type,
       o.owner,
       s.gc_cr_blocks_received,
       s.gc_current_blocks_received,
       s.gc_cr_blocks_received + s.gc_current_blocks_received AS total_gc_blocks
FROM gv$segment_statistics s
JOIN dba_objects o ON s.obj# = o.object_id
WHERE s.statistic_name IN ('gc cr blocks received', 'gc current blocks received')
  AND s.value > 0
ORDER BY total_gc_blocks DESC
FETCH FIRST 20 ROWS ONLY;

-- Find the specific SQL generating the most GCS traffic
SELECT sq.inst_id,
       sq.sql_id,
       sq.executions,
       sq.buffer_gets,
       sq.rows_processed,
       sq.elapsed_time / 1000000 AS elapsed_sec,
       SUBSTR(sq.sql_text, 1, 100) AS sql_text
FROM gv$sql sq
WHERE sq.buffer_gets > 100000
ORDER BY sq.buffer_gets DESC
FETCH FIRST 20 ROWS ONLY;

High GCS traffic on a specific table or index, combined with that object being accessed from multiple nodes, indicates an application routing problem — not a network problem. The fix is connection affinity: route all sessions that access this object to the same node using Oracle's Services framework or application-level connection pool partitioning.

Step 3: Measure Actual Network Latency on the Interconnect

If average GCS wait times exceed the thresholds above, measure raw network latency on the interconnect interfaces directly. This requires OS-level access to the RAC nodes.

# On each RAC node, identify the interconnect interface
# Oracle Clusterware registers the interconnect interface
oifcfg getif

# Output example:
# bond0  10.0.1.0  global  public
# ib0    192.168.1.0  global  cluster_interconnect

# Measure round-trip latency between nodes on the interconnect
# Run from node1 to node2's interconnect IP
ping -I ib0 -c 1000 -i 0.01 192.168.1.2 | tail -5

# For InfiniBand, use ibping for more accurate measurements
ibping -c 1000 -G 

# Target: p99 latency under 200 microseconds for InfiniBand
#         p99 latency under 500 microseconds for 10GbE

# Check for packet loss on the interconnect interface
netstat -s | grep -E "retransmit|error|drop"
cat /proc/net/dev | grep ib0

# Check interface errors specifically
ethtool -S ib0 | grep -E "error|drop|miss"

# For InfiniBand, check for port errors
perfquery  # from libibverbs-utils package
# Look for SymbolErrors, LinkErrorRecovery, PortRcvErrors

Step 4: Check OS Interrupt Handling and CPU Affinity

A common and overlooked cause of RAC interconnect latency is interrupt handling on the interconnect NIC being assigned to a CPU that is already saturated, or interrupt coalescing settings that batch interrupts and add latency.

# Check which CPU is handling interconnect NIC interrupts
cat /proc/interrupts | grep ib0
# The CPU column shows which CPU cores are handling IRQs

# Check interrupt rate on the interconnect NIC
watch -n 1 "cat /proc/interrupts | grep ib0"

# If a single CPU core is handling all interconnect IRQs under heavy load,
# it becomes the bottleneck. Enable IRQ balancing:
service irqbalance start

# Or manually set IRQ affinity to spread across multiple CPUs
# Find the IRQ number for ib0
IRQ_NUM=$(cat /proc/interrupts | grep ib0 | awk '{print $1}' | tr -d ':')
# Set affinity to CPUs 2-5 (bitmask 0x3c = 0b00111100)
echo 3c > /proc/irq/${IRQ_NUM}/smp_affinity

# Check interrupt coalescing settings
# High coalescing reduces CPU overhead but adds latency per packet
ethtool -c ib0

# For latency-sensitive RAC interconnects, reduce rx-usecs
ethtool -C ib0 rx-usecs 50  # reduce from default (often 200+)
# Test the impact on ping latency before making permanent

# Check for CPU frequency scaling (can cause variable latency)
cpupower frequency-info | grep "current CPU frequency"
# Oracle RAC nodes should use performance governor, not powersave
cpupower frequency-set -g performance

Step 5: Examine GCS Statistics in AWR/Statspack

If you have access to AWR (requires Diagnostic Pack license) or Statspack, the Global Cache and Enqueue Statistics section provides the most complete picture of interconnect health over a time period:

-- Global Cache Efficiency Percentages from AWR
-- (requires Diagnostic Pack license)
SELECT metric_name, value
FROM dba_hist_sysmetric_summary
WHERE metric_name IN (
  'Global Cache Average CR Get Time',
  'Global Cache Average Current Get Time',
  'Global Cache Blocks Corrupted',
  'Global Cache Blocks Lost'
)
AND snap_id = (SELECT MAX(snap_id) FROM dba_hist_snapshot)
ORDER BY metric_name;

-- Without AWR: use gv$sysstat for cumulative statistics
SELECT inst_id,
       name,
       value
FROM gv$sysstat
WHERE name IN (
  'gc cr blocks received',
  'gc current blocks received',
  'gc cr blocks served',
  'gc current blocks served',
  'gc blocks lost',
  'gc blocks corrupt',
  'gcs messages sent',
  'ges messages sent'
)
ORDER BY inst_id, name;

gc blocks lost and gc blocks corrupt are critical — any non-zero value indicates a network-level problem (packet loss or corruption on the interconnect) that requires hardware or switch investigation. These do not occur under normal conditions.

Step 6: Check the Clusterware Alert Log

# Clusterware logs interconnect health events
# Location varies by Oracle version and OS
# Oracle 12c+:
$GRID_HOME/log//alertlog/alert_.log

# Look for these patterns indicating interconnect issues:
grep -E "NCRQ|RCFG|CSS|misscount|reconfig|evict" \
  $GRID_HOME/log/$(hostname)/alertlog/alert_$(hostname).log \
  | tail -50

# Node evictions triggered by missed heartbeats are
# the most serious symptom of interconnect problems
grep "evict" $GRID_HOME/log/$(hostname)/alertlog/alert_$(hostname).log

The Diagnosis Decision Tree

After running these steps, the cause falls into one of three categories:

Application-layer problem (high GCS volume, normal latency): Hot blocks accessed from multiple nodes. Fix: Oracle Services with connection affinity, application-level node routing, or in-memory cache for frequently-read reference data.

OS configuration problem (elevated latency, no packet loss): IRQ handling, interrupt coalescing, CPU frequency scaling, or OS scheduler configuration. Fix: IRQ affinity, ethtool coalescing adjustments, CPU governor setting.

Network hardware problem (elevated latency AND/OR packet loss, gc blocks lost > 0): Switch port errors, cable degradation, NIC errors, or InfiniBand fabric issues. Fix: Physical inspection, switch port replacement, NIC replacement. This is the case that genuinely requires the infrastructure team — and now you can tell them exactly which interface and which metric to investigate, rather than opening a support ticket and waiting.

Oracle RAC performance problems slowing you down?

We diagnose RAC interconnect issues, gc wait event root causes, and application connection architecture as part of a free Oracle database assessment.

Get a Free Assessment → More Articles