Home // Cloud Infrastructure // Odoo Development // AI & Automation // Odoo + AI Agents Case Studies About Blog Free Assessment
// DATABASE OPERATIONS · 13 MIN READ

Diagnosing Oracle RAC Interconnect Latency Without Oracle Support

RAC interconnect problems show up as gc buffer busy, gc cr request, and gc current request waits dominating the top wait events. Oracle Support will ask for a full system dump. You can narrow the problem down to hardware, OS configuration, or application access patterns in a few hours using tools already on the system.

// PUBLISHED 2024-05-14 · LANIAKEA TEAM

Understanding What the Wait Events Actually Mean

Before running a single diagnostic query, understand the distinction between the four primary RAC interconnect wait events:

High gc cr request waits with low average wait time (under 1ms) usually indicate an application design issue — too many cross-node block requests, suggesting data that is accessed from all nodes should be pinned or routed to a single node. High average wait time (above 2-3ms on InfiniBand, above 5ms on 10GbE) indicates a network or OS configuration problem.

Step 1: Establish a Baseline Wait Time Threshold

The first diagnostic query separates "many requests" from "slow requests":

-- RAC interconnect wait event summary with average times
SELECT inst_id,
       event,
       total_waits,
       time_waited_micro / 1000000 AS time_waited_sec,
       ROUND(average_wait * 10, 2)  AS avg_wait_ms,
       -- average_wait is in centiseconds; convert to ms
       ROUND(time_waited_micro / NULLIF(total_waits, 0) / 1000, 3) AS avg_wait_ms_precise
FROM gv$system_event
WHERE event IN (
  'gc cr request',
  'gc current request',
  'gc buffer busy acquire',
  'gc buffer busy release',
  'gc cr block 2-way',
  'gc current block 2-way',
  'gc cr block 3-way',
  'gc current block 3-way'
)
ORDER BY inst_id, time_waited_micro DESC;

Reference thresholds for a healthy RAC cluster:

If average wait times are within threshold but total wait time is high, the problem is application-level block contention — too many cross-node block transfers for hot data. If average wait times exceed the threshold, the problem is network-level.

Step 2: Identify the Hot Blocks Causing Cross-Node Traffic

-- Find objects with the highest GCS (Global Cache Service) block transfers
SELECT o.object_name,
       o.object_type,
       o.owner,
       s.gc_cr_blocks_received,
       s.gc_current_blocks_received,
       s.gc_cr_blocks_received + s.gc_current_blocks_received AS total_gc_blocks
FROM gv$segment_statistics s
JOIN dba_objects o ON s.obj# = o.object_id
WHERE s.statistic_name IN ('gc cr blocks received', 'gc current blocks received')
  AND s.value > 0
ORDER BY total_gc_blocks DESC
FETCH FIRST 20 ROWS ONLY;
-- Find the specific SQL generating the most GCS traffic
SELECT sq.inst_id,
       sq.sql_id,
       sq.executions,
       sq.buffer_gets,
       sq.rows_processed,
       sq.elapsed_time / 1000000 AS elapsed_sec,
       SUBSTR(sq.sql_text, 1, 100) AS sql_text
FROM gv$sql sq
WHERE sq.buffer_gets > 100000
ORDER BY sq.buffer_gets DESC
FETCH FIRST 20 ROWS ONLY;

High GCS traffic on a specific table or index, combined with that object being accessed from multiple nodes, indicates an application routing problem — not a network problem. The fix is connection affinity: route all sessions that access this object to the same node using Oracle's Services framework or application-level connection pool partitioning.

Step 3: Measure Actual Network Latency on the Interconnect

If average GCS wait times exceed the thresholds above, measure raw network latency on the interconnect interfaces directly. This requires OS-level access to the RAC nodes.

# On each RAC node, identify the interconnect interface
# Oracle Clusterware registers the interconnect interface
oifcfg getif

# Output example:
# bond0  10.0.1.0  global  public
# ib0    192.168.1.0  global  cluster_interconnect

# Measure round-trip latency between nodes on the interconnect
# Run from node1 to node2's interconnect IP
ping -I ib0 -c 1000 -i 0.01 192.168.1.2 | tail -5

# For InfiniBand, use ibping for more accurate measurements
ibping -c 1000 -G 

# Target: p99 latency under 200 microseconds for InfiniBand
#         p99 latency under 500 microseconds for 10GbE
# Check for packet loss on the interconnect interface
netstat -s | grep -E "retransmit|error|drop"
cat /proc/net/dev | grep ib0

# Check interface errors specifically
ethtool -S ib0 | grep -E "error|drop|miss"

# For InfiniBand, check for port errors
perfquery  # from libibverbs-utils package
# Look for SymbolErrors, LinkErrorRecovery, PortRcvErrors

Step 4: Check OS Interrupt Handling and CPU Affinity

A common and overlooked cause of RAC interconnect latency is interrupt handling on the interconnect NIC being assigned to a CPU that is already saturated, or interrupt coalescing settings that batch interrupts and add latency.

# Check which CPU is handling interconnect NIC interrupts
cat /proc/interrupts | grep ib0
# The CPU column shows which CPU cores are handling IRQs

# Check interrupt rate on the interconnect NIC
watch -n 1 "cat /proc/interrupts | grep ib0"

# If a single CPU core is handling all interconnect IRQs under heavy load,
# it becomes the bottleneck. Enable IRQ balancing:
service irqbalance start

# Or manually set IRQ affinity to spread across multiple CPUs
# Find the IRQ number for ib0
IRQ_NUM=$(cat /proc/interrupts | grep ib0 | awk '{print $1}' | tr -d ':')
# Set affinity to CPUs 2-5 (bitmask 0x3c = 0b00111100)
echo 3c > /proc/irq/${IRQ_NUM}/smp_affinity
# Check interrupt coalescing settings
# High coalescing reduces CPU overhead but adds latency per packet
ethtool -c ib0

# For latency-sensitive RAC interconnects, reduce rx-usecs
ethtool -C ib0 rx-usecs 50  # reduce from default (often 200+)
# Test the impact on ping latency before making permanent

# Check for CPU frequency scaling (can cause variable latency)
cpupower frequency-info | grep "current CPU frequency"
# Oracle RAC nodes should use performance governor, not powersave
cpupower frequency-set -g performance

Step 5: Examine GCS Statistics in AWR/Statspack

If you have access to AWR (requires Diagnostic Pack license) or Statspack, the Global Cache and Enqueue Statistics section provides the most complete picture of interconnect health over a time period:

-- Global Cache Efficiency Percentages from AWR
-- (requires Diagnostic Pack license)
SELECT metric_name, value
FROM dba_hist_sysmetric_summary
WHERE metric_name IN (
  'Global Cache Average CR Get Time',
  'Global Cache Average Current Get Time',
  'Global Cache Blocks Corrupted',
  'Global Cache Blocks Lost'
)
AND snap_id = (SELECT MAX(snap_id) FROM dba_hist_snapshot)
ORDER BY metric_name;
-- Without AWR: use gv$sysstat for cumulative statistics
SELECT inst_id,
       name,
       value
FROM gv$sysstat
WHERE name IN (
  'gc cr blocks received',
  'gc current blocks received',
  'gc cr blocks served',
  'gc current blocks served',
  'gc blocks lost',
  'gc blocks corrupt',
  'gcs messages sent',
  'ges messages sent'
)
ORDER BY inst_id, name;

gc blocks lost and gc blocks corrupt are critical — any non-zero value indicates a network-level problem (packet loss or corruption on the interconnect) that requires hardware or switch investigation. These do not occur under normal conditions.

Step 6: Check the Clusterware Alert Log

# Clusterware logs interconnect health events
# Location varies by Oracle version and OS
# Oracle 12c+:
$GRID_HOME/log//alertlog/alert_.log

# Look for these patterns indicating interconnect issues:
grep -E "NCRQ|RCFG|CSS|misscount|reconfig|evict" \
  $GRID_HOME/log/$(hostname)/alertlog/alert_$(hostname).log \
  | tail -50

# Node evictions triggered by missed heartbeats are
# the most serious symptom of interconnect problems
grep "evict" $GRID_HOME/log/$(hostname)/alertlog/alert_$(hostname).log

The Diagnosis Decision Tree

After running these steps, the cause falls into one of three categories:

Application-layer problem (high GCS volume, normal latency): Hot blocks accessed from multiple nodes. Fix: Oracle Services with connection affinity, application-level node routing, or in-memory cache for frequently-read reference data.

OS configuration problem (elevated latency, no packet loss): IRQ handling, interrupt coalescing, CPU frequency scaling, or OS scheduler configuration. Fix: IRQ affinity, ethtool coalescing adjustments, CPU governor setting.

Network hardware problem (elevated latency AND/OR packet loss, gc blocks lost > 0): Switch port errors, cable degradation, NIC errors, or InfiniBand fabric issues. Fix: Physical inspection, switch port replacement, NIC replacement. This is the case that genuinely requires the infrastructure team — and now you can tell them exactly which interface and which metric to investigate, rather than opening a support ticket and waiting.

Oracle RAC performance problems slowing you down?

We diagnose RAC interconnect issues, gc wait event root causes, and application connection architecture as part of a free Oracle database assessment.