Oracle - Real Application Cluster Troubleshooting

Oracle RAC Troubleshooting — Colorful Guide

🚑 Oracle RAC Troubleshooting Guide 🌈

Oracle Real Application Clusters (RAC) troubleshooting can be challenging due to its complexity and multiple layers of infrastructure. This guide provides colorful and practical tips for identifying and resolving common RAC issues effectively.

🔍 Diagnosing Cluster Issues

  • Check resource status with crsctl stat res -t.
  • Review CRS logs under $GRID_HOME/log/<node>/crs.
  • Confirm node membership using olsnodes -n.
  • Inspect voting disk and OCR consistency.

🌐 Network & Connectivity Problems

  • Verify interconnect with ping or traceroute.
  • Check for network packet drops in OS logs.
  • Ensure SCAN listeners are running: srvctl status scan_listener.
  • Look for DNS misconfigurations affecting node resolution.

💽 ASM & Storage Issues

  • Check disk group health: asmcmd lsdg.
  • Review asm alert.log for errors.
  • Confirm shared storage is accessible across all nodes.
  • Address ORA-150xx errors by validating disk paths.

🖥️ Node Eviction Problems

  • Review CSS logs (cssd.log) for eviction reasons.
  • Check time synchronization (NTP/Chrony) to avoid split-brain.
  • Inspect memory/CPU pressure on evicted nodes.
  • Evaluate interconnect latency affecting heartbeat.

🚨 Database Service Failures

  • Restart database/service with srvctl.
  • Check database alert logs for ORA errors.
  • Validate service placement with srvctl config service.
  • Ensure load balancing and failover policies are configured properly.

💡 Pro Tips

  • Enable OSWatcher for continuous system diagnostics.
  • Use Cluster Health Monitor (CHM) for detailed analysis.
  • Collect diagnostic data with diagcollection.pl.
  • Keep OCR and voting disk backups up to date.
  • Test failover scenarios regularly in non-production environments.

✨ Conclusion

Troubleshooting Oracle RAC requires a systematic approach, starting from cluster diagnostics, network verification, ASM checks, and service validation. By mastering these steps, DBAs can quickly identify root causes and restore high availability in mission-critical systems.