🚑 Oracle RAC Troubleshooting Guide 🌈
Oracle Real Application Clusters (RAC) troubleshooting can be challenging due to its complexity and multiple layers of infrastructure. This guide provides colorful and practical tips for identifying and resolving common RAC issues effectively.
🔍 Diagnosing Cluster Issues
- Check resource status with
crsctl stat res -t. - Review CRS logs under
$GRID_HOME/log/<node>/crs. - Confirm node membership using
olsnodes -n. - Inspect voting disk and OCR consistency.
🌐 Network & Connectivity Problems
- Verify interconnect with
pingortraceroute. - Check for network packet drops in OS logs.
- Ensure SCAN listeners are running:
srvctl status scan_listener. - Look for DNS misconfigurations affecting node resolution.
💽 ASM & Storage Issues
- Check disk group health:
asmcmd lsdg. - Review
asm alert.logfor errors. - Confirm shared storage is accessible across all nodes.
- Address ORA-150xx errors by validating disk paths.
🖥️ Node Eviction Problems
- Review CSS logs (
cssd.log) for eviction reasons. - Check time synchronization (NTP/Chrony) to avoid split-brain.
- Inspect memory/CPU pressure on evicted nodes.
- Evaluate interconnect latency affecting heartbeat.
🚨 Database Service Failures
- Restart database/service with
srvctl. - Check database alert logs for ORA errors.
- Validate service placement with
srvctl config service. - Ensure load balancing and failover policies are configured properly.
💡 Pro Tips
- Enable OSWatcher for continuous system diagnostics.
- Use Cluster Health Monitor (CHM) for detailed analysis.
- Collect diagnostic data with
diagcollection.pl. - Keep OCR and voting disk backups up to date.
- Test failover scenarios regularly in non-production environments.
✨ Conclusion
Troubleshooting Oracle RAC requires a systematic approach, starting from cluster diagnostics, network verification, ASM checks, and service validation. By mastering these steps, DBAs can quickly identify root causes and restore high availability in mission-critical systems.