Skip to content

Operations: AIOps & RCA

Goal: Reduce MTTR (Mean Time To Resolution) from hours to minutes. We shift from “staring at dashboards” to “reviewing AI diagnoses.”

graph TD
    App[Application] -->|Logs| Observability["Azure Monitor / Datadog"]
    Observability -->|"Alert (High CPU)"| Alerter[Alert Manager]
    Alerter -->|Trigger| RCA[RCA Agent]
    RCA -->|Query| Logs[Log Store]
    RCA -->|Query| Code[GitHub Repo]
    RCA -->|Analysis| Draft[Incident Report]
    Draft -->|Suggest| Fix[Remediation Script]
    
    style RCA fill:#b3e5fc
    style Draft fill:#fff9c4
    style Fix fill:#c8e6c9
  • Observability: Azure Monitor, Datadog.
  • LLM Integration: LangChain (connecting logs to LLM), Langfuse (tracing AI apps).
  • Log Analytics: KQL (Kusto Query Language) generated by AI.
  1. Log Collection: Ensure structural logging (JSON) so AI can parse it easily.
  2. Anomaly Detection: Use built-in AI in Azure Monitor/Datadog to find “unknown unknowns” (e.g., weird latency pattern).
  3. RCA Agent: Build a simple agent that triggers on P1 alerts.
  4. Automated Remediation: Allow the agent to restart pods or clear caches (with safeguards).

Alert: “SQL Query duration > 2s on high load.”

Process: The RCA Agent receives the alert payload.

Agent Action:

  1. Queries the DB logs for the timestamp.
  2. Identifies the slow query: SELECT * FROM Orders WHERE ...
  3. Checks git log for recent changes to that query.
  4. Analyzes the execution plan.

Agent Output (Slack Message):

Incident: #1234 - High Latency Cause: Missing Index on OrderDate column. Evidence: Query execution time jumped from 20ms to 2000ms after Deployment #55. Proposed Fix: CREATE INDEX idx_order_date ON Orders(OrderDate);

Human: “Approved.” (Clicks button in Slack).

Agent: Executes the script against the DB (or creates a Hotfix PR).

  • Sanitize Logs: Before sending logs to an LLM, ensure PII masking is active. You don’t want to send customer emails to OpenAI.
  • Context Window: You can’t send all logs. Send the error stack trace and the 50 lines preceding it.
  • Feedback Loop: If the AI’s diagnosis is wrong, tell it. “No, it wasn’t the index, it was the disk I/O.”

Noise

If you alert on everything, the AI will hallucinate causes for non-issues. Only trigger RCA on clear, actionable signals.

  1. Sleep Better: AIOps handles the “3 AM” alerts that validly just need a restart.
  2. Knowledge Capture: The “Incident Report” generated by AI becomes part of your knowledge base for future training.
  3. Proactive vs Reactive: AI can predict “Disk will fill up in 2 days” based on trends, allowing you to fix it before it breaks.