feat: complete RAG runbook workflow and release docs

2026-05-06 04:48:41 +02:00
parent 450de24d28
commit 57f4c0efaa
26 changed files with 2510 additions and 137 deletions
--- a/README.md
+++ b/README.md
@@ -1,112 +1,202 @@
-# tai — Linux AI Troubleshooting Agent
+# tai - Linux AI Troubleshooting Agent

-`tai` is an agentic AI-driven troubleshooting tool for Linux systems. It autonomously investigates issues on remote hosts via SSH, analyzes relevant logs and configuration files, and provides a clear diagnosis along with suggested remediation steps — all without making any changes to the target system.
+`tai` is a read-only Linux troubleshooting assistant that connects to remote hosts via SSH, collects diagnostics, and runs grounded AI analysis using local models.

-## Overview
+The project is designed for operators who want AI speed without losing operational safety or evidence traceability.

-Given a problem description and a target hostname, `tai` connects to the remote system over SSH, gathers relevant data (logs, configuration files, service status, etc.), and uses a locally-hosted AI model to reason about the root cause and recommend solutions.
+## What tai Does

-The agent operates in **read-only mode at all times**. It will never modify the target system under any circumstances — all suggestions are presented to the human troubleshooter for review and action.
+- Runs safe, read-only remote checks over SSH
+- Builds a diagnostics collection plan from issue text
+- Supports one-shot analysis and interactive follow-up mode
+- Uses local AI backends (OpenAI-compatible endpoint, typically Ollama)
+- Uses RAG over collected diagnostics (Tier 1)
+- Uses persistent runbook retrieval with ChromaDB (Tier 2)
+- Emits structured Markdown analysis with evidence and actions
+- Can log session and retrieval telemetry locally as JSONL

-## Supported Distributions
+## Safety Model

- Ubuntu
- Debian
- RHEL
- Rocky Linux
+`tai` enforces read-only command policy on all remote commands.

-## Example Workflow
+- Allowlist based command validation
+- Blocked shell operators (`>`, `>>`, `<`, `|`, `&&`, `||`, `;`)
+- No write/mutation actions are executed on target hosts

-A troubleshooter receives a ticket reporting that the Apache service on a remote server has failed to start. They provide `tai` with:
+The tool may suggest remediation commands in output, but does not execute them.

-1. The ticket description or error message
-1. The hostname of the affected system
-1. Any relevant directories to focus on
+## Current Feature Set

-`tai` then connects to the host, reads through system logs, service configurations, and any other related files, and returns a structured analysis of the likely cause along with recommended next steps.
+### Core CLI

-## Suggested Tooling
+- `tai run ...` main troubleshooting entrypoint
+- SSH options: host, port, identity file, jump host, SSH config control
+- Live probe mode (`uname -a`)
+- Diagnostics collection mode
+- AI analysis mode
+- Interactive loop with `/collect`, `/analyze`, `/help`, `/quit`

-| Component | Tool |
-|-----------|------|
-| AI inference backend | [Ollama](https://ollama.com) |
-| Chat model | `gemma3:4b`, `llama3.1:8b`, or `qwen2.5:7b` |
-| Embedding model | `nomic-embed-text` (via Ollama) |
-| Vector store | [ChromaDB](https://www.trychroma.com) (embedded, local) |
-| Language | Python 3.11+ |
+### AI and Prompting

-______________________________________________________________________
+- OpenAI-compatible AI client
+- Configurable model, timeout, token budget
+- Guardrails to keep responses evidence-based
+- Initial and follow-up prompts grounded in collected diagnostics
+- Non-streaming completion path for local backend reliability

-## How-To: Setting Up the AI Backend (Arch Linux + RTX 3080)
+### RAG and Knowledge

-`tai` uses [Ollama](https://ollama.com) as its local AI backend. It exposes an OpenAI-compatible HTTP API that `tai` talks to — no cloud services, no data leaving your machine.
+- Tier 1: semantic retrieval of diagnostic chunks per question
+- Tier 2: persistent runbook knowledge base with ChromaDB
+- Runbook retrieval injected as separate prompt context
+- Retrieval debug output (`--rag-debug`)
+- Full-context fallback if retrieval/indexing fails

-An RTX 3080 (10 GB VRAM) comfortably runs 7–8B parameter models at 4-bit quantisation.
+### Runbook Management

-### 1. Install CUDA and Ollama
+- `tai runbooks sync --path ./runbooks --store ~/.tai/runbooks`
+- `tai runbooks list --store ~/.tai/runbooks`
+- `tai runbooks add <file> --store ~/.tai/runbooks`

-```bash
-# CUDA runtime (skip if already installed)
-sudo pacman -S cuda
+### Presence and Absence Signals

-# Ollama with CUDA support from the AUR
-yay -S ollama-cuda
-# or: paru -S ollama-cuda
+For recognized services/subsystems (for example `sssd`, `docker`, `x2go`, `xorg`, `wayland`, `selinux`, `apparmor`), collection includes:

-# Enable and start the service
-sudo systemctl enable --now ollama
+- service unit-file discovery (`systemctl list-unit-files ...`)
+- binary presence checks via `ls -l <expected path>`
+- service status and journals
+- selected config path probes where defined
+
+This improves analysis quality for "component missing/not installed" scenarios.
+
+## Repository Layout
+
+```text
+src/tai/
+  cli.py                # CLI commands and orchestration
+  ssh_client.py         # SSH execution + read-only policy
+  collectors.py         # execution of collection plans
+  plan.py               # issue -> command plan builder
+  ai_client.py          # OpenAI-compatible AI + embeddings client
+  ai_guardrails.py      # response guardrails/validation
+  prompt_builder.py     # prompt composition
+  rag_retriever.py      # diagnostic chunk retrieval
+  runbook_store.py      # persistent ChromaDB runbook index/query
+  chroma_telemetry.py   # no-op Chroma telemetry client
+  session_log.py        # JSONL session logging
+  input_parser.py       # CLI input validation
+  models.py             # domain request models
+
+runbooks/
+  *.md                  # Markdown runbooks with frontmatter
+
+tests/
+  test_*.py             # unit and CLI coverage
 ```

-### 2. Pull a chat model
+## Installation

 ```bash
-ollama pull gemma3:4b       # ~3 GB — fast, good for sysadmin tasks
-ollama pull llama3.1:8b     # ~5 GB — stronger reasoning
-ollama pull qwen2.5:7b      # ~4.5 GB — strong structured output
+python -m venv .venv
+source .venv/bin/activate
+pip install -e .
 ```

-### 3. Pull the embedding model
-
-`tai` uses `nomic-embed-text` to embed diagnostic data and runbooks for semantic retrieval (RAG). Pull it on the same host as Ollama:
+RAG runbook storage requires optional dependencies:

 ```bash
-ollama pull nomic-embed-text   # ~274 MB
+pip install -e .[rag]
 ```

-Verify it loaded:
+Development dependencies:

 ```bash
-curl http://localhost:11434/api/embeddings \
-  -d '{"model":"nomic-embed-text","prompt":"test"}'
+pip install -e .[dev]
 ```

-A JSON response with an `"embedding"` array confirms it is ready.
+## AI Backend Setup (Ollama)

-### 4. Verify the chat model works
+`tai` expects an OpenAI-compatible API endpoint, defaulting to `http://localhost:11434/v1`.

 ```bash
-ollama run gemma3:4b "what causes a systemd service to enter failed state?"
+ollama pull gemma3:4b
+ollama pull nomic-embed-text
 ```

-### 5. Verify the HTTP API is running
-
-`tai` communicates with Ollama over its OpenAI-compatible REST API:
+Quick backend check:

 ```bash
 curl http://localhost:11434/api/generate \
  -d '{"model":"gemma3:4b","prompt":"hello","stream":false}'
 ```

-A JSON response with a `response` field confirms everything is working.
+## Usage

-### 6. Point tai at your Ollama instance
-
-Once `tai` AI integration is complete, use these flags:
+### Basic Probe and Collect

 ```bash
-tai "nginx failing to start" --host web01 \
-  --ai-host http://localhost:11434 \
-  --model gemma3:4b
+tai run "nginx failing to start" \
+  --host web01 \
+  --probe \
+  --collect
 ```

-The default values for `--ai-host` and `--model` will be `http://localhost:11434` and `gemma3:4b` respectively, so for local use you won't need to specify them explicitly.
+### Analyze with RAG and Runbooks
+
+```bash
+tai run "why isnt sssd working?" \
+  --host ssh.archflux.net \
+  --port 5566 \
+  --probe --collect --analyze \
+  --runbooks ~/.tai/runbooks \
+  --rag-debug \
+  --ai-timeout-seconds 45 \
+  --ai-max-tokens 300
+```
+
+### Interactive Session
+
+```bash
+tai run "docker daemon keeps failing" \
+  --host app01 \
+  --collect \
+  --interactive \
+  --runbooks ~/.tai/runbooks
+```
+
+## Runbook Workflow
+
+1. Write Markdown runbooks in `runbooks/` with frontmatter keys: `service`, `symptoms`, `tags`.
+1. Sync the store.
+1. Pass `--runbooks <store-path>` to `tai run`.
+
+Example:
+
+```bash
+tai runbooks sync --path ./runbooks --store ~/.tai/runbooks
+tai runbooks list --store ~/.tai/runbooks
+```
+
+## Testing
+
+```bash
+pytest
+```
+
+Focused suites:
+
+```bash
+pytest tests/test_plan.py tests/test_ai.py tests/test_cli.py
+```
+
+## Known Limits
+
+- Service-specific presence checks currently apply to recognized service/subsystem names.
+- Package-manager-level presence checks are not yet in the default read-only command allowlist.
+- Tier 3 persistent session memory is not implemented yet.
+
+## Changelog and Roadmap
+
+- See `CHANGELOG.md` for release history.
+- See `ROADMAP.md` for phase status and next milestones.
+- See `docs/ARCHITECTURE.md` for module-level architecture and data flow.