feat: complete RAG runbook workflow and release docs
Some checks failed
CI / test (push) Failing after 15s

This commit is contained in:
2026-05-06 04:48:41 +02:00
parent 450de24d28
commit 57f4c0efaa
26 changed files with 2510 additions and 137 deletions

214
README.md
View File

@@ -1,112 +1,202 @@
# tai Linux AI Troubleshooting Agent
# tai - Linux AI Troubleshooting Agent
`tai` is an agentic AI-driven troubleshooting tool for Linux systems. It autonomously investigates issues on remote hosts via SSH, analyzes relevant logs and configuration files, and provides a clear diagnosis along with suggested remediation steps — all without making any changes to the target system.
`tai` is a read-only Linux troubleshooting assistant that connects to remote hosts via SSH, collects diagnostics, and runs grounded AI analysis using local models.
## Overview
The project is designed for operators who want AI speed without losing operational safety or evidence traceability.
Given a problem description and a target hostname, `tai` connects to the remote system over SSH, gathers relevant data (logs, configuration files, service status, etc.), and uses a locally-hosted AI model to reason about the root cause and recommend solutions.
## What tai Does
The agent operates in **read-only mode at all times**. It will never modify the target system under any circumstances — all suggestions are presented to the human troubleshooter for review and action.
- Runs safe, read-only remote checks over SSH
- Builds a diagnostics collection plan from issue text
- Supports one-shot analysis and interactive follow-up mode
- Uses local AI backends (OpenAI-compatible endpoint, typically Ollama)
- Uses RAG over collected diagnostics (Tier 1)
- Uses persistent runbook retrieval with ChromaDB (Tier 2)
- Emits structured Markdown analysis with evidence and actions
- Can log session and retrieval telemetry locally as JSONL
## Supported Distributions
## Safety Model
- Ubuntu
- Debian
- RHEL
- Rocky Linux
`tai` enforces read-only command policy on all remote commands.
## Example Workflow
- Allowlist based command validation
- Blocked shell operators (`>`, `>>`, `<`, `|`, `&&`, `||`, `;`)
- No write/mutation actions are executed on target hosts
A troubleshooter receives a ticket reporting that the Apache service on a remote server has failed to start. They provide `tai` with:
The tool may suggest remediation commands in output, but does not execute them.
1. The ticket description or error message
1. The hostname of the affected system
1. Any relevant directories to focus on
## Current Feature Set
`tai` then connects to the host, reads through system logs, service configurations, and any other related files, and returns a structured analysis of the likely cause along with recommended next steps.
### Core CLI
## Suggested Tooling
- `tai run ...` main troubleshooting entrypoint
- SSH options: host, port, identity file, jump host, SSH config control
- Live probe mode (`uname -a`)
- Diagnostics collection mode
- AI analysis mode
- Interactive loop with `/collect`, `/analyze`, `/help`, `/quit`
| Component | Tool |
|-----------|------|
| AI inference backend | [Ollama](https://ollama.com) |
| Chat model | `gemma3:4b`, `llama3.1:8b`, or `qwen2.5:7b` |
| Embedding model | `nomic-embed-text` (via Ollama) |
| Vector store | [ChromaDB](https://www.trychroma.com) (embedded, local) |
| Language | Python 3.11+ |
### AI and Prompting
______________________________________________________________________
- OpenAI-compatible AI client
- Configurable model, timeout, token budget
- Guardrails to keep responses evidence-based
- Initial and follow-up prompts grounded in collected diagnostics
- Non-streaming completion path for local backend reliability
## How-To: Setting Up the AI Backend (Arch Linux + RTX 3080)
### RAG and Knowledge
`tai` uses [Ollama](https://ollama.com) as its local AI backend. It exposes an OpenAI-compatible HTTP API that `tai` talks to — no cloud services, no data leaving your machine.
- Tier 1: semantic retrieval of diagnostic chunks per question
- Tier 2: persistent runbook knowledge base with ChromaDB
- Runbook retrieval injected as separate prompt context
- Retrieval debug output (`--rag-debug`)
- Full-context fallback if retrieval/indexing fails
An RTX 3080 (10 GB VRAM) comfortably runs 78B parameter models at 4-bit quantisation.
### Runbook Management
### 1. Install CUDA and Ollama
- `tai runbooks sync --path ./runbooks --store ~/.tai/runbooks`
- `tai runbooks list --store ~/.tai/runbooks`
- `tai runbooks add <file> --store ~/.tai/runbooks`
```bash
# CUDA runtime (skip if already installed)
sudo pacman -S cuda
### Presence and Absence Signals
# Ollama with CUDA support from the AUR
yay -S ollama-cuda
# or: paru -S ollama-cuda
For recognized services/subsystems (for example `sssd`, `docker`, `x2go`, `xorg`, `wayland`, `selinux`, `apparmor`), collection includes:
# Enable and start the service
sudo systemctl enable --now ollama
- service unit-file discovery (`systemctl list-unit-files ...`)
- binary presence checks via `ls -l <expected path>`
- service status and journals
- selected config path probes where defined
This improves analysis quality for "component missing/not installed" scenarios.
## Repository Layout
```text
src/tai/
cli.py # CLI commands and orchestration
ssh_client.py # SSH execution + read-only policy
collectors.py # execution of collection plans
plan.py # issue -> command plan builder
ai_client.py # OpenAI-compatible AI + embeddings client
ai_guardrails.py # response guardrails/validation
prompt_builder.py # prompt composition
rag_retriever.py # diagnostic chunk retrieval
runbook_store.py # persistent ChromaDB runbook index/query
chroma_telemetry.py # no-op Chroma telemetry client
session_log.py # JSONL session logging
input_parser.py # CLI input validation
models.py # domain request models
runbooks/
*.md # Markdown runbooks with frontmatter
tests/
test_*.py # unit and CLI coverage
```
### 2. Pull a chat model
## Installation
```bash
ollama pull gemma3:4b # ~3 GB — fast, good for sysadmin tasks
ollama pull llama3.1:8b # ~5 GB — stronger reasoning
ollama pull qwen2.5:7b # ~4.5 GB — strong structured output
python -m venv .venv
source .venv/bin/activate
pip install -e .
```
### 3. Pull the embedding model
`tai` uses `nomic-embed-text` to embed diagnostic data and runbooks for semantic retrieval (RAG). Pull it on the same host as Ollama:
RAG runbook storage requires optional dependencies:
```bash
ollama pull nomic-embed-text # ~274 MB
pip install -e .[rag]
```
Verify it loaded:
Development dependencies:
```bash
curl http://localhost:11434/api/embeddings \
-d '{"model":"nomic-embed-text","prompt":"test"}'
pip install -e .[dev]
```
A JSON response with an `"embedding"` array confirms it is ready.
## AI Backend Setup (Ollama)
### 4. Verify the chat model works
`tai` expects an OpenAI-compatible API endpoint, defaulting to `http://localhost:11434/v1`.
```bash
ollama run gemma3:4b "what causes a systemd service to enter failed state?"
ollama pull gemma3:4b
ollama pull nomic-embed-text
```
### 5. Verify the HTTP API is running
`tai` communicates with Ollama over its OpenAI-compatible REST API:
Quick backend check:
```bash
curl http://localhost:11434/api/generate \
-d '{"model":"gemma3:4b","prompt":"hello","stream":false}'
```
A JSON response with a `response` field confirms everything is working.
## Usage
### 6. Point tai at your Ollama instance
Once `tai` AI integration is complete, use these flags:
### Basic Probe and Collect
```bash
tai "nginx failing to start" --host web01 \
--ai-host http://localhost:11434 \
--model gemma3:4b
tai run "nginx failing to start" \
--host web01 \
--probe \
--collect
```
The default values for `--ai-host` and `--model` will be `http://localhost:11434` and `gemma3:4b` respectively, so for local use you won't need to specify them explicitly.
### Analyze with RAG and Runbooks
```bash
tai run "why isnt sssd working?" \
--host ssh.archflux.net \
--port 5566 \
--probe --collect --analyze \
--runbooks ~/.tai/runbooks \
--rag-debug \
--ai-timeout-seconds 45 \
--ai-max-tokens 300
```
### Interactive Session
```bash
tai run "docker daemon keeps failing" \
--host app01 \
--collect \
--interactive \
--runbooks ~/.tai/runbooks
```
## Runbook Workflow
1. Write Markdown runbooks in `runbooks/` with frontmatter keys: `service`, `symptoms`, `tags`.
1. Sync the store.
1. Pass `--runbooks <store-path>` to `tai run`.
Example:
```bash
tai runbooks sync --path ./runbooks --store ~/.tai/runbooks
tai runbooks list --store ~/.tai/runbooks
```
## Testing
```bash
pytest
```
Focused suites:
```bash
pytest tests/test_plan.py tests/test_ai.py tests/test_cli.py
```
## Known Limits
- Service-specific presence checks currently apply to recognized service/subsystem names.
- Package-manager-level presence checks are not yet in the default read-only command allowlist.
- Tier 3 persistent session memory is not implemented yet.
## Changelog and Roadmap
- See `CHANGELOG.md` for release history.
- See `ROADMAP.md` for phase status and next milestones.
- See `docs/ARCHITECTURE.md` for module-level architecture and data flow.