feat: complete RAG runbook workflow and release docs
Some checks failed
CI / test (push) Failing after 15s
Some checks failed
CI / test (push) Failing after 15s
This commit is contained in:
214
README.md
214
README.md
@@ -1,112 +1,202 @@
|
||||
# tai — Linux AI Troubleshooting Agent
|
||||
# tai - Linux AI Troubleshooting Agent
|
||||
|
||||
`tai` is an agentic AI-driven troubleshooting tool for Linux systems. It autonomously investigates issues on remote hosts via SSH, analyzes relevant logs and configuration files, and provides a clear diagnosis along with suggested remediation steps — all without making any changes to the target system.
|
||||
`tai` is a read-only Linux troubleshooting assistant that connects to remote hosts via SSH, collects diagnostics, and runs grounded AI analysis using local models.
|
||||
|
||||
## Overview
|
||||
The project is designed for operators who want AI speed without losing operational safety or evidence traceability.
|
||||
|
||||
Given a problem description and a target hostname, `tai` connects to the remote system over SSH, gathers relevant data (logs, configuration files, service status, etc.), and uses a locally-hosted AI model to reason about the root cause and recommend solutions.
|
||||
## What tai Does
|
||||
|
||||
The agent operates in **read-only mode at all times**. It will never modify the target system under any circumstances — all suggestions are presented to the human troubleshooter for review and action.
|
||||
- Runs safe, read-only remote checks over SSH
|
||||
- Builds a diagnostics collection plan from issue text
|
||||
- Supports one-shot analysis and interactive follow-up mode
|
||||
- Uses local AI backends (OpenAI-compatible endpoint, typically Ollama)
|
||||
- Uses RAG over collected diagnostics (Tier 1)
|
||||
- Uses persistent runbook retrieval with ChromaDB (Tier 2)
|
||||
- Emits structured Markdown analysis with evidence and actions
|
||||
- Can log session and retrieval telemetry locally as JSONL
|
||||
|
||||
## Supported Distributions
|
||||
## Safety Model
|
||||
|
||||
- Ubuntu
|
||||
- Debian
|
||||
- RHEL
|
||||
- Rocky Linux
|
||||
`tai` enforces read-only command policy on all remote commands.
|
||||
|
||||
## Example Workflow
|
||||
- Allowlist based command validation
|
||||
- Blocked shell operators (`>`, `>>`, `<`, `|`, `&&`, `||`, `;`)
|
||||
- No write/mutation actions are executed on target hosts
|
||||
|
||||
A troubleshooter receives a ticket reporting that the Apache service on a remote server has failed to start. They provide `tai` with:
|
||||
The tool may suggest remediation commands in output, but does not execute them.
|
||||
|
||||
1. The ticket description or error message
|
||||
1. The hostname of the affected system
|
||||
1. Any relevant directories to focus on
|
||||
## Current Feature Set
|
||||
|
||||
`tai` then connects to the host, reads through system logs, service configurations, and any other related files, and returns a structured analysis of the likely cause along with recommended next steps.
|
||||
### Core CLI
|
||||
|
||||
## Suggested Tooling
|
||||
- `tai run ...` main troubleshooting entrypoint
|
||||
- SSH options: host, port, identity file, jump host, SSH config control
|
||||
- Live probe mode (`uname -a`)
|
||||
- Diagnostics collection mode
|
||||
- AI analysis mode
|
||||
- Interactive loop with `/collect`, `/analyze`, `/help`, `/quit`
|
||||
|
||||
| Component | Tool |
|
||||
|-----------|------|
|
||||
| AI inference backend | [Ollama](https://ollama.com) |
|
||||
| Chat model | `gemma3:4b`, `llama3.1:8b`, or `qwen2.5:7b` |
|
||||
| Embedding model | `nomic-embed-text` (via Ollama) |
|
||||
| Vector store | [ChromaDB](https://www.trychroma.com) (embedded, local) |
|
||||
| Language | Python 3.11+ |
|
||||
### AI and Prompting
|
||||
|
||||
______________________________________________________________________
|
||||
- OpenAI-compatible AI client
|
||||
- Configurable model, timeout, token budget
|
||||
- Guardrails to keep responses evidence-based
|
||||
- Initial and follow-up prompts grounded in collected diagnostics
|
||||
- Non-streaming completion path for local backend reliability
|
||||
|
||||
## How-To: Setting Up the AI Backend (Arch Linux + RTX 3080)
|
||||
### RAG and Knowledge
|
||||
|
||||
`tai` uses [Ollama](https://ollama.com) as its local AI backend. It exposes an OpenAI-compatible HTTP API that `tai` talks to — no cloud services, no data leaving your machine.
|
||||
- Tier 1: semantic retrieval of diagnostic chunks per question
|
||||
- Tier 2: persistent runbook knowledge base with ChromaDB
|
||||
- Runbook retrieval injected as separate prompt context
|
||||
- Retrieval debug output (`--rag-debug`)
|
||||
- Full-context fallback if retrieval/indexing fails
|
||||
|
||||
An RTX 3080 (10 GB VRAM) comfortably runs 7–8B parameter models at 4-bit quantisation.
|
||||
### Runbook Management
|
||||
|
||||
### 1. Install CUDA and Ollama
|
||||
- `tai runbooks sync --path ./runbooks --store ~/.tai/runbooks`
|
||||
- `tai runbooks list --store ~/.tai/runbooks`
|
||||
- `tai runbooks add <file> --store ~/.tai/runbooks`
|
||||
|
||||
```bash
|
||||
# CUDA runtime (skip if already installed)
|
||||
sudo pacman -S cuda
|
||||
### Presence and Absence Signals
|
||||
|
||||
# Ollama with CUDA support from the AUR
|
||||
yay -S ollama-cuda
|
||||
# or: paru -S ollama-cuda
|
||||
For recognized services/subsystems (for example `sssd`, `docker`, `x2go`, `xorg`, `wayland`, `selinux`, `apparmor`), collection includes:
|
||||
|
||||
# Enable and start the service
|
||||
sudo systemctl enable --now ollama
|
||||
- service unit-file discovery (`systemctl list-unit-files ...`)
|
||||
- binary presence checks via `ls -l <expected path>`
|
||||
- service status and journals
|
||||
- selected config path probes where defined
|
||||
|
||||
This improves analysis quality for "component missing/not installed" scenarios.
|
||||
|
||||
## Repository Layout
|
||||
|
||||
```text
|
||||
src/tai/
|
||||
cli.py # CLI commands and orchestration
|
||||
ssh_client.py # SSH execution + read-only policy
|
||||
collectors.py # execution of collection plans
|
||||
plan.py # issue -> command plan builder
|
||||
ai_client.py # OpenAI-compatible AI + embeddings client
|
||||
ai_guardrails.py # response guardrails/validation
|
||||
prompt_builder.py # prompt composition
|
||||
rag_retriever.py # diagnostic chunk retrieval
|
||||
runbook_store.py # persistent ChromaDB runbook index/query
|
||||
chroma_telemetry.py # no-op Chroma telemetry client
|
||||
session_log.py # JSONL session logging
|
||||
input_parser.py # CLI input validation
|
||||
models.py # domain request models
|
||||
|
||||
runbooks/
|
||||
*.md # Markdown runbooks with frontmatter
|
||||
|
||||
tests/
|
||||
test_*.py # unit and CLI coverage
|
||||
```
|
||||
|
||||
### 2. Pull a chat model
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
ollama pull gemma3:4b # ~3 GB — fast, good for sysadmin tasks
|
||||
ollama pull llama3.1:8b # ~5 GB — stronger reasoning
|
||||
ollama pull qwen2.5:7b # ~4.5 GB — strong structured output
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
### 3. Pull the embedding model
|
||||
|
||||
`tai` uses `nomic-embed-text` to embed diagnostic data and runbooks for semantic retrieval (RAG). Pull it on the same host as Ollama:
|
||||
RAG runbook storage requires optional dependencies:
|
||||
|
||||
```bash
|
||||
ollama pull nomic-embed-text # ~274 MB
|
||||
pip install -e .[rag]
|
||||
```
|
||||
|
||||
Verify it loaded:
|
||||
Development dependencies:
|
||||
|
||||
```bash
|
||||
curl http://localhost:11434/api/embeddings \
|
||||
-d '{"model":"nomic-embed-text","prompt":"test"}'
|
||||
pip install -e .[dev]
|
||||
```
|
||||
|
||||
A JSON response with an `"embedding"` array confirms it is ready.
|
||||
## AI Backend Setup (Ollama)
|
||||
|
||||
### 4. Verify the chat model works
|
||||
`tai` expects an OpenAI-compatible API endpoint, defaulting to `http://localhost:11434/v1`.
|
||||
|
||||
```bash
|
||||
ollama run gemma3:4b "what causes a systemd service to enter failed state?"
|
||||
ollama pull gemma3:4b
|
||||
ollama pull nomic-embed-text
|
||||
```
|
||||
|
||||
### 5. Verify the HTTP API is running
|
||||
|
||||
`tai` communicates with Ollama over its OpenAI-compatible REST API:
|
||||
Quick backend check:
|
||||
|
||||
```bash
|
||||
curl http://localhost:11434/api/generate \
|
||||
-d '{"model":"gemma3:4b","prompt":"hello","stream":false}'
|
||||
```
|
||||
|
||||
A JSON response with a `response` field confirms everything is working.
|
||||
## Usage
|
||||
|
||||
### 6. Point tai at your Ollama instance
|
||||
|
||||
Once `tai` AI integration is complete, use these flags:
|
||||
### Basic Probe and Collect
|
||||
|
||||
```bash
|
||||
tai "nginx failing to start" --host web01 \
|
||||
--ai-host http://localhost:11434 \
|
||||
--model gemma3:4b
|
||||
tai run "nginx failing to start" \
|
||||
--host web01 \
|
||||
--probe \
|
||||
--collect
|
||||
```
|
||||
|
||||
The default values for `--ai-host` and `--model` will be `http://localhost:11434` and `gemma3:4b` respectively, so for local use you won't need to specify them explicitly.
|
||||
### Analyze with RAG and Runbooks
|
||||
|
||||
```bash
|
||||
tai run "why isnt sssd working?" \
|
||||
--host ssh.archflux.net \
|
||||
--port 5566 \
|
||||
--probe --collect --analyze \
|
||||
--runbooks ~/.tai/runbooks \
|
||||
--rag-debug \
|
||||
--ai-timeout-seconds 45 \
|
||||
--ai-max-tokens 300
|
||||
```
|
||||
|
||||
### Interactive Session
|
||||
|
||||
```bash
|
||||
tai run "docker daemon keeps failing" \
|
||||
--host app01 \
|
||||
--collect \
|
||||
--interactive \
|
||||
--runbooks ~/.tai/runbooks
|
||||
```
|
||||
|
||||
## Runbook Workflow
|
||||
|
||||
1. Write Markdown runbooks in `runbooks/` with frontmatter keys: `service`, `symptoms`, `tags`.
|
||||
1. Sync the store.
|
||||
1. Pass `--runbooks <store-path>` to `tai run`.
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
tai runbooks sync --path ./runbooks --store ~/.tai/runbooks
|
||||
tai runbooks list --store ~/.tai/runbooks
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
pytest
|
||||
```
|
||||
|
||||
Focused suites:
|
||||
|
||||
```bash
|
||||
pytest tests/test_plan.py tests/test_ai.py tests/test_cli.py
|
||||
```
|
||||
|
||||
## Known Limits
|
||||
|
||||
- Service-specific presence checks currently apply to recognized service/subsystem names.
|
||||
- Package-manager-level presence checks are not yet in the default read-only command allowlist.
|
||||
- Tier 3 persistent session memory is not implemented yet.
|
||||
|
||||
## Changelog and Roadmap
|
||||
|
||||
- See `CHANGELOG.md` for release history.
|
||||
- See `ROADMAP.md` for phase status and next milestones.
|
||||
- See `docs/ARCHITECTURE.md` for module-level architecture and data flow.
|
||||
|
||||
Reference in New Issue
Block a user