Files
tai/ROADMAP.md
zphinx 7749a02706
Some checks failed
CI / test (push) Failing after 15s
feat: add history UX and expand retention-focused roadmap
2026-05-11 21:07:39 +02:00

515 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Roadmap
This document outlines the major decisions, milestones, and development phases required to bring `tai` from concept to a working tool.
______________________________________________________________________
## Phase 0 — Decisions & Prerequisites
These must be resolved before meaningful development can begin.
### Language Selection
- [x] **Decision: Python**
- Key factors: native vLLM integration, mature SSH libraries (`paramiko` / `asyncssh`), strong text/log parsing, rapid development
- Single binary distribution will be achieved via **Nuitka** (preferred for true compilation) or **PyInstaller** as a fallback
- [ ] Evaluate Nuitka vs PyInstaller for binary output quality and CI reproducibility
- [ ] Add binary build step to CI pipeline
### AI Backend & Model
- [x] OpenAI-compatible backend client implemented (`AIClient`)
- [x] Default local backend profile wired for Ollama (`http://localhost:11434/v1`)
- [x] Default model profile set to `gemma3:4b` (override via `--model`)
- [ ] Define minimum hardware requirements for running the model locally
- [x] AI backend is user-supplied/self-hosted
### SSH Strategy
- [x] **Decision: keypair authentication only** — no password auth; eliminates credential storage risk
- Default key resolution: `~/.ssh/id_ed25519`, `~/.ssh/id_rsa` (in order of preference)
- CLI override via `--identity-file <path>`
- No SSH agent forwarding needed — a shared key is distributed to all managed hosts via Puppet
- [x] **Known hosts: auto-accept new hosts; reject on key mismatch** — a changed host key triggers a hard stop with a MITM warning; unknown/new hosts are accepted silently on first connect
- [x] **Bastion/jump host: `--jump-host <host>` flag** — delegates to SSH's native ProxyJump functionality
- [x] **SSH config behavior: respect existing `~/.ssh/config` by default; allow CLI override**
- Default: follow host settings from `~/.ssh/config` (for `User`, `Port`, `ProxyJump`, etc.)
- Override switch: `--ignore-ssh-config` to bypass local SSH config when required
### Scope & Constraints
- [ ] Define the supported scope of issues (services, network, disk, kernel, etc.)
- [x] Read-only guarantee implemented with command allowlist + blocked shell operator policy
- [x] **Decision: interactive REPL mode for v0.1, full TUI for v0.2+**
- v0.1: chat-loop REPL launched from CLI; human can follow up, correct, and redirect the agent
- v0.2+: `textual`-based TUI with split panes (collected data | AI output | input bar)
- Built-in slash commands: `/collect`, `/show logs`, `/clear`, `/host <hostname>`, `/help`, `/quit`
______________________________________________________________________
## Phase 1 — Project Foundation
Basic project scaffolding and connectivity.
- [x] Finalise repository structure and language toolchain
- [x] Set up CI pipeline (linting, tests)
- [x] Implement SSH connection module
- [x] Define SSH config model and probe interface scaffold
- [x] Connect to remote host
- [x] Execute read-only commands (e.g. `journalctl`, `systemctl status`, `cat`)
- [x] Stream or collect command output safely (byte-limited output with truncation marker)
- [x] Implement basic input parsing (ticket text, hostname, target directories)
- [x] Write unit tests for SSH and input modules
- [x] Input parser and CLI tests added
- [x] SSH module tests added for command policy and SSH argv behavior
______________________________________________________________________
## Phase 2 — Data Collection Layer
Define what information the agent gathers and how.
- [x] Identify a baseline canonical set of data sources per issue type:
- Service failures: `journalctl`, `systemctl`, service config files
- Network issues: `ip`, `ss`, `netstat`, firewall rules
- Disk issues: `df`, `du`, `dmesg`, `smartctl`
- General: `/var/log/syslog`, `/var/log/messages`, `dmesg`
- [x] Implement collectors and plan builder for baseline issue categories
- [x] Implement directory traversal for user-specified paths (read-only)
- [ ] Add support for per-distro variations (Ubuntu vs RHEL path differences, etc.)
- [x] Write tests with mocked SSH output
______________________________________________________________________
## Phase 3 — AI Integration
Wire collected data into the local AI model.
- [x] Implement OpenAI-compatible AI client module
- [x] Design prompt templates for initial and follow-up analysis
- [x] Implement response guardrail checks and structured response headings
- [x] Tune context usage with RAG retrieval and chunk/runbook truncation budgets
- [x] Implement reliable non-streaming completion path for local backends
- [ ] Continue output quality tuning and grounding evaluation on real hosts
______________________________________________________________________
## Phase 4 — CLI & User Experience
Polish the interface for real-world use.
- [x] Design CLI interface with run command, interactive prompts, and runbook subcommands
- [x] Implement structured output sections (Root Cause, Evidence, Recommended Actions)
- [x] Add RAG debug mode (`--rag-debug`) showing retrieval scores
- [ ] Support output to file or clipboard
- [x] Provide comprehensive `--help` command documentation via Typer options
______________________________________________________________________
## Phase 5 — Hardening & Distribution
Prepare for broader use.
- [ ] Security review of SSH handling and credential storage
- [ ] Ensure no data is written to the remote system under any path
- [ ] Package for distribution (binary release, container image, or distro packages)
- [ ] Write installation and quickstart documentation
- [ ] End-to-end integration tests against a test VM
______________________________________________________________________
## Phase 6 — RAG & Knowledge Layer
Introduce Retrieval-Augmented Generation to ground AI responses in evidence rather than
model weights alone. Three tiers of increasing capability, each buildable independently.
### Goals
- Eliminate prompt flooding on hosts with large log output
- Ground recommendations in version-controlled runbooks, not model improvisation
- Build compounding institutional memory from past troubleshooting sessions
- Keep all data local — no embeddings or session content leaves the network
______________________________________________________________________
### Technology Decisions Required
| Decision | Options | Recommendation | Status |
|---|---|---|---|
| Embedding model | `nomic-embed-text`, `mxbai-embed-large`, `all-minilm` | `nomic-embed-text` via Ollama (local, 274MB, strong perf) | ✅ Implemented |
| Vector store — Tier 1 | In-memory numpy cosine, `faiss-cpu` | numpy (zero deps) for session scope | ✅ Implemented |
| Vector store — Tier 2/3 | `chromadb`, `qdrant`, `weaviate`, `pgvector` | `chromadb` embedded mode | ✅ Tier 2 Implemented |
| Chunking strategy | Fixed token, sentence-aware, command-boundary | Command-boundary splitting (natural unit for diagnostics) | ✅ Implemented |
| Hybrid retrieval | Semantic only, BM25 only, hybrid | Hybrid (BM25 keyword + cosine semantic) for best recall | ⬜ Pending |
| Reranking | None, cross-encoder (`ms-marco-MiniLM`), LLM-as-judge | Cross-encoder rerank pass before prompt injection | ⬜ Pending |
| Runbook format | Markdown, YAML, JSON | Markdown (human-editable, version-controllable) | ✅ Implemented |
| Session index storage | Local `~/.tai/`, configurable path | `~/.tai/sessions/` with ChromaDB collection | ✅ Implemented (core) |
______________________________________________________________________
### Tier 1 — Diagnostic Chunk Retrieval (in-memory, per-session)
Status: ✅ Implemented
**Problem:** Current flow injects all collected output into the prompt as one block.
On busy hosts this floods the context window with irrelevant output, degrading quality.
**Approach:**
- After collection, split each command's output into overlapping token chunks (e.g. 512 tokens, 64 overlap)
- Embed all chunks using `nomic-embed-text` via Ollama embeddings API
- On each question (initial + follow-up), embed the question and retrieve top-k chunks by cosine similarity
- Inject only retrieved chunks into the prompt, not the full dump
**New module:** `src/tai/rag_retriever.py`
- `chunk_report(report) -> list[Chunk]`
- `embed_chunks(chunks) -> list[EmbeddedChunk]`
- `retrieve(question, embedded_chunks, top_k) -> list[Chunk]`
**Changes to existing code:**
- `prompt_builder.py`: accept `retrieved_chunks` instead of full `CollectionReport` for RAG-mode prompts
- `cli.py`: embed report after collection, pass retriever to `_run_analysis` and `_run_followup_analysis`
- `ai_client.py`: add `embed(text) -> list[float]` method using Ollama `/api/embeddings`
**Companion features buildable at same time:**
- `--no-rag` flag to bypass retrieval and use full dump (backwards compat)
- Token budget display: show user how many tokens are being sent vs. saved
- Per-chunk source attribution in AI response (which command produced the evidence)
**Tests:**
- `tests/test_rag_retriever.py`: chunk splitting, cosine similarity ranking, top-k retrieval
- `tests/test_ai.py`: add `test_embed_returns_float_list()`
______________________________________________________________________
### Tier 2 — Runbook Knowledge Base (persistent, ChromaDB)
Status: ✅ Implemented
**Problem:** AI improvises remediation steps from training data, which may be wrong for
specific environments, distros, or internal conventions.
**Approach:**
- Maintain a version-controlled corpus of Markdown runbooks in `runbooks/` directory
- On first run (or `tai runbooks --sync`), embed all runbooks and persist to ChromaDB collection
- On each analysis, retrieve top-3 relevant runbook chunks alongside diagnostic chunks
- Inject as a separate `## Runbook Context` section in the prompt
**New module:** `src/tai/runbook_store.py`
- `RunbookStore`: wraps ChromaDB collection
- `sync(runbooks_dir) -> int` — embed and upsert all runbooks
- `query(question, top_k) -> list[RunbookChunk]`
**New directory:** `runbooks/`
- `ssh.md`, `nginx.md`, `postgres.md`, `disk.md`, `kernel.md`, etc.
- Each runbook: YAML frontmatter (`service`, `symptoms`, `tags`) + Markdown body
**New CLI command:** `tai runbooks --sync [--path ./runbooks]`
**Changes to existing code:**
- `prompt_builder.py`: add `build_message_with_runbooks(retrieved_chunks, runbook_chunks)`
- `cli.py`: optionally load `RunbookStore`, query it per analysis turn
**Companion features buildable at same time:**
- `tai runbooks --list` — show indexed runbooks and last sync time
- `tai runbooks --add <file>` — index a single runbook
- `/runbooks` slash command in interactive mode — show which runbooks were retrieved
- Runbook citation in AI output: "Based on runbook: `ssh.md#AuthenticationFailures`"
______________________________________________________________________
### Tier 3 — Session Memory Index (institutional learning)
Status: ✅ Implemented (core retrieval/indexing) / ⬜ UX commands pending
**Problem:** Every session starts from zero. Repeat incidents on the same host or
same issue type get no benefit from past work.
**Implemented now:**
- On session end, embed the session summary (issue + root cause + actions) and upsert into a persistent ChromaDB collection (`~/.tai/sessions/`)
- On session start, query for similar past sessions by issue text + hostname
- Inject top-2 past sessions as `## Prior Sessions` context
**Pending UX layer:**
- `/history` command in interactive mode to surface past sessions explicitly
**New module:** `src/tai/session_store.py`
- `SessionStore`: wraps ChromaDB collection at `~/.tai/sessions/`
- `index_session(host, issue, summary, ai)` — embed and store completed session
- `query(question, host, ai, top_k) -> list[PastSession]`
**Changes to existing code:**
- `cli.py`: query `SessionStore` during analysis turns and index final responses at session end
**Companion features buildable at same time:**
- `tai history` CLI subcommand — search past sessions by keyword
- `tai history --host <hostname>` — all sessions for a host
- `tai history --export <file>` — export session summaries as Markdown report
- Auto-suggest: "Similar issue found from 2 weeks ago — load context? [y/N]"
______________________________________________________________________
### Implementation Order
```
Tier 1 (diagnostic chunks) ← Start here. Zero new infra. Immediate prompt quality gain.
Tier 2 (runbook KB) ← After Tier 1. Requires ChromaDB dep + runbook authoring.
Tier 3 (session memory) ← Builds on Tier 2 infrastructure. Minimal extra work.
```
**Estimated effort:**
- Tier 1: 23 days (new module + prompt builder changes + tests)
- Tier 2: 34 days (ChromaDB + runbook authoring + CLI command + tests)
- Tier 3: 12 days (reuses Tier 2 infrastructure)
### New Dependencies
```
# Tier 1 (zero new runtime deps — uses Ollama HTTP API already in use)
# No additions needed
# Tier 2 + 3
chromadb>=0.5,<1.0 # embedded vector store, no separate server
# OR
qdrant-client>=1.9,<2.0 # if self-hosted Qdrant preferred
sentence-transformers>=3.0 # optional: cross-encoder reranking
```
### New pyproject.toml optional group
```toml
[project.optional-dependencies]
rag = [
"chromadb>=0.5,<1.0",
"sentence-transformers>=3.0,<4.0",
]
```
______________________________________________________________________
## Decisions Log
| Date | Decision | Outcome |
|------|----------|---------|
| 2026-05-04 | Implementation language | Python — with single distributable binary via Nuitka |
| 2026-05-04 | AI backend API | OpenAI-compatible API endpoint (local Ollama by default) |
| 2026-05-04 | Default model | `gemma3:4b` |
| 2026-05-04 | SSH auth methods | Keypair only (ed25519/RSA); auto-accept new hosts; reject on key change (MITM) |
| 2026-05-04 | Bastion host support | `--jump-host` flag via SSH native ProxyJump |
| 2026-05-04 | SSH config behavior | Use `~/.ssh/config` by default; allow override via `--ignore-ssh-config` |
| 2026-05-04 | CLI vs interactive mode | Interactive: REPL for v0.1, `textual` TUI for v0.2+ |
| 2026-05-04 | RAG embedding model | `nomic-embed-text` via Ollama (local, air-gapped safe) |
| 2026-05-04 | RAG vector store (Tier 1) | In-memory numpy cosine similarity — zero deps, session-scoped |
| 2026-05-04 | RAG vector store (Tier 2/3) | `chromadb` embedded mode (default) or `qdrant` self-hosted |
| 2026-05-04 | RAG chunking unit | Command-boundary splitting — each collected command = one or more chunks |
| 2026-05-04 | Runbook format | Markdown with YAML frontmatter, version-controlled in `runbooks/` directory |
______________________________________________________________________
## End-State UX Goal
After the current CLI and memory roadmap phases are stable, the long-term UX goal is a full-screen terminal TUI with an ncurses-style workflow.
### Target End-State
- Split-pane troubleshooting workspace (diagnostics, AI output, and command/input area)
- Live command/probe status with clear success/failure indicators
- In-session history browser for prior questions, retrieved evidence, and related past sessions
- Keyboard-first navigation for operators running in SSH-only environments
### Delivery Approach
- Keep shipping incremental CLI features first (current roadmap order remains unchanged)
- Promote stable workflows into TUI panels once behavior is proven in CLI mode
- Treat the TUI as a final UX consolidation milestone, not a blocker for core troubleshooting capabilities
______________________________________________________________________
## Container Distribution Goal (Docker)
After core CLI/TUI workflows stabilize, provide an official Docker image as an additional distribution target.
### Container Execution Model (Decision)
- Docker is a one-shot invocation target, not a daemon/service mode
- Each run executes a single `tai` command and exits
- State is persisted only through mounted host volumes
### Why Docker Is Valuable Here
- Reproducible runtime: pin Python and dependency versions to remove host-level drift
- Faster operator onboarding: run with one command instead of local Python setup
- Cleaner CI/CD release path: publish versioned images aligned with git tags
- Safer local footprint: isolate dependencies from the host OS package manager
### Subgoals
1. Base image and runtime hardening
- Multi-stage Dockerfile with slim runtime image
- Non-root runtime user and minimal filesystem permissions
- Healthcheck for CLI startup and version command
2. Runtime integration for SSH workflows
- Documented mounts for `~/.ssh` (read-only where possible) and known-hosts handling
- Pass-through for SSH config when needed (`--ignore-ssh-config` behavior documented)
- Clear guidance for jump-host and bastion scenarios from inside the container
- Documented one-shot run examples for `tai run` and `tai history`
3. Persistent data strategy
- Required volume mount guidance for runbook store (`~/.tai/runbooks`)
- Required volume mount guidance for session memory/history (`~/.tai/sessions`)
- Optional bind mount for JSONL logs and report export artifacts
- Clear defaults for container paths and equivalent host path mappings
4. Release and quality gates
- Build and publish image on tagged releases
- Smoke tests in CI: probe mode, collect mode, and history command against mocked endpoints
- Version labeling (image tags and OCI metadata) tied to changelog/release tags
### Data Retention and Lifecycle Policy
Retention behavior must be explicit and configurable at runtime. Defaults should be conservative and documented.
1. Retention classes
- Session memory store (`~/.tai/sessions`): keep semantically indexed summaries for troubleshooting continuity
- Runbook store (`~/.tai/runbooks`): retain until explicitly replaced or pruned by sync policy
- JSONL logs and exported reports: operator-controlled retention with optional TTL cleanup
2. Retention controls
- Add CLI controls for age-based pruning (for example `--retain-days` on cleanup commands)
- Add host-scoped cleanup (delete history for one host) and full-store cleanup (all hosts)
- Add dry-run cleanup mode to show what would be deleted before applying changes
3. No-persist mode
- Add a documented ephemeral mode where no session memory or logs are written
- Ensure one-shot diagnostics can run in read-only operational contexts
### Configuration and State Persistence Model
Configuration and retained state should be predictable across container upgrades and host environments.
1. Mount and path contract
- Define canonical container paths for `~/.tai/runbooks`, `~/.tai/sessions`, and optional log/export paths
- Document required versus optional mounts and expected permissions for each
- Document UID/GID mapping guidance to prevent host volume ownership issues
2. Schema and compatibility
- Introduce explicit storage schema version metadata for persistent stores
- Define upgrade behavior for older stores (migrate, re-index, or fail with clear guidance)
- Add compatibility notes for image upgrades and rollback expectations
3. Backup and recovery
- Provide export/import workflows for session memory and runbook indexes
- Document minimal backup set and restore order for disaster recovery
### Security and Privacy for Retained Data
Persisted troubleshooting evidence can include sensitive operational data and must be handled accordingly.
1. Data minimization
- Add optional redaction hooks for common sensitive patterns before persistence
- Keep prompt-only transient data separate from persisted summary/index content
2. Runtime hardening
- Target non-root container execution with read-only root filesystem by default
- Require explicit writable mounts only for retained data locations
3. Auditable behavior
- Log retention-affecting operations (cleanup, purge, export/import) with timestamps and scope
- Define stable exit codes for cleanup and retention workflows to support automation
### Kubernetes Position
Kubernetes is out of scope for this delivery plan.
- `tai` is currently an operator-invoked troubleshooting client, not a long-running service
- AI inference is external to `tai` (OpenAI-compatible endpoint), reducing the need for in-cluster model orchestration
- SSH key/config handling and per-operator context are simpler with local or single-container execution
Kubernetes can be revisited only if `tai` evolves into a centralized multi-user service with queueing, RBAC, and shared tenancy requirements.
______________________________________________________________________
## Final Long-Term Goal: Full Rust Migration
This is a final-stage roadmap goal and remains explicitly out of near-term scope.
It should begin only after the Python implementation, TUI direction, Docker one-shot model,
and retention/persistence policies are stable and proven in production usage.
### Why This Is the Final Goal
- Improve execution latency and startup speed for both native runs and container one-shot invocations
- Produce a single, portable native binary with minimal runtime dependency footprint
- Strengthen reliability and memory safety under heavy log parsing and concurrent workflows
- Simplify long-term packaging and distribution across Linux targets
### Migration Objectives
1. Preserve feature parity first
- Match existing CLI behavior, interactive workflows, RAG integration, runbook management, and history/session-memory features
- Keep command semantics and safety boundaries equivalent during transition
2. Target both distribution modes
- Native Rust binary for direct operator use
- Docker image built around the Rust binary for one-shot execution with mounted persistent volumes
3. Keep compatibility guardrails
- Define persistent data format compatibility or migration tooling for runbook/session stores
- Preserve operator-visible flags where practical to reduce migration friction
### Suggested Delivery Phases
1. Build baseline Rust CLI scaffold with feature-flagged parity checkpoints
2. Port SSH execution and read-only policy enforcement modules
3. Port planner, collectors, prompt composition, and AI client adapters
4. Port session memory/history and runbook workflows with migration tests
5. Port interactive UX/TUI layer and deprecate Python runtime path
### Rust Toolchain End-State
- Standardize on Cargo-based build/test/lint pipeline (`cargo fmt`, `cargo clippy`, `cargo test`)
- Add release profile optimization and reproducible build settings
- Publish signed native artifacts and Docker images derived from Rust release binaries
### Decision Gate Before Starting
Begin Rust migration only when:
- Python roadmap milestones are complete and stable
- Container distribution and retention policy workflows are operationally validated
- A parity test matrix exists to prove behavior equivalence during migration