From e49670a6642e11839731e96e6b196412e0ce858b Mon Sep 17 00:00:00 2001
From: zphinx <cban@gmx.com>
Date: Mon, 4 May 2026 18:23:33 +0200
Subject: [PATCH] docs(roadmap): add Phase 6 RAG & Knowledge Layer plan

- Three-tier RAG architecture: diagnostic chunks, runbook KB, session memory
- Technology decisions table with options and recommendations
- Per-tier: approach, new modules, changes to existing code, companion features
- Implementation order and effort estimates
- New dependencies and optional pyproject.toml group
- Decisions log entries for RAG choices pending confirmation
---
 ROADMAP.md | 169 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 169 insertions(+)

diff --git a/ROADMAP.md b/ROADMAP.md
index c20e242..208ae12 100644
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -117,6 +117,170 @@ Prepare for broader use.
 
 ______________________________________________________________________
 
+## Phase 6 — RAG & Knowledge Layer
+
+Introduce Retrieval-Augmented Generation to ground AI responses in evidence rather than
+model weights alone. Three tiers of increasing capability, each buildable independently.
+
+### Goals
+
+- Eliminate prompt flooding on hosts with large log output
+- Ground recommendations in version-controlled runbooks, not model improvisation
+- Build compounding institutional memory from past troubleshooting sessions
+- Keep all data local — no embeddings or session content leaves the network
+
+---
+
+### Technology Decisions Required
+
+| Decision | Options | Recommendation | Status |
+|---|---|---|---|
+| Embedding model | `nomic-embed-text`, `mxbai-embed-large`, `all-minilm` | `nomic-embed-text` via Ollama (local, 274MB, strong perf) | ⬜ Pending |
+| Vector store — Tier 1 | In-memory numpy cosine, `faiss-cpu` | numpy (zero deps) for session scope | ⬜ Pending |
+| Vector store — Tier 2/3 | `chromadb`, `qdrant`, `weaviate`, `pgvector` | `chromadb` (embedded mode, no server needed) or `qdrant` (self-hosted, REST API, production-grade) | ⬜ Pending |
+| Chunking strategy | Fixed token, sentence-aware, command-boundary | Command-boundary splitting (natural unit for diagnostics) | ⬜ Pending |
+| Hybrid retrieval | Semantic only, BM25 only, hybrid | Hybrid (BM25 keyword + cosine semantic) for best recall | ⬜ Pending |
+| Reranking | None, cross-encoder (`ms-marco-MiniLM`), LLM-as-judge | Cross-encoder rerank pass before prompt injection | ⬜ Pending |
+| Runbook format | Markdown, YAML, JSON | Markdown (human-editable, version-controllable) | ⬜ Pending |
+| Session index storage | Local `~/.tai/`, configurable path | `~/.tai/sessions/` with ChromaDB collection | ⬜ Pending |
+
+---
+
+### Tier 1 — Diagnostic Chunk Retrieval (in-memory, per-session)
+
+**Problem:** Current flow injects all collected output into the prompt as one block.
+On busy hosts this floods the context window with irrelevant output, degrading quality.
+
+**Approach:**
+- After collection, split each command's output into overlapping token chunks (e.g. 512 tokens, 64 overlap)
+- Embed all chunks using `nomic-embed-text` via Ollama embeddings API
+- On each question (initial + follow-up), embed the question and retrieve top-k chunks by cosine similarity
+- Inject only retrieved chunks into the prompt, not the full dump
+
+**New module:** `src/tai/rag_retriever.py`
+- `chunk_report(report) -> list[Chunk]`
+- `embed_chunks(chunks) -> list[EmbeddedChunk]`
+- `retrieve(question, embedded_chunks, top_k) -> list[Chunk]`
+
+**Changes to existing code:**
+- `prompt_builder.py`: accept `retrieved_chunks` instead of full `CollectionReport` for RAG-mode prompts
+- `cli.py`: embed report after collection, pass retriever to `_run_analysis` and `_run_followup_analysis`
+- `ai_client.py`: add `embed(text) -> list[float]` method using Ollama `/api/embeddings`
+
+**Companion features buildable at same time:**
+- `--no-rag` flag to bypass retrieval and use full dump (backwards compat)
+- Token budget display: show user how many tokens are being sent vs. saved
+- Per-chunk source attribution in AI response (which command produced the evidence)
+
+**Tests:**
+- `tests/test_rag_retriever.py`: chunk splitting, cosine similarity ranking, top-k retrieval
+- `tests/test_ai.py`: add `test_embed_returns_float_list()`
+
+---
+
+### Tier 2 — Runbook Knowledge Base (persistent, ChromaDB)
+
+**Problem:** AI improvises remediation steps from training data, which may be wrong for
+specific environments, distros, or internal conventions.
+
+**Approach:**
+- Maintain a version-controlled corpus of Markdown runbooks in `runbooks/` directory
+- On first run (or `tai runbooks --sync`), embed all runbooks and persist to ChromaDB collection
+- On each analysis, retrieve top-3 relevant runbook chunks alongside diagnostic chunks
+- Inject as a separate `## Runbook Context` section in the prompt
+
+**New module:** `src/tai/runbook_store.py`
+- `RunbookStore`: wraps ChromaDB collection
+- `sync(runbooks_dir) -> int` — embed and upsert all runbooks
+- `query(question, top_k) -> list[RunbookChunk]`
+
+**New directory:** `runbooks/`
+- `ssh.md`, `nginx.md`, `postgres.md`, `disk.md`, `kernel.md`, etc.
+- Each runbook: YAML frontmatter (`service`, `symptoms`, `tags`) + Markdown body
+
+**New CLI command:** `tai runbooks --sync [--path ./runbooks]`
+
+**Changes to existing code:**
+- `prompt_builder.py`: add `build_message_with_runbooks(retrieved_chunks, runbook_chunks)`
+- `cli.py`: optionally load `RunbookStore`, query it per analysis turn
+
+**Companion features buildable at same time:**
+- `tai runbooks --list` — show indexed runbooks and last sync time
+- `tai runbooks --add <file>` — index a single runbook
+- `/runbooks` slash command in interactive mode — show which runbooks were retrieved
+- Runbook citation in AI output: "Based on runbook: `ssh.md#AuthenticationFailures`"
+
+---
+
+### Tier 3 — Session Memory Index (institutional learning)
+
+**Problem:** Every session starts from zero. Repeat incidents on the same host or
+same issue type get no benefit from past work.
+
+**Approach:**
+- On session end, embed the session summary (issue + root cause + actions) and upsert into a persistent ChromaDB collection (`~/.tai/sessions/`)
+- On session start, query for similar past sessions by issue text + hostname
+- Inject top-2 past sessions as `## Prior Sessions` context
+- Optionally: `/history` command in interactive mode to surface past sessions explicitly
+
+**New module:** `src/tai/session_store.py`
+- `SessionStore`: wraps ChromaDB collection at `~/.tai/sessions/`
+- `index_session(session_log_path)` — embed and store completed session
+- `query_similar(issue, host, top_k) -> list[PastSession]`
+
+**Changes to existing code:**
+- `session_log.py`: add `summarise() -> str` method (issue + final AI response)
+- `cli.py`: query `SessionStore` at session start, index at session end
+
+**Companion features buildable at same time:**
+- `tai history` CLI subcommand — search past sessions by keyword
+- `tai history --host <hostname>` — all sessions for a host
+- `tai history --export <file>` — export session summaries as Markdown report
+- Auto-suggest: "Similar issue found from 2 weeks ago — load context? [y/N]"
+
+---
+
+### Implementation Order
+
+```
+Tier 1 (diagnostic chunks)     ← Start here. Zero new infra. Immediate prompt quality gain.
+       ↓
+Tier 2 (runbook KB)            ← After Tier 1. Requires ChromaDB dep + runbook authoring.
+       ↓
+Tier 3 (session memory)        ← Builds on Tier 2 infrastructure. Minimal extra work.
+```
+
+**Estimated effort:**
+- Tier 1: 2–3 days (new module + prompt builder changes + tests)
+- Tier 2: 3–4 days (ChromaDB + runbook authoring + CLI command + tests)
+- Tier 3: 1–2 days (reuses Tier 2 infrastructure)
+
+### New Dependencies
+
+```
+# Tier 1 (zero new runtime deps — uses Ollama HTTP API already in use)
+# No additions needed
+
+# Tier 2 + 3
+chromadb>=0.5,<1.0          # embedded vector store, no separate server
+# OR
+qdrant-client>=1.9,<2.0     # if self-hosted Qdrant preferred
+
+sentence-transformers>=3.0  # optional: cross-encoder reranking
+```
+
+### New pyproject.toml optional group
+
+```toml
+[project.optional-dependencies]
+rag = [
+  "chromadb>=0.5,<1.0",
+  "sentence-transformers>=3.0,<4.0",
+]
+```
+
+______________________________________________________________________
+
 ## Decisions Log
 
 | Date | Decision | Outcome |
@@ -128,3 +292,8 @@ ______________________________________________________________________
 | 2026-05-04 | Bastion host support | `--jump-host` flag via SSH native ProxyJump |
 | 2026-05-04 | SSH config behavior | Use `~/.ssh/config` by default; allow override via `--ignore-ssh-config` |
 | 2026-05-04 | CLI vs interactive mode | Interactive: REPL for v0.1, `textual` TUI for v0.2+ |
+| 2026-05-04 | RAG embedding model | `nomic-embed-text` via Ollama (local, air-gapped safe) — ⬜ pending confirmation |
+| 2026-05-04 | RAG vector store (Tier 1) | In-memory numpy cosine similarity — zero deps, session-scoped |
+| 2026-05-04 | RAG vector store (Tier 2/3) | `chromadb` embedded mode (default) or `qdrant` self-hosted — ⬜ pending confirmation |
+| 2026-05-04 | RAG chunking unit | Command-boundary splitting — each collected command = one or more chunks |
+| 2026-05-04 | Runbook format | Markdown with YAML frontmatter, version-controlled in `runbooks/` directory |