zphinx/tai

Files

zphinx 2d8a5a66ca feat(cli): add clean analysis export with markdown/json output

2026-05-11 21:54:21 +02:00

22 KiB

Raw Blame History

Roadmap

This document outlines the major decisions, milestones, and development phases required to bring tai from concept to a working tool.

Phase 0 — Decisions & Prerequisites

These must be resolved before meaningful development can begin.

Language Selection

Decision: Python
Key factors: native vLLM integration, mature SSH libraries (paramiko / asyncssh), strong text/log parsing, rapid development
Single binary distribution will be achieved via Nuitka (preferred for true compilation) or PyInstaller as a fallback
Evaluate Nuitka vs PyInstaller for binary output quality and CI reproducibility
Add binary build step to CI pipeline

AI Backend & Model

OpenAI-compatible backend client implemented (AIClient)
Default local backend profile wired for Ollama (http://localhost:11434/v1)
Default model profile set to gemma3:4b (override via --model)
Define minimum hardware requirements for running the model locally
AI backend is user-supplied/self-hosted

SSH Strategy

Decision: keypair authentication only — no password auth; eliminates credential storage risk
- Default key resolution: ~/.ssh/id_ed25519, ~/.ssh/id_rsa (in order of preference)
- CLI override via --identity-file <path>
- No SSH agent forwarding needed — a shared key is distributed to all managed hosts via Puppet
Known hosts: auto-accept new hosts; reject on key mismatch — a changed host key triggers a hard stop with a MITM warning; unknown/new hosts are accepted silently on first connect
Bastion/jump host: --jump-host <host> flag — delegates to SSH's native ProxyJump functionality
SSH config behavior: respect existing ~/.ssh/config by default; allow CLI override
- Default: follow host settings from ~/.ssh/config (for User, Port, ProxyJump, etc.)
- Override switch: --ignore-ssh-config to bypass local SSH config when required

Scope & Constraints

Define the supported scope of issues (services, network, disk, kernel, etc.)
Read-only guarantee implemented with command allowlist + blocked shell operator policy
Decision: interactive REPL mode for v0.1, full TUI for v0.2+
- v0.1: chat-loop REPL launched from CLI; human can follow up, correct, and redirect the agent
- v0.2+: textual-based TUI with split panes (collected data | AI output | input bar)
- Built-in slash commands: /collect, /show logs, /clear, /host <hostname>, /help, /quit

Phase 1 — Project Foundation

Basic project scaffolding and connectivity.

Finalise repository structure and language toolchain
Set up CI pipeline (linting, tests)
Implement SSH connection module
- Define SSH config model and probe interface scaffold
- Connect to remote host
- Execute read-only commands (e.g. journalctl, systemctl status, cat)
- Stream or collect command output safely (byte-limited output with truncation marker)
Implement basic input parsing (ticket text, hostname, target directories)
Write unit tests for SSH and input modules
- Input parser and CLI tests added
- SSH module tests added for command policy and SSH argv behavior

Phase 2 — Data Collection Layer

Define what information the agent gathers and how.

Identify a baseline canonical set of data sources per issue type:
- Service failures: journalctl, systemctl, service config files
- Network issues: ip, ss, netstat, firewall rules
- Disk issues: df, du, dmesg, smartctl
- General: /var/log/syslog, /var/log/messages, dmesg
Implement collectors and plan builder for baseline issue categories
Implement directory traversal for user-specified paths (read-only)
Add support for per-distro variations (Ubuntu vs RHEL path differences, etc.)
Write tests with mocked SSH output

Phase 3 — AI Integration

Wire collected data into the local AI model.

Implement OpenAI-compatible AI client module
Design prompt templates for initial and follow-up analysis
Implement response guardrail checks and structured response headings
Tune context usage with RAG retrieval and chunk/runbook truncation budgets
Implement reliable non-streaming completion path for local backends
Continue output quality tuning and grounding evaluation on real hosts

Phase 4 — CLI & User Experience

Polish the interface for real-world use.

Design CLI interface with run command, interactive prompts, and runbook subcommands
Implement structured output sections (Root Cause, Evidence, Recommended Actions)
Add RAG debug mode (--rag-debug) showing retrieval scores
Support output to file (--output-file)
Provide comprehensive --help command documentation via Typer options

Phase 5 — Hardening & Distribution

Prepare for broader use.

Security review of SSH handling and credential storage
Ensure no data is written to the remote system under any path
Package for distribution (binary release, container image, or distro packages)
Write installation and quickstart documentation
End-to-end integration tests against a test VM

Phase 6 — RAG & Knowledge Layer

Introduce Retrieval-Augmented Generation to ground AI responses in evidence rather than model weights alone. Three tiers of increasing capability, each buildable independently.

Goals

Eliminate prompt flooding on hosts with large log output
Ground recommendations in version-controlled runbooks, not model improvisation
Build compounding institutional memory from past troubleshooting sessions
Keep all data local — no embeddings or session content leaves the network

Technology Decisions Required

Decision	Options	Recommendation	Status
Embedding model	`nomic-embed-text`, `mxbai-embed-large`, `all-minilm`	`nomic-embed-text` via Ollama (local, 274MB, strong perf)	✅ Implemented
Vector store — Tier 1	In-memory numpy cosine, `faiss-cpu`	numpy (zero deps) for session scope	✅ Implemented
Vector store — Tier 2/3	`chromadb`, `qdrant`, `weaviate`, `pgvector`	`chromadb` embedded mode	✅ Tier 2 Implemented
Chunking strategy	Fixed token, sentence-aware, command-boundary	Command-boundary splitting (natural unit for diagnostics)	✅ Implemented
Hybrid retrieval	Semantic only, BM25 only, hybrid	Hybrid (BM25 keyword + cosine semantic) for best recall	⬜ Pending
Reranking	None, cross-encoder (`ms-marco-MiniLM`), LLM-as-judge	Cross-encoder rerank pass before prompt injection	⬜ Pending
Runbook format	Markdown, YAML, JSON	Markdown (human-editable, version-controllable)	✅ Implemented
Session index storage	Local `~/.tai/`, configurable path	`~/.tai/sessions/` with ChromaDB collection	✅ Implemented (core)

Tier 1 — Diagnostic Chunk Retrieval (in-memory, per-session)

Status: ✅ Implemented

Problem: Current flow injects all collected output into the prompt as one block. On busy hosts this floods the context window with irrelevant output, degrading quality.

Approach:

After collection, split each command's output into overlapping token chunks (e.g. 512 tokens, 64 overlap)
Embed all chunks using nomic-embed-text via Ollama embeddings API
On each question (initial + follow-up), embed the question and retrieve top-k chunks by cosine similarity
Inject only retrieved chunks into the prompt, not the full dump

New module: src/tai/rag_retriever.py

chunk_report(report) -> list[Chunk]
embed_chunks(chunks) -> list[EmbeddedChunk]
retrieve(question, embedded_chunks, top_k) -> list[Chunk]

Changes to existing code:

prompt_builder.py: accept retrieved_chunks instead of full CollectionReport for RAG-mode prompts
cli.py: embed report after collection, pass retriever to _run_analysis and _run_followup_analysis
ai_client.py: add embed(text) -> list[float] method using Ollama /api/embeddings

Companion features buildable at same time:

--no-rag flag to bypass retrieval and use full dump (backwards compat)
Token budget display: show user how many tokens are being sent vs. saved
Per-chunk source attribution in AI response (which command produced the evidence)

Tests:

tests/test_rag_retriever.py: chunk splitting, cosine similarity ranking, top-k retrieval
tests/test_ai.py: add test_embed_returns_float_list()

Tier 2 — Runbook Knowledge Base (persistent, ChromaDB)

Status: ✅ Implemented

Problem: AI improvises remediation steps from training data, which may be wrong for specific environments, distros, or internal conventions.

Approach:

Maintain a version-controlled corpus of Markdown runbooks in runbooks/ directory
On first run (or tai runbooks --sync), embed all runbooks and persist to ChromaDB collection
On each analysis, retrieve top-3 relevant runbook chunks alongside diagnostic chunks
Inject as a separate ## Runbook Context section in the prompt

New module: src/tai/runbook_store.py

RunbookStore: wraps ChromaDB collection
sync(runbooks_dir) -> int — embed and upsert all runbooks
query(question, top_k) -> list[RunbookChunk]

New directory: runbooks/

ssh.md, nginx.md, postgres.md, disk.md, kernel.md, etc.
Each runbook: YAML frontmatter (service, symptoms, tags) + Markdown body

New CLI command: tai runbooks --sync [--path ./runbooks]

Changes to existing code:

prompt_builder.py: add build_message_with_runbooks(retrieved_chunks, runbook_chunks)
cli.py: optionally load RunbookStore, query it per analysis turn

Companion features buildable at same time:

tai runbooks --list — show indexed runbooks and last sync time
tai runbooks --add <file> — index a single runbook
/runbooks slash command in interactive mode — show which runbooks were retrieved
Runbook citation in AI output: "Based on runbook: ssh.md#AuthenticationFailures"

Tier 3 — Session Memory Index (institutional learning)

Status: ✅ Implemented (core retrieval/indexing) / ⬜ UX commands pending

Problem: Every session starts from zero. Repeat incidents on the same host or same issue type get no benefit from past work.

Implemented now:

On session end, embed the session summary (issue + root cause + actions) and upsert into a persistent ChromaDB collection (~/.tai/sessions/)
On session start, query for similar past sessions by issue text + hostname
Inject top-2 past sessions as ## Prior Sessions context

Pending UX layer:

/history command in interactive mode to surface past sessions explicitly

New module: src/tai/session_store.py

SessionStore: wraps ChromaDB collection at ~/.tai/sessions/
index_session(host, issue, summary, ai) — embed and store completed session
query(question, host, ai, top_k) -> list[PastSession]

Changes to existing code:

cli.py: query SessionStore during analysis turns and index final responses at session end

Companion features buildable at same time:

tai history CLI subcommand — search past sessions by keyword
tai history --host <hostname> — all sessions for a host
tai history --export <file> — export session summaries as Markdown report
Auto-suggest: "Similar issue found from 2 weeks ago — load context? [y/N]"

Implementation Order

Tier 1 (diagnostic chunks)     ← Start here. Zero new infra. Immediate prompt quality gain.
       ↓
Tier 2 (runbook KB)            ← After Tier 1. Requires ChromaDB dep + runbook authoring.
       ↓
Tier 3 (session memory)        ← Builds on Tier 2 infrastructure. Minimal extra work.

Estimated effort:

Tier 1: 2–3 days (new module + prompt builder changes + tests)
Tier 2: 3–4 days (ChromaDB + runbook authoring + CLI command + tests)
Tier 3: 1–2 days (reuses Tier 2 infrastructure)

New Dependencies

# Tier 1 (zero new runtime deps — uses Ollama HTTP API already in use)
# No additions needed

# Tier 2 + 3
chromadb>=0.5,<1.0          # embedded vector store, no separate server
# OR
qdrant-client>=1.9,<2.0     # if self-hosted Qdrant preferred

sentence-transformers>=3.0  # optional: cross-encoder reranking

New pyproject.toml optional group

[project.optional-dependencies]
rag = [
  "chromadb>=0.5,<1.0",
  "sentence-transformers>=3.0,<4.0",
]

Decisions Log

Date	Decision	Outcome
2026-05-04	Implementation language	Python — with single distributable binary via Nuitka
2026-05-04	AI backend API	OpenAI-compatible API endpoint (local Ollama by default)
2026-05-04	Default model	`gemma3:4b`
2026-05-04	SSH auth methods	Keypair only (ed25519/RSA); auto-accept new hosts; reject on key change (MITM)
2026-05-04	Bastion host support	`--jump-host` flag via SSH native ProxyJump
2026-05-04	SSH config behavior	Use `~/.ssh/config` by default; allow override via `--ignore-ssh-config`
2026-05-04	CLI vs interactive mode	Interactive: REPL for v0.1, `textual` TUI for v0.2+
2026-05-04	RAG embedding model	`nomic-embed-text` via Ollama (local, air-gapped safe)
2026-05-04	RAG vector store (Tier 1)	In-memory numpy cosine similarity — zero deps, session-scoped
2026-05-04	RAG vector store (Tier 2/3)	`chromadb` embedded mode (default) or `qdrant` self-hosted
2026-05-04	RAG chunking unit	Command-boundary splitting — each collected command = one or more chunks
2026-05-04	Runbook format	Markdown with YAML frontmatter, version-controlled in `runbooks/` directory

End-State UX Goal

After the current CLI and memory roadmap phases are stable, the long-term UX goal is a full-screen terminal TUI with an ncurses-style workflow.

Target End-State

Split-pane troubleshooting workspace (diagnostics, AI output, and command/input area)
Live command/probe status with clear success/failure indicators
In-session history browser for prior questions, retrieved evidence, and related past sessions
Keyboard-first navigation for operators running in SSH-only environments

Delivery Approach

Keep shipping incremental CLI features first (current roadmap order remains unchanged)
Promote stable workflows into TUI panels once behavior is proven in CLI mode
Treat the TUI as a final UX consolidation milestone, not a blocker for core troubleshooting capabilities

Container Distribution Goal (Docker)

After core CLI/TUI workflows stabilize, provide an official Docker image as an additional distribution target.

Container Execution Model (Decision)

Docker is a one-shot invocation target, not a daemon/service mode
Each run executes a single tai command and exits
State is persisted only through mounted host volumes

Why Docker Is Valuable Here

Reproducible runtime: pin Python and dependency versions to remove host-level drift
Faster operator onboarding: run with one command instead of local Python setup
Cleaner CI/CD release path: publish versioned images aligned with git tags
Safer local footprint: isolate dependencies from the host OS package manager

Subgoals

Base image and runtime hardening

Multi-stage Dockerfile with slim runtime image
Non-root runtime user and minimal filesystem permissions
Healthcheck for CLI startup and version command

Runtime integration for SSH workflows

Documented mounts for ~/.ssh (read-only where possible) and known-hosts handling
Pass-through for SSH config when needed (--ignore-ssh-config behavior documented)
Clear guidance for jump-host and bastion scenarios from inside the container
Documented one-shot run examples for tai run and tai history

Persistent data strategy

Required volume mount guidance for runbook store (~/.tai/runbooks)
Required volume mount guidance for session memory/history (~/.tai/sessions)
Optional bind mount for JSONL logs and report export artifacts
Clear defaults for container paths and equivalent host path mappings

Release and quality gates

Build and publish image on tagged releases
Smoke tests in CI: probe mode, collect mode, and history command against mocked endpoints
Version labeling (image tags and OCI metadata) tied to changelog/release tags

Data Retention and Lifecycle Policy

Retention behavior must be explicit and configurable at runtime. Defaults should be conservative and documented.

Retention classes

Session memory store (~/.tai/sessions): keep semantically indexed summaries for troubleshooting continuity
Runbook store (~/.tai/runbooks): retain until explicitly replaced or pruned by sync policy
JSONL logs and exported reports: operator-controlled retention with optional TTL cleanup

Retention controls

Add CLI controls for age-based pruning (for example --retain-days on cleanup commands)
Add host-scoped cleanup (delete history for one host) and full-store cleanup (all hosts)
Add dry-run cleanup mode to show what would be deleted before applying changes

No-persist mode

Add a documented ephemeral mode where no session memory or logs are written
Ensure one-shot diagnostics can run in read-only operational contexts

Configuration and State Persistence Model

Configuration and retained state should be predictable across container upgrades and host environments.

Mount and path contract

Define canonical container paths for ~/.tai/runbooks, ~/.tai/sessions, and optional log/export paths
Document required versus optional mounts and expected permissions for each
Document UID/GID mapping guidance to prevent host volume ownership issues

Schema and compatibility

Introduce explicit storage schema version metadata for persistent stores
Define upgrade behavior for older stores (migrate, re-index, or fail with clear guidance)
Add compatibility notes for image upgrades and rollback expectations

Backup and recovery

Provide export/import workflows for session memory and runbook indexes
Document minimal backup set and restore order for disaster recovery

Security and Privacy for Retained Data

Persisted troubleshooting evidence can include sensitive operational data and must be handled accordingly.

Data minimization

Add optional redaction hooks for common sensitive patterns before persistence
Keep prompt-only transient data separate from persisted summary/index content

Runtime hardening

Target non-root container execution with read-only root filesystem by default
Require explicit writable mounts only for retained data locations

Auditable behavior

Log retention-affecting operations (cleanup, purge, export/import) with timestamps and scope
Define stable exit codes for cleanup and retention workflows to support automation

Kubernetes Position

Kubernetes is out of scope for this delivery plan.

tai is currently an operator-invoked troubleshooting client, not a long-running service
AI inference is external to tai (OpenAI-compatible endpoint), reducing the need for in-cluster model orchestration
SSH key/config handling and per-operator context are simpler with local or single-container execution

Kubernetes can be revisited only if tai evolves into a centralized multi-user service with queueing, RBAC, and shared tenancy requirements.

Final Long-Term Goal: Full Rust Migration

This is a final-stage roadmap goal and remains explicitly out of near-term scope. It should begin only after the Python implementation, TUI direction, Docker one-shot model, and retention/persistence policies are stable and proven in production usage.

Why This Is the Final Goal

Improve execution latency and startup speed for both native runs and container one-shot invocations
Produce a single, portable native binary with minimal runtime dependency footprint
Strengthen reliability and memory safety under heavy log parsing and concurrent workflows
Simplify long-term packaging and distribution across Linux targets

Migration Objectives

Preserve feature parity first

Match existing CLI behavior, interactive workflows, RAG integration, runbook management, and history/session-memory features
Keep command semantics and safety boundaries equivalent during transition

Target both distribution modes

Native Rust binary for direct operator use
Docker image built around the Rust binary for one-shot execution with mounted persistent volumes

Keep compatibility guardrails

Define persistent data format compatibility or migration tooling for runbook/session stores
Preserve operator-visible flags where practical to reduce migration friction

Suggested Delivery Phases

Build baseline Rust CLI scaffold with feature-flagged parity checkpoints
Port SSH execution and read-only policy enforcement modules
Port planner, collectors, prompt composition, and AI client adapters
Port session memory/history and runbook workflows with migration tests
Port interactive UX/TUI layer and deprecate Python runtime path

Rust Toolchain End-State

Standardize on Cargo-based build/test/lint pipeline (cargo fmt, cargo clippy, cargo test)
Add release profile optimization and reproducible build settings
Publish signed native artifacts and Docker images derived from Rust release binaries

Decision Gate Before Starting

Begin Rust migration only when:

Python roadmap milestones are complete and stable
Container distribution and retention policy workflows are operationally validated
A parity test matrix exists to prove behavior equivalence during migration

22 KiB Raw Blame History Unescape Escape

Roadmap

Phase 0 — Decisions & Prerequisites

Language Selection

AI Backend & Model

SSH Strategy

Scope & Constraints

Phase 1 — Project Foundation

Phase 2 — Data Collection Layer

Phase 3 — AI Integration

Phase 4 — CLI & User Experience

Phase 5 — Hardening & Distribution

Phase 6 — RAG & Knowledge Layer

Goals

Technology Decisions Required

Tier 1 — Diagnostic Chunk Retrieval (in-memory, per-session)

Tier 2 — Runbook Knowledge Base (persistent, ChromaDB)

Tier 3 — Session Memory Index (institutional learning)

Implementation Order

New Dependencies

New pyproject.toml optional group

Decisions Log

End-State UX Goal

Target End-State

Delivery Approach

Container Distribution Goal (Docker)

Container Execution Model (Decision)

Why Docker Is Valuable Here

Subgoals

Data Retention and Lifecycle Policy

Configuration and State Persistence Model

Security and Privacy for Retained Data

Kubernetes Position

Final Long-Term Goal: Full Rust Migration

Why This Is the Final Goal

Migration Objectives

Suggested Delivery Phases

Rust Toolchain End-State

Decision Gate Before Starting

22 KiB

Raw Blame History