22 KiB
Roadmap
This document outlines the major decisions, milestones, and development phases required to bring tai from concept to a working tool.
Phase 0 — Decisions & Prerequisites
These must be resolved before meaningful development can begin.
Language Selection
- Decision: Python
- Key factors: native vLLM integration, mature SSH libraries (
paramiko/asyncssh), strong text/log parsing, rapid development - Single binary distribution will be achieved via Nuitka (preferred for true compilation) or PyInstaller as a fallback
- Evaluate Nuitka vs PyInstaller for binary output quality and CI reproducibility
- Add binary build step to CI pipeline
AI Backend & Model
- OpenAI-compatible backend client implemented (
AIClient) - Default local backend profile wired for Ollama (
http://localhost:11434/v1) - Default model profile set to
gemma3:4b(override via--model) - Define minimum hardware requirements for running the model locally
- AI backend is user-supplied/self-hosted
SSH Strategy
- Decision: keypair authentication only — no password auth; eliminates credential storage risk
- Default key resolution:
~/.ssh/id_ed25519,~/.ssh/id_rsa(in order of preference) - CLI override via
--identity-file <path> - No SSH agent forwarding needed — a shared key is distributed to all managed hosts via Puppet
- Default key resolution:
- Known hosts: auto-accept new hosts; reject on key mismatch — a changed host key triggers a hard stop with a MITM warning; unknown/new hosts are accepted silently on first connect
- Bastion/jump host:
--jump-host <host>flag — delegates to SSH's native ProxyJump functionality - SSH config behavior: respect existing
~/.ssh/configby default; allow CLI override- Default: follow host settings from
~/.ssh/config(forUser,Port,ProxyJump, etc.) - Override switch:
--ignore-ssh-configto bypass local SSH config when required
- Default: follow host settings from
Scope & Constraints
- Define the supported scope of issues (services, network, disk, kernel, etc.)
- Read-only guarantee implemented with command allowlist + blocked shell operator policy
- Decision: interactive REPL mode for v0.1, full TUI for v0.2+
- v0.1: chat-loop REPL launched from CLI; human can follow up, correct, and redirect the agent
- v0.2+:
textual-based TUI with split panes (collected data | AI output | input bar) - Built-in slash commands:
/collect,/show logs,/clear,/host <hostname>,/help,/quit
Phase 1 — Project Foundation
Basic project scaffolding and connectivity.
- Finalise repository structure and language toolchain
- Set up CI pipeline (linting, tests)
- Implement SSH connection module
- Define SSH config model and probe interface scaffold
- Connect to remote host
- Execute read-only commands (e.g.
journalctl,systemctl status,cat) - Stream or collect command output safely (byte-limited output with truncation marker)
- Implement basic input parsing (ticket text, hostname, target directories)
- Write unit tests for SSH and input modules
- Input parser and CLI tests added
- SSH module tests added for command policy and SSH argv behavior
Phase 2 — Data Collection Layer
Define what information the agent gathers and how.
- Identify a baseline canonical set of data sources per issue type:
- Service failures:
journalctl,systemctl, service config files - Network issues:
ip,ss,netstat, firewall rules - Disk issues:
df,du,dmesg,smartctl - General:
/var/log/syslog,/var/log/messages,dmesg
- Service failures:
- Implement collectors and plan builder for baseline issue categories
- Implement directory traversal for user-specified paths (read-only)
- Add support for per-distro variations (Ubuntu vs RHEL path differences, etc.)
- Write tests with mocked SSH output
Phase 3 — AI Integration
Wire collected data into the local AI model.
- Implement OpenAI-compatible AI client module
- Design prompt templates for initial and follow-up analysis
- Implement response guardrail checks and structured response headings
- Tune context usage with RAG retrieval and chunk/runbook truncation budgets
- Implement reliable non-streaming completion path for local backends
- Continue output quality tuning and grounding evaluation on real hosts
Phase 4 — CLI & User Experience
Polish the interface for real-world use.
- Design CLI interface with run command, interactive prompts, and runbook subcommands
- Implement structured output sections (Root Cause, Evidence, Recommended Actions)
- Add RAG debug mode (
--rag-debug) showing retrieval scores - Support output to file (
--output-file) - Provide comprehensive
--helpcommand documentation via Typer options
Phase 5 — Hardening & Distribution
Prepare for broader use.
- Security review of SSH handling and credential storage
- Ensure no data is written to the remote system under any path
- Package for distribution (binary release, container image, or distro packages)
- Write installation and quickstart documentation
- End-to-end integration tests against a test VM
Phase 6 — RAG & Knowledge Layer
Introduce Retrieval-Augmented Generation to ground AI responses in evidence rather than model weights alone. Three tiers of increasing capability, each buildable independently.
Goals
- Eliminate prompt flooding on hosts with large log output
- Ground recommendations in version-controlled runbooks, not model improvisation
- Build compounding institutional memory from past troubleshooting sessions
- Keep all data local — no embeddings or session content leaves the network
Technology Decisions Required
| Decision | Options | Recommendation | Status |
|---|---|---|---|
| Embedding model | nomic-embed-text, mxbai-embed-large, all-minilm |
nomic-embed-text via Ollama (local, 274MB, strong perf) |
✅ Implemented |
| Vector store — Tier 1 | In-memory numpy cosine, faiss-cpu |
numpy (zero deps) for session scope | ✅ Implemented |
| Vector store — Tier 2/3 | chromadb, qdrant, weaviate, pgvector |
chromadb embedded mode |
✅ Tier 2 Implemented |
| Chunking strategy | Fixed token, sentence-aware, command-boundary | Command-boundary splitting (natural unit for diagnostics) | ✅ Implemented |
| Hybrid retrieval | Semantic only, BM25 only, hybrid | Hybrid (BM25 keyword + cosine semantic) for best recall | ⬜ Pending |
| Reranking | None, cross-encoder (ms-marco-MiniLM), LLM-as-judge |
Cross-encoder rerank pass before prompt injection | ⬜ Pending |
| Runbook format | Markdown, YAML, JSON | Markdown (human-editable, version-controllable) | ✅ Implemented |
| Session index storage | Local ~/.tai/, configurable path |
~/.tai/sessions/ with ChromaDB collection |
✅ Implemented (core) |
Tier 1 — Diagnostic Chunk Retrieval (in-memory, per-session)
Status: ✅ Implemented
Problem: Current flow injects all collected output into the prompt as one block. On busy hosts this floods the context window with irrelevant output, degrading quality.
Approach:
- After collection, split each command's output into overlapping token chunks (e.g. 512 tokens, 64 overlap)
- Embed all chunks using
nomic-embed-textvia Ollama embeddings API - On each question (initial + follow-up), embed the question and retrieve top-k chunks by cosine similarity
- Inject only retrieved chunks into the prompt, not the full dump
New module: src/tai/rag_retriever.py
chunk_report(report) -> list[Chunk]embed_chunks(chunks) -> list[EmbeddedChunk]retrieve(question, embedded_chunks, top_k) -> list[Chunk]
Changes to existing code:
prompt_builder.py: acceptretrieved_chunksinstead of fullCollectionReportfor RAG-mode promptscli.py: embed report after collection, pass retriever to_run_analysisand_run_followup_analysisai_client.py: addembed(text) -> list[float]method using Ollama/api/embeddings
Companion features buildable at same time:
--no-ragflag to bypass retrieval and use full dump (backwards compat)- Token budget display: show user how many tokens are being sent vs. saved
- Per-chunk source attribution in AI response (which command produced the evidence)
Tests:
tests/test_rag_retriever.py: chunk splitting, cosine similarity ranking, top-k retrievaltests/test_ai.py: addtest_embed_returns_float_list()
Tier 2 — Runbook Knowledge Base (persistent, ChromaDB)
Status: ✅ Implemented
Problem: AI improvises remediation steps from training data, which may be wrong for specific environments, distros, or internal conventions.
Approach:
- Maintain a version-controlled corpus of Markdown runbooks in
runbooks/directory - On first run (or
tai runbooks --sync), embed all runbooks and persist to ChromaDB collection - On each analysis, retrieve top-3 relevant runbook chunks alongside diagnostic chunks
- Inject as a separate
## Runbook Contextsection in the prompt
New module: src/tai/runbook_store.py
RunbookStore: wraps ChromaDB collectionsync(runbooks_dir) -> int— embed and upsert all runbooksquery(question, top_k) -> list[RunbookChunk]
New directory: runbooks/
ssh.md,nginx.md,postgres.md,disk.md,kernel.md, etc.- Each runbook: YAML frontmatter (
service,symptoms,tags) + Markdown body
New CLI command: tai runbooks --sync [--path ./runbooks]
Changes to existing code:
prompt_builder.py: addbuild_message_with_runbooks(retrieved_chunks, runbook_chunks)cli.py: optionally loadRunbookStore, query it per analysis turn
Companion features buildable at same time:
tai runbooks --list— show indexed runbooks and last sync timetai runbooks --add <file>— index a single runbook/runbooksslash command in interactive mode — show which runbooks were retrieved- Runbook citation in AI output: "Based on runbook:
ssh.md#AuthenticationFailures"
Tier 3 — Session Memory Index (institutional learning)
Status: ✅ Implemented (core retrieval/indexing) / ⬜ UX commands pending
Problem: Every session starts from zero. Repeat incidents on the same host or same issue type get no benefit from past work.
Implemented now:
- On session end, embed the session summary (issue + root cause + actions) and upsert into a persistent ChromaDB collection (
~/.tai/sessions/) - On session start, query for similar past sessions by issue text + hostname
- Inject top-2 past sessions as
## Prior Sessionscontext
Pending UX layer:
/historycommand in interactive mode to surface past sessions explicitly
New module: src/tai/session_store.py
SessionStore: wraps ChromaDB collection at~/.tai/sessions/index_session(host, issue, summary, ai)— embed and store completed sessionquery(question, host, ai, top_k) -> list[PastSession]
Changes to existing code:
cli.py: querySessionStoreduring analysis turns and index final responses at session end
Companion features buildable at same time:
tai historyCLI subcommand — search past sessions by keywordtai history --host <hostname>— all sessions for a hosttai history --export <file>— export session summaries as Markdown report- Auto-suggest: "Similar issue found from 2 weeks ago — load context? [y/N]"
Implementation Order
Tier 1 (diagnostic chunks) ← Start here. Zero new infra. Immediate prompt quality gain.
↓
Tier 2 (runbook KB) ← After Tier 1. Requires ChromaDB dep + runbook authoring.
↓
Tier 3 (session memory) ← Builds on Tier 2 infrastructure. Minimal extra work.
Estimated effort:
- Tier 1: 2–3 days (new module + prompt builder changes + tests)
- Tier 2: 3–4 days (ChromaDB + runbook authoring + CLI command + tests)
- Tier 3: 1–2 days (reuses Tier 2 infrastructure)
New Dependencies
# Tier 1 (zero new runtime deps — uses Ollama HTTP API already in use)
# No additions needed
# Tier 2 + 3
chromadb>=0.5,<1.0 # embedded vector store, no separate server
# OR
qdrant-client>=1.9,<2.0 # if self-hosted Qdrant preferred
sentence-transformers>=3.0 # optional: cross-encoder reranking
New pyproject.toml optional group
[project.optional-dependencies]
rag = [
"chromadb>=0.5,<1.0",
"sentence-transformers>=3.0,<4.0",
]
Decisions Log
| Date | Decision | Outcome |
|---|---|---|
| 2026-05-04 | Implementation language | Python — with single distributable binary via Nuitka |
| 2026-05-04 | AI backend API | OpenAI-compatible API endpoint (local Ollama by default) |
| 2026-05-04 | Default model | gemma3:4b |
| 2026-05-04 | SSH auth methods | Keypair only (ed25519/RSA); auto-accept new hosts; reject on key change (MITM) |
| 2026-05-04 | Bastion host support | --jump-host flag via SSH native ProxyJump |
| 2026-05-04 | SSH config behavior | Use ~/.ssh/config by default; allow override via --ignore-ssh-config |
| 2026-05-04 | CLI vs interactive mode | Interactive: REPL for v0.1, textual TUI for v0.2+ |
| 2026-05-04 | RAG embedding model | nomic-embed-text via Ollama (local, air-gapped safe) |
| 2026-05-04 | RAG vector store (Tier 1) | In-memory numpy cosine similarity — zero deps, session-scoped |
| 2026-05-04 | RAG vector store (Tier 2/3) | chromadb embedded mode (default) or qdrant self-hosted |
| 2026-05-04 | RAG chunking unit | Command-boundary splitting — each collected command = one or more chunks |
| 2026-05-04 | Runbook format | Markdown with YAML frontmatter, version-controlled in runbooks/ directory |
End-State UX Goal
After the current CLI and memory roadmap phases are stable, the long-term UX goal is a full-screen terminal TUI with an ncurses-style workflow.
Target End-State
- Split-pane troubleshooting workspace (diagnostics, AI output, and command/input area)
- Live command/probe status with clear success/failure indicators
- In-session history browser for prior questions, retrieved evidence, and related past sessions
- Keyboard-first navigation for operators running in SSH-only environments
Delivery Approach
- Keep shipping incremental CLI features first (current roadmap order remains unchanged)
- Promote stable workflows into TUI panels once behavior is proven in CLI mode
- Treat the TUI as a final UX consolidation milestone, not a blocker for core troubleshooting capabilities
Container Distribution Goal (Docker)
After core CLI/TUI workflows stabilize, provide an official Docker image as an additional distribution target.
Container Execution Model (Decision)
- Docker is a one-shot invocation target, not a daemon/service mode
- Each run executes a single
taicommand and exits - State is persisted only through mounted host volumes
Why Docker Is Valuable Here
- Reproducible runtime: pin Python and dependency versions to remove host-level drift
- Faster operator onboarding: run with one command instead of local Python setup
- Cleaner CI/CD release path: publish versioned images aligned with git tags
- Safer local footprint: isolate dependencies from the host OS package manager
Subgoals
- Base image and runtime hardening
- Multi-stage Dockerfile with slim runtime image
- Non-root runtime user and minimal filesystem permissions
- Healthcheck for CLI startup and version command
- Runtime integration for SSH workflows
- Documented mounts for
~/.ssh(read-only where possible) and known-hosts handling - Pass-through for SSH config when needed (
--ignore-ssh-configbehavior documented) - Clear guidance for jump-host and bastion scenarios from inside the container
- Documented one-shot run examples for
tai runandtai history
- Persistent data strategy
- Required volume mount guidance for runbook store (
~/.tai/runbooks) - Required volume mount guidance for session memory/history (
~/.tai/sessions) - Optional bind mount for JSONL logs and report export artifacts
- Clear defaults for container paths and equivalent host path mappings
- Release and quality gates
- Build and publish image on tagged releases
- Smoke tests in CI: probe mode, collect mode, and history command against mocked endpoints
- Version labeling (image tags and OCI metadata) tied to changelog/release tags
Data Retention and Lifecycle Policy
Retention behavior must be explicit and configurable at runtime. Defaults should be conservative and documented.
- Retention classes
- Session memory store (
~/.tai/sessions): keep semantically indexed summaries for troubleshooting continuity - Runbook store (
~/.tai/runbooks): retain until explicitly replaced or pruned by sync policy - JSONL logs and exported reports: operator-controlled retention with optional TTL cleanup
- Retention controls
- Add CLI controls for age-based pruning (for example
--retain-dayson cleanup commands) - Add host-scoped cleanup (delete history for one host) and full-store cleanup (all hosts)
- Add dry-run cleanup mode to show what would be deleted before applying changes
- No-persist mode
- Add a documented ephemeral mode where no session memory or logs are written
- Ensure one-shot diagnostics can run in read-only operational contexts
Configuration and State Persistence Model
Configuration and retained state should be predictable across container upgrades and host environments.
- Mount and path contract
- Define canonical container paths for
~/.tai/runbooks,~/.tai/sessions, and optional log/export paths - Document required versus optional mounts and expected permissions for each
- Document UID/GID mapping guidance to prevent host volume ownership issues
- Schema and compatibility
- Introduce explicit storage schema version metadata for persistent stores
- Define upgrade behavior for older stores (migrate, re-index, or fail with clear guidance)
- Add compatibility notes for image upgrades and rollback expectations
- Backup and recovery
- Provide export/import workflows for session memory and runbook indexes
- Document minimal backup set and restore order for disaster recovery
Security and Privacy for Retained Data
Persisted troubleshooting evidence can include sensitive operational data and must be handled accordingly.
- Data minimization
- Add optional redaction hooks for common sensitive patterns before persistence
- Keep prompt-only transient data separate from persisted summary/index content
- Runtime hardening
- Target non-root container execution with read-only root filesystem by default
- Require explicit writable mounts only for retained data locations
- Auditable behavior
- Log retention-affecting operations (cleanup, purge, export/import) with timestamps and scope
- Define stable exit codes for cleanup and retention workflows to support automation
Kubernetes Position
Kubernetes is out of scope for this delivery plan.
taiis currently an operator-invoked troubleshooting client, not a long-running service- AI inference is external to
tai(OpenAI-compatible endpoint), reducing the need for in-cluster model orchestration - SSH key/config handling and per-operator context are simpler with local or single-container execution
Kubernetes can be revisited only if tai evolves into a centralized multi-user service with queueing, RBAC, and shared tenancy requirements.
Final Long-Term Goal: Full Rust Migration
This is a final-stage roadmap goal and remains explicitly out of near-term scope. It should begin only after the Python implementation, TUI direction, Docker one-shot model, and retention/persistence policies are stable and proven in production usage.
Why This Is the Final Goal
- Improve execution latency and startup speed for both native runs and container one-shot invocations
- Produce a single, portable native binary with minimal runtime dependency footprint
- Strengthen reliability and memory safety under heavy log parsing and concurrent workflows
- Simplify long-term packaging and distribution across Linux targets
Migration Objectives
- Preserve feature parity first
- Match existing CLI behavior, interactive workflows, RAG integration, runbook management, and history/session-memory features
- Keep command semantics and safety boundaries equivalent during transition
- Target both distribution modes
- Native Rust binary for direct operator use
- Docker image built around the Rust binary for one-shot execution with mounted persistent volumes
- Keep compatibility guardrails
- Define persistent data format compatibility or migration tooling for runbook/session stores
- Preserve operator-visible flags where practical to reduce migration friction
Suggested Delivery Phases
- Build baseline Rust CLI scaffold with feature-flagged parity checkpoints
- Port SSH execution and read-only policy enforcement modules
- Port planner, collectors, prompt composition, and AI client adapters
- Port session memory/history and runbook workflows with migration tests
- Port interactive UX/TUI layer and deprecate Python runtime path
Rust Toolchain End-State
- Standardize on Cargo-based build/test/lint pipeline (
cargo fmt,cargo clippy,cargo test) - Add release profile optimization and reproducible build settings
- Publish signed native artifacts and Docker images derived from Rust release binaries
Decision Gate Before Starting
Begin Rust migration only when:
- Python roadmap milestones are complete and stable
- Container distribution and retention policy workflows are operationally validated
- A parity test matrix exists to prove behavior equivalence during migration