Files
tai/ROADMAP.md

22 KiB
Raw Blame History

Roadmap

This document outlines the major decisions, milestones, and development phases required to bring tai from concept to a working tool.


Phase 0 — Decisions & Prerequisites

These must be resolved before meaningful development can begin.

Language Selection

  • Decision: Python
  • Key factors: native vLLM integration, mature SSH libraries (paramiko / asyncssh), strong text/log parsing, rapid development
  • Single binary distribution will be achieved via Nuitka (preferred for true compilation) or PyInstaller as a fallback
  • Evaluate Nuitka vs PyInstaller for binary output quality and CI reproducibility
  • Add binary build step to CI pipeline

AI Backend & Model

  • OpenAI-compatible backend client implemented (AIClient)
  • Default local backend profile wired for Ollama (http://localhost:11434/v1)
  • Default model profile set to gemma3:4b (override via --model)
  • Define minimum hardware requirements for running the model locally
  • AI backend is user-supplied/self-hosted

SSH Strategy

  • Decision: keypair authentication only — no password auth; eliminates credential storage risk
    • Default key resolution: ~/.ssh/id_ed25519, ~/.ssh/id_rsa (in order of preference)
    • CLI override via --identity-file <path>
    • No SSH agent forwarding needed — a shared key is distributed to all managed hosts via Puppet
  • Known hosts: auto-accept new hosts; reject on key mismatch — a changed host key triggers a hard stop with a MITM warning; unknown/new hosts are accepted silently on first connect
  • Bastion/jump host: --jump-host <host> flag — delegates to SSH's native ProxyJump functionality
  • SSH config behavior: respect existing ~/.ssh/config by default; allow CLI override
    • Default: follow host settings from ~/.ssh/config (for User, Port, ProxyJump, etc.)
    • Override switch: --ignore-ssh-config to bypass local SSH config when required

Scope & Constraints

  • Define the supported scope of issues (services, network, disk, kernel, etc.)
  • Read-only guarantee implemented with command allowlist + blocked shell operator policy
  • Decision: interactive REPL mode for v0.1, full TUI for v0.2+
    • v0.1: chat-loop REPL launched from CLI; human can follow up, correct, and redirect the agent
    • v0.2+: textual-based TUI with split panes (collected data | AI output | input bar)
    • Built-in slash commands: /collect, /show logs, /clear, /host <hostname>, /help, /quit

Phase 1 — Project Foundation

Basic project scaffolding and connectivity.

  • Finalise repository structure and language toolchain
  • Set up CI pipeline (linting, tests)
  • Implement SSH connection module
    • Define SSH config model and probe interface scaffold
    • Connect to remote host
    • Execute read-only commands (e.g. journalctl, systemctl status, cat)
    • Stream or collect command output safely (byte-limited output with truncation marker)
  • Implement basic input parsing (ticket text, hostname, target directories)
  • Write unit tests for SSH and input modules
    • Input parser and CLI tests added
    • SSH module tests added for command policy and SSH argv behavior

Phase 2 — Data Collection Layer

Define what information the agent gathers and how.

  • Identify a baseline canonical set of data sources per issue type:
    • Service failures: journalctl, systemctl, service config files
    • Network issues: ip, ss, netstat, firewall rules
    • Disk issues: df, du, dmesg, smartctl
    • General: /var/log/syslog, /var/log/messages, dmesg
  • Implement collectors and plan builder for baseline issue categories
  • Implement directory traversal for user-specified paths (read-only)
  • Add support for per-distro variations (Ubuntu vs RHEL path differences, etc.)
  • Write tests with mocked SSH output

Phase 3 — AI Integration

Wire collected data into the local AI model.

  • Implement OpenAI-compatible AI client module
  • Design prompt templates for initial and follow-up analysis
  • Implement response guardrail checks and structured response headings
  • Tune context usage with RAG retrieval and chunk/runbook truncation budgets
  • Implement reliable non-streaming completion path for local backends
  • Continue output quality tuning and grounding evaluation on real hosts

Phase 4 — CLI & User Experience

Polish the interface for real-world use.

  • Design CLI interface with run command, interactive prompts, and runbook subcommands
  • Implement structured output sections (Root Cause, Evidence, Recommended Actions)
  • Add RAG debug mode (--rag-debug) showing retrieval scores
  • Support output to file (--output-file)
  • Provide comprehensive --help command documentation via Typer options

Phase 5 — Hardening & Distribution

Prepare for broader use.

  • Security review of SSH handling and credential storage
  • Ensure no data is written to the remote system under any path
  • Package for distribution (binary release, container image, or distro packages)
  • Write installation and quickstart documentation
  • End-to-end integration tests against a test VM

Phase 6 — RAG & Knowledge Layer

Introduce Retrieval-Augmented Generation to ground AI responses in evidence rather than model weights alone. Three tiers of increasing capability, each buildable independently.

Goals

  • Eliminate prompt flooding on hosts with large log output
  • Ground recommendations in version-controlled runbooks, not model improvisation
  • Build compounding institutional memory from past troubleshooting sessions
  • Keep all data local — no embeddings or session content leaves the network

Technology Decisions Required

Decision Options Recommendation Status
Embedding model nomic-embed-text, mxbai-embed-large, all-minilm nomic-embed-text via Ollama (local, 274MB, strong perf) Implemented
Vector store — Tier 1 In-memory numpy cosine, faiss-cpu numpy (zero deps) for session scope Implemented
Vector store — Tier 2/3 chromadb, qdrant, weaviate, pgvector chromadb embedded mode Tier 2 Implemented
Chunking strategy Fixed token, sentence-aware, command-boundary Command-boundary splitting (natural unit for diagnostics) Implemented
Hybrid retrieval Semantic only, BM25 only, hybrid Hybrid (BM25 keyword + cosine semantic) for best recall Pending
Reranking None, cross-encoder (ms-marco-MiniLM), LLM-as-judge Cross-encoder rerank pass before prompt injection Pending
Runbook format Markdown, YAML, JSON Markdown (human-editable, version-controllable) Implemented
Session index storage Local ~/.tai/, configurable path ~/.tai/sessions/ with ChromaDB collection Implemented (core)

Tier 1 — Diagnostic Chunk Retrieval (in-memory, per-session)

Status: Implemented

Problem: Current flow injects all collected output into the prompt as one block. On busy hosts this floods the context window with irrelevant output, degrading quality.

Approach:

  • After collection, split each command's output into overlapping token chunks (e.g. 512 tokens, 64 overlap)
  • Embed all chunks using nomic-embed-text via Ollama embeddings API
  • On each question (initial + follow-up), embed the question and retrieve top-k chunks by cosine similarity
  • Inject only retrieved chunks into the prompt, not the full dump

New module: src/tai/rag_retriever.py

  • chunk_report(report) -> list[Chunk]
  • embed_chunks(chunks) -> list[EmbeddedChunk]
  • retrieve(question, embedded_chunks, top_k) -> list[Chunk]

Changes to existing code:

  • prompt_builder.py: accept retrieved_chunks instead of full CollectionReport for RAG-mode prompts
  • cli.py: embed report after collection, pass retriever to _run_analysis and _run_followup_analysis
  • ai_client.py: add embed(text) -> list[float] method using Ollama /api/embeddings

Companion features buildable at same time:

  • --no-rag flag to bypass retrieval and use full dump (backwards compat)
  • Token budget display: show user how many tokens are being sent vs. saved
  • Per-chunk source attribution in AI response (which command produced the evidence)

Tests:

  • tests/test_rag_retriever.py: chunk splitting, cosine similarity ranking, top-k retrieval
  • tests/test_ai.py: add test_embed_returns_float_list()

Tier 2 — Runbook Knowledge Base (persistent, ChromaDB)

Status: Implemented

Problem: AI improvises remediation steps from training data, which may be wrong for specific environments, distros, or internal conventions.

Approach:

  • Maintain a version-controlled corpus of Markdown runbooks in runbooks/ directory
  • On first run (or tai runbooks --sync), embed all runbooks and persist to ChromaDB collection
  • On each analysis, retrieve top-3 relevant runbook chunks alongside diagnostic chunks
  • Inject as a separate ## Runbook Context section in the prompt

New module: src/tai/runbook_store.py

  • RunbookStore: wraps ChromaDB collection
  • sync(runbooks_dir) -> int — embed and upsert all runbooks
  • query(question, top_k) -> list[RunbookChunk]

New directory: runbooks/

  • ssh.md, nginx.md, postgres.md, disk.md, kernel.md, etc.
  • Each runbook: YAML frontmatter (service, symptoms, tags) + Markdown body

New CLI command: tai runbooks --sync [--path ./runbooks]

Changes to existing code:

  • prompt_builder.py: add build_message_with_runbooks(retrieved_chunks, runbook_chunks)
  • cli.py: optionally load RunbookStore, query it per analysis turn

Companion features buildable at same time:

  • tai runbooks --list — show indexed runbooks and last sync time
  • tai runbooks --add <file> — index a single runbook
  • /runbooks slash command in interactive mode — show which runbooks were retrieved
  • Runbook citation in AI output: "Based on runbook: ssh.md#AuthenticationFailures"

Tier 3 — Session Memory Index (institutional learning)

Status: Implemented (core retrieval/indexing) / UX commands pending

Problem: Every session starts from zero. Repeat incidents on the same host or same issue type get no benefit from past work.

Implemented now:

  • On session end, embed the session summary (issue + root cause + actions) and upsert into a persistent ChromaDB collection (~/.tai/sessions/)
  • On session start, query for similar past sessions by issue text + hostname
  • Inject top-2 past sessions as ## Prior Sessions context

Pending UX layer:

  • /history command in interactive mode to surface past sessions explicitly

New module: src/tai/session_store.py

  • SessionStore: wraps ChromaDB collection at ~/.tai/sessions/
  • index_session(host, issue, summary, ai) — embed and store completed session
  • query(question, host, ai, top_k) -> list[PastSession]

Changes to existing code:

  • cli.py: query SessionStore during analysis turns and index final responses at session end

Companion features buildable at same time:

  • tai history CLI subcommand — search past sessions by keyword
  • tai history --host <hostname> — all sessions for a host
  • tai history --export <file> — export session summaries as Markdown report
  • Auto-suggest: "Similar issue found from 2 weeks ago — load context? [y/N]"

Implementation Order

Tier 1 (diagnostic chunks)     ← Start here. Zero new infra. Immediate prompt quality gain.
       ↓
Tier 2 (runbook KB)            ← After Tier 1. Requires ChromaDB dep + runbook authoring.
       ↓
Tier 3 (session memory)        ← Builds on Tier 2 infrastructure. Minimal extra work.

Estimated effort:

  • Tier 1: 23 days (new module + prompt builder changes + tests)
  • Tier 2: 34 days (ChromaDB + runbook authoring + CLI command + tests)
  • Tier 3: 12 days (reuses Tier 2 infrastructure)

New Dependencies

# Tier 1 (zero new runtime deps — uses Ollama HTTP API already in use)
# No additions needed

# Tier 2 + 3
chromadb>=0.5,<1.0          # embedded vector store, no separate server
# OR
qdrant-client>=1.9,<2.0     # if self-hosted Qdrant preferred

sentence-transformers>=3.0  # optional: cross-encoder reranking

New pyproject.toml optional group

[project.optional-dependencies]
rag = [
  "chromadb>=0.5,<1.0",
  "sentence-transformers>=3.0,<4.0",
]

Decisions Log

Date Decision Outcome
2026-05-04 Implementation language Python — with single distributable binary via Nuitka
2026-05-04 AI backend API OpenAI-compatible API endpoint (local Ollama by default)
2026-05-04 Default model gemma3:4b
2026-05-04 SSH auth methods Keypair only (ed25519/RSA); auto-accept new hosts; reject on key change (MITM)
2026-05-04 Bastion host support --jump-host flag via SSH native ProxyJump
2026-05-04 SSH config behavior Use ~/.ssh/config by default; allow override via --ignore-ssh-config
2026-05-04 CLI vs interactive mode Interactive: REPL for v0.1, textual TUI for v0.2+
2026-05-04 RAG embedding model nomic-embed-text via Ollama (local, air-gapped safe)
2026-05-04 RAG vector store (Tier 1) In-memory numpy cosine similarity — zero deps, session-scoped
2026-05-04 RAG vector store (Tier 2/3) chromadb embedded mode (default) or qdrant self-hosted
2026-05-04 RAG chunking unit Command-boundary splitting — each collected command = one or more chunks
2026-05-04 Runbook format Markdown with YAML frontmatter, version-controlled in runbooks/ directory

End-State UX Goal

After the current CLI and memory roadmap phases are stable, the long-term UX goal is a full-screen terminal TUI with an ncurses-style workflow.

Target End-State

  • Split-pane troubleshooting workspace (diagnostics, AI output, and command/input area)
  • Live command/probe status with clear success/failure indicators
  • In-session history browser for prior questions, retrieved evidence, and related past sessions
  • Keyboard-first navigation for operators running in SSH-only environments

Delivery Approach

  • Keep shipping incremental CLI features first (current roadmap order remains unchanged)
  • Promote stable workflows into TUI panels once behavior is proven in CLI mode
  • Treat the TUI as a final UX consolidation milestone, not a blocker for core troubleshooting capabilities

Container Distribution Goal (Docker)

After core CLI/TUI workflows stabilize, provide an official Docker image as an additional distribution target.

Container Execution Model (Decision)

  • Docker is a one-shot invocation target, not a daemon/service mode
  • Each run executes a single tai command and exits
  • State is persisted only through mounted host volumes

Why Docker Is Valuable Here

  • Reproducible runtime: pin Python and dependency versions to remove host-level drift
  • Faster operator onboarding: run with one command instead of local Python setup
  • Cleaner CI/CD release path: publish versioned images aligned with git tags
  • Safer local footprint: isolate dependencies from the host OS package manager

Subgoals

  1. Base image and runtime hardening
  • Multi-stage Dockerfile with slim runtime image
  • Non-root runtime user and minimal filesystem permissions
  • Healthcheck for CLI startup and version command
  1. Runtime integration for SSH workflows
  • Documented mounts for ~/.ssh (read-only where possible) and known-hosts handling
  • Pass-through for SSH config when needed (--ignore-ssh-config behavior documented)
  • Clear guidance for jump-host and bastion scenarios from inside the container
  • Documented one-shot run examples for tai run and tai history
  1. Persistent data strategy
  • Required volume mount guidance for runbook store (~/.tai/runbooks)
  • Required volume mount guidance for session memory/history (~/.tai/sessions)
  • Optional bind mount for JSONL logs and report export artifacts
  • Clear defaults for container paths and equivalent host path mappings
  1. Release and quality gates
  • Build and publish image on tagged releases
  • Smoke tests in CI: probe mode, collect mode, and history command against mocked endpoints
  • Version labeling (image tags and OCI metadata) tied to changelog/release tags

Data Retention and Lifecycle Policy

Retention behavior must be explicit and configurable at runtime. Defaults should be conservative and documented.

  1. Retention classes
  • Session memory store (~/.tai/sessions): keep semantically indexed summaries for troubleshooting continuity
  • Runbook store (~/.tai/runbooks): retain until explicitly replaced or pruned by sync policy
  • JSONL logs and exported reports: operator-controlled retention with optional TTL cleanup
  1. Retention controls
  • Add CLI controls for age-based pruning (for example --retain-days on cleanup commands)
  • Add host-scoped cleanup (delete history for one host) and full-store cleanup (all hosts)
  • Add dry-run cleanup mode to show what would be deleted before applying changes
  1. No-persist mode
  • Add a documented ephemeral mode where no session memory or logs are written
  • Ensure one-shot diagnostics can run in read-only operational contexts

Configuration and State Persistence Model

Configuration and retained state should be predictable across container upgrades and host environments.

  1. Mount and path contract
  • Define canonical container paths for ~/.tai/runbooks, ~/.tai/sessions, and optional log/export paths
  • Document required versus optional mounts and expected permissions for each
  • Document UID/GID mapping guidance to prevent host volume ownership issues
  1. Schema and compatibility
  • Introduce explicit storage schema version metadata for persistent stores
  • Define upgrade behavior for older stores (migrate, re-index, or fail with clear guidance)
  • Add compatibility notes for image upgrades and rollback expectations
  1. Backup and recovery
  • Provide export/import workflows for session memory and runbook indexes
  • Document minimal backup set and restore order for disaster recovery

Security and Privacy for Retained Data

Persisted troubleshooting evidence can include sensitive operational data and must be handled accordingly.

  1. Data minimization
  • Add optional redaction hooks for common sensitive patterns before persistence
  • Keep prompt-only transient data separate from persisted summary/index content
  1. Runtime hardening
  • Target non-root container execution with read-only root filesystem by default
  • Require explicit writable mounts only for retained data locations
  1. Auditable behavior
  • Log retention-affecting operations (cleanup, purge, export/import) with timestamps and scope
  • Define stable exit codes for cleanup and retention workflows to support automation

Kubernetes Position

Kubernetes is out of scope for this delivery plan.

  • tai is currently an operator-invoked troubleshooting client, not a long-running service
  • AI inference is external to tai (OpenAI-compatible endpoint), reducing the need for in-cluster model orchestration
  • SSH key/config handling and per-operator context are simpler with local or single-container execution

Kubernetes can be revisited only if tai evolves into a centralized multi-user service with queueing, RBAC, and shared tenancy requirements.


Final Long-Term Goal: Full Rust Migration

This is a final-stage roadmap goal and remains explicitly out of near-term scope. It should begin only after the Python implementation, TUI direction, Docker one-shot model, and retention/persistence policies are stable and proven in production usage.

Why This Is the Final Goal

  • Improve execution latency and startup speed for both native runs and container one-shot invocations
  • Produce a single, portable native binary with minimal runtime dependency footprint
  • Strengthen reliability and memory safety under heavy log parsing and concurrent workflows
  • Simplify long-term packaging and distribution across Linux targets

Migration Objectives

  1. Preserve feature parity first
  • Match existing CLI behavior, interactive workflows, RAG integration, runbook management, and history/session-memory features
  • Keep command semantics and safety boundaries equivalent during transition
  1. Target both distribution modes
  • Native Rust binary for direct operator use
  • Docker image built around the Rust binary for one-shot execution with mounted persistent volumes
  1. Keep compatibility guardrails
  • Define persistent data format compatibility or migration tooling for runbook/session stores
  • Preserve operator-visible flags where practical to reduce migration friction

Suggested Delivery Phases

  1. Build baseline Rust CLI scaffold with feature-flagged parity checkpoints
  2. Port SSH execution and read-only policy enforcement modules
  3. Port planner, collectors, prompt composition, and AI client adapters
  4. Port session memory/history and runbook workflows with migration tests
  5. Port interactive UX/TUI layer and deprecate Python runtime path

Rust Toolchain End-State

  • Standardize on Cargo-based build/test/lint pipeline (cargo fmt, cargo clippy, cargo test)
  • Add release profile optimization and reproducible build settings
  • Publish signed native artifacts and Docker images derived from Rust release binaries

Decision Gate Before Starting

Begin Rust migration only when:

  • Python roadmap milestones are complete and stable
  • Container distribution and retention policy workflows are operationally validated
  • A parity test matrix exists to prove behavior equivalence during migration