feat: complete RAG runbook workflow and release docs

2026-05-06 04:48:41 +02:00
parent 450de24d28
commit 57f4c0efaa
26 changed files with 2510 additions and 137 deletions
--- a/runbooks/apparmor.md
+++ b/runbooks/apparmor.md
@@ -0,0 +1,86 @@
+---
+service: apparmor
+symptoms: permission denied despite correct unix permissions, apparmor deny logs, service blocked by profile, executable transition denied, path access denied, snap confinement issue, profile in complain mode
+tags: apparmor, security, profile, aa-status, audit, confinement, complain, enforce, snap
+---
+
+## Symptoms
+
+- Application gets `Permission denied` even though Unix permissions look correct
+- Service starts in complain mode but fails in enforce mode
+- Log shows AppArmor `DENIED` entries
+- Binary works when profile is disabled but fails when confinement is enabled
+- Snap or packaged app cannot access expected files or sockets
+
+## Diagnostics
+
+### Check AppArmor status and loaded profiles
+
+```
+aa-status
+systemctl status apparmor
+```
+
+Confirm whether the profile is loaded and whether it is in enforce or complain mode.
+
+### Check denial logs
+
+```
+journalctl -k | grep -i apparmor
+journalctl -b | grep -i DENIED
+dmesg | grep -i apparmor
+```
+
+AppArmor denials usually identify the profile, operation, and path that was blocked.
+
+### Inspect the active profile
+
+```
+find /etc/apparmor.d -maxdepth 2 -type f | sort
+cat /etc/apparmor.d/<profile>
+```
+
+Look for missing file path rules, capability rules, and `ix`/`px` execution transitions.
+
+### Check complain vs enforce mode
+
+```
+aa-status | grep complain
+```
+
+If the issue only occurs in enforce mode, the profile is too restrictive rather than the app being broken.
+
+### Check profile parser and reload
+
+```
+apparmor_parser -r /etc/apparmor.d/<profile>
+aa-status
+```
+
+Syntax or include errors can prevent an updated profile from loading.
+
+## Remediation
+
+**Profile too restrictive:**
+Add the missing path, capability, or network rule to the profile, then reload AppArmor.
+
+If the denial pattern is repetitive, use AppArmor tooling to review and refine the profile instead of disabling confinement globally.
+
+**Need to observe without blocking:**
+Temporarily switch the profile to complain mode:
+```
+aa-complain /etc/apparmor.d/<profile>
+```
+
+**Return to enforcement after fixing rules:**
+```
+aa-enforce /etc/apparmor.d/<profile>
+```
+
+**Profile reload after changes:**
+```
+apparmor_parser -r /etc/apparmor.d/<profile>
+systemctl reload apparmor
+```
+
+Do not disable AppArmor globally when the issue is isolated to a single profile.
--- a/runbooks/disk.md
+++ b/runbooks/disk.md
@@ -0,0 +1,106 @@
+---
+service: disk
+symptoms: no space left on device, disk full, inode exhaustion, df shows 100%, du large files, write failed, cannot create file, filesystem read-only, ext4 error
+tags: disk, filesystem, storage, inodes, df, du, ext4, xfs, lvm, partition, full, space
+---
+
+## Symptoms
+
+- `No space left on device` — disk or inode exhaustion
+- `df -h` shows a filesystem at 100% (or near 100%)
+- `df -i` shows inode usage at 100% — file count exhausted even if byte space is free
+- Filesystem remounted read-only — kernel detected errors and protected itself
+- Services failing to write logs, create temp files, or open sockets
+
+## Diagnostics
+
+### Overall disk usage
+
+```
+df -h
+df -i
+```
+
+`df -h` shows byte space; `df -i` shows inode usage. Both can be independently exhausted.
+Note which filesystem is full (`/`, `/var`, `/tmp`, `/home`, etc.).
+
+### Find the large directories
+
+```
+du -sh /* 2>/dev/null | sort -rh | head -20
+du -sh /var/* 2>/dev/null | sort -rh | head -20
+du -sh /var/log/* 2>/dev/null | sort -rh | head -20
+```
+
+### Find large individual files
+
+```
+find / -xdev -type f -size +100M 2>/dev/null | sort -k5 -rn
+find /var/log -type f -size +50M 2>/dev/null
+```
+
+### Find deleted-but-open files holding space
+
+```
+lsof +L1 2>/dev/null | grep -v "^COMMAND"
+```
+
+Files deleted while a process still has them open do not free space until the process releases the file descriptor.
+
+### Inode exhaustion — find directories with many small files
+
+```
+find / -xdev -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -rn | head -20
+```
+
+### Filesystem errors (after a crash or read-only remount)
+
+```
+dmesg | grep -i 'ext4\|xfs\|btrfs\|error\|corrupt'
+journalctl -k | grep -i 'filesystem\|disk\|io error'
+```
+
+### LVM / partition layout
+
+```
+lsblk
+pvs
+vgs
+lvs
+```
+
+## Remediation
+
+**Large log files — truncate safely (do NOT rm while in use):**
+```
+truncate -s 0 /var/log/<logfile>
+```
+Or configure log rotation in `/etc/logrotate.d/`.
+
+**Old journal logs eating space:**
+```
+journalctl --disk-usage
+journalctl --vacuum-size=500M
+journalctl --vacuum-time=30d
+```
+
+**Deleted-but-open files — restart the holding process to release space:**
+Identify the PID from `lsof +L1`, then:
+```
+systemctl restart <service>
+```
+
+**Inode exhaustion — remove many small files:**
+Common culprits: PHP session files in `/var/lib/php/sessions/`, old apt cache, tmp dirs.
+```
+find /var/lib/php/sessions -type f -mtime +7 -delete
+apt-get clean
+find /tmp -type f -mtime +3 -delete
+```
+
+**Extend LVM volume (if free extents exist in the volume group):**
+```
+lvextend -l +100%FREE /dev/<vg>/<lv>
+resize2fs /dev/<vg>/<lv>      # ext4
+xfs_growfs /mountpoint         # xfs
+```
--- a/runbooks/docker.md
+++ b/runbooks/docker.md
@@ -0,0 +1,120 @@
+---
+service: docker
+symptoms: cannot connect to docker daemon, docker daemon failed to start, docker socket permission denied, containers cannot resolve dns, docker network broken, daemon.json conflict, docker oom, unable to remove filesystem
+tags: docker, dockerd, containerd, container, daemon, daemon.json, cgroup, dns, docker0, socket, compose
+---
+
+## Symptoms
+
+- `Cannot connect to the Docker daemon. Is the docker daemon running on this host?`
+- `permission denied` on `/var/run/docker.sock`
+- `dockerd` fails to start after a `daemon.json` change
+- Containers cannot resolve DNS or pull images
+- Docker bridge/network disappears or container networking breaks after boot
+- Container or daemon is killed by the kernel OOM killer
+- `Error: Unable to remove filesystem` when removing a container
+
+## Diagnostics
+
+### Check daemon health and client target
+
+```
+docker info
+systemctl is-active docker
+systemctl status docker
+ps -ef | grep dockerd
+env | grep DOCKER_HOST
+```
+
+If `DOCKER_HOST` is set incorrectly, the CLI may be talking to the wrong daemon.
+
+### Check daemon logs and startup failures
+
+```
+journalctl -u docker -n 200
+journalctl -u containerd -n 100
+cat /etc/docker/daemon.json
+systemctl cat docker
+```
+
+Look for conflicts between `daemon.json` keys and systemd startup flags, especially duplicate `hosts` settings.
+
+### Check socket permissions and group access
+
+```
+ls -la /var/run/docker.sock
+id
+getent group docker
+ls -la ~/.docker/
+```
+
+If the user was added to the `docker` group recently, a new login shell may be required.
+
+### Check kernel, cgroups, and memory pressure
+
+```
+uname -r
+free -h
+dmesg | grep -i -E 'docker|cgroup|oom|killed process'
+```
+
+Low memory, missing kernel features, or cgroup issues can stop containers or the daemon.
+
+### Check Docker networking and DNS
+
+```
+docker network ls
+ip addr show docker0
+sysctl net.ipv4.ip_forward
+cat /etc/resolv.conf
+ps aux | grep dnsmasq
+```
+
+Loopback DNS resolvers in `/etc/resolv.conf` often break container DNS unless Docker is given explicit nameservers.
+
+### Check storage and stuck mounts
+
+```
+df -h /var/lib/docker
+docker system df
+lsof /var/lib/docker
+```
+
+Bind-mounting `/var/lib/docker` into other containers can keep container filesystems busy and block removal.
+
+## Remediation
+
+**Daemon not running or client aimed at the wrong host:**
+Unset an incorrect `DOCKER_HOST`, then start the daemon:
+```
+unset DOCKER_HOST
+systemctl restart docker
+```
+
+**`daemon.json` conflicts with systemd flags:**
+Remove duplicate settings or create a systemd override so `dockerd` is started without conflicting flags.
+
+**Permission denied on Docker socket:**
+Add the user to the `docker` group, then re-login:
+```
+usermod -aG docker $USER
+newgrp docker
+```
+
+If `~/.docker/` was created by `sudo`, fix ownership:
+```
+sudo chown "$USER":"$USER" "$HOME/.docker" -R
+sudo chmod g+rwx "$HOME/.docker" -R
+```
+
+**Container DNS broken:**
+Configure explicit DNS servers in `/etc/docker/daemon.json`, then restart Docker.
+
+**Docker networking disappears after boot:**
+Stop the host network manager from managing Docker interfaces and confirm `net.ipv4.ip_forward=1`.
+
+**OOM kills:**
+Treat this as host memory pressure first; reduce workload, add memory, or enforce container memory limits.
+
+**Unable to remove filesystem:**
+Find the process holding the path open with `lsof`, then stop that process or the container bind-mounting `/var/lib/docker`.
--- a/runbooks/kernel.md
+++ b/runbooks/kernel.md
@@ -0,0 +1,117 @@
+---
+service: kernel
+symptoms: OOM kill, out of memory, high load average, kernel panic, segfault, soft lockup, CPU steal, system unresponsive, zombie processes, NMI watchdog
+tags: kernel, oom, memory, load, cpu, panic, dmesg, segfault, lockup, swap, zombie
+---
+
+## Symptoms
+
+- `Out of memory: Kill process <pid>` in dmesg — OOM killer fired
+- Load average far above CPU count — system overloaded or I/O blocked
+- `kernel: BUG: soft lockup` — CPU stuck in kernel code
+- `segfault at ...` in dmesg — process crashed due to invalid memory access
+- `kernel panic` — unrecoverable kernel error (visible only on console or serial)
+- Many zombie (`Z`) processes in `ps` output
+- High `%steal` in `top`/`vmstat` — hypervisor CPU contention
+
+## Diagnostics
+
+### Recent kernel messages
+
+```
+dmesg -T | tail -100
+dmesg -T | grep -iE 'error|warn|oom|kill|panic|oops|fault|hung|lockup'
+journalctl -k -n 200
+```
+
+### OOM events
+
+```
+dmesg -T | grep -i 'out of memory\|oom_kill\|killed process'
+```
+
+The log shows which process was killed, its RSS at time of kill, and available memory.
+
+### Memory usage
+
+```
+free -h
+cat /proc/meminfo | head -30
+vmstat -s
+```
+
+`MemAvailable` is the key metric. If it is near zero and swap is also exhausted, OOM kills are imminent.
+
+### Swap
+
+```
+swapon --show
+cat /proc/swaps
+vmstat 1 5
+```
+
+High `si`/`so` (swap-in/swap-out) in `vmstat` indicates active swapping and likely memory pressure.
+
+### Load average and CPU
+
+```
+uptime
+top -b -n1 | head -30
+mpstat -P ALL 1 3
+```
+
+Load average above 2× CPU count sustained over 15 minutes is concerning.
+High `%iowait` indicates processes blocked on disk I/O, not CPU-bound load.
+
+### Process memory usage
+
+```
+ps aux --sort=-%mem | head -20
+ps aux --sort=-%cpu | head -20
+```
+
+### Zombie processes
+
+```
+ps aux | awk '$8=="Z"'
+```
+
+Zombies cannot be killed; the parent must `wait()` for them or be killed itself.
+
+### I/O wait and disk health
+
+```
+iostat -x 1 3
+dmesg -T | grep -iE 'i/o error|hard resetting link|ata.*error|blk_update_request'
+```
+
+Persistent I/O errors alongside high load suggest failing storage.
+
+## Remediation
+
+**Memory pressure / frequent OOM kills:**
+Identify the largest memory consumers from `ps aux --sort=-%mem`.
+Consider increasing swap, adding RAM, tuning `vm.overcommit_memory`, or scaling the workload.
+Do NOT just raise `vm.overcommit_ratio` without understanding the root consumer.
+
+**Adjust OOM killer scoring for critical services (temporary, resets on reboot):**
+```
+echo -17 > /proc/<pid>/oom_adj        # legacy
+echo -1000 > /proc/<pid>/oom_score_adj  # current kernels
+```
+
+**Swap exhausted — add a swapfile:**
+```
+fallocate -l 2G /swapfile
+chmod 600 /swapfile
+mkswap /swapfile
+swapon /swapfile
+```
+
+**High I/O wait — find the I/O-heavy process:**
+```
+iotop -a -o -b -n3
+```
+
+**Zombie reaping — if parent is stuck:**
+Kill the parent process (it will reap children on exit), then verify zombies disappear.
--- a/runbooks/nginx.md
+++ b/runbooks/nginx.md
@@ -0,0 +1,99 @@
+---
+service: nginx
+symptoms: 502 Bad Gateway, 504 Gateway Timeout, upstream connection refused, nginx not starting, failed to bind socket, permission denied reading config, configuration test failed
+tags: nginx, web, http, https, proxy, upstream, reverse-proxy, load-balancer
+---
+
+## Symptoms
+
+- `502 Bad Gateway` — nginx reached the upstream but got an invalid response, or upstream is down
+- `504 Gateway Timeout` — upstream took too long to respond
+- `111: Connection refused` in nginx error log — upstream process is not running or not on the expected port
+- `nginx.service: Start request repeated too quickly` — crash-loop; check error log
+- `[emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)` — port conflict
+- `[emerg] open() ... failed (13: Permission denied)` — file permission issue
+
+## Diagnostics
+
+### Service status
+
+```
+systemctl status nginx
+```
+
+### Config test
+
+```
+nginx -t
+```
+
+A config error is the most common reason for nginx failing to start or reload.
+
+### Error log
+
+```
+journalctl -u nginx -n 100
+tail -n 100 /var/log/nginx/error.log
+```
+
+For 502/504 errors look for: `connect() failed`, `upstream timed out`, `no live upstreams`.
+
+### Access log — recent requests
+
+```
+tail -n 50 /var/log/nginx/access.log
+```
+
+### Check upstream services
+
+For `proxy_pass` targets, verify the upstream is running:
+```
+systemctl status <upstream-service>
+ss -tlnp | grep <upstream-port>
+```
+
+Common upstreams: `gunicorn`, `uwsgi`, `node`, `puma`, `php-fpm`.
+
+### Port binding conflicts
+
+```
+ss -tlnp | grep ':80\|:443'
+```
+
+### Config files
+
+```
+cat /etc/nginx/nginx.conf
+ls /etc/nginx/sites-enabled/
+cat /etc/nginx/sites-enabled/<vhost>
+```
+
+Check `proxy_pass`, `upstream` blocks, `proxy_connect_timeout`, `proxy_read_timeout`.
+
+## Remediation
+
+**Upstream service not running:**
+Start the upstream service, then verify nginx resumes proxying.
+
+**Config syntax error:**
+Fix the error shown by `nginx -t`, then:
+```
+systemctl reload nginx
+```
+
+**Port already in use:**
+Find the conflicting process with `ss -tlnp | grep :80`, stop it, then restart nginx.
+
+**Upstream timeouts — increase timeouts (caution: treat the slow upstream as the root cause):**
+```nginx
+proxy_connect_timeout 10s;
+proxy_read_timeout 60s;
+proxy_send_timeout 60s;
+```
+
+**Permission denied on log or socket file:**
+```
+ls -la /var/log/nginx/
+ls -la /run/nginx.pid
+chown -R www-data:www-data /var/log/nginx/
+```
--- a/runbooks/postgres.md
+++ b/runbooks/postgres.md
@@ -0,0 +1,107 @@
+---
+service: postgres
+symptoms: connection refused port 5432, FATAL password authentication failed, replication lag, disk full, out of shared memory, too many connections, relation does not exist, could not connect to the primary
+tags: postgres, postgresql, database, replication, pg, psql, disk, connections
+---
+
+## Symptoms
+
+- `could not connect to server: Connection refused` — postgres not running or not on port 5432
+- `FATAL:  password authentication failed for user "<user>"` — wrong credentials or pg_hba mismatch
+- `FATAL:  too many connections` — connection pool exhausted
+- `ERROR:  could not resize shared memory segment` / `out of shared memory` — shared_buffers too high for system
+- `PANIC:  could not write to file "pg_wal/..."` — disk full on WAL directory
+- Replication lag growing — standby falling behind primary
+- `FATAL:  could not connect to the primary server` — standby cannot reach primary
+
+## Diagnostics
+
+### Service status
+
+```
+systemctl status postgresql
+systemctl status postgresql@<version>-main
+```
+
+### PostgreSQL logs
+
+```
+journalctl -u postgresql -n 100
+tail -n 100 /var/log/postgresql/postgresql-*.log
+```
+
+### Is postgres listening?
+
+```
+ss -tlnp | grep 5432
+```
+
+### Disk space (WAL and data directory are the critical paths)
+
+```
+df -h
+du -sh /var/lib/postgresql/
+du -sh /var/lib/postgresql/*/main/pg_wal/
+```
+
+A full disk on the pg_wal partition causes a PANIC and hard crash.
+
+### Connection count
+
+```sql
+SELECT count(*), state FROM pg_stat_activity GROUP BY state;
+SELECT setting FROM pg_settings WHERE name = 'max_connections';
+```
+
+### Replication lag (run on primary)
+
+```sql
+SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
+       (sent_lsn - replay_lsn) AS lag_bytes
+FROM pg_stat_replication;
+```
+
+### pg_hba.conf — authentication rules
+
+```
+cat /etc/postgresql/*/main/pg_hba.conf
+```
+
+Entries are matched top-to-bottom. `reject` or missing entry for the client IP causes auth failure even with correct credentials.
+
+### Shared memory / kernel settings
+
+```
+cat /proc/sys/kernel/shmmax
+cat /etc/postgresql/*/main/postgresql.conf | grep shared_buffers
+```
+
+`shared_buffers` must not exceed ~40% of RAM; kernel `shmmax` must accommodate it.
+
+## Remediation
+
+**Postgres not running:**
+```
+systemctl start postgresql
+```
+Check logs immediately after start for the failure reason.
+
+**Authentication failure (pg_hba mismatch):**
+Add or update the correct entry in `pg_hba.conf`, then reload:
+```
+systemctl reload postgresql
+```
+
+**Too many connections — increase limit (requires restart):**
+In `postgresql.conf`:
+```
+max_connections = 200
+```
+Or deploy a connection pooler (`pgbouncer`).
+
+**Disk full on WAL:**
+Identify and remove old base backups or archived WAL segments under `/var/lib/postgresql/*/main/pg_wal/`.
+Do NOT delete pg_wal files directly — use `pg_archivecleanup` or let archiving catch up.
+
+**Replication lag — standby too far behind:**
+Check network bandwidth and I/O on standby. If `wal_receiver_status_interval` lag is large, increase `wal_sender_timeout` temporarily.
--- a/runbooks/selinux.md
+++ b/runbooks/selinux.md
@@ -0,0 +1,112 @@
+---
+service: selinux
+symptoms: permission denied despite correct unix permissions, service blocked by selinux, avc denied, file context mismatch, port binding denied, boolean missing, domain transition failure
+tags: selinux, avc, enforcing, security, policy, restorecon, audit, sealert, semanage
+---
+
+## Symptoms
+
+- Service gets `Permission denied` even though file ownership and mode look correct
+- Process cannot bind to a port or open a file after a config change
+- AVC denials appear in audit logs
+- App works when SELinux is permissive but fails in enforcing mode
+- Newly created files under custom paths are inaccessible to a confined service
+
+## Diagnostics
+
+### Confirm SELinux mode and policy
+
+```
+getenforce
+sestatus
+cat /etc/selinux/config
+```
+
+If SELinux is `Permissive`, denials are logged but not enforced.
+
+### Check AVC denials
+
+```
+auditctl -s
+ausearch -m AVC,USER_AVC,SELINUX_ERR,USER_SELINUX_ERR -ts recent
+journalctl -t setroubleshoot -n 50
+dmesg | grep -i -e type=1300 -e type=1400
+```
+
+AVC denials are the primary source of truth for SELinux policy failures.
+
+If AVCs are missing but SELinux still appears involved, temporarily disable `dontaudit` rules to expose hidden denials:
+```
+semodule -DB
+```
+Re-enable them after reproducing the issue:
+```
+semodule -B
+```
+
+### Inspect file contexts
+
+```
+ls -lZ /path/to/file
+ps -eZ | grep <service>
+matchpathcon -V /path/to/file
+```
+
+A service can have correct Unix permissions and still fail if the SELinux context is wrong.
+
+### Check port labeling and booleans
+
+```
+semanage port -l | grep <port>
+getsebool -a | grep <service-or-feature>
+semanage boolean -l | grep <service-or-feature>
+```
+
+Custom ports often require explicit SELinux port labels.
+
+### Check for relabeling needs
+
+```
+restorecon -nRv /path
+matchpathcon /path/to/file
+sealert -l "*"
+```
+
+`restorecon -n` shows what would change without modifying labels.
+
+`sealert` is often the fastest way to turn a raw AVC into a concrete fix, but treat `audit2allow` suggestions as a last resort, not a first response.
+
+## Remediation
+
+**Wrong file context:**
+Restore the default context:
+```
+restorecon -Rv /path
+```
+
+**Custom application path needs persistent labeling:**
+```
+semanage fcontext -a -t <type> '/custom/path(/.*)?'
+restorecon -Rv /custom/path
+```
+
+**Custom port binding denied:**
+Add the port label required by the service type:
+```
+semanage port -a -t <port_type> -p tcp <port>
+```
+
+**Boolean disabled:**
+Enable the needed boolean persistently:
+```
+setsebool -P <boolean_name> on
+```
+
+**Still unsure whether SELinux is the blocker:**
+Temporarily switch to permissive mode and reproduce the issue:
+```
+setenforce 0
+```
+If the problem still occurs, SELinux is not the root cause.
+
+Do not disable SELinux or generate custom policy modules as a first response. Fix labels, booleans, or port mappings first.
--- a/runbooks/ssh.md
+++ b/runbooks/ssh.md
@@ -0,0 +1,100 @@
+---
+service: ssh
+symptoms: connection refused, authentication failed, host key mismatch, permission denied, timeout connecting, no route to host
+tags: ssh, sshd, openssh, authentication, network, connectivity
+---
+
+## Symptoms
+
+- `ssh: connect to host <hostname> port 22: Connection refused`
+- `Permission denied (publickey)` — key not accepted or wrong user
+- `WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!` — host key mismatch
+- `Connection timed out` — firewall blocking or host unreachable
+- `No route to host` — routing issue or host is down
+
+## Diagnostics
+
+### Is sshd running?
+
+```
+systemctl status sshd
+systemctl status ssh
+```
+
+A stopped or failed sshd is the most common cause of "connection refused".
+
+### Check sshd configuration
+
+```
+sshd -t
+cat /etc/ssh/sshd_config
+```
+
+Look for: `PasswordAuthentication`, `PubkeyAuthentication yes`, `AuthorizedKeysFile`.
+
+### Check authorised keys
+
+```
+ls -la ~/.ssh/
+cat ~/.ssh/authorized_keys
+```
+
+Permissions must be: `~/.ssh` → `700`, `authorized_keys` → `600`.
+Wrong permissions cause silent auth failure even with the correct key.
+
+### Check sshd logs
+
+```
+journalctl -u sshd -n 100
+journalctl -u ssh -n 100
+grep sshd /var/log/auth.log | tail -50
+```
+
+Look for: `Invalid user`, `Failed publickey`, `Connection reset by peer`, `Too many authentication failures`.
+
+### Check listening port
+
+```
+ss -tlnp | grep sshd
+netstat -tlnp | grep :22
+```
+
+If sshd is running but not listening on the expected port, check `Port` in `/etc/ssh/sshd_config`.
+
+### Firewall rules
+
+```
+iptables -L INPUT -n -v
+nft list ruleset
+ufw status verbose
+```
+
+A DROP rule on port 22 causes silent timeouts, not "connection refused".
+
+## Remediation
+
+**sshd not running:**
+```
+systemctl enable --now sshd
+```
+
+**Wrong permissions on authorized_keys:**
+```
+chmod 700 ~/.ssh
+chmod 600 ~/.ssh/authorized_keys
+chown -R $USER:$USER ~/.ssh
+```
+
+**sshd config error:**
+Fix the error reported by `sshd -t`, then:
+```
+systemctl restart sshd
+```
+
+**Host key mismatch (expected after reinstall/reprovisioning):**
+Remove the old key from the client:
+```
+ssh-keygen -R <hostname>
+```
+Only do this if you are certain the host was intentionally reprovisioned.
+If the key change is unexpected, treat as a potential MITM and investigate before connecting.
--- a/runbooks/sssd.md
+++ b/runbooks/sssd.md
@@ -0,0 +1,115 @@
+---
+service: sssd
+symptoms: login denied, user not found, id command hangs, sudo rules missing, ldap auth failure, kerberos failure, cache stale, offline authentication not working
+tags: sssd, ldap, kerberos, ad, identity, auth, pam, nss, sudo
+---
+
+## Symptoms
+
+- `id <user>` hangs or returns no such user for a domain account
+- SSH or console login fails for directory-backed users
+- Group membership is missing or incomplete
+- `sudo` rules from LDAP/AD do not appear
+- Authentication works intermittently or only after cache flush
+- Offline authentication fails when the directory is unreachable
+
+## Diagnostics
+
+### Check service health
+
+```
+systemctl status sssd
+sssctl domain-list
+sssctl config-check
+cat /etc/nsswitch.conf
+```
+
+A running daemon with a valid config and `sss` present in `nsswitch.conf` are the first prerequisites.
+
+### Check identity resolution
+
+```
+id <user>
+getent passwd <user>
+getent group <group>
+```
+
+If NSS lookups fail, the issue is often in SSSD configuration, connectivity, or cache.
+
+### Check SSSD logs
+
+```
+journalctl -u sssd -n 100
+ls -la /var/log/sssd/
+tail -n 100 /var/log/sssd/*.log
+sssctl logs-fetch
+```
+
+Look for: backend offline, LDAP bind failures, Kerberos errors, TLS problems, and access provider denials.
+
+If the issue is unclear, raise `debug_level=6` in the relevant `[nss]`, `[pam]`, and `[domain/<name>]` sections. Raising debug only in `[sssd]` is not enough for most real failures.
+
+### Check domain reachability
+
+```
+sssctl domain-status <domain>
+ping <ldap-or-ad-host>
+dig -t SRV _ldap._tcp.<domain>
+cat /etc/resolv.conf
+```
+
+If the identity provider is unreachable, SSSD may serve cached data only or fail entirely.
+
+### Check Kerberos and LDAP configuration
+
+```
+cat /etc/sssd/sssd.conf
+cat /etc/krb5.conf
+kinit <user>
+klist
+ldapsearch -ZZ -x -H ldap://<server> -b <base-dn>
+```
+
+Look for wrong realm names, bad server addresses, TLS settings, and access filters.
+
+For AD or IPA providers, Kerberos and DNS are often the real dependency chain: broken SRV lookup, keytab issues, or a slow KDC will surface as SSSD failures.
+
+### Check cache and permissions
+
+```
+ls -la /var/lib/sss/db/
+sssctl cache-status
+sssctl cache-expire -E
+```
+
+`/etc/sssd/sssd.conf` must usually be mode `600` or SSSD will refuse to start.
+
+Do not wipe cache files blindly on an offline system that depends on cached logins.
+
+## Remediation
+
+**Config syntax or permission issue:**
+Fix `sssd.conf`, set secure permissions, then restart:
+```
+chmod 600 /etc/sssd/sssd.conf
+systemctl restart sssd
+```
+
+**Stale cache:**
+Clear cache carefully, then repopulate with a fresh lookup:
+```
+sss_cache -E
+id <user>
+```
+
+**Kerberos failure:**
+Validate time sync, realm, keytab credentials, and KDC reachability before changing LDAP settings.
+
+**Backend offline or `sdap_async_sys_connect request failed`:**
+Treat as DNS/network first. Validate SRV records and TLS handshake before increasing `ldap_network_timeout` or `ldap_search_timeout`.
+
+**Access denied despite successful lookup:**
+Check `access_provider`, LDAP filters, HBAC rules, or AD group-based access restrictions.
+
+**No `pam_sss` messages at all:**
+The PAM stack is likely misconfigured. Fix the PAM/authselect profile before changing SSSD itself.
--- a/runbooks/wayland.md
+++ b/runbooks/wayland.md
@@ -0,0 +1,89 @@
+---
+service: wayland
+symptoms: wayland session fails, gdm falls back to xorg, black screen on login, fractional scaling broken, screen sharing broken, remote desktop broken, wlroots crash, compositor crash
+tags: wayland, compositor, gnome, kde, mutter, wlroots, pipewire, xwayland, graphics
+---
+
+## Symptoms
+
+- User selects a Wayland session but is returned to login
+- GDM or another display manager falls back to Xorg
+- Screen sharing, remote desktop, or clipboard integration is broken
+- Apps requiring XWayland fail while native Wayland apps work
+- Fractional scaling or multi-monitor layout behaves incorrectly
+- Wayland compositor crashes after login
+
+## Diagnostics
+
+### Confirm the active session type
+
+```
+echo $XDG_SESSION_TYPE
+loginctl show-session $XDG_SESSION_ID -p Type
+echo $WAYLAND_DISPLAY
+```
+
+If the session type is `x11`, you are not debugging an active Wayland session.
+
+### Check display manager and compositor logs
+
+```
+systemctl status gdm
+journalctl -b | grep -iE 'wayland|mutter|kwin|wlroots|xwayland'
+journalctl -b | grep -i 'renderer for'
+```
+
+Look for compositor crashes, GPU driver incompatibilities, and forced Xorg fallback messages.
+
+### Check XWayland and PipeWire components
+
+```
+which Xwayland
+systemctl --user status pipewire
+systemctl --user status xdg-desktop-portal
+systemctl --user status xdg-desktop-portal-gnome
+systemctl --user status xdg-desktop-portal-kde
+xlsclients -l
+```
+
+Broken screen sharing is often a PipeWire or portal issue, not a compositor issue.
+
+`xlsclients -l` helps identify apps that are actually running under XWayland rather than native Wayland.
+
+### Check GPU compatibility
+
+```
+lspci -k | grep -A3 -E 'VGA|3D|Display'
+lsmod | grep -E 'nvidia|nouveau|amdgpu|i915'
+```
+
+Wayland support quality depends heavily on the GPU driver stack.
+
+### Check environment and session overrides
+
+```
+env | grep -E 'WAYLAND|XDG|GDK_BACKEND|QT_QPA_PLATFORM'
+cat /etc/gdm/custom.conf
+wayland-info
+```
+
+Environment overrides can force apps onto X11 or disable Wayland entirely.
+
+For NVIDIA systems, confirm the compositor is using a supported buffer path (GBM on current drivers is the expected default).
+
+## Remediation
+
+**Wayland disabled in display manager config:**
+Check `WaylandEnable=false` or similar settings and remove the override if unintended.
+
+**Fallback to Xorg on unsupported GPU stack:**
+Upgrade or change the graphics driver; Wayland stability is often limited by the driver, not the compositor.
+
+**Screen sharing broken:**
+Fix PipeWire and `xdg-desktop-portal` services before changing compositor settings.
+
+**XWayland-only app failures:**
+Treat them separately from native Wayland issues; confirm `Xwayland` is installed and launching.
+
+**Remote desktop, VM, or game input grabbing is broken:**
+This is often a Wayland protocol/compositor support limitation, not a generic keyboard bug. Check compositor support for pointer constraints, relative pointer, and keyboard shortcut inhibit protocols.
--- a/runbooks/x2go.md
+++ b/runbooks/x2go.md
@@ -0,0 +1,106 @@
+---
+service: x2go
+symptoms: x2go session fails to start, x2go black screen, x2go disconnects immediately, no desktop in session, authentication failure, x2go agent not starting, sound forwarding broken
+tags: x2go, nx, remote-desktop, x2goserver, x2goclient, session, desktop, xauth
+---
+
+## Symptoms
+
+- X2Go login succeeds but the session immediately disconnects
+- Black screen after login
+- Session is created but no desktop appears
+- `x2goruncommand error` or `X2Go Agent got stuck in state`
+- Sound, clipboard, or drive sharing fails while login itself works
+- Authentication works over SSH but X2Go session startup fails
+
+## Diagnostics
+
+### Check X2Go services and packages
+
+```
+systemctl status x2goserver
+systemctl status sshd
+rpm -qa | grep x2go
+apt list --installed | grep x2go
+which x2golistsessions
+```
+
+X2Go depends on working SSH plus installed `x2goserver` and `x2goserver-xsession` components.
+
+### Check X2Go logs
+
+```
+journalctl -u x2goserver -n 100
+journalctl -u sshd -n 100
+ls -la ~/.x2go/
+find ~/.x2go -type f -maxdepth 2 -print
+x2golistsessions
+```
+
+Look for session startup failures, agent crashes, and auth helper errors.
+
+### Check desktop environment startup command
+
+```
+cat /etc/x2go/Xsession
+cat ~/.xsession
+cat ~/.Xclients
+```
+
+A missing or broken desktop session command is a common cause of black screens.
+
+### Check X11 and xauth availability
+
+```
+which xauth
+xauth -V
+ls -la ~/.Xauthority
+which sshfs
+```
+
+X2Go requires a working X11 session setup. Missing `xauth` or a bad `.Xauthority` often breaks startup.
+
+Filesystem and folder-sharing features may also depend on `sshfs` being installed.
+
+### Check session limits and stale sessions
+
+```
+x2golistsessions
+x2gocleansessions
+ulimit -a
+loginctl list-sessions
+```
+
+Stale sessions or per-user process limits can prevent a new desktop from starting.
+
+### Check desktop dependencies
+
+```
+which startxfce4
+which mate-session
+which startplasma-x11
+env | grep -E 'DESKTOP|XDG'
+```
+
+If the selected desktop command does not exist, X2Go may connect and then terminate immediately.
+
+## Remediation
+
+**Missing or broken desktop startup command:**
+Set the session to a known-good desktop such as XFCE and verify the binary exists.
+
+**Corrupt Xauthority or stale X2Go session files:**
+Remove stale session state and regenerate auth files:
+```
+rm -f ~/.Xauthority
+rm -rf ~/.x2go/C-*
+```
+
+**Missing `xauth` or X11 helpers:**
+Install the missing X11 packages, then retry the session.
+
+**Required server packages missing:**
+Install `x2goserver` and `x2goserver-xsession` first, then retry before debugging desktop startup.
+
+**SSH works but X2Go session fails:**
+Treat it as a desktop startup or X11 auth problem, not an SSH transport problem.
--- a/runbooks/xorg.md
+++ b/runbooks/xorg.md
@@ -0,0 +1,94 @@
+---
+service: xorg
+symptoms: xorg black screen, display manager loop, no screens found, failed to start X server, GPU driver error, xrandr missing outputs, login screen not appearing
+tags: xorg, x11, display, gpu, drm, xrandr, gdm, sddm, lightdm
+---
+
+## Symptoms
+
+- Black screen after graphical boot
+- Display manager loops back to login
+- `no screens found` in Xorg log
+- External monitors are missing or not detected
+- X server fails after a driver update
+- `startx` exits immediately with display or device errors
+
+## Diagnostics
+
+### Check display manager and Xorg service path
+
+```
+systemctl status display-manager
+systemctl status gdm
+systemctl status sddm
+systemctl status lightdm
+```
+
+If the display manager is failing, inspect its logs before focusing on Xorg itself.
+
+### Check Xorg logs
+
+```
+find /var/log -name 'Xorg*.log' -o -name 'Xorg.*.log'
+grep -E '\(EE\)|\(WW\)' /var/log/Xorg.0.log
+journalctl -b | grep -iE 'xorg|gdm|sddm|lightdm'
+ls -la ~/.local/share/xorg/
+```
+
+Look for: `no screens found`, GPU module load failures, and permission/device access errors.
+
+On rootless Xorg, logs are often under `~/.local/share/xorg/Xorg.0.log` instead of `/var/log/`.
+
+### Check DRM and GPU driver state
+
+```
+lspci -k | grep -A3 -E 'VGA|3D|Display'
+lsmod | grep -E 'nouveau|nvidia|amdgpu|i915'
+dmesg | grep -iE 'drm|gpu|nvidia|amdgpu|i915'
+```
+
+Driver mismatches after kernel updates are a common cause of X startup failures.
+
+### Check monitor detection and permissions
+
+```
+loginctl session-status
+xrandr --query
+ls -la /dev/dri/
+ps -o user= -C Xorg
+```
+
+If `/dev/dri/*` permissions or seat assignment are wrong, X may fail to access the GPU.
+
+### Check X configuration files
+
+```
+find /etc/X11 -maxdepth 3 -type f
+cat /etc/X11/xorg.conf
+cat /etc/X11/xorg.conf.d/*.conf
+ls -la ~/.xinitrc ~/.xserverrc
+```
+
+Custom `Device`, `Monitor`, or `Screen` sections often break auto-detection.
+
+An empty or broken `.xinitrc` can produce a black screen even when the X server itself started correctly.
+
+## Remediation
+
+**Bad static Xorg config:**
+Move custom config aside and let auto-detection work unless the hardware truly needs manual config.
+
+**Driver mismatch after update:**
+Reinstall the GPU driver package matching the running kernel and reboot or restart the display manager.
+
+**`no screens found`:**
+Check whether the correct DRM module loaded and whether the display manager is running on the expected seat.
+
+**Display manager loop:**
+Correlate Xorg errors with PAM/auth logs; some loops are session startup failures, not graphics failures.
+
+**Framebuffer mode failure:**
+If X falls back to `fbdev` and errors with framebuffer/bus ID messages, remove the generic `fbdev` driver package and let Xorg use the proper modesetting or vendor driver.
+
+**`SocketCreateListener() failed`:**
+Check for stale sockets in `/tmp/.X11-unix`, especially after previous root-run Xorg sessions.