feat: complete RAG runbook workflow and release docs
Some checks failed
CI / test (push) Failing after 15s

This commit is contained in:
2026-05-06 04:48:41 +02:00
parent 450de24d28
commit 57f4c0efaa
26 changed files with 2510 additions and 137 deletions

86
runbooks/apparmor.md Normal file
View File

@@ -0,0 +1,86 @@
---
service: apparmor
symptoms: permission denied despite correct unix permissions, apparmor deny logs, service blocked by profile, executable transition denied, path access denied, snap confinement issue, profile in complain mode
tags: apparmor, security, profile, aa-status, audit, confinement, complain, enforce, snap
---
## Symptoms
- Application gets `Permission denied` even though Unix permissions look correct
- Service starts in complain mode but fails in enforce mode
- Log shows AppArmor `DENIED` entries
- Binary works when profile is disabled but fails when confinement is enabled
- Snap or packaged app cannot access expected files or sockets
## Diagnostics
### Check AppArmor status and loaded profiles
```
aa-status
systemctl status apparmor
```
Confirm whether the profile is loaded and whether it is in enforce or complain mode.
### Check denial logs
```
journalctl -k | grep -i apparmor
journalctl -b | grep -i DENIED
dmesg | grep -i apparmor
```
AppArmor denials usually identify the profile, operation, and path that was blocked.
### Inspect the active profile
```
find /etc/apparmor.d -maxdepth 2 -type f | sort
cat /etc/apparmor.d/<profile>
```
Look for missing file path rules, capability rules, and `ix`/`px` execution transitions.
### Check complain vs enforce mode
```
aa-status | grep complain
```
If the issue only occurs in enforce mode, the profile is too restrictive rather than the app being broken.
### Check profile parser and reload
```
apparmor_parser -r /etc/apparmor.d/<profile>
aa-status
```
Syntax or include errors can prevent an updated profile from loading.
## Remediation
**Profile too restrictive:**
Add the missing path, capability, or network rule to the profile, then reload AppArmor.
If the denial pattern is repetitive, use AppArmor tooling to review and refine the profile instead of disabling confinement globally.
**Need to observe without blocking:**
Temporarily switch the profile to complain mode:
```
aa-complain /etc/apparmor.d/<profile>
```
**Return to enforcement after fixing rules:**
```
aa-enforce /etc/apparmor.d/<profile>
```
**Profile reload after changes:**
```
apparmor_parser -r /etc/apparmor.d/<profile>
systemctl reload apparmor
```
Do not disable AppArmor globally when the issue is isolated to a single profile.

106
runbooks/disk.md Normal file
View File

@@ -0,0 +1,106 @@
---
service: disk
symptoms: no space left on device, disk full, inode exhaustion, df shows 100%, du large files, write failed, cannot create file, filesystem read-only, ext4 error
tags: disk, filesystem, storage, inodes, df, du, ext4, xfs, lvm, partition, full, space
---
## Symptoms
- `No space left on device` — disk or inode exhaustion
- `df -h` shows a filesystem at 100% (or near 100%)
- `df -i` shows inode usage at 100% — file count exhausted even if byte space is free
- Filesystem remounted read-only — kernel detected errors and protected itself
- Services failing to write logs, create temp files, or open sockets
## Diagnostics
### Overall disk usage
```
df -h
df -i
```
`df -h` shows byte space; `df -i` shows inode usage. Both can be independently exhausted.
Note which filesystem is full (`/`, `/var`, `/tmp`, `/home`, etc.).
### Find the large directories
```
du -sh /* 2>/dev/null | sort -rh | head -20
du -sh /var/* 2>/dev/null | sort -rh | head -20
du -sh /var/log/* 2>/dev/null | sort -rh | head -20
```
### Find large individual files
```
find / -xdev -type f -size +100M 2>/dev/null | sort -k5 -rn
find /var/log -type f -size +50M 2>/dev/null
```
### Find deleted-but-open files holding space
```
lsof +L1 2>/dev/null | grep -v "^COMMAND"
```
Files deleted while a process still has them open do not free space until the process releases the file descriptor.
### Inode exhaustion — find directories with many small files
```
find / -xdev -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -rn | head -20
```
### Filesystem errors (after a crash or read-only remount)
```
dmesg | grep -i 'ext4\|xfs\|btrfs\|error\|corrupt'
journalctl -k | grep -i 'filesystem\|disk\|io error'
```
### LVM / partition layout
```
lsblk
pvs
vgs
lvs
```
## Remediation
**Large log files — truncate safely (do NOT rm while in use):**
```
truncate -s 0 /var/log/<logfile>
```
Or configure log rotation in `/etc/logrotate.d/`.
**Old journal logs eating space:**
```
journalctl --disk-usage
journalctl --vacuum-size=500M
journalctl --vacuum-time=30d
```
**Deleted-but-open files — restart the holding process to release space:**
Identify the PID from `lsof +L1`, then:
```
systemctl restart <service>
```
**Inode exhaustion — remove many small files:**
Common culprits: PHP session files in `/var/lib/php/sessions/`, old apt cache, tmp dirs.
```
find /var/lib/php/sessions -type f -mtime +7 -delete
apt-get clean
find /tmp -type f -mtime +3 -delete
```
**Extend LVM volume (if free extents exist in the volume group):**
```
lvextend -l +100%FREE /dev/<vg>/<lv>
resize2fs /dev/<vg>/<lv> # ext4
xfs_growfs /mountpoint # xfs
```

120
runbooks/docker.md Normal file
View File

@@ -0,0 +1,120 @@
---
service: docker
symptoms: cannot connect to docker daemon, docker daemon failed to start, docker socket permission denied, containers cannot resolve dns, docker network broken, daemon.json conflict, docker oom, unable to remove filesystem
tags: docker, dockerd, containerd, container, daemon, daemon.json, cgroup, dns, docker0, socket, compose
---
## Symptoms
- `Cannot connect to the Docker daemon. Is the docker daemon running on this host?`
- `permission denied` on `/var/run/docker.sock`
- `dockerd` fails to start after a `daemon.json` change
- Containers cannot resolve DNS or pull images
- Docker bridge/network disappears or container networking breaks after boot
- Container or daemon is killed by the kernel OOM killer
- `Error: Unable to remove filesystem` when removing a container
## Diagnostics
### Check daemon health and client target
```
docker info
systemctl is-active docker
systemctl status docker
ps -ef | grep dockerd
env | grep DOCKER_HOST
```
If `DOCKER_HOST` is set incorrectly, the CLI may be talking to the wrong daemon.
### Check daemon logs and startup failures
```
journalctl -u docker -n 200
journalctl -u containerd -n 100
cat /etc/docker/daemon.json
systemctl cat docker
```
Look for conflicts between `daemon.json` keys and systemd startup flags, especially duplicate `hosts` settings.
### Check socket permissions and group access
```
ls -la /var/run/docker.sock
id
getent group docker
ls -la ~/.docker/
```
If the user was added to the `docker` group recently, a new login shell may be required.
### Check kernel, cgroups, and memory pressure
```
uname -r
free -h
dmesg | grep -i -E 'docker|cgroup|oom|killed process'
```
Low memory, missing kernel features, or cgroup issues can stop containers or the daemon.
### Check Docker networking and DNS
```
docker network ls
ip addr show docker0
sysctl net.ipv4.ip_forward
cat /etc/resolv.conf
ps aux | grep dnsmasq
```
Loopback DNS resolvers in `/etc/resolv.conf` often break container DNS unless Docker is given explicit nameservers.
### Check storage and stuck mounts
```
df -h /var/lib/docker
docker system df
lsof /var/lib/docker
```
Bind-mounting `/var/lib/docker` into other containers can keep container filesystems busy and block removal.
## Remediation
**Daemon not running or client aimed at the wrong host:**
Unset an incorrect `DOCKER_HOST`, then start the daemon:
```
unset DOCKER_HOST
systemctl restart docker
```
**`daemon.json` conflicts with systemd flags:**
Remove duplicate settings or create a systemd override so `dockerd` is started without conflicting flags.
**Permission denied on Docker socket:**
Add the user to the `docker` group, then re-login:
```
usermod -aG docker $USER
newgrp docker
```
If `~/.docker/` was created by `sudo`, fix ownership:
```
sudo chown "$USER":"$USER" "$HOME/.docker" -R
sudo chmod g+rwx "$HOME/.docker" -R
```
**Container DNS broken:**
Configure explicit DNS servers in `/etc/docker/daemon.json`, then restart Docker.
**Docker networking disappears after boot:**
Stop the host network manager from managing Docker interfaces and confirm `net.ipv4.ip_forward=1`.
**OOM kills:**
Treat this as host memory pressure first; reduce workload, add memory, or enforce container memory limits.
**Unable to remove filesystem:**
Find the process holding the path open with `lsof`, then stop that process or the container bind-mounting `/var/lib/docker`.

117
runbooks/kernel.md Normal file
View File

@@ -0,0 +1,117 @@
---
service: kernel
symptoms: OOM kill, out of memory, high load average, kernel panic, segfault, soft lockup, CPU steal, system unresponsive, zombie processes, NMI watchdog
tags: kernel, oom, memory, load, cpu, panic, dmesg, segfault, lockup, swap, zombie
---
## Symptoms
- `Out of memory: Kill process <pid>` in dmesg — OOM killer fired
- Load average far above CPU count — system overloaded or I/O blocked
- `kernel: BUG: soft lockup` — CPU stuck in kernel code
- `segfault at ...` in dmesg — process crashed due to invalid memory access
- `kernel panic` — unrecoverable kernel error (visible only on console or serial)
- Many zombie (`Z`) processes in `ps` output
- High `%steal` in `top`/`vmstat` — hypervisor CPU contention
## Diagnostics
### Recent kernel messages
```
dmesg -T | tail -100
dmesg -T | grep -iE 'error|warn|oom|kill|panic|oops|fault|hung|lockup'
journalctl -k -n 200
```
### OOM events
```
dmesg -T | grep -i 'out of memory\|oom_kill\|killed process'
```
The log shows which process was killed, its RSS at time of kill, and available memory.
### Memory usage
```
free -h
cat /proc/meminfo | head -30
vmstat -s
```
`MemAvailable` is the key metric. If it is near zero and swap is also exhausted, OOM kills are imminent.
### Swap
```
swapon --show
cat /proc/swaps
vmstat 1 5
```
High `si`/`so` (swap-in/swap-out) in `vmstat` indicates active swapping and likely memory pressure.
### Load average and CPU
```
uptime
top -b -n1 | head -30
mpstat -P ALL 1 3
```
Load average above 2× CPU count sustained over 15 minutes is concerning.
High `%iowait` indicates processes blocked on disk I/O, not CPU-bound load.
### Process memory usage
```
ps aux --sort=-%mem | head -20
ps aux --sort=-%cpu | head -20
```
### Zombie processes
```
ps aux | awk '$8=="Z"'
```
Zombies cannot be killed; the parent must `wait()` for them or be killed itself.
### I/O wait and disk health
```
iostat -x 1 3
dmesg -T | grep -iE 'i/o error|hard resetting link|ata.*error|blk_update_request'
```
Persistent I/O errors alongside high load suggest failing storage.
## Remediation
**Memory pressure / frequent OOM kills:**
Identify the largest memory consumers from `ps aux --sort=-%mem`.
Consider increasing swap, adding RAM, tuning `vm.overcommit_memory`, or scaling the workload.
Do NOT just raise `vm.overcommit_ratio` without understanding the root consumer.
**Adjust OOM killer scoring for critical services (temporary, resets on reboot):**
```
echo -17 > /proc/<pid>/oom_adj # legacy
echo -1000 > /proc/<pid>/oom_score_adj # current kernels
```
**Swap exhausted — add a swapfile:**
```
fallocate -l 2G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
```
**High I/O wait — find the I/O-heavy process:**
```
iotop -a -o -b -n3
```
**Zombie reaping — if parent is stuck:**
Kill the parent process (it will reap children on exit), then verify zombies disappear.

99
runbooks/nginx.md Normal file
View File

@@ -0,0 +1,99 @@
---
service: nginx
symptoms: 502 Bad Gateway, 504 Gateway Timeout, upstream connection refused, nginx not starting, failed to bind socket, permission denied reading config, configuration test failed
tags: nginx, web, http, https, proxy, upstream, reverse-proxy, load-balancer
---
## Symptoms
- `502 Bad Gateway` — nginx reached the upstream but got an invalid response, or upstream is down
- `504 Gateway Timeout` — upstream took too long to respond
- `111: Connection refused` in nginx error log — upstream process is not running or not on the expected port
- `nginx.service: Start request repeated too quickly` — crash-loop; check error log
- `[emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)` — port conflict
- `[emerg] open() ... failed (13: Permission denied)` — file permission issue
## Diagnostics
### Service status
```
systemctl status nginx
```
### Config test
```
nginx -t
```
A config error is the most common reason for nginx failing to start or reload.
### Error log
```
journalctl -u nginx -n 100
tail -n 100 /var/log/nginx/error.log
```
For 502/504 errors look for: `connect() failed`, `upstream timed out`, `no live upstreams`.
### Access log — recent requests
```
tail -n 50 /var/log/nginx/access.log
```
### Check upstream services
For `proxy_pass` targets, verify the upstream is running:
```
systemctl status <upstream-service>
ss -tlnp | grep <upstream-port>
```
Common upstreams: `gunicorn`, `uwsgi`, `node`, `puma`, `php-fpm`.
### Port binding conflicts
```
ss -tlnp | grep ':80\|:443'
```
### Config files
```
cat /etc/nginx/nginx.conf
ls /etc/nginx/sites-enabled/
cat /etc/nginx/sites-enabled/<vhost>
```
Check `proxy_pass`, `upstream` blocks, `proxy_connect_timeout`, `proxy_read_timeout`.
## Remediation
**Upstream service not running:**
Start the upstream service, then verify nginx resumes proxying.
**Config syntax error:**
Fix the error shown by `nginx -t`, then:
```
systemctl reload nginx
```
**Port already in use:**
Find the conflicting process with `ss -tlnp | grep :80`, stop it, then restart nginx.
**Upstream timeouts — increase timeouts (caution: treat the slow upstream as the root cause):**
```nginx
proxy_connect_timeout 10s;
proxy_read_timeout 60s;
proxy_send_timeout 60s;
```
**Permission denied on log or socket file:**
```
ls -la /var/log/nginx/
ls -la /run/nginx.pid
chown -R www-data:www-data /var/log/nginx/
```

107
runbooks/postgres.md Normal file
View File

@@ -0,0 +1,107 @@
---
service: postgres
symptoms: connection refused port 5432, FATAL password authentication failed, replication lag, disk full, out of shared memory, too many connections, relation does not exist, could not connect to the primary
tags: postgres, postgresql, database, replication, pg, psql, disk, connections
---
## Symptoms
- `could not connect to server: Connection refused` — postgres not running or not on port 5432
- `FATAL: password authentication failed for user "<user>"` — wrong credentials or pg_hba mismatch
- `FATAL: too many connections` — connection pool exhausted
- `ERROR: could not resize shared memory segment` / `out of shared memory` — shared_buffers too high for system
- `PANIC: could not write to file "pg_wal/..."` — disk full on WAL directory
- Replication lag growing — standby falling behind primary
- `FATAL: could not connect to the primary server` — standby cannot reach primary
## Diagnostics
### Service status
```
systemctl status postgresql
systemctl status postgresql@<version>-main
```
### PostgreSQL logs
```
journalctl -u postgresql -n 100
tail -n 100 /var/log/postgresql/postgresql-*.log
```
### Is postgres listening?
```
ss -tlnp | grep 5432
```
### Disk space (WAL and data directory are the critical paths)
```
df -h
du -sh /var/lib/postgresql/
du -sh /var/lib/postgresql/*/main/pg_wal/
```
A full disk on the pg_wal partition causes a PANIC and hard crash.
### Connection count
```sql
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
SELECT setting FROM pg_settings WHERE name = 'max_connections';
```
### Replication lag (run on primary)
```sql
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
(sent_lsn - replay_lsn) AS lag_bytes
FROM pg_stat_replication;
```
### pg_hba.conf — authentication rules
```
cat /etc/postgresql/*/main/pg_hba.conf
```
Entries are matched top-to-bottom. `reject` or missing entry for the client IP causes auth failure even with correct credentials.
### Shared memory / kernel settings
```
cat /proc/sys/kernel/shmmax
cat /etc/postgresql/*/main/postgresql.conf | grep shared_buffers
```
`shared_buffers` must not exceed ~40% of RAM; kernel `shmmax` must accommodate it.
## Remediation
**Postgres not running:**
```
systemctl start postgresql
```
Check logs immediately after start for the failure reason.
**Authentication failure (pg_hba mismatch):**
Add or update the correct entry in `pg_hba.conf`, then reload:
```
systemctl reload postgresql
```
**Too many connections — increase limit (requires restart):**
In `postgresql.conf`:
```
max_connections = 200
```
Or deploy a connection pooler (`pgbouncer`).
**Disk full on WAL:**
Identify and remove old base backups or archived WAL segments under `/var/lib/postgresql/*/main/pg_wal/`.
Do NOT delete pg_wal files directly — use `pg_archivecleanup` or let archiving catch up.
**Replication lag — standby too far behind:**
Check network bandwidth and I/O on standby. If `wal_receiver_status_interval` lag is large, increase `wal_sender_timeout` temporarily.

112
runbooks/selinux.md Normal file
View File

@@ -0,0 +1,112 @@
---
service: selinux
symptoms: permission denied despite correct unix permissions, service blocked by selinux, avc denied, file context mismatch, port binding denied, boolean missing, domain transition failure
tags: selinux, avc, enforcing, security, policy, restorecon, audit, sealert, semanage
---
## Symptoms
- Service gets `Permission denied` even though file ownership and mode look correct
- Process cannot bind to a port or open a file after a config change
- AVC denials appear in audit logs
- App works when SELinux is permissive but fails in enforcing mode
- Newly created files under custom paths are inaccessible to a confined service
## Diagnostics
### Confirm SELinux mode and policy
```
getenforce
sestatus
cat /etc/selinux/config
```
If SELinux is `Permissive`, denials are logged but not enforced.
### Check AVC denials
```
auditctl -s
ausearch -m AVC,USER_AVC,SELINUX_ERR,USER_SELINUX_ERR -ts recent
journalctl -t setroubleshoot -n 50
dmesg | grep -i -e type=1300 -e type=1400
```
AVC denials are the primary source of truth for SELinux policy failures.
If AVCs are missing but SELinux still appears involved, temporarily disable `dontaudit` rules to expose hidden denials:
```
semodule -DB
```
Re-enable them after reproducing the issue:
```
semodule -B
```
### Inspect file contexts
```
ls -lZ /path/to/file
ps -eZ | grep <service>
matchpathcon -V /path/to/file
```
A service can have correct Unix permissions and still fail if the SELinux context is wrong.
### Check port labeling and booleans
```
semanage port -l | grep <port>
getsebool -a | grep <service-or-feature>
semanage boolean -l | grep <service-or-feature>
```
Custom ports often require explicit SELinux port labels.
### Check for relabeling needs
```
restorecon -nRv /path
matchpathcon /path/to/file
sealert -l "*"
```
`restorecon -n` shows what would change without modifying labels.
`sealert` is often the fastest way to turn a raw AVC into a concrete fix, but treat `audit2allow` suggestions as a last resort, not a first response.
## Remediation
**Wrong file context:**
Restore the default context:
```
restorecon -Rv /path
```
**Custom application path needs persistent labeling:**
```
semanage fcontext -a -t <type> '/custom/path(/.*)?'
restorecon -Rv /custom/path
```
**Custom port binding denied:**
Add the port label required by the service type:
```
semanage port -a -t <port_type> -p tcp <port>
```
**Boolean disabled:**
Enable the needed boolean persistently:
```
setsebool -P <boolean_name> on
```
**Still unsure whether SELinux is the blocker:**
Temporarily switch to permissive mode and reproduce the issue:
```
setenforce 0
```
If the problem still occurs, SELinux is not the root cause.
Do not disable SELinux or generate custom policy modules as a first response. Fix labels, booleans, or port mappings first.

100
runbooks/ssh.md Normal file
View File

@@ -0,0 +1,100 @@
---
service: ssh
symptoms: connection refused, authentication failed, host key mismatch, permission denied, timeout connecting, no route to host
tags: ssh, sshd, openssh, authentication, network, connectivity
---
## Symptoms
- `ssh: connect to host <hostname> port 22: Connection refused`
- `Permission denied (publickey)` — key not accepted or wrong user
- `WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!` — host key mismatch
- `Connection timed out` — firewall blocking or host unreachable
- `No route to host` — routing issue or host is down
## Diagnostics
### Is sshd running?
```
systemctl status sshd
systemctl status ssh
```
A stopped or failed sshd is the most common cause of "connection refused".
### Check sshd configuration
```
sshd -t
cat /etc/ssh/sshd_config
```
Look for: `PasswordAuthentication`, `PubkeyAuthentication yes`, `AuthorizedKeysFile`.
### Check authorised keys
```
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys
```
Permissions must be: `~/.ssh``700`, `authorized_keys``600`.
Wrong permissions cause silent auth failure even with the correct key.
### Check sshd logs
```
journalctl -u sshd -n 100
journalctl -u ssh -n 100
grep sshd /var/log/auth.log | tail -50
```
Look for: `Invalid user`, `Failed publickey`, `Connection reset by peer`, `Too many authentication failures`.
### Check listening port
```
ss -tlnp | grep sshd
netstat -tlnp | grep :22
```
If sshd is running but not listening on the expected port, check `Port` in `/etc/ssh/sshd_config`.
### Firewall rules
```
iptables -L INPUT -n -v
nft list ruleset
ufw status verbose
```
A DROP rule on port 22 causes silent timeouts, not "connection refused".
## Remediation
**sshd not running:**
```
systemctl enable --now sshd
```
**Wrong permissions on authorized_keys:**
```
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
chown -R $USER:$USER ~/.ssh
```
**sshd config error:**
Fix the error reported by `sshd -t`, then:
```
systemctl restart sshd
```
**Host key mismatch (expected after reinstall/reprovisioning):**
Remove the old key from the client:
```
ssh-keygen -R <hostname>
```
Only do this if you are certain the host was intentionally reprovisioned.
If the key change is unexpected, treat as a potential MITM and investigate before connecting.

115
runbooks/sssd.md Normal file
View File

@@ -0,0 +1,115 @@
---
service: sssd
symptoms: login denied, user not found, id command hangs, sudo rules missing, ldap auth failure, kerberos failure, cache stale, offline authentication not working
tags: sssd, ldap, kerberos, ad, identity, auth, pam, nss, sudo
---
## Symptoms
- `id <user>` hangs or returns no such user for a domain account
- SSH or console login fails for directory-backed users
- Group membership is missing or incomplete
- `sudo` rules from LDAP/AD do not appear
- Authentication works intermittently or only after cache flush
- Offline authentication fails when the directory is unreachable
## Diagnostics
### Check service health
```
systemctl status sssd
sssctl domain-list
sssctl config-check
cat /etc/nsswitch.conf
```
A running daemon with a valid config and `sss` present in `nsswitch.conf` are the first prerequisites.
### Check identity resolution
```
id <user>
getent passwd <user>
getent group <group>
```
If NSS lookups fail, the issue is often in SSSD configuration, connectivity, or cache.
### Check SSSD logs
```
journalctl -u sssd -n 100
ls -la /var/log/sssd/
tail -n 100 /var/log/sssd/*.log
sssctl logs-fetch
```
Look for: backend offline, LDAP bind failures, Kerberos errors, TLS problems, and access provider denials.
If the issue is unclear, raise `debug_level=6` in the relevant `[nss]`, `[pam]`, and `[domain/<name>]` sections. Raising debug only in `[sssd]` is not enough for most real failures.
### Check domain reachability
```
sssctl domain-status <domain>
ping <ldap-or-ad-host>
dig -t SRV _ldap._tcp.<domain>
cat /etc/resolv.conf
```
If the identity provider is unreachable, SSSD may serve cached data only or fail entirely.
### Check Kerberos and LDAP configuration
```
cat /etc/sssd/sssd.conf
cat /etc/krb5.conf
kinit <user>
klist
ldapsearch -ZZ -x -H ldap://<server> -b <base-dn>
```
Look for wrong realm names, bad server addresses, TLS settings, and access filters.
For AD or IPA providers, Kerberos and DNS are often the real dependency chain: broken SRV lookup, keytab issues, or a slow KDC will surface as SSSD failures.
### Check cache and permissions
```
ls -la /var/lib/sss/db/
sssctl cache-status
sssctl cache-expire -E
```
`/etc/sssd/sssd.conf` must usually be mode `600` or SSSD will refuse to start.
Do not wipe cache files blindly on an offline system that depends on cached logins.
## Remediation
**Config syntax or permission issue:**
Fix `sssd.conf`, set secure permissions, then restart:
```
chmod 600 /etc/sssd/sssd.conf
systemctl restart sssd
```
**Stale cache:**
Clear cache carefully, then repopulate with a fresh lookup:
```
sss_cache -E
id <user>
```
**Kerberos failure:**
Validate time sync, realm, keytab credentials, and KDC reachability before changing LDAP settings.
**Backend offline or `sdap_async_sys_connect request failed`:**
Treat as DNS/network first. Validate SRV records and TLS handshake before increasing `ldap_network_timeout` or `ldap_search_timeout`.
**Access denied despite successful lookup:**
Check `access_provider`, LDAP filters, HBAC rules, or AD group-based access restrictions.
**No `pam_sss` messages at all:**
The PAM stack is likely misconfigured. Fix the PAM/authselect profile before changing SSSD itself.

89
runbooks/wayland.md Normal file
View File

@@ -0,0 +1,89 @@
---
service: wayland
symptoms: wayland session fails, gdm falls back to xorg, black screen on login, fractional scaling broken, screen sharing broken, remote desktop broken, wlroots crash, compositor crash
tags: wayland, compositor, gnome, kde, mutter, wlroots, pipewire, xwayland, graphics
---
## Symptoms
- User selects a Wayland session but is returned to login
- GDM or another display manager falls back to Xorg
- Screen sharing, remote desktop, or clipboard integration is broken
- Apps requiring XWayland fail while native Wayland apps work
- Fractional scaling or multi-monitor layout behaves incorrectly
- Wayland compositor crashes after login
## Diagnostics
### Confirm the active session type
```
echo $XDG_SESSION_TYPE
loginctl show-session $XDG_SESSION_ID -p Type
echo $WAYLAND_DISPLAY
```
If the session type is `x11`, you are not debugging an active Wayland session.
### Check display manager and compositor logs
```
systemctl status gdm
journalctl -b | grep -iE 'wayland|mutter|kwin|wlroots|xwayland'
journalctl -b | grep -i 'renderer for'
```
Look for compositor crashes, GPU driver incompatibilities, and forced Xorg fallback messages.
### Check XWayland and PipeWire components
```
which Xwayland
systemctl --user status pipewire
systemctl --user status xdg-desktop-portal
systemctl --user status xdg-desktop-portal-gnome
systemctl --user status xdg-desktop-portal-kde
xlsclients -l
```
Broken screen sharing is often a PipeWire or portal issue, not a compositor issue.
`xlsclients -l` helps identify apps that are actually running under XWayland rather than native Wayland.
### Check GPU compatibility
```
lspci -k | grep -A3 -E 'VGA|3D|Display'
lsmod | grep -E 'nvidia|nouveau|amdgpu|i915'
```
Wayland support quality depends heavily on the GPU driver stack.
### Check environment and session overrides
```
env | grep -E 'WAYLAND|XDG|GDK_BACKEND|QT_QPA_PLATFORM'
cat /etc/gdm/custom.conf
wayland-info
```
Environment overrides can force apps onto X11 or disable Wayland entirely.
For NVIDIA systems, confirm the compositor is using a supported buffer path (GBM on current drivers is the expected default).
## Remediation
**Wayland disabled in display manager config:**
Check `WaylandEnable=false` or similar settings and remove the override if unintended.
**Fallback to Xorg on unsupported GPU stack:**
Upgrade or change the graphics driver; Wayland stability is often limited by the driver, not the compositor.
**Screen sharing broken:**
Fix PipeWire and `xdg-desktop-portal` services before changing compositor settings.
**XWayland-only app failures:**
Treat them separately from native Wayland issues; confirm `Xwayland` is installed and launching.
**Remote desktop, VM, or game input grabbing is broken:**
This is often a Wayland protocol/compositor support limitation, not a generic keyboard bug. Check compositor support for pointer constraints, relative pointer, and keyboard shortcut inhibit protocols.

106
runbooks/x2go.md Normal file
View File

@@ -0,0 +1,106 @@
---
service: x2go
symptoms: x2go session fails to start, x2go black screen, x2go disconnects immediately, no desktop in session, authentication failure, x2go agent not starting, sound forwarding broken
tags: x2go, nx, remote-desktop, x2goserver, x2goclient, session, desktop, xauth
---
## Symptoms
- X2Go login succeeds but the session immediately disconnects
- Black screen after login
- Session is created but no desktop appears
- `x2goruncommand error` or `X2Go Agent got stuck in state`
- Sound, clipboard, or drive sharing fails while login itself works
- Authentication works over SSH but X2Go session startup fails
## Diagnostics
### Check X2Go services and packages
```
systemctl status x2goserver
systemctl status sshd
rpm -qa | grep x2go
apt list --installed | grep x2go
which x2golistsessions
```
X2Go depends on working SSH plus installed `x2goserver` and `x2goserver-xsession` components.
### Check X2Go logs
```
journalctl -u x2goserver -n 100
journalctl -u sshd -n 100
ls -la ~/.x2go/
find ~/.x2go -type f -maxdepth 2 -print
x2golistsessions
```
Look for session startup failures, agent crashes, and auth helper errors.
### Check desktop environment startup command
```
cat /etc/x2go/Xsession
cat ~/.xsession
cat ~/.Xclients
```
A missing or broken desktop session command is a common cause of black screens.
### Check X11 and xauth availability
```
which xauth
xauth -V
ls -la ~/.Xauthority
which sshfs
```
X2Go requires a working X11 session setup. Missing `xauth` or a bad `.Xauthority` often breaks startup.
Filesystem and folder-sharing features may also depend on `sshfs` being installed.
### Check session limits and stale sessions
```
x2golistsessions
x2gocleansessions
ulimit -a
loginctl list-sessions
```
Stale sessions or per-user process limits can prevent a new desktop from starting.
### Check desktop dependencies
```
which startxfce4
which mate-session
which startplasma-x11
env | grep -E 'DESKTOP|XDG'
```
If the selected desktop command does not exist, X2Go may connect and then terminate immediately.
## Remediation
**Missing or broken desktop startup command:**
Set the session to a known-good desktop such as XFCE and verify the binary exists.
**Corrupt Xauthority or stale X2Go session files:**
Remove stale session state and regenerate auth files:
```
rm -f ~/.Xauthority
rm -rf ~/.x2go/C-*
```
**Missing `xauth` or X11 helpers:**
Install the missing X11 packages, then retry the session.
**Required server packages missing:**
Install `x2goserver` and `x2goserver-xsession` first, then retry before debugging desktop startup.
**SSH works but X2Go session fails:**
Treat it as a desktop startup or X11 auth problem, not an SSH transport problem.

94
runbooks/xorg.md Normal file
View File

@@ -0,0 +1,94 @@
---
service: xorg
symptoms: xorg black screen, display manager loop, no screens found, failed to start X server, GPU driver error, xrandr missing outputs, login screen not appearing
tags: xorg, x11, display, gpu, drm, xrandr, gdm, sddm, lightdm
---
## Symptoms
- Black screen after graphical boot
- Display manager loops back to login
- `no screens found` in Xorg log
- External monitors are missing or not detected
- X server fails after a driver update
- `startx` exits immediately with display or device errors
## Diagnostics
### Check display manager and Xorg service path
```
systemctl status display-manager
systemctl status gdm
systemctl status sddm
systemctl status lightdm
```
If the display manager is failing, inspect its logs before focusing on Xorg itself.
### Check Xorg logs
```
find /var/log -name 'Xorg*.log' -o -name 'Xorg.*.log'
grep -E '\(EE\)|\(WW\)' /var/log/Xorg.0.log
journalctl -b | grep -iE 'xorg|gdm|sddm|lightdm'
ls -la ~/.local/share/xorg/
```
Look for: `no screens found`, GPU module load failures, and permission/device access errors.
On rootless Xorg, logs are often under `~/.local/share/xorg/Xorg.0.log` instead of `/var/log/`.
### Check DRM and GPU driver state
```
lspci -k | grep -A3 -E 'VGA|3D|Display'
lsmod | grep -E 'nouveau|nvidia|amdgpu|i915'
dmesg | grep -iE 'drm|gpu|nvidia|amdgpu|i915'
```
Driver mismatches after kernel updates are a common cause of X startup failures.
### Check monitor detection and permissions
```
loginctl session-status
xrandr --query
ls -la /dev/dri/
ps -o user= -C Xorg
```
If `/dev/dri/*` permissions or seat assignment are wrong, X may fail to access the GPU.
### Check X configuration files
```
find /etc/X11 -maxdepth 3 -type f
cat /etc/X11/xorg.conf
cat /etc/X11/xorg.conf.d/*.conf
ls -la ~/.xinitrc ~/.xserverrc
```
Custom `Device`, `Monitor`, or `Screen` sections often break auto-detection.
An empty or broken `.xinitrc` can produce a black screen even when the X server itself started correctly.
## Remediation
**Bad static Xorg config:**
Move custom config aside and let auto-detection work unless the hardware truly needs manual config.
**Driver mismatch after update:**
Reinstall the GPU driver package matching the running kernel and reboot or restart the display manager.
**`no screens found`:**
Check whether the correct DRM module loaded and whether the display manager is running on the expected seat.
**Display manager loop:**
Correlate Xorg errors with PAM/auth logs; some loops are session startup failures, not graphics failures.
**Framebuffer mode failure:**
If X falls back to `fbdev` and errors with framebuffer/bus ID messages, remove the generic `fbdev` driver package and let Xorg use the proper modesetting or vendor driver.
**`SocketCreateListener() failed`:**
Check for stale sockets in `/tmp/.X11-unix`, especially after previous root-run Xorg sessions.