feat: complete RAG runbook workflow and release docs
Some checks failed
CI / test (push) Failing after 15s

This commit is contained in:
2026-05-06 04:48:41 +02:00
parent 450de24d28
commit 57f4c0efaa
26 changed files with 2510 additions and 137 deletions

117
runbooks/kernel.md Normal file
View File

@@ -0,0 +1,117 @@
---
service: kernel
symptoms: OOM kill, out of memory, high load average, kernel panic, segfault, soft lockup, CPU steal, system unresponsive, zombie processes, NMI watchdog
tags: kernel, oom, memory, load, cpu, panic, dmesg, segfault, lockup, swap, zombie
---
## Symptoms
- `Out of memory: Kill process <pid>` in dmesg — OOM killer fired
- Load average far above CPU count — system overloaded or I/O blocked
- `kernel: BUG: soft lockup` — CPU stuck in kernel code
- `segfault at ...` in dmesg — process crashed due to invalid memory access
- `kernel panic` — unrecoverable kernel error (visible only on console or serial)
- Many zombie (`Z`) processes in `ps` output
- High `%steal` in `top`/`vmstat` — hypervisor CPU contention
## Diagnostics
### Recent kernel messages
```
dmesg -T | tail -100
dmesg -T | grep -iE 'error|warn|oom|kill|panic|oops|fault|hung|lockup'
journalctl -k -n 200
```
### OOM events
```
dmesg -T | grep -i 'out of memory\|oom_kill\|killed process'
```
The log shows which process was killed, its RSS at time of kill, and available memory.
### Memory usage
```
free -h
cat /proc/meminfo | head -30
vmstat -s
```
`MemAvailable` is the key metric. If it is near zero and swap is also exhausted, OOM kills are imminent.
### Swap
```
swapon --show
cat /proc/swaps
vmstat 1 5
```
High `si`/`so` (swap-in/swap-out) in `vmstat` indicates active swapping and likely memory pressure.
### Load average and CPU
```
uptime
top -b -n1 | head -30
mpstat -P ALL 1 3
```
Load average above 2× CPU count sustained over 15 minutes is concerning.
High `%iowait` indicates processes blocked on disk I/O, not CPU-bound load.
### Process memory usage
```
ps aux --sort=-%mem | head -20
ps aux --sort=-%cpu | head -20
```
### Zombie processes
```
ps aux | awk '$8=="Z"'
```
Zombies cannot be killed; the parent must `wait()` for them or be killed itself.
### I/O wait and disk health
```
iostat -x 1 3
dmesg -T | grep -iE 'i/o error|hard resetting link|ata.*error|blk_update_request'
```
Persistent I/O errors alongside high load suggest failing storage.
## Remediation
**Memory pressure / frequent OOM kills:**
Identify the largest memory consumers from `ps aux --sort=-%mem`.
Consider increasing swap, adding RAM, tuning `vm.overcommit_memory`, or scaling the workload.
Do NOT just raise `vm.overcommit_ratio` without understanding the root consumer.
**Adjust OOM killer scoring for critical services (temporary, resets on reboot):**
```
echo -17 > /proc/<pid>/oom_adj # legacy
echo -1000 > /proc/<pid>/oom_score_adj # current kernels
```
**Swap exhausted — add a swapfile:**
```
fallocate -l 2G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
```
**High I/O wait — find the I/O-heavy process:**
```
iotop -a -o -b -n3
```
**Zombie reaping — if parent is stuck:**
Kill the parent process (it will reap children on exit), then verify zombies disappear.