Files
tai/runbooks/kernel.md
zphinx 57f4c0efaa
Some checks failed
CI / test (push) Failing after 15s
feat: complete RAG runbook workflow and release docs
2026-05-06 04:48:41 +02:00

118 lines
2.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
service: kernel
symptoms: OOM kill, out of memory, high load average, kernel panic, segfault, soft lockup, CPU steal, system unresponsive, zombie processes, NMI watchdog
tags: kernel, oom, memory, load, cpu, panic, dmesg, segfault, lockup, swap, zombie
---
## Symptoms
- `Out of memory: Kill process <pid>` in dmesg — OOM killer fired
- Load average far above CPU count — system overloaded or I/O blocked
- `kernel: BUG: soft lockup` — CPU stuck in kernel code
- `segfault at ...` in dmesg — process crashed due to invalid memory access
- `kernel panic` — unrecoverable kernel error (visible only on console or serial)
- Many zombie (`Z`) processes in `ps` output
- High `%steal` in `top`/`vmstat` — hypervisor CPU contention
## Diagnostics
### Recent kernel messages
```
dmesg -T | tail -100
dmesg -T | grep -iE 'error|warn|oom|kill|panic|oops|fault|hung|lockup'
journalctl -k -n 200
```
### OOM events
```
dmesg -T | grep -i 'out of memory\|oom_kill\|killed process'
```
The log shows which process was killed, its RSS at time of kill, and available memory.
### Memory usage
```
free -h
cat /proc/meminfo | head -30
vmstat -s
```
`MemAvailable` is the key metric. If it is near zero and swap is also exhausted, OOM kills are imminent.
### Swap
```
swapon --show
cat /proc/swaps
vmstat 1 5
```
High `si`/`so` (swap-in/swap-out) in `vmstat` indicates active swapping and likely memory pressure.
### Load average and CPU
```
uptime
top -b -n1 | head -30
mpstat -P ALL 1 3
```
Load average above 2× CPU count sustained over 15 minutes is concerning.
High `%iowait` indicates processes blocked on disk I/O, not CPU-bound load.
### Process memory usage
```
ps aux --sort=-%mem | head -20
ps aux --sort=-%cpu | head -20
```
### Zombie processes
```
ps aux | awk '$8=="Z"'
```
Zombies cannot be killed; the parent must `wait()` for them or be killed itself.
### I/O wait and disk health
```
iostat -x 1 3
dmesg -T | grep -iE 'i/o error|hard resetting link|ata.*error|blk_update_request'
```
Persistent I/O errors alongside high load suggest failing storage.
## Remediation
**Memory pressure / frequent OOM kills:**
Identify the largest memory consumers from `ps aux --sort=-%mem`.
Consider increasing swap, adding RAM, tuning `vm.overcommit_memory`, or scaling the workload.
Do NOT just raise `vm.overcommit_ratio` without understanding the root consumer.
**Adjust OOM killer scoring for critical services (temporary, resets on reboot):**
```
echo -17 > /proc/<pid>/oom_adj # legacy
echo -1000 > /proc/<pid>/oom_score_adj # current kernels
```
**Swap exhausted — add a swapfile:**
```
fallocate -l 2G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
```
**High I/O wait — find the I/O-heavy process:**
```
iotop -a -o -b -n3
```
**Zombie reaping — if parent is stuck:**
Kill the parent process (it will reap children on exit), then verify zombies disappear.