feat: complete RAG runbook workflow and release docs
Some checks failed
CI / test (push) Failing after 15s
Some checks failed
CI / test (push) Failing after 15s
This commit is contained in:
117
runbooks/kernel.md
Normal file
117
runbooks/kernel.md
Normal file
@@ -0,0 +1,117 @@
|
||||
---
|
||||
service: kernel
|
||||
symptoms: OOM kill, out of memory, high load average, kernel panic, segfault, soft lockup, CPU steal, system unresponsive, zombie processes, NMI watchdog
|
||||
tags: kernel, oom, memory, load, cpu, panic, dmesg, segfault, lockup, swap, zombie
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- `Out of memory: Kill process <pid>` in dmesg — OOM killer fired
|
||||
- Load average far above CPU count — system overloaded or I/O blocked
|
||||
- `kernel: BUG: soft lockup` — CPU stuck in kernel code
|
||||
- `segfault at ...` in dmesg — process crashed due to invalid memory access
|
||||
- `kernel panic` — unrecoverable kernel error (visible only on console or serial)
|
||||
- Many zombie (`Z`) processes in `ps` output
|
||||
- High `%steal` in `top`/`vmstat` — hypervisor CPU contention
|
||||
|
||||
## Diagnostics
|
||||
|
||||
### Recent kernel messages
|
||||
|
||||
```
|
||||
dmesg -T | tail -100
|
||||
dmesg -T | grep -iE 'error|warn|oom|kill|panic|oops|fault|hung|lockup'
|
||||
journalctl -k -n 200
|
||||
```
|
||||
|
||||
### OOM events
|
||||
|
||||
```
|
||||
dmesg -T | grep -i 'out of memory\|oom_kill\|killed process'
|
||||
```
|
||||
|
||||
The log shows which process was killed, its RSS at time of kill, and available memory.
|
||||
|
||||
### Memory usage
|
||||
|
||||
```
|
||||
free -h
|
||||
cat /proc/meminfo | head -30
|
||||
vmstat -s
|
||||
```
|
||||
|
||||
`MemAvailable` is the key metric. If it is near zero and swap is also exhausted, OOM kills are imminent.
|
||||
|
||||
### Swap
|
||||
|
||||
```
|
||||
swapon --show
|
||||
cat /proc/swaps
|
||||
vmstat 1 5
|
||||
```
|
||||
|
||||
High `si`/`so` (swap-in/swap-out) in `vmstat` indicates active swapping and likely memory pressure.
|
||||
|
||||
### Load average and CPU
|
||||
|
||||
```
|
||||
uptime
|
||||
top -b -n1 | head -30
|
||||
mpstat -P ALL 1 3
|
||||
```
|
||||
|
||||
Load average above 2× CPU count sustained over 15 minutes is concerning.
|
||||
High `%iowait` indicates processes blocked on disk I/O, not CPU-bound load.
|
||||
|
||||
### Process memory usage
|
||||
|
||||
```
|
||||
ps aux --sort=-%mem | head -20
|
||||
ps aux --sort=-%cpu | head -20
|
||||
```
|
||||
|
||||
### Zombie processes
|
||||
|
||||
```
|
||||
ps aux | awk '$8=="Z"'
|
||||
```
|
||||
|
||||
Zombies cannot be killed; the parent must `wait()` for them or be killed itself.
|
||||
|
||||
### I/O wait and disk health
|
||||
|
||||
```
|
||||
iostat -x 1 3
|
||||
dmesg -T | grep -iE 'i/o error|hard resetting link|ata.*error|blk_update_request'
|
||||
```
|
||||
|
||||
Persistent I/O errors alongside high load suggest failing storage.
|
||||
|
||||
## Remediation
|
||||
|
||||
**Memory pressure / frequent OOM kills:**
|
||||
Identify the largest memory consumers from `ps aux --sort=-%mem`.
|
||||
Consider increasing swap, adding RAM, tuning `vm.overcommit_memory`, or scaling the workload.
|
||||
Do NOT just raise `vm.overcommit_ratio` without understanding the root consumer.
|
||||
|
||||
**Adjust OOM killer scoring for critical services (temporary, resets on reboot):**
|
||||
```
|
||||
echo -17 > /proc/<pid>/oom_adj # legacy
|
||||
echo -1000 > /proc/<pid>/oom_score_adj # current kernels
|
||||
```
|
||||
|
||||
**Swap exhausted — add a swapfile:**
|
||||
```
|
||||
fallocate -l 2G /swapfile
|
||||
chmod 600 /swapfile
|
||||
mkswap /swapfile
|
||||
swapon /swapfile
|
||||
```
|
||||
|
||||
**High I/O wait — find the I/O-heavy process:**
|
||||
```
|
||||
iotop -a -o -b -n3
|
||||
```
|
||||
|
||||
**Zombie reaping — if parent is stuck:**
|
||||
Kill the parent process (it will reap children on exit), then verify zombies disappear.
|
||||
Reference in New Issue
Block a user