Files
tai/runbooks/kernel.md
zphinx 57f4c0efaa
Some checks failed
CI / test (push) Failing after 15s
feat: complete RAG runbook workflow and release docs
2026-05-06 04:48:41 +02:00

2.9 KiB
Raw Permalink Blame History

service, symptoms, tags
service symptoms tags
kernel OOM kill, out of memory, high load average, kernel panic, segfault, soft lockup, CPU steal, system unresponsive, zombie processes, NMI watchdog kernel, oom, memory, load, cpu, panic, dmesg, segfault, lockup, swap, zombie

Symptoms

  • Out of memory: Kill process <pid> in dmesg — OOM killer fired
  • Load average far above CPU count — system overloaded or I/O blocked
  • kernel: BUG: soft lockup — CPU stuck in kernel code
  • segfault at ... in dmesg — process crashed due to invalid memory access
  • kernel panic — unrecoverable kernel error (visible only on console or serial)
  • Many zombie (Z) processes in ps output
  • High %steal in top/vmstat — hypervisor CPU contention

Diagnostics

Recent kernel messages

dmesg -T | tail -100
dmesg -T | grep -iE 'error|warn|oom|kill|panic|oops|fault|hung|lockup'
journalctl -k -n 200

OOM events

dmesg -T | grep -i 'out of memory\|oom_kill\|killed process'

The log shows which process was killed, its RSS at time of kill, and available memory.

Memory usage

free -h
cat /proc/meminfo | head -30
vmstat -s

MemAvailable is the key metric. If it is near zero and swap is also exhausted, OOM kills are imminent.

Swap

swapon --show
cat /proc/swaps
vmstat 1 5

High si/so (swap-in/swap-out) in vmstat indicates active swapping and likely memory pressure.

Load average and CPU

uptime
top -b -n1 | head -30
mpstat -P ALL 1 3

Load average above 2× CPU count sustained over 15 minutes is concerning. High %iowait indicates processes blocked on disk I/O, not CPU-bound load.

Process memory usage

ps aux --sort=-%mem | head -20
ps aux --sort=-%cpu | head -20

Zombie processes

ps aux | awk '$8=="Z"'

Zombies cannot be killed; the parent must wait() for them or be killed itself.

I/O wait and disk health

iostat -x 1 3
dmesg -T | grep -iE 'i/o error|hard resetting link|ata.*error|blk_update_request'

Persistent I/O errors alongside high load suggest failing storage.

Remediation

Memory pressure / frequent OOM kills: Identify the largest memory consumers from ps aux --sort=-%mem. Consider increasing swap, adding RAM, tuning vm.overcommit_memory, or scaling the workload. Do NOT just raise vm.overcommit_ratio without understanding the root consumer.

Adjust OOM killer scoring for critical services (temporary, resets on reboot):

echo -17 > /proc/<pid>/oom_adj        # legacy
echo -1000 > /proc/<pid>/oom_score_adj  # current kernels

Swap exhausted — add a swapfile:

fallocate -l 2G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

High I/O wait — find the I/O-heavy process:

iotop -a -o -b -n3

Zombie reaping — if parent is stuck: Kill the parent process (it will reap children on exit), then verify zombies disappear.