tai/kernel.md at feature/history-ux-session-memory

zphinx/tai

Fork 0

Files

zphinx 57f4c0efaa

CI / test (push) Failing after 15s

Details

feat: complete RAG runbook workflow and release docs

2026-05-06 04:48:41 +02:00

2.9 KiB

Raw Permalink Blame History

service, symptoms, tags

service	symptoms	tags
kernel	OOM kill, out of memory, high load average, kernel panic, segfault, soft lockup, CPU steal, system unresponsive, zombie processes, NMI watchdog	kernel, oom, memory, load, cpu, panic, dmesg, segfault, lockup, swap, zombie

Symptoms

Out of memory: Kill process <pid> in dmesg — OOM killer fired
Load average far above CPU count — system overloaded or I/O blocked
kernel: BUG: soft lockup — CPU stuck in kernel code
segfault at ... in dmesg — process crashed due to invalid memory access
kernel panic — unrecoverable kernel error (visible only on console or serial)
Many zombie (Z) processes in ps output
High %steal in top/vmstat — hypervisor CPU contention

Diagnostics

Recent kernel messages

dmesg -T | tail -100
dmesg -T | grep -iE 'error|warn|oom|kill|panic|oops|fault|hung|lockup'
journalctl -k -n 200

OOM events

dmesg -T | grep -i 'out of memory\|oom_kill\|killed process'

The log shows which process was killed, its RSS at time of kill, and available memory.

Memory usage

free -h
cat /proc/meminfo | head -30
vmstat -s

MemAvailable is the key metric. If it is near zero and swap is also exhausted, OOM kills are imminent.

Swap

swapon --show
cat /proc/swaps
vmstat 1 5

High si/so (swap-in/swap-out) in vmstat indicates active swapping and likely memory pressure.

Load average and CPU

uptime
top -b -n1 | head -30
mpstat -P ALL 1 3

Load average above 2× CPU count sustained over 15 minutes is concerning. High %iowait indicates processes blocked on disk I/O, not CPU-bound load.

Process memory usage

ps aux --sort=-%mem | head -20
ps aux --sort=-%cpu | head -20

Zombie processes

ps aux | awk '$8=="Z"'

Zombies cannot be killed; the parent must wait() for them or be killed itself.

I/O wait and disk health

iostat -x 1 3
dmesg -T | grep -iE 'i/o error|hard resetting link|ata.*error|blk_update_request'

Persistent I/O errors alongside high load suggest failing storage.

Remediation

Memory pressure / frequent OOM kills: Identify the largest memory consumers from ps aux --sort=-%mem. Consider increasing swap, adding RAM, tuning vm.overcommit_memory, or scaling the workload. Do NOT just raise vm.overcommit_ratio without understanding the root consumer.

Adjust OOM killer scoring for critical services (temporary, resets on reboot):

echo -17 > /proc/<pid>/oom_adj        # legacy
echo -1000 > /proc/<pid>/oom_score_adj  # current kernels

Swap exhausted — add a swapfile:

fallocate -l 2G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

High I/O wait — find the I/O-heavy process:

iotop -a -o -b -n3

Zombie reaping — if parent is stuck: Kill the parent process (it will reap children on exit), then verify zombies disappear.

2.9 KiB Raw Permalink Blame History Unescape Escape