--- service: kernel symptoms: OOM kill, out of memory, high load average, kernel panic, segfault, soft lockup, CPU steal, system unresponsive, zombie processes, NMI watchdog tags: kernel, oom, memory, load, cpu, panic, dmesg, segfault, lockup, swap, zombie --- ## Symptoms - `Out of memory: Kill process ` in dmesg — OOM killer fired - Load average far above CPU count — system overloaded or I/O blocked - `kernel: BUG: soft lockup` — CPU stuck in kernel code - `segfault at ...` in dmesg — process crashed due to invalid memory access - `kernel panic` — unrecoverable kernel error (visible only on console or serial) - Many zombie (`Z`) processes in `ps` output - High `%steal` in `top`/`vmstat` — hypervisor CPU contention ## Diagnostics ### Recent kernel messages ``` dmesg -T | tail -100 dmesg -T | grep -iE 'error|warn|oom|kill|panic|oops|fault|hung|lockup' journalctl -k -n 200 ``` ### OOM events ``` dmesg -T | grep -i 'out of memory\|oom_kill\|killed process' ``` The log shows which process was killed, its RSS at time of kill, and available memory. ### Memory usage ``` free -h cat /proc/meminfo | head -30 vmstat -s ``` `MemAvailable` is the key metric. If it is near zero and swap is also exhausted, OOM kills are imminent. ### Swap ``` swapon --show cat /proc/swaps vmstat 1 5 ``` High `si`/`so` (swap-in/swap-out) in `vmstat` indicates active swapping and likely memory pressure. ### Load average and CPU ``` uptime top -b -n1 | head -30 mpstat -P ALL 1 3 ``` Load average above 2× CPU count sustained over 15 minutes is concerning. High `%iowait` indicates processes blocked on disk I/O, not CPU-bound load. ### Process memory usage ``` ps aux --sort=-%mem | head -20 ps aux --sort=-%cpu | head -20 ``` ### Zombie processes ``` ps aux | awk '$8=="Z"' ``` Zombies cannot be killed; the parent must `wait()` for them or be killed itself. ### I/O wait and disk health ``` iostat -x 1 3 dmesg -T | grep -iE 'i/o error|hard resetting link|ata.*error|blk_update_request' ``` Persistent I/O errors alongside high load suggest failing storage. ## Remediation **Memory pressure / frequent OOM kills:** Identify the largest memory consumers from `ps aux --sort=-%mem`. Consider increasing swap, adding RAM, tuning `vm.overcommit_memory`, or scaling the workload. Do NOT just raise `vm.overcommit_ratio` without understanding the root consumer. **Adjust OOM killer scoring for critical services (temporary, resets on reboot):** ``` echo -17 > /proc//oom_adj # legacy echo -1000 > /proc//oom_score_adj # current kernels ``` **Swap exhausted — add a swapfile:** ``` fallocate -l 2G /swapfile chmod 600 /swapfile mkswap /swapfile swapon /swapfile ``` **High I/O wait — find the I/O-heavy process:** ``` iotop -a -o -b -n3 ``` **Zombie reaping — if parent is stuck:** Kill the parent process (it will reap children on exit), then verify zombies disappear.