Files
tai/runbooks/docker.md
zphinx 57f4c0efaa
Some checks failed
CI / test (push) Failing after 15s
feat: complete RAG runbook workflow and release docs
2026-05-06 04:48:41 +02:00

3.3 KiB

service, symptoms, tags
service symptoms tags
docker cannot connect to docker daemon, docker daemon failed to start, docker socket permission denied, containers cannot resolve dns, docker network broken, daemon.json conflict, docker oom, unable to remove filesystem docker, dockerd, containerd, container, daemon, daemon.json, cgroup, dns, docker0, socket, compose

Symptoms

  • Cannot connect to the Docker daemon. Is the docker daemon running on this host?
  • permission denied on /var/run/docker.sock
  • dockerd fails to start after a daemon.json change
  • Containers cannot resolve DNS or pull images
  • Docker bridge/network disappears or container networking breaks after boot
  • Container or daemon is killed by the kernel OOM killer
  • Error: Unable to remove filesystem when removing a container

Diagnostics

Check daemon health and client target

docker info
systemctl is-active docker
systemctl status docker
ps -ef | grep dockerd
env | grep DOCKER_HOST

If DOCKER_HOST is set incorrectly, the CLI may be talking to the wrong daemon.

Check daemon logs and startup failures

journalctl -u docker -n 200
journalctl -u containerd -n 100
cat /etc/docker/daemon.json
systemctl cat docker

Look for conflicts between daemon.json keys and systemd startup flags, especially duplicate hosts settings.

Check socket permissions and group access

ls -la /var/run/docker.sock
id
getent group docker
ls -la ~/.docker/

If the user was added to the docker group recently, a new login shell may be required.

Check kernel, cgroups, and memory pressure

uname -r
free -h
dmesg | grep -i -E 'docker|cgroup|oom|killed process'

Low memory, missing kernel features, or cgroup issues can stop containers or the daemon.

Check Docker networking and DNS

docker network ls
ip addr show docker0
sysctl net.ipv4.ip_forward
cat /etc/resolv.conf
ps aux | grep dnsmasq

Loopback DNS resolvers in /etc/resolv.conf often break container DNS unless Docker is given explicit nameservers.

Check storage and stuck mounts

df -h /var/lib/docker
docker system df
lsof /var/lib/docker

Bind-mounting /var/lib/docker into other containers can keep container filesystems busy and block removal.

Remediation

Daemon not running or client aimed at the wrong host: Unset an incorrect DOCKER_HOST, then start the daemon:

unset DOCKER_HOST
systemctl restart docker

daemon.json conflicts with systemd flags: Remove duplicate settings or create a systemd override so dockerd is started without conflicting flags.

Permission denied on Docker socket: Add the user to the docker group, then re-login:

usermod -aG docker $USER
newgrp docker

If ~/.docker/ was created by sudo, fix ownership:

sudo chown "$USER":"$USER" "$HOME/.docker" -R
sudo chmod g+rwx "$HOME/.docker" -R

Container DNS broken: Configure explicit DNS servers in /etc/docker/daemon.json, then restart Docker.

Docker networking disappears after boot: Stop the host network manager from managing Docker interfaces and confirm net.ipv4.ip_forward=1.

OOM kills: Treat this as host memory pressure first; reduce workload, add memory, or enforce container memory limits.

Unable to remove filesystem: Find the process holding the path open with lsof, then stop that process or the container bind-mounting /var/lib/docker.