Files
tai/runbooks/postgres.md
zphinx 57f4c0efaa
Some checks failed
CI / test (push) Failing after 15s
feat: complete RAG runbook workflow and release docs
2026-05-06 04:48:41 +02:00

3.0 KiB

service, symptoms, tags
service symptoms tags
postgres connection refused port 5432, FATAL password authentication failed, replication lag, disk full, out of shared memory, too many connections, relation does not exist, could not connect to the primary postgres, postgresql, database, replication, pg, psql, disk, connections

Symptoms

  • could not connect to server: Connection refused — postgres not running or not on port 5432
  • FATAL: password authentication failed for user "<user>" — wrong credentials or pg_hba mismatch
  • FATAL: too many connections — connection pool exhausted
  • ERROR: could not resize shared memory segment / out of shared memory — shared_buffers too high for system
  • PANIC: could not write to file "pg_wal/..." — disk full on WAL directory
  • Replication lag growing — standby falling behind primary
  • FATAL: could not connect to the primary server — standby cannot reach primary

Diagnostics

Service status

systemctl status postgresql
systemctl status postgresql@<version>-main

PostgreSQL logs

journalctl -u postgresql -n 100
tail -n 100 /var/log/postgresql/postgresql-*.log

Is postgres listening?

ss -tlnp | grep 5432

Disk space (WAL and data directory are the critical paths)

df -h
du -sh /var/lib/postgresql/
du -sh /var/lib/postgresql/*/main/pg_wal/

A full disk on the pg_wal partition causes a PANIC and hard crash.

Connection count

SELECT count(*), state FROM pg_stat_activity GROUP BY state;
SELECT setting FROM pg_settings WHERE name = 'max_connections';

Replication lag (run on primary)

SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       (sent_lsn - replay_lsn) AS lag_bytes
FROM pg_stat_replication;

pg_hba.conf — authentication rules

cat /etc/postgresql/*/main/pg_hba.conf

Entries are matched top-to-bottom. reject or missing entry for the client IP causes auth failure even with correct credentials.

Shared memory / kernel settings

cat /proc/sys/kernel/shmmax
cat /etc/postgresql/*/main/postgresql.conf | grep shared_buffers

shared_buffers must not exceed ~40% of RAM; kernel shmmax must accommodate it.

Remediation

Postgres not running:

systemctl start postgresql

Check logs immediately after start for the failure reason.

Authentication failure (pg_hba mismatch): Add or update the correct entry in pg_hba.conf, then reload:

systemctl reload postgresql

Too many connections — increase limit (requires restart): In postgresql.conf:

max_connections = 200

Or deploy a connection pooler (pgbouncer).

Disk full on WAL: Identify and remove old base backups or archived WAL segments under /var/lib/postgresql/*/main/pg_wal/. Do NOT delete pg_wal files directly — use pg_archivecleanup or let archiving catch up.

Replication lag — standby too far behind: Check network bandwidth and I/O on standby. If wal_receiver_status_interval lag is large, increase wal_sender_timeout temporarily.