tai/postgres.md at bbc75b1559c1df57e8bce0116ea227ccc7246399

zphinx/tai

Fork 0

Files

zphinx 57f4c0efaa

CI / test (push) Failing after 15s

Details

feat: complete RAG runbook workflow and release docs

2026-05-06 04:48:41 +02:00

3.0 KiB

Raw Blame History

service, symptoms, tags

service	symptoms	tags
postgres	connection refused port 5432, FATAL password authentication failed, replication lag, disk full, out of shared memory, too many connections, relation does not exist, could not connect to the primary	postgres, postgresql, database, replication, pg, psql, disk, connections

Symptoms

could not connect to server: Connection refused — postgres not running or not on port 5432
FATAL: password authentication failed for user "<user>" — wrong credentials or pg_hba mismatch
FATAL: too many connections — connection pool exhausted
ERROR: could not resize shared memory segment / out of shared memory — shared_buffers too high for system
PANIC: could not write to file "pg_wal/..." — disk full on WAL directory
Replication lag growing — standby falling behind primary
FATAL: could not connect to the primary server — standby cannot reach primary

Diagnostics

Service status

systemctl status postgresql
systemctl status postgresql@<version>-main

PostgreSQL logs

journalctl -u postgresql -n 100
tail -n 100 /var/log/postgresql/postgresql-*.log

Is postgres listening?

ss -tlnp | grep 5432

Disk space (WAL and data directory are the critical paths)

df -h
du -sh /var/lib/postgresql/
du -sh /var/lib/postgresql/*/main/pg_wal/

A full disk on the pg_wal partition causes a PANIC and hard crash.

Connection count

SELECT count(*), state FROM pg_stat_activity GROUP BY state;
SELECT setting FROM pg_settings WHERE name = 'max_connections';

Replication lag (run on primary)

SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       (sent_lsn - replay_lsn) AS lag_bytes
FROM pg_stat_replication;

pg_hba.conf — authentication rules

cat /etc/postgresql/*/main/pg_hba.conf

Entries are matched top-to-bottom. reject or missing entry for the client IP causes auth failure even with correct credentials.

Shared memory / kernel settings

cat /proc/sys/kernel/shmmax
cat /etc/postgresql/*/main/postgresql.conf | grep shared_buffers

shared_buffers must not exceed ~40% of RAM; kernel shmmax must accommodate it.

Remediation

Postgres not running:

systemctl start postgresql

Check logs immediately after start for the failure reason.

Authentication failure (pg_hba mismatch): Add or update the correct entry in pg_hba.conf, then reload:

systemctl reload postgresql

Too many connections — increase limit (requires restart): In postgresql.conf:

max_connections = 200

Or deploy a connection pooler (pgbouncer).

Disk full on WAL: Identify and remove old base backups or archived WAL segments under /var/lib/postgresql/*/main/pg_wal/. Do NOT delete pg_wal files directly — use pg_archivecleanup or let archiving catch up.

Replication lag — standby too far behind: Check network bandwidth and I/O on standby. If wal_receiver_status_interval lag is large, increase wal_sender_timeout temporarily.

3.0 KiB Raw Blame History