108 lines
3.0 KiB
Markdown
108 lines
3.0 KiB
Markdown
---
|
|
service: postgres
|
|
symptoms: connection refused port 5432, FATAL password authentication failed, replication lag, disk full, out of shared memory, too many connections, relation does not exist, could not connect to the primary
|
|
tags: postgres, postgresql, database, replication, pg, psql, disk, connections
|
|
---
|
|
|
|
## Symptoms
|
|
|
|
- `could not connect to server: Connection refused` — postgres not running or not on port 5432
|
|
- `FATAL: password authentication failed for user "<user>"` — wrong credentials or pg_hba mismatch
|
|
- `FATAL: too many connections` — connection pool exhausted
|
|
- `ERROR: could not resize shared memory segment` / `out of shared memory` — shared_buffers too high for system
|
|
- `PANIC: could not write to file "pg_wal/..."` — disk full on WAL directory
|
|
- Replication lag growing — standby falling behind primary
|
|
- `FATAL: could not connect to the primary server` — standby cannot reach primary
|
|
|
|
## Diagnostics
|
|
|
|
### Service status
|
|
|
|
```
|
|
systemctl status postgresql
|
|
systemctl status postgresql@<version>-main
|
|
```
|
|
|
|
### PostgreSQL logs
|
|
|
|
```
|
|
journalctl -u postgresql -n 100
|
|
tail -n 100 /var/log/postgresql/postgresql-*.log
|
|
```
|
|
|
|
### Is postgres listening?
|
|
|
|
```
|
|
ss -tlnp | grep 5432
|
|
```
|
|
|
|
### Disk space (WAL and data directory are the critical paths)
|
|
|
|
```
|
|
df -h
|
|
du -sh /var/lib/postgresql/
|
|
du -sh /var/lib/postgresql/*/main/pg_wal/
|
|
```
|
|
|
|
A full disk on the pg_wal partition causes a PANIC and hard crash.
|
|
|
|
### Connection count
|
|
|
|
```sql
|
|
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
|
|
SELECT setting FROM pg_settings WHERE name = 'max_connections';
|
|
```
|
|
|
|
### Replication lag (run on primary)
|
|
|
|
```sql
|
|
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
|
|
(sent_lsn - replay_lsn) AS lag_bytes
|
|
FROM pg_stat_replication;
|
|
```
|
|
|
|
### pg_hba.conf — authentication rules
|
|
|
|
```
|
|
cat /etc/postgresql/*/main/pg_hba.conf
|
|
```
|
|
|
|
Entries are matched top-to-bottom. `reject` or missing entry for the client IP causes auth failure even with correct credentials.
|
|
|
|
### Shared memory / kernel settings
|
|
|
|
```
|
|
cat /proc/sys/kernel/shmmax
|
|
cat /etc/postgresql/*/main/postgresql.conf | grep shared_buffers
|
|
```
|
|
|
|
`shared_buffers` must not exceed ~40% of RAM; kernel `shmmax` must accommodate it.
|
|
|
|
## Remediation
|
|
|
|
**Postgres not running:**
|
|
```
|
|
systemctl start postgresql
|
|
```
|
|
Check logs immediately after start for the failure reason.
|
|
|
|
**Authentication failure (pg_hba mismatch):**
|
|
Add or update the correct entry in `pg_hba.conf`, then reload:
|
|
```
|
|
systemctl reload postgresql
|
|
```
|
|
|
|
**Too many connections — increase limit (requires restart):**
|
|
In `postgresql.conf`:
|
|
```
|
|
max_connections = 200
|
|
```
|
|
Or deploy a connection pooler (`pgbouncer`).
|
|
|
|
**Disk full on WAL:**
|
|
Identify and remove old base backups or archived WAL segments under `/var/lib/postgresql/*/main/pg_wal/`.
|
|
Do NOT delete pg_wal files directly — use `pg_archivecleanup` or let archiving catch up.
|
|
|
|
**Replication lag — standby too far behind:**
|
|
Check network bandwidth and I/O on standby. If `wal_receiver_status_interval` lag is large, increase `wal_sender_timeout` temporarily.
|