feat: complete RAG runbook workflow and release docs
Some checks failed
CI / test (push) Failing after 15s
Some checks failed
CI / test (push) Failing after 15s
This commit is contained in:
107
runbooks/postgres.md
Normal file
107
runbooks/postgres.md
Normal file
@@ -0,0 +1,107 @@
|
||||
---
|
||||
service: postgres
|
||||
symptoms: connection refused port 5432, FATAL password authentication failed, replication lag, disk full, out of shared memory, too many connections, relation does not exist, could not connect to the primary
|
||||
tags: postgres, postgresql, database, replication, pg, psql, disk, connections
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- `could not connect to server: Connection refused` — postgres not running or not on port 5432
|
||||
- `FATAL: password authentication failed for user "<user>"` — wrong credentials or pg_hba mismatch
|
||||
- `FATAL: too many connections` — connection pool exhausted
|
||||
- `ERROR: could not resize shared memory segment` / `out of shared memory` — shared_buffers too high for system
|
||||
- `PANIC: could not write to file "pg_wal/..."` — disk full on WAL directory
|
||||
- Replication lag growing — standby falling behind primary
|
||||
- `FATAL: could not connect to the primary server` — standby cannot reach primary
|
||||
|
||||
## Diagnostics
|
||||
|
||||
### Service status
|
||||
|
||||
```
|
||||
systemctl status postgresql
|
||||
systemctl status postgresql@<version>-main
|
||||
```
|
||||
|
||||
### PostgreSQL logs
|
||||
|
||||
```
|
||||
journalctl -u postgresql -n 100
|
||||
tail -n 100 /var/log/postgresql/postgresql-*.log
|
||||
```
|
||||
|
||||
### Is postgres listening?
|
||||
|
||||
```
|
||||
ss -tlnp | grep 5432
|
||||
```
|
||||
|
||||
### Disk space (WAL and data directory are the critical paths)
|
||||
|
||||
```
|
||||
df -h
|
||||
du -sh /var/lib/postgresql/
|
||||
du -sh /var/lib/postgresql/*/main/pg_wal/
|
||||
```
|
||||
|
||||
A full disk on the pg_wal partition causes a PANIC and hard crash.
|
||||
|
||||
### Connection count
|
||||
|
||||
```sql
|
||||
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
|
||||
SELECT setting FROM pg_settings WHERE name = 'max_connections';
|
||||
```
|
||||
|
||||
### Replication lag (run on primary)
|
||||
|
||||
```sql
|
||||
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
|
||||
(sent_lsn - replay_lsn) AS lag_bytes
|
||||
FROM pg_stat_replication;
|
||||
```
|
||||
|
||||
### pg_hba.conf — authentication rules
|
||||
|
||||
```
|
||||
cat /etc/postgresql/*/main/pg_hba.conf
|
||||
```
|
||||
|
||||
Entries are matched top-to-bottom. `reject` or missing entry for the client IP causes auth failure even with correct credentials.
|
||||
|
||||
### Shared memory / kernel settings
|
||||
|
||||
```
|
||||
cat /proc/sys/kernel/shmmax
|
||||
cat /etc/postgresql/*/main/postgresql.conf | grep shared_buffers
|
||||
```
|
||||
|
||||
`shared_buffers` must not exceed ~40% of RAM; kernel `shmmax` must accommodate it.
|
||||
|
||||
## Remediation
|
||||
|
||||
**Postgres not running:**
|
||||
```
|
||||
systemctl start postgresql
|
||||
```
|
||||
Check logs immediately after start for the failure reason.
|
||||
|
||||
**Authentication failure (pg_hba mismatch):**
|
||||
Add or update the correct entry in `pg_hba.conf`, then reload:
|
||||
```
|
||||
systemctl reload postgresql
|
||||
```
|
||||
|
||||
**Too many connections — increase limit (requires restart):**
|
||||
In `postgresql.conf`:
|
||||
```
|
||||
max_connections = 200
|
||||
```
|
||||
Or deploy a connection pooler (`pgbouncer`).
|
||||
|
||||
**Disk full on WAL:**
|
||||
Identify and remove old base backups or archived WAL segments under `/var/lib/postgresql/*/main/pg_wal/`.
|
||||
Do NOT delete pg_wal files directly — use `pg_archivecleanup` or let archiving catch up.
|
||||
|
||||
**Replication lag — standby too far behind:**
|
||||
Check network bandwidth and I/O on standby. If `wal_receiver_status_interval` lag is large, increase `wal_sender_timeout` temporarily.
|
||||
Reference in New Issue
Block a user