feat: complete RAG runbook workflow and release docs
Some checks failed
CI / test (push) Failing after 15s

This commit is contained in:
2026-05-06 04:48:41 +02:00
parent 450de24d28
commit 57f4c0efaa
26 changed files with 2510 additions and 137 deletions

107
runbooks/postgres.md Normal file
View File

@@ -0,0 +1,107 @@
---
service: postgres
symptoms: connection refused port 5432, FATAL password authentication failed, replication lag, disk full, out of shared memory, too many connections, relation does not exist, could not connect to the primary
tags: postgres, postgresql, database, replication, pg, psql, disk, connections
---
## Symptoms
- `could not connect to server: Connection refused` — postgres not running or not on port 5432
- `FATAL: password authentication failed for user "<user>"` — wrong credentials or pg_hba mismatch
- `FATAL: too many connections` — connection pool exhausted
- `ERROR: could not resize shared memory segment` / `out of shared memory` — shared_buffers too high for system
- `PANIC: could not write to file "pg_wal/..."` — disk full on WAL directory
- Replication lag growing — standby falling behind primary
- `FATAL: could not connect to the primary server` — standby cannot reach primary
## Diagnostics
### Service status
```
systemctl status postgresql
systemctl status postgresql@<version>-main
```
### PostgreSQL logs
```
journalctl -u postgresql -n 100
tail -n 100 /var/log/postgresql/postgresql-*.log
```
### Is postgres listening?
```
ss -tlnp | grep 5432
```
### Disk space (WAL and data directory are the critical paths)
```
df -h
du -sh /var/lib/postgresql/
du -sh /var/lib/postgresql/*/main/pg_wal/
```
A full disk on the pg_wal partition causes a PANIC and hard crash.
### Connection count
```sql
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
SELECT setting FROM pg_settings WHERE name = 'max_connections';
```
### Replication lag (run on primary)
```sql
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
(sent_lsn - replay_lsn) AS lag_bytes
FROM pg_stat_replication;
```
### pg_hba.conf — authentication rules
```
cat /etc/postgresql/*/main/pg_hba.conf
```
Entries are matched top-to-bottom. `reject` or missing entry for the client IP causes auth failure even with correct credentials.
### Shared memory / kernel settings
```
cat /proc/sys/kernel/shmmax
cat /etc/postgresql/*/main/postgresql.conf | grep shared_buffers
```
`shared_buffers` must not exceed ~40% of RAM; kernel `shmmax` must accommodate it.
## Remediation
**Postgres not running:**
```
systemctl start postgresql
```
Check logs immediately after start for the failure reason.
**Authentication failure (pg_hba mismatch):**
Add or update the correct entry in `pg_hba.conf`, then reload:
```
systemctl reload postgresql
```
**Too many connections — increase limit (requires restart):**
In `postgresql.conf`:
```
max_connections = 200
```
Or deploy a connection pooler (`pgbouncer`).
**Disk full on WAL:**
Identify and remove old base backups or archived WAL segments under `/var/lib/postgresql/*/main/pg_wal/`.
Do NOT delete pg_wal files directly — use `pg_archivecleanup` or let archiving catch up.
**Replication lag — standby too far behind:**
Check network bandwidth and I/O on standby. If `wal_receiver_status_interval` lag is large, increase `wal_sender_timeout` temporarily.