Files
sciagent/docs/deploy-production-docker.md
Thinh Lam 688fac73e9
CI/CD / backend (push) Failing after 2m8s
CI/CD / frontend (push) Failing after 1m40s
CI/CD / deploy (push) Has been skipped
sciagent code + Gitea Actions CI/CD
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 09:38:30 +07:00

233 lines
9.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Production Docker deployment (`docker-compose.prod.yml`)
This guide walks through **common failures** when running the prod-style stack locally or on a VPS, in a fixed order: validate environment, reconcile Postgres credentials with the Docker volume, then confirm frontend wiring.
**Stack topology (frontend → backend → DB → MinIO):** [deploy-stack-overview.md](./deploy-stack-overview.md)
Related files: `.env.example` (copy to `.env`), `scripts/deploy-prod.sh`, `scripts/verify-prod-env.sh`.
---
## 1. `.env` in the repo root (cloud / VPS)
Docker Compose substitutes `${PUBLIC_HOST}`, `${POSTGRES_USER}`, etc. from a file named `.env` in the **same directory** as `docker-compose.prod.yml` (or from `--env-file` when you use the deploy script).
### It may already be there: plain `ls` hides it
Unix `ls` does **not** list dotfiles. A file named `.env` will **not** show up unless you:
```bash
ls -a # lists .env alongside . ..
test -f .env && echo ok # exits 0 if the file exists
```
### Create it when it is missing
From the repo root on the server:
```bash
cp .env.example .env
nano .env # or vim / your editor — set PUBLIC_HOST, secrets, Postgres identifiers (see section **3** below)
chmod 600 .env # optional: restrict reads to your user/root
```
`./scripts/deploy-prod.sh` refuses to run if `.env` is absent. If you start Compose by hand **without** a `.env` file, `${POSTGRES_*}` interpolates empty and Postgres health checks / connections can misbehave — always keep a populated `.env` next to the compose file.
---
## 2. Run validation before compose
Always fix script failures before restarting containers.
```bash
./scripts/verify-prod-env.sh
```
`verify-prod-env.sh` rejects:
- Empty `PUBLIC_HOST`, ports, MinIO or Postgres variables.
- `POSTGRES_USER` / `POSTGRES_DB` that are not plain SQL identifiers (letters, digits, underscore only — no `!`, spaces, unicode).
- `POSTGRES_PASSWORD` containing `@`, `:`, `/`, or `%`, which breaks `INITIATIVE_DATABASE_URL` in Compose (assembled without URL-encoding).
If `deploy-prod.sh` exits early, rerun `verify-prod-env.sh` and edit `.env` until it prints `OK`.
---
## 3. Postgres — `FATAL: role "<name>" does not exist`
### Why it happens
The official Postgres image **creates `POSTGRES_USER` and `POSTGRES_DB` only when the data directory is empty** (first start of the named volume). After that, changing `.env` does **not** rename or recreate roles inside the volume.
Typical triggers:
| Situation | Result |
|-----------|--------|
| Volume was initialized with `POSTGRES_USER=initiative`; `.env` now uses a different username | Existing DB has role `initiative`, not your new name. |
| Username with special characters (`user_pkhcn2025!`) | Prefer plain identifiers — see validation above — and historically some setups never created the role cleanly. |
### Fix (pick one track)
**A. Keep existing data — align `.env` with the roles that already exist**
1. Discover the logical volume name Compose uses:
```bash
docker compose --env-file .env -f docker-compose.prod.yml down
docker volume ls | grep initiative_pg_data
```
The name looks like `<project>_initiative_pg_data` (Compose names the volume from your project directory).
2. Start only Postgres temporarily with `.env` that matches credentials you **know** worked on first bootstrap (often your dev values from `docker-compose.yml`: user `initiative`, DB `initiatives`):
```bash
docker compose --env-file .env -f docker-compose.prod.yml up -d postgres
```
3. List roles inside the cluster (substitute `-U`/`-d`/`PGPASSWORD` to match credentials that succeed):
```bash
docker compose --env-file .env -f docker-compose.prod.yml exec postgres \
psql -U initiative -d initiatives -c '\du'
```
Set `POSTGRES_USER` / `POSTGRES_DB` / `POSTGRES_PASSWORD` in `.env` to match an existing role and database. Do **not** change only the username without aligning to an existing login.
**Password-only mismatch:** If the role and database names are already correct but someone changed `POSTGRES_PASSWORD` in `.env` after the volume was first created, run from the repo root (with `postgres` running):
```bash
./scripts/sync-postgres-app-password.sh
```
That executes `ALTER ROLE … PASSWORD` to match `.env` when `psql` inside the container can connect without the old password (typical with the official images local socket rules). If it fails, use the `psql` steps above with credentials that still work, or re-init the volume (**B**). Optional: `POSTGRES_SUPERUSER` in `.env` if you must connect as another superuser (e.g. `postgres`).
**B. You can afford to lose Postgres data — re-init the volume**
1. Stop stack; remove volume (this **deletes** all DB data):
```bash
docker compose --env-file .env -f docker-compose.prod.yml down
docker volume rm <project>_initiative_pg_data # exact name from `docker volume ls`
```
2. Ensure `./scripts/verify-prod-env.sh` passes.
3. Bring stack up fresh so scripts in `docker-entrypoint-initdb.d/` run:
```bash
./scripts/deploy-prod.sh
```
**C. Rename or add roles without wiping data (advanced)**
Connect as your **currently working** database superuser, then:
- `ALTER ROLE initiative RENAME TO new_name;`
- Create a parallel role/password with matching grants if your app expects a dedicated user only.
Operational details vary with your retention and backup policy; involve your DBA playbook if applicable.
---
## 4. Frontend (`fe0`) — port mismatch (host cannot reach UI)
Compose maps **`${FE_PORT}:8080`**: traffic to the container must hit **port 8080** inside `fe0`.
Vite defaults to **5173** if nothing overrides it. Previously that meant the mapped port forwarded to nothing or the wrong listener.
### Required state
[Vite](../fe0/vite.config.ts) must set:
- `server.port: 8080`
- host `0.0.0.0` and **port 8080** (Compose/Dockerfile pass `npm run dev -- --host 0.0.0.0 --port 8080` so bind-mounted trees without an updated `vite.config.ts` still match `${FE_PORT}:8080`)
If logs show:
```text
Local: http://localhost:5173/
```
fix `vite.config.ts` so the dev server uses **8080**, then recreate or restart `fe0`.
After that, browsers use:
```text
http://${PUBLIC_HOST}:${FE_PORT}
```
---
## 5. Different IPs in logs (`fe0` vs MinIO)
This is usually **correct**, not contradictory:
| Log line | Meaning |
|----------|---------|
| `fe0` “Network”: `http://10.5.0.x:…` | **Static container IP** on Compose bridge `profyt-net` (`docker-compose.prod.yml` `ipv4_address`). |
| MinIO banner: `http://<PUBLIC_HOST>:19000` | **Public/browser URL**, from `MINIO_SERVER_URL` using `PUBLIC_HOST` and `MINIO_API_PORT`. |
`be0` still talks to MinIO as `http://minio:9000` internally; browsers use `${PUBLIC_HOST}` unless you override presign with **`S3_PUBLIC_ENDPOINT_URL`**.
When the UI is **`https://`**, embedding plain **`http://…:${MINIO_API_PORT}`** presigned URLs is blocked (**mixed content**). In-app PDF preview can use **`GET …/evidence/content`**; for direct presigned links in the browser, terminate TLS on the MinIO API host and set **`S3_PUBLIC_ENDPOINT_URL`** / **`MINIO_SERVER_URL`** to that **`https://…`** base — see **[minio-behind-https.md](./minio-behind-https.md)** and **`deploy/nginx/minio-s3-proxy.conf.example`**.
---
## 6. Operational checklist after changes
```bash
./scripts/verify-prod-env.sh
docker compose --env-file .env -f docker-compose.prod.yml config >/dev/null
./scripts/deploy-prod.sh # or: up without -d for foreground logs
docker compose --env-file .env -f docker-compose.prod.yml ps
```
For Postgres persistence issues, skim **section 3** before editing `.env` again.
---
## 7. Postgres — `relation "audit_events" does not exist`
### Why it happens
`docker-entrypoint-initdb.d` on the Postgres image runs **only when the data volume is empty**. If the volume was created **before** `008_audit_events.sql` existed in compose, that migration never ran. **`be0`** then fails when it tries to write audit rows.
### Fix
**After pulling a current `be0` image / repo:** restart **`be0`**. On startup, `scripts/apply_initiative_migrations.py` applies **`008_audit_events.sql`** automatically if `public.audit_events` is missing (same pattern as migration 009).
Or apply by hand from the repo root on the server (adjust user/db to match `.env`):
```bash
docker compose --env-file .env -f docker-compose.prod.yml exec -T postgres \
psql -U "${POSTGRES_USER}" -d "${POSTGRES_DB}" \
< be0/migrations/008_audit_events.sql
```
---
## 8. Large uploads — `413 Request Entity Too Large` (evidence PDF, etc.)
The app allows evidence up to **50 MB** end-to-end, but **HTTPS reverse proxies** (nginx in front of `www.rcc-ump.com`) often default to **`client_max_body_size 1m`**, which rejects a **multi-megabyte** PDF **before** Docker sees the request. The browser console may show an HTML nginx error page (comment about “friendly error page”).
### Fix (nginx)
In the `server { }` (or the `location` that proxies to your `fe0` port), set at least:
```nginx
client_max_body_size 64m;
```
Reload nginx after editing. If uploads are slow, you may also need longer timeouts on the same `location`:
```nginx
proxy_read_timeout 300s;
proxy_connect_timeout 300s;
proxy_send_timeout 300s;
```
If **Cloudflare** (or another CDN) sits in front of the origin, confirm it does not impose a smaller upload limit than nginx.
**Note:** Browsers hit **`fe0`** (Vite proxy `/api` → `be0`). The body limit must allow the full multipart upload on the **first** hop (usually nginx → origin), not only inside Docker.