sciagent/docs/feedback-data-management.md

# Data Management Review — Admin Backup Feature

Reviewing `application-files-persistence.md` from the perspective of building an admin endpoint that produces a **trustworthy, complete archive** of an initiative: all evidence attachments, the submitted PDF, and the submitted DOCX.

The persistence layer is **functional** and the documentation is unusually candid about its rough edges (best-effort MinIO upload, polymorphic `storage_uri`, three different identifiers, an unwired schema file). That candor is the right starting point. But several of those rough edges become **load-bearing** the moment you add a backup feature, because a backup is essentially a contract that says *"these bytes are what was submitted"* — and right now the system cannot honestly make that claim for one of the three artifact types you want to back up.

The notes below are ordered by impact on the backup feature, not by where they appear in the document.

---

## 1. Critical: the DOCX is not stored — a "backup" that regenerates it is not a backup

This is the single biggest issue and it blocks the feature you want to ship.

> *"The binary DOCX is not stored as a file in MinIO."* … *"For backup, … call server-side preview endpoints for each version to regenerate DOCX/PDF if you need binaries in the archive."*

Regenerating the DOCX at backup time means the bytes you hand to an admin are produced by:

- **The current template** (`template_application_form.docx`) — which will change.
- **The current `docxtpl` version** — which will change.
- **The current LibreOffice version** — which has known rendering drift across releases.
- **The current font set installed in the container** — which will change every base-image upgrade.

A backup taken in 2026 of a 2024 submission may not be byte-identical, or even visually identical, to what the user actually submitted in 2024. For an approvals workflow that may have **legal or audit weight**, this is a real problem: you cannot prove what the applicant signed off on. If a dispute arises ("I never agreed to that section"), your backup is evidence of what the system *would produce today*, not what was submitted.

**Recommendation — do this before building the backup endpoint:**

1. At submit time, render the official DOCX **and** the official PDF, hash both, and write them immutably to MinIO with their SHA-256 in object metadata and in `application_artifacts`. Treat them the same way you (mostly) treat `full_pdf` today.
2. Keep the `application_review_documents` JSON bundle. It's still useful — for re-rendering with newer templates, for diffing, for analytics. But it stops being the source of truth for what was submitted.
3. The backup endpoint then *just streams stored bytes*, never invokes LibreOffice. This also removes a slow, fragile dependency from the admin request path.

Without this fix, anything else you build is a backup of *some* of the artifacts plus a regeneration of the rest, which is a different and weaker product than what you described.

---

## 2. Critical: the submitted PDF lives in two places, and MinIO is best-effort

> *"If MinIO fails, the artifact still points at the filesystem URL only."*

This means at any moment, for any given submission, the canonical bytes of the submitted PDF live in **one of three states**:

- Filesystem only (MinIO upload failed, or feature was off).
- MinIO only (would happen if filesystem cleanup ever ran — does it?).
- Both (happy path, but with no guarantee the bytes match if anything ever rewrote one side).

A backup endpoint must handle all three, *and* it must know which to trust when both exist. The string-prefix logic in `_enrich_application_detail_full_pdf_presign` ("if it looks like a MinIO key, presign it; otherwise treat as filesystem URL") is too fragile to be the answer here.

There's also a deployment hazard hiding in the filesystem path. `SUBMITTED_INITIATIVES_DIR` defaults to `assets/submitted-initiatives` or `fe0/public/submitted-initiatives`. If that path is **not on a persistent volume** in the cloud deployment (easy to miss in a Docker setup), then container restarts silently lose data — and the artifact row still points at a now-404 URL. Your backup endpoint would happily produce a ZIP with a missing file and an admin would not know until they tried to open it.

**Recommendations:**

1. **Make MinIO upload synchronous and required at submit time.** If MinIO is down, fail the submission and let the user retry — don't silently degrade to a single-host filesystem copy. The current "best effort" pattern trades a visible error today for an invisible data-loss event later.
2. **Treat the filesystem location as a cache, not a store.** It's fine to keep it for the dev-mode static-file flow, but it should never be the only copy.
3. **Audit `SUBMITTED_INITIATIVES_DIR` mounting in every environment** before backup ships. If it's not on a persistent volume, fix it.
4. **Backfill** existing filesystem-only PDFs into MinIO with a one-time job. After that, every artifact row should resolve to a MinIO key.

---

## 3. High: `storage_uri` is polymorphic and parsed by string prefix

> *"`storage_uri` is either a **MinIO key** (under exports bucket) or a **relative URL** to static files"* … *"when `full_pdf.storage_uri` looks like a **MinIO key** (not `/submitted-initiatives` or `http`), …"*

Detecting storage type by string-shape is the kind of code that works for a year and then breaks the day someone changes a URL prefix, deploys behind a CDN, or introduces a second exports bucket. It also makes the backup logic harder to reason about: every code path that reads `storage_uri` re-implements the same brittle dispatch.

**Recommendation:** add an explicit `storage_kind` column (enum: `minio_exports`, `minio_attachments`, `filesystem`, `external_url`) to `application_artifacts`, populate it on every write, and dispatch on it. The migration is small. The clarity is permanent. Once #2 above is done, you'd expect almost all rows to be `minio_exports`, but the column lets you migrate confidently and lets the backup endpoint fail loudly when it sees something it doesn't know how to handle.

---

## 4. High: integrity is recorded but apparently not verified on read

`application_artifacts.sha256` is stored. Nothing in the document mentions verifying it when the file is read back. For a backup endpoint, this needs to be **mandatory**:

- Compute SHA-256 while streaming bytes into the ZIP.
- Compare against the recorded value.
- If it mismatches, fail the entire backup loudly and log it as a P1 — silent corruption is worse than a missing file.
- Include the verified SHA in the manifest (see #10).

This is cheap to implement and turns a passive integrity field into an active guarantee. It also catches MinIO storage corruption, accidental object overwrites, and bugs that double-encode bytes — all of which have precedent in real systems.

---

## 5. High: three identifiers and scan-based resolution

> *"`get_application_by_id` … scans submitted initiatives and matches when either `_submission_display_id(...) == applicationId`, or `initiative.case_code == applicationId`."*

A linear scan is fine at hundreds of records, fragile at thousands, and broken at tens of thousands. It's also racy — `_submission_display_id` is a derived value, so two submissions with subtle metadata differences could in principle collide.

For the backup feature this matters more than usual because:

- Admins often want **bulk** backups (a date range, a status, an owner). Bulk × scan = N².
- The endpoint is admin-facing, so slowness is less likely to be reported as "the app is broken" and more likely to silently get worse.

**Recommendations:**

1. Add a `submission_public_id` column (the `sub-...` value) on `initiatives`, indexed and unique. Compute it once at submit time, store it, never re-derive.
2. Change `get_application_by_id` to a single indexed lookup against either `submission_public_id` or `case_code`.
3. Document the resolution order explicitly. Right now the contract — "admins can deep-link with `sub-…` or sometimes `CASE-…`" — has a "sometimes" in it, which is exactly the kind of language that produces support tickets.

---

## 6. High: dead schema in `database/schema.sql`

> *"The root file `database/schema.sql` describes a separate **integer `applications`** domain (attachments table with `application_id` INT); that schema is **not** wired into `be0` today."*

This file will trick at least one future engineer into writing code against the wrong schema. It will also confuse any backup-related tooling, since "applicationId" in that file means an integer and in the running system means a `sub-...` string.

**Recommendation:** delete it, or move it to `database/unused/` with a `README.md` explaining why it's there. Don't leave authoritative-looking schema files next to the real ones. If there's a reason it can't be deleted (historical reference, planned migration), say so in a header comment at the top of the file.

---

## 7. Medium: evidence files have no apparent versioning

> *"One row per **`(initiative_id, role)`** … `research_evidence` | `textbook_evidence` | `technical_evidence`"*

If a user uploads research evidence, then re-uploads a corrected version before submitting, what happens to the old file? Three possibilities, all bad if not addressed:

- The MinIO object is overwritten — old bytes gone, no audit trail of the change.
- A new MinIO object is written but the old one is orphaned — storage grows forever, and nothing references the old bytes.
- The old `application_artifacts` row is updated in place — Postgres has no record that a previous version existed.

For backup integrity, this matters because **what gets reviewed and what gets archived may not be the same thing**. A reviewer might approve based on version 2 of an evidence file, but if version 3 was uploaded post-review, your backup would archive version 3.

**Recommendation:** decide explicitly. Either:

- Make evidence uploads append-only (new row per upload, old rows marked superseded), and have the backup capture the version that was current at submit time, *or*
- Document clearly that only the latest evidence is archived and that re-uploads after review are not tracked. Then add a UI guardrail preventing re-upload after review-locked status.

The first option is more work and almost certainly the right call for an approvals system.

---

## 8. Medium: `application_submit_snapshots.full_pdf_uri` can disagree with the artifact

> *"`full_pdf_uri` (today this records the **URL passed at submit time**, typically `/submitted-initiatives/...`, not necessarily the MinIO key)."*

So the snapshot table holds one URL form and the artifacts table can hold another. Which is canonical for the backup? The doc implies artifacts wins, but this isn't enforced anywhere. If the two ever drift, debugging "why does the snapshot say one thing and the backup contains another" will be painful.

**Recommendation:** treat `application_artifacts` as the single source of truth for "where the bytes live" and treat `application_submit_snapshots` as an immutable audit log of "what the request looked like at submit time." Document this distinction. Don't read `full_pdf_uri` from the snapshot for any operational purpose, including backup — it's history, not state.

---

## 9. Medium: MinIO bucket policy is not described

The document covers what's stored where, but not:

- **Versioning**: are buckets MinIO-versioned? If not, an accidental overwrite is unrecoverable.
- **Object lock / WORM**: for an approvals system with audit requirements, write-once on `initiative-exports` would protect against silent tampering, including from an admin with bucket credentials.
- **Lifecycle**: does anything age out? If retention rules apply (GDPR-style, contractual), the backup endpoint is exactly where they'll be tested.
- **Encryption at rest**: SSE config?
- **Backup of MinIO itself**: who backs up the backups?

These are not blockers for shipping the feature, but they're questions an auditor will ask the day after you ship it. Better to have answers in writing now. A short `MINIO_OPERATIONS.md` covering bucket policies, retention, and disaster recovery would close most of these in an afternoon.

---

## 10. Medium: backup endpoint design — stream, manifest, audit

A few concrete design points for the endpoint itself, since the doc's outline is sparse:

- **Stream the ZIP, never buffer it.** A single submission might have a few hundred MB of evidence. Buffering in memory will OOM the API container under modest concurrency. Use `zipstream-ng` (Python) or equivalent and write directly to the response.
- **Include a `manifest.json` at the root** of the ZIP, containing: `applicationId`, `case_code`, `initiative_id`, submitted-at timestamp, owner, status at backup time, list of files with their roles, original filenames, MIME types, byte sizes, recorded SHA-256, and *verified* SHA-256 (computed during streaming). The manifest is what makes the ZIP a self-describing archive rather than a folder of mystery bytes.
- **Use a clear directory structure inside the ZIP**, e.g. `submitted/full.pdf`, `submitted/full.docx`, `evidence/research/...`, `evidence/textbook/...`, `evidence/technical/...`, `manifest.json`. Avoid Vietnamese filenames at the top level — preserve them inside `manifest.json` and use ASCII-safe `{role}-{n}-{sha-prefix}.{ext}` on disk so Windows and older zip tools don't choke on UTF-8.
- **For long downloads or bulk exports, switch to an async job pattern.** Admin requests a backup → server creates a job → job streams the ZIP into `initiative-exports` (or a dedicated `initiative-backups` bucket) → admin gets a presigned URL when ready. This isolates long-running work from the request lifecycle and survives reverse-proxy timeouts. For single-initiative backups a synchronous endpoint is fine; for "back up everything from Q1" it isn't.
- **Audit log every backup download.** Who, when, which `applicationId`, IP, user-agent, bytes streamed, success/failure. This is admin access to user-submitted content — it should be at least as well-logged as any other privileged action.
- **Consider a "verify-only" mode** that re-downloads from MinIO, recomputes SHAs, and reports discrepancies without producing a ZIP. Cheap to implement once #4 is in place, very useful for periodic data-integrity audits.

---

## 11. Low: the quarantine bucket is undocumented

> *"`initiative-quarantine` (`S3_BUCKET_QUARANTINE`) — Reserved for quarantine flows (not detailed here)"*

If files can land in quarantine (presumably for AV scanning or content review), the backup endpoint needs a defined behavior for them: include and label, exclude entirely, or fail the backup until they're cleared. Pick one and document it. Otherwise this becomes the kind of edge case discovered by a real incident.

---

## What's already good

To be fair to the existing design — several decisions in this document are correct and worth preserving:

- **SHA-256 captured at upload time** is the right foundation; it just needs to be actively verified on read.
- **Append-only `application_submit_snapshots`** is exactly the right shape for an audit table. Don't ever let it become mutable.
- **Separate buckets per concern** (`attachments`, `exports`, `quarantine`) makes lifecycle and access policies straightforward.
- **`application_review_documents` storing JSON** is genuinely useful for re-rendering and analytics — the issue isn't that it exists, it's that it's currently being asked to also serve as the source of truth for "what was submitted," which is a job for stored bytes.
- **Public/internal endpoint split for MinIO** is the right pattern for presigned URLs. Most teams get this wrong on the first try.
- **Documenting the dual-identifier confusion explicitly** is rare and valuable. Don't lose this institutional knowledge.

---

## Suggested order of work

A pragmatic sequencing if the team can only do this incrementally:

1. **Week 1 — unblock the backup feature's core promise.** Persist the rendered DOCX and official PDF as immutable bytes in MinIO at submit time, with SHA-256. This is what makes "backup" actually mean backup. Until this lands, do not ship the backup endpoint — you'll have to break the contract later when you fix it.
2. **Week 2 — make storage canonical.** Make MinIO upload synchronous and required for the submitted PDF. Add `storage_kind` to `application_artifacts`. Backfill any filesystem-only rows. Verify `SUBMITTED_INITIATIVES_DIR` is on a persistent volume in every environment, or stop relying on it.
3. **Week 3 — build the endpoint.** Streaming ZIP, manifest with verified SHAs, audit log, async job pattern for bulk. Single-initiative download first; bulk later.
4. **Following sprint — clean up the foundations.** Delete or quarantine `database/schema.sql`. Add `submission_public_id` indexed column and remove the scan-based lookup. Decide and document evidence versioning. Write `MINIO_OPERATIONS.md`.
5. **Quarter horizon — harden.** Periodic verify-only sweeps. MinIO bucket versioning and object lock if compliance requires. Backup of the MinIO itself (off-cluster).

The order matters: shipping the endpoint before #1 produces a backup that lies about what it contains, and that's worse than not having a backup at all — admins will rely on it, and you'll find out it's wrong only when something goes wrong elsewhere.