Files
sciagent/docs/application-files-persistence-and-backup.md
T
Thinh Lam 688fac73e9
CI/CD / backend (push) Failing after 2m8s
CI/CD / frontend (push) Failing after 1m40s
CI/CD / deploy (push) Has been skipped
sciagent code + Gitea Actions CI/CD
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 09:38:30 +07:00

337 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Application files: persistence, retrieval by `applicationId`, and backup notes
This document describes how the running **initiative** stack stores and loads:
- Evidence attachments (minh chứng 2.1 / 2.2 / kỹ thuật)
- The **submitted full-package PDF** (đơn + báo cáo from the « Xem lại » flow)
- The **filled DOCX / official PDF** derived from the Word template
It focuses on what **PostgreSQL** and **MinIO** hold. The root file [`database/schema.sql`](../database/schema.sql) describes a separate **integer `applications`** domain (attachments table with `application_id` INT); that schema is **not** wired into `be0` today. Production behavior is driven by **`be0/migrations/*.sql`** and **`INITIATIVE_DATABASE_URL`**.
**Implementation planning:** The phased backup and storage-hardening plan below is **refined against** the review in [`feedback-data-management.md`](feedback-data-management.md) (canonical bytes, `storage_kind`, SHA verification on pack, streaming ZIP + manifest, indexed IDs, evidence versioning, and sequencing).
---
## Identifiers: what “applicationId” means
The UI and APIs expose a **public submission id** shaped like `sub-{16 hex chars}` (see `save_submitted_application` in `be0/src/initiative_db/submissions.py`). Internally, persistence is keyed by:
| Concept | Example | Where |
|--------|---------|--------|
| **Public `applicationId`** (list/detail) | `sub-abc123def4567890` | `drafts.payload.submissionRecord.id`, API responses |
| **Draft / case code** | `CASE-…` or `SUB-…` | `initiatives.case_code`, `draft_case_id` on API rows |
| **Initiative primary key** | UUID | `initiatives.id`, MinIO key prefix, `application_artifacts.initiative_id` |
**Resolving a row:** `get_application_by_id` (`be0/src/initiative_db/submissions.py`) scans submitted initiatives and matches when either:
- `_submission_display_id(initiative, submissionRecord) == applicationId`, or
- `initiative.case_code == applicationId`.
So admins can deep-link with **`sub-…`** or sometimes **`CASE-…`**. For backups, always persist **`initiatives.id`**, **`case_code`**, and **`sub-…`** together.
---
## MinIO
Configured in Docker via `S3_*` env vars (`docker-compose.yml`):
| Bucket (env) | Purpose |
|----------------|---------|
| **`initiative-attachments`** (`S3_BUCKET_ATTACHMENTS`) | Evidence uploads for Đơn (research / textbook / technical) |
| **`initiative-exports`** (`S3_BUCKET_EXPORTS`) | Optional copy of the **submitted full PDF** after successful submit |
| **`initiative-quarantine`** (`S3_BUCKET_QUARANTINE`) | Reserved for quarantine flows (not detailed here) |
**Object key layout** (`be0/src/minio/storage.py`):
- Evidence and export artifacts use **`build_key_for_initiative`**:
`initiatives/{initiative_uuid_no_hyphens}/{yyyy}/{mm}/{uuid}-{safe_filename}`
The API uses the **internal endpoint** for the server (`S3_ENDPOINT_URL`, e.g. `http://minio:9000`) and **`S3_PUBLIC_ENDPOINT_URL`** for presigned URLs the browser can open (e.g. `http://localhost:19000`).
**Integrity:** uploads compute SHA-256 and store it in object metadata and/or Postgres (`application_artifacts.sha256`).
---
## PostgreSQL (initiative database)
Core tables (`be0/migrations/001_initiative_schema.sql`, `002_application_storage_extensions.sql`, plus review-doc extensions):
### `initiatives`
- `id` (UUID), `case_code` (unique text), `owner_id`, `status`, `submitted_at`, etc.
- Submitted applications have `status != 'draft'` (e.g. `submitted`).
### `drafts`
- `payload` JSONB holds the live bundle: tab data, `submissionRecord`, `submissionFile`, etc.
After submit, important keys include:
- `payload.submissionRecord` — metadata including public `id` (`sub-…`)
- `payload.submissionFile` — e.g. `{ "url": "/submitted-initiatives/sub-….pdf", "type": "pdf" }`
### `application_artifacts`
One row per **`(initiative_id, role)`** (`002_application_storage_extensions.sql`). **Planned (Phase 1):** add roles for the **printable application form** binaries (e.g. **`official_form_docx`**, **`official_form_pdf`**) — distinct from **`full_pdf`** (the **client-uploaded** full hồ sơ PDF).
| `role` | Meaning |
|--------|---------|
| `full_pdf` | Submitted package PDF — **`storage_uri`** is either a **MinIO key** (under exports bucket) or a **relative URL** to static files |
| `research_evidence` | Minh chứng 2.1 (nghiên cứu) |
| `textbook_evidence` | Minh chứng 2.2 (giáo trình) |
| `technical_evidence` | Minh chứng kỹ thuật (nhóm 1) |
Columns: `storage_uri`, `original_name`, `mime_type`, `byte_size`, `sha256`, `uploaded_by`, `uploaded_at`, plus review fields for evidence.
### `application_submit_snapshots`
Append-only rows: merged tabs, submit metadata, and **`full_pdf_uri`** (today this records the **URL passed at submit time**, typically `/submitted-initiatives/...`, not necessarily the MinIO key).
Treat this table as **historical audit** of the submit request, not as the driver for backup byte locations: **`application_artifacts`** (and `storage_kind` once added) is the operational source of truth ([`feedback-data-management.md`](feedback-data-management.md) §8).
### `application_review_documents`
Versioned JSON used to regenerate the Word template output:
- `official_bieu_mau`, `template_data`, `full_bundle` (JSONB)
- Tied to **`initiative_id`** and `case_id`
**Today:** the binary filled DOCX is **not** stored in MinIO; this table is the only server-side input to regeneration. **Target (for a trustworthy admin backup):** treat this JSON as **supporting data** (re-render, analytics, diffing). The **canonical bytes** for “what the applicant signed off on” for the printable mẫu should be **immutable objects in MinIO** plus rows in `application_artifacts` (see [Implementation plan — Phase 1](#phase-1-canonical-bytes-for-printable-docx--pdf-before-backup-ships)).
### Other useful tables
- `draft_tab_snapshots` — history of tab JSON (`report` / `application` / `contribution`)
---
## Backend flows
### Evidence upload & download
- **POST** `/api/v1/application-drafts/{case_id}/evidence` — multipart upload; stores object in **`initiative-attachments`**; upserts `application_artifacts` with role `research_evidence` | `textbook_evidence` | `technical_evidence` (`be0/main.py`).
- **GET** `/api/v1/application-drafts/{case_id}/evidence` — returns metadata plus **presigned** `downloadUrl` / `viewUrl` for staff or owner.
`case_id` is normalized to the initiatives **`case_code`** (e.g. `CASE-…`).
### Submit full PDF
- **POST** `/api/applications/submit` — receives PDF + JSON `metadata` (`be0/main.py`).
- Always writes the file to **`SUBMITTED_INITIATIVES_DIR`** (default: repo `assets/submitted-initiatives` or `fe0/public/submitted-initiatives` in dev), served under **`/submitted-initiatives/{sub-….pdf}`**.
- If PostgreSQL is enabled: **`save_submitted_application`** updates `initiatives` / `drafts`, writes **`application_submit_snapshots`**, **`application_taxonomy`**, **`application_workflow`**, and **`upsert_artifact_full_pdf`**.
- **MinIO copy:** `_maybe_upload_submitted_pdf_to_exports_minio` uploads the same bytes to **`initiative-exports`** and, on success, sets **`application_artifacts.full_pdf.storage_uri`** to the **object key** (not the `/submitted-initiatives/...` URL). If MinIO fails, the artifact still points at the **filesystem URL** only — **this is slated to become a hard failure** once canonical storage is enforced ([Phase 2](#phase-2-canonical-storage-for-submitted-full-pdf)).
### Filled DOCX / official PDF (preview; persistence plan)
- **POST** `/api/v1/docx/preview-application-form` — renders `template_application_form.docx` with **docxtpl**; returns bytes (**no DB/MinIO write** today).
- **POST** `/api/v1/docx/preview-application-form-pdf` — same merge, then **LibreOffice** conversion to PDF; returns bytes.
The client builds `officialBieuMau` from draft state; **`persistReviewDocumentBundle`** (**POST** `/api/v1/review-documents`) saves the JSON bundle to **`application_review_documents`**.
**Preview endpoints remain useful** for staff “what-if” and for regenerating with newer templates. **They must not** be the only path that feeds the admin backup ZIP once Phase 1 is done — backups should stream **stored** printable DOCX/PDF bytes unless a legacy row has no stored object (then document explicit fallback or backfill).
### Admin detail: presigned full PDF
For **GET** `/api/applications/{application_id}`, when `full_pdf.storage_uri` looks like a **MinIO key** (not `/submitted-initiatives` or `http`), **`_enrich_application_detail_full_pdf_presign`** adds `files.fullText.viewUrl` (presigned GET on **`initiative-exports`**).
---
## Frontend
| Concern | Location |
|---------|----------|
| Submit PDF | `fe0/src/components/applicant/submitInitiativePdf.ts`**POST** `/api/applications/submit` with `FormData` + JWT; metadata includes **`initiativeCaseId`** (must match Postgres `case_code`). |
| Draft load/save | `fe0/src/components/applicant/applicationDrafts.ts`**GET/POST** `/api/v1/application-drafts/...`. |
| DOCX/PDF from template | `fe0/src/lib/applicationFormDocxApi.ts` → preview endpoints; `ApplicationFormDocxPreview.tsx` orchestrates save + review bundle persistence. |
| Evidence UI | e.g. `ApplicationEvidenceManagePage.tsx` — uses **GET** `/api/v1/application-drafts/{caseId}/evidence` with presigned URLs. |
| Admin list/detail | Uses **GET** `/api/applications`, **GET** list/detail with `applicationId`; detail exposes `draft_case_id` for loading drafts/evidence. |
Important: **`sub-…`** is the list id; **draft/evidence APIs use `case_code` (`CASE-…`)**. The API surfaces `draft_case_id` on submission rows to bridge the two.
---
## Applicant honesty checkboxes, complete tabs & PDF minh chứng (engineering guide)
Goal: applicants cannot tick the **cam kết trung thực** checkboxes at the end of **Báo cáo**, **Đơn**, and **Xác nhận đóng góp** until the workflow rules below are satisfied; the UI shows a **Sonner** toast listing missing items. **PDF minh chứng** means the classification-specific evidence file for Đơn (research / textbook / technical), stored in **MinIO** via `POST /api/v1/application-drafts/{case_id}/evidence` (see [Evidence upload & download](#evidence-upload--download)).
### Intended behaviour (product)
| Control | When it may be ticked |
|--------|------------------------|
| **Báo cáo** (`InitiativeReportForm`) | All required fields on the report tab are non-empty (§1–§6 narrative + hiệu quả fields exposed in the UI). |
| **Đơn** (`InitiativeApplicationForm`) | All required Đơn fields are complete **and** the correct **PDF minh chứng** slot is filled for the chosen classification (local `File`, or `FileHandle` with `serverStorageKey` after MinIO upload). Sub-forms (bản cam kết / biểu xác nhận) must match the selected nhóm. |
| **Xác nhận đóng góp** (`ContributionConfirmationForm`) | Same checks as Đơn **and** Báo cáo, **and** the applicant has already ticked honesty on **Báo cáo** and **Đơn**. |
| **Xem lại — Gửi** (`ApplicationFormDocxPreview`) | Same as contribution gate **plus** `contribution.digitalSignatureConfirmed` in the persisted contribution JSON. |
Implementation reference:
- Shared validators + messages: `fe0/src/lib/applicantHonestyPrerequisites.ts` (`collectReportTabHonestyGaps`, `collectApplicationTabHonestyGaps`, `collectContributionDigitalSignaturePrerequisiteGaps`, `collectApplicantSubmitToAdminPrerequisiteGaps`, `formatApplicantPrerequisiteToastDescription`).
- Checkbox handlers toast with `toast.error(..., { description })` and **do not** flip state when prerequisites fail.
Staff / council flows without `DraftProvider` skip the contribution-tab signature gate (no full draft in context); fields stay **`readOnly`** as today.
### Frontend (detailed)
1. **Single source of truth for messages** — Keep gap strings in `applicantHonestyPrerequisites.ts` so DOCX preview and forms stay aligned.
2. **Evidence PDF** — Treat as present if `applicantEvidencePdfPresent(file)` is true: `File` with non-zero size, or `FileHandle` with `serverStorageKey` (MinIO) or positive `size` (IndexedDB). Matches hydration in `DraftContext` after `getApplicationEvidence(caseId)`.
3. **Contribution tab** — Uses `draft.report` and `draft.application` from `DraftContext`; authors/% totals are validated on Đơn; contribution UI mirrors `authors` when connected to Postgres drafts.
4. **Review submit** — Besides tab JSON, enforce contribution signature flag on the object passed into `ApplicationFormDocxPreview` (from `draftTabs.contribution`).
### Backend (recommended)
Today, gates are **client-side** only. For integrity:
- **`POST /api/applications/submit`** — Implemented in `be0/src/initiative_db/submission_readiness.py`, invoked from `save_submitted_application` **before** the initiative is marked submitted. Loads merged `drafts.payload.tabs` (with snapshot fallback), reads **`application_artifacts`** for `research_evidence` / `textbook_evidence` / `technical_evidence` (non-empty `storage_uri`), and validates tab JSON + honesty flags to match the applicant UI. On failure: **400** with `detail: { "message": "…", "missing": ["…", …] }` (see `ApplicationSubmissionNotReadyError` handling in `be0/main.py`). The client maps this in `fe0/src/components/applicant/submitInitiativePdf.ts`. Partial PDF written on disk is removed when Postgres validation fails.
- **`POST /api/v1/application-drafts/{case_id}/evidence`** — Already the canonical upload path; reject non-PDF or oversize files (existing behaviour).
### PostgreSQL
- Tab JSON lives under **`drafts.payload`** (and/or tab snapshots). Honesty flags are plain booleans: `report.honestyConfirmed`, `application.honestyConfirmed`, `contribution.digitalSignatureConfirmed`. No migration is required for gating unless you add a **server-side** “submission readiness” snapshot column.
### MinIO
- Required PDF for Đơn is stored under **`initiative-attachments`** with keys from `build_key_for_initiative`; metadata is reflected in **`application_artifacts`** (`research_evidence` | `textbook_evidence` | `technical_evidence`). Frontend readiness should agree with **either** the draft file handle (`serverStorageKey`) **or** a fresh **`GET .../evidence`** bundle (see `collectDocxTemplateCompletenessGaps` in admin review for a related pattern).
---
## Retrieving everything for one submission (interim checklist)
Until Phases 12 are done, a reader resolving **`applicationId`** (`sub-…`) should:
1. **Postgres:** Resolve `initiatives` + latest `drafts` (today: `get_application_by_id` scan; target: indexed `submission_public_id` — [Phase 4](#phase-4-identifiers--schema-hygiene)).
2. **Submitted full-package PDF (`full_pdf` artifact):** Read `application_artifacts` with `role = 'full_pdf'`. Dispatch on **`storage_kind`** once added; until then, avoid relying only on string-prefix heuristics for production backups.
3. **Evidence:** Roles `research_evidence`, `textbook_evidence`, `technical_evidence` → keys in **`initiative-attachments`**.
4. **Printable mẫu DOCX/PDF:** After Phase 1, stream from MinIO using new artifact roles; until then see **legacy** note in Phase 3.
Optional ZIP extras: latest `application_review_documents` JSON, `draft_tab_snapshots`, read-only copies of `application_submit_snapshots` for audit.
**Related rationale and risks** (regeneration vs backup, polymorphic `storage_uri`, integrity): [`feedback-data-management.md`](feedback-data-management.md).
---
## Implementation plan: admin backup (database + document management)
Goal: **admin downloads one ZIP** containing **all evidence attachments**, the **submitted full-package PDF**, and the **printable application DOCX + PDF** (mẫu), with **verifiable integrity** and **no reliance on regenerating** printable documents at download time (after prerequisites).
Phasing follows the sequencing in [`feedback-data-management.md`](feedback-data-management.md) §“Suggested order of work”, expanded into concrete schema and API work.
### Phase 0 — Decisions & prerequisites
| Item | Action |
|------|--------|
| **Canonical bytes for printable mẫu** | Store immutable DOCX + PDF in MinIO at submit (or immediately pre-submit in the same transaction as finalize), not only JSON. |
| **Evidence versioning** | Decide: append-only evidence history vs “latest only”. For approvals, prefer **versioned or append-only** so backup matches what was reviewed ([`feedback-data-management.md`](feedback-data-management.md) §7). |
| **Quarantine bucket** | Define behavior if objects exist in **`initiative-quarantine`**: include/exclude/fail backup ([`feedback-data-management.md`](feedback-data-management.md) §11). |
| **MinIO operations** | Document versioning, lifecycle, retention, DR (suggested spin-off: `MINIO_OPERATIONS.md` per feedback §9). |
| **Dead schema** | Move or clearly label [`database/schema.sql`](../database/schema.sql) so tooling does not confuse INT `application_id` with `sub-…` ([`feedback-data-management.md`](feedback-data-management.md) §6). |
### Phase 1 — Canonical bytes for printable DOCX + PDF (before backup ships)
**Problem:** Regenerating DOCX/PDF at backup time uses **current** template, docxtpl, LibreOffice, and fonts — not provably what the applicant saw ([`feedback-data-management.md`](feedback-data-management.md) §1).
**Database**
- Extend `application_artifacts.role` CHECK (new migration) with two roles, e.g. **`official_form_docx`** and **`official_form_pdf`** (names TBD; must be distinct from **`full_pdf`**, which is the **client-uploaded full hồ sơ** PDF).
- On successful submit (or single “finalize” step server-side): compute SHA-256 for each file; **`INSERT`/upsert** rows with `storage_uri` = MinIO key, `sha256`, `byte_size`, `mime_type`, `original_name`, **`storage_kind = 'minio_exports'`** (once column exists).
**Application logic**
- Server: build `officialBieuMau` from the same snapshot used for submission (bundle already available in draft + review document path), call existing **`fill_application_form_docx`** → bytes; call **`convert_docx_bytes_to_pdf`** → bytes; upload both to **`initiative-exports`** using `build_key_for_initiative`.
- **Do not** put LibreOffice on the admin **download** path after this; optional background **verify-only** job may re-read objects.
**JSON**
- Keep saving **`application_review_documents`** for re-render/diff; it is **not** the sole legal snapshot of the printable files once binaries exist.
**Gate:** Do **not** release the admin backup endpoint that promises “printable DOCX/PDF” until this phase is done for **new** submits; for **legacy** rows without these artifacts, define policy (backfill job vs manifest flag `missing_official_form: true`).
### Phase 2 — Canonical storage for submitted full-package PDF
**Problem:** `full_pdf` may point at filesystem-only, MinIO-only, or both; best-effort upload risks silent loss ([`feedback-data-management.md`](feedback-data-management.md) §2).
**Database**
- Add **`storage_kind`** on **`application_artifacts`** (enum/text): e.g. `minio_exports`, `minio_attachments`, `filesystem`, `external_url`. Backfill from existing `storage_uri` shape; default new rows explicitly.
- Optionally add **`content_sha256_verified_at`** or rely on manifest at backup time only.
**Application logic**
- Make **MinIO upload of `full_pdf` synchronous and required** when persistence is enabled: if upload fails, **fail submit** with retryable error.
- Treat filesystem write as **cache** for dev/static serving if desired, not sole store.
- **Backfill job:** filesystem-only historical PDFs → **`initiative-exports`**, then update artifact row + `storage_kind`.
**Infrastructure**
- Ensure **`SUBMITTED_INITIATIVES_DIR`** is on a **persistent volume** in every environment, or stop relying on it for production.
### Phase 3 — Admin backup endpoint + ZIP contract
**Authorization:** admin-only; **audit** every request: actor, `applicationId`, timestamp, outcome, bytes streamed ([`feedback-data-management.md`](feedback-data-management.md) §10).
**Resolution:** load initiative by **`submission_public_id`** or **`case_code`** (indexed) after Phase 4; until then use existing lookup with awareness of scan cost for **bulk** exports.
**Integrity**
- While streaming each file into the ZIP, **compute SHA-256** and **compare** to `application_artifacts.sha256`. On mismatch: **fail entire export**, log at high severity ([`feedback-data-management.md`](feedback-data-management.md) §4).
- Optional **`POST /admin/…/backup/verify`** (verify-only, no ZIP) for periodic audits.
**ZIP layout** (suggested; ASCII-safe entry names, original names in manifest):
```text
manifest.json
submitted/full-package.pdf
submitted/official-form.docx
submitted/official-form.pdf
evidence/research/{safe-name-or-id}
evidence/textbook/…
evidence/technical/…
metadata/application_review_documents.json # optional
```
**`manifest.json`** (minimum fields): `applicationId`, `case_code`, `initiative_id`, submitted timestamps, owner id, **list of files** with `role`, `original_name`, `mime_type`, `byte_size`, **stored** `sha256`, **verified** `sha256` (computed during ZIP build), `storage_kind`.
**Transport**
- **Stream ZIP** with a streaming library (e.g. `zipstream-ng`); **do not** buffer whole archives in memory.
- Single-initiative: synchronous response acceptable.
- **Bulk** (date range, many rows): **async job** → write ZIP to **`initiative-exports`** or **`initiative-backups`** → presigned URL when ready (avoids proxy timeouts).
**Sources for each ZIP entry**
| Content | Source |
|--------|--------|
| Full hồ sơ PDF | `application_artifacts.full_pdf` → MinIO **`initiative-exports`** (after Phase 2) |
| Printable DOCX / PDF | `official_form_docx` / `official_form_pdf`**`initiative-exports`** |
| Evidence | `research_*`, `textbook_*`, `technical_*`**`initiative-attachments`** |
| Structured snapshot | Optional: latest `application_review_documents` JSON |
**Legacy:** If `official_form_*` missing, either skip with manifest flags or run **one-time backfill** using frozen template policy — **document** that backfilled bytes are “as-of backfill date” not original submit date.
### Phase 4 — Identifiers & schema hygiene
- Add **`submission_public_id`** (unique, indexed) on **`initiatives`**, set once at submit; replace linear scan in `get_application_by_id` with indexed lookup ([`feedback-data-management.md`](feedback-data-management.md) §5).
- Document resolution: **`sub-…`** vs **`CASE-…`** explicitly (remove “sometimes” from ops docs).
### Phase 5 — Hardening (ongoing)
- MinIO **versioning** / **object lock** if compliance requires; off-cluster backup of MinIO; periodic verify-only sweeps ([`feedback-data-management.md`](feedback-data-management.md) §9, §10, quarter roadmap).
---
### Frontend (admin)
- New **“Tải bản sao lưu”** (or similar) on application detail: call backup endpoint, handle long downloads (progress if async + poll).
- For async pattern: show job id, link when presigned URL ready.
- Ensure **admin audit** expectations match backend logging.
---
### Summary
| Layer | Current summary | After plan |
|--------|-----------------|------------|
| **Postgres** | Artifacts + polymorphic `storage_uri` | Explicit `storage_kind`, optional `submission_public_id`, new artifact roles for official DOCX/PDF |
| **MinIO** | Evidence + best-effort full PDF | Required `full_pdf` + official form binaries on **`initiative-exports`**; evidence on **`initiative-attachments`** |
| **Admin backup** | Would require regeneration / fragile dispatch | Streaming ZIP + manifest + verified SHA + audit; optional async for bulk |
This aligns the **database and document management system** with a backup that **admins can trust**: **stored bytes**, **verified at pack time**, and **operationally grounded** in explicit storage metadata.