sciagent code + Gitea Actions CI/CD
CI/CD / backend (push) Failing after 2m8s
CI/CD / frontend (push) Failing after 1m40s
CI/CD / deploy (push) Has been skipped

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Thinh Lam
2026-06-30 09:38:30 +07:00
commit 688fac73e9
1167 changed files with 158244 additions and 0 deletions
@@ -0,0 +1,336 @@
# Application files: persistence, retrieval by `applicationId`, and backup notes
This document describes how the running **initiative** stack stores and loads:
- Evidence attachments (minh chứng 2.1 / 2.2 / kỹ thuật)
- The **submitted full-package PDF** (đơn + báo cáo from the « Xem lại » flow)
- The **filled DOCX / official PDF** derived from the Word template
It focuses on what **PostgreSQL** and **MinIO** hold. The root file [`database/schema.sql`](../database/schema.sql) describes a separate **integer `applications`** domain (attachments table with `application_id` INT); that schema is **not** wired into `be0` today. Production behavior is driven by **`be0/migrations/*.sql`** and **`INITIATIVE_DATABASE_URL`**.
**Implementation planning:** The phased backup and storage-hardening plan below is **refined against** the review in [`feedback-data-management.md`](feedback-data-management.md) (canonical bytes, `storage_kind`, SHA verification on pack, streaming ZIP + manifest, indexed IDs, evidence versioning, and sequencing).
---
## Identifiers: what “applicationId” means
The UI and APIs expose a **public submission id** shaped like `sub-{16 hex chars}` (see `save_submitted_application` in `be0/src/initiative_db/submissions.py`). Internally, persistence is keyed by:
| Concept | Example | Where |
|--------|---------|--------|
| **Public `applicationId`** (list/detail) | `sub-abc123def4567890` | `drafts.payload.submissionRecord.id`, API responses |
| **Draft / case code** | `CASE-…` or `SUB-…` | `initiatives.case_code`, `draft_case_id` on API rows |
| **Initiative primary key** | UUID | `initiatives.id`, MinIO key prefix, `application_artifacts.initiative_id` |
**Resolving a row:** `get_application_by_id` (`be0/src/initiative_db/submissions.py`) scans submitted initiatives and matches when either:
- `_submission_display_id(initiative, submissionRecord) == applicationId`, or
- `initiative.case_code == applicationId`.
So admins can deep-link with **`sub-…`** or sometimes **`CASE-…`**. For backups, always persist **`initiatives.id`**, **`case_code`**, and **`sub-…`** together.
---
## MinIO
Configured in Docker via `S3_*` env vars (`docker-compose.yml`):
| Bucket (env) | Purpose |
|----------------|---------|
| **`initiative-attachments`** (`S3_BUCKET_ATTACHMENTS`) | Evidence uploads for Đơn (research / textbook / technical) |
| **`initiative-exports`** (`S3_BUCKET_EXPORTS`) | Optional copy of the **submitted full PDF** after successful submit |
| **`initiative-quarantine`** (`S3_BUCKET_QUARANTINE`) | Reserved for quarantine flows (not detailed here) |
**Object key layout** (`be0/src/minio/storage.py`):
- Evidence and export artifacts use **`build_key_for_initiative`**:
`initiatives/{initiative_uuid_no_hyphens}/{yyyy}/{mm}/{uuid}-{safe_filename}`
The API uses the **internal endpoint** for the server (`S3_ENDPOINT_URL`, e.g. `http://minio:9000`) and **`S3_PUBLIC_ENDPOINT_URL`** for presigned URLs the browser can open (e.g. `http://localhost:19000`).
**Integrity:** uploads compute SHA-256 and store it in object metadata and/or Postgres (`application_artifacts.sha256`).
---
## PostgreSQL (initiative database)
Core tables (`be0/migrations/001_initiative_schema.sql`, `002_application_storage_extensions.sql`, plus review-doc extensions):
### `initiatives`
- `id` (UUID), `case_code` (unique text), `owner_id`, `status`, `submitted_at`, etc.
- Submitted applications have `status != 'draft'` (e.g. `submitted`).
### `drafts`
- `payload` JSONB holds the live bundle: tab data, `submissionRecord`, `submissionFile`, etc.
After submit, important keys include:
- `payload.submissionRecord` — metadata including public `id` (`sub-…`)
- `payload.submissionFile` — e.g. `{ "url": "/submitted-initiatives/sub-….pdf", "type": "pdf" }`
### `application_artifacts`
One row per **`(initiative_id, role)`** (`002_application_storage_extensions.sql`). **Planned (Phase 1):** add roles for the **printable application form** binaries (e.g. **`official_form_docx`**, **`official_form_pdf`**) — distinct from **`full_pdf`** (the **client-uploaded** full hồ sơ PDF).
| `role` | Meaning |
|--------|---------|
| `full_pdf` | Submitted package PDF — **`storage_uri`** is either a **MinIO key** (under exports bucket) or a **relative URL** to static files |
| `research_evidence` | Minh chứng 2.1 (nghiên cứu) |
| `textbook_evidence` | Minh chứng 2.2 (giáo trình) |
| `technical_evidence` | Minh chứng kỹ thuật (nhóm 1) |
Columns: `storage_uri`, `original_name`, `mime_type`, `byte_size`, `sha256`, `uploaded_by`, `uploaded_at`, plus review fields for evidence.
### `application_submit_snapshots`
Append-only rows: merged tabs, submit metadata, and **`full_pdf_uri`** (today this records the **URL passed at submit time**, typically `/submitted-initiatives/...`, not necessarily the MinIO key).
Treat this table as **historical audit** of the submit request, not as the driver for backup byte locations: **`application_artifacts`** (and `storage_kind` once added) is the operational source of truth ([`feedback-data-management.md`](feedback-data-management.md) §8).
### `application_review_documents`
Versioned JSON used to regenerate the Word template output:
- `official_bieu_mau`, `template_data`, `full_bundle` (JSONB)
- Tied to **`initiative_id`** and `case_id`
**Today:** the binary filled DOCX is **not** stored in MinIO; this table is the only server-side input to regeneration. **Target (for a trustworthy admin backup):** treat this JSON as **supporting data** (re-render, analytics, diffing). The **canonical bytes** for “what the applicant signed off on” for the printable mẫu should be **immutable objects in MinIO** plus rows in `application_artifacts` (see [Implementation plan — Phase 1](#phase-1-canonical-bytes-for-printable-docx--pdf-before-backup-ships)).
### Other useful tables
- `draft_tab_snapshots` — history of tab JSON (`report` / `application` / `contribution`)
---
## Backend flows
### Evidence upload & download
- **POST** `/api/v1/application-drafts/{case_id}/evidence` — multipart upload; stores object in **`initiative-attachments`**; upserts `application_artifacts` with role `research_evidence` | `textbook_evidence` | `technical_evidence` (`be0/main.py`).
- **GET** `/api/v1/application-drafts/{case_id}/evidence` — returns metadata plus **presigned** `downloadUrl` / `viewUrl` for staff or owner.
`case_id` is normalized to the initiatives **`case_code`** (e.g. `CASE-…`).
### Submit full PDF
- **POST** `/api/applications/submit` — receives PDF + JSON `metadata` (`be0/main.py`).
- Always writes the file to **`SUBMITTED_INITIATIVES_DIR`** (default: repo `assets/submitted-initiatives` or `fe0/public/submitted-initiatives` in dev), served under **`/submitted-initiatives/{sub-….pdf}`**.
- If PostgreSQL is enabled: **`save_submitted_application`** updates `initiatives` / `drafts`, writes **`application_submit_snapshots`**, **`application_taxonomy`**, **`application_workflow`**, and **`upsert_artifact_full_pdf`**.
- **MinIO copy:** `_maybe_upload_submitted_pdf_to_exports_minio` uploads the same bytes to **`initiative-exports`** and, on success, sets **`application_artifacts.full_pdf.storage_uri`** to the **object key** (not the `/submitted-initiatives/...` URL). If MinIO fails, the artifact still points at the **filesystem URL** only — **this is slated to become a hard failure** once canonical storage is enforced ([Phase 2](#phase-2-canonical-storage-for-submitted-full-pdf)).
### Filled DOCX / official PDF (preview; persistence plan)
- **POST** `/api/v1/docx/preview-application-form` — renders `template_application_form.docx` with **docxtpl**; returns bytes (**no DB/MinIO write** today).
- **POST** `/api/v1/docx/preview-application-form-pdf` — same merge, then **LibreOffice** conversion to PDF; returns bytes.
The client builds `officialBieuMau` from draft state; **`persistReviewDocumentBundle`** (**POST** `/api/v1/review-documents`) saves the JSON bundle to **`application_review_documents`**.
**Preview endpoints remain useful** for staff “what-if” and for regenerating with newer templates. **They must not** be the only path that feeds the admin backup ZIP once Phase 1 is done — backups should stream **stored** printable DOCX/PDF bytes unless a legacy row has no stored object (then document explicit fallback or backfill).
### Admin detail: presigned full PDF
For **GET** `/api/applications/{application_id}`, when `full_pdf.storage_uri` looks like a **MinIO key** (not `/submitted-initiatives` or `http`), **`_enrich_application_detail_full_pdf_presign`** adds `files.fullText.viewUrl` (presigned GET on **`initiative-exports`**).
---
## Frontend
| Concern | Location |
|---------|----------|
| Submit PDF | `fe0/src/components/applicant/submitInitiativePdf.ts`**POST** `/api/applications/submit` with `FormData` + JWT; metadata includes **`initiativeCaseId`** (must match Postgres `case_code`). |
| Draft load/save | `fe0/src/components/applicant/applicationDrafts.ts`**GET/POST** `/api/v1/application-drafts/...`. |
| DOCX/PDF from template | `fe0/src/lib/applicationFormDocxApi.ts` → preview endpoints; `ApplicationFormDocxPreview.tsx` orchestrates save + review bundle persistence. |
| Evidence UI | e.g. `ApplicationEvidenceManagePage.tsx` — uses **GET** `/api/v1/application-drafts/{caseId}/evidence` with presigned URLs. |
| Admin list/detail | Uses **GET** `/api/applications`, **GET** list/detail with `applicationId`; detail exposes `draft_case_id` for loading drafts/evidence. |
Important: **`sub-…`** is the list id; **draft/evidence APIs use `case_code` (`CASE-…`)**. The API surfaces `draft_case_id` on submission rows to bridge the two.
---
## Applicant honesty checkboxes, complete tabs & PDF minh chứng (engineering guide)
Goal: applicants cannot tick the **cam kết trung thực** checkboxes at the end of **Báo cáo**, **Đơn**, and **Xác nhận đóng góp** until the workflow rules below are satisfied; the UI shows a **Sonner** toast listing missing items. **PDF minh chứng** means the classification-specific evidence file for Đơn (research / textbook / technical), stored in **MinIO** via `POST /api/v1/application-drafts/{case_id}/evidence` (see [Evidence upload & download](#evidence-upload--download)).
### Intended behaviour (product)
| Control | When it may be ticked |
|--------|------------------------|
| **Báo cáo** (`InitiativeReportForm`) | All required fields on the report tab are non-empty (§1–§6 narrative + hiệu quả fields exposed in the UI). |
| **Đơn** (`InitiativeApplicationForm`) | All required Đơn fields are complete **and** the correct **PDF minh chứng** slot is filled for the chosen classification (local `File`, or `FileHandle` with `serverStorageKey` after MinIO upload). Sub-forms (bản cam kết / biểu xác nhận) must match the selected nhóm. |
| **Xác nhận đóng góp** (`ContributionConfirmationForm`) | Same checks as Đơn **and** Báo cáo, **and** the applicant has already ticked honesty on **Báo cáo** and **Đơn**. |
| **Xem lại — Gửi** (`ApplicationFormDocxPreview`) | Same as contribution gate **plus** `contribution.digitalSignatureConfirmed` in the persisted contribution JSON. |
Implementation reference:
- Shared validators + messages: `fe0/src/lib/applicantHonestyPrerequisites.ts` (`collectReportTabHonestyGaps`, `collectApplicationTabHonestyGaps`, `collectContributionDigitalSignaturePrerequisiteGaps`, `collectApplicantSubmitToAdminPrerequisiteGaps`, `formatApplicantPrerequisiteToastDescription`).
- Checkbox handlers toast with `toast.error(..., { description })` and **do not** flip state when prerequisites fail.
Staff / council flows without `DraftProvider` skip the contribution-tab signature gate (no full draft in context); fields stay **`readOnly`** as today.
### Frontend (detailed)
1. **Single source of truth for messages** — Keep gap strings in `applicantHonestyPrerequisites.ts` so DOCX preview and forms stay aligned.
2. **Evidence PDF** — Treat as present if `applicantEvidencePdfPresent(file)` is true: `File` with non-zero size, or `FileHandle` with `serverStorageKey` (MinIO) or positive `size` (IndexedDB). Matches hydration in `DraftContext` after `getApplicationEvidence(caseId)`.
3. **Contribution tab** — Uses `draft.report` and `draft.application` from `DraftContext`; authors/% totals are validated on Đơn; contribution UI mirrors `authors` when connected to Postgres drafts.
4. **Review submit** — Besides tab JSON, enforce contribution signature flag on the object passed into `ApplicationFormDocxPreview` (from `draftTabs.contribution`).
### Backend (recommended)
Today, gates are **client-side** only. For integrity:
- **`POST /api/applications/submit`** — Implemented in `be0/src/initiative_db/submission_readiness.py`, invoked from `save_submitted_application` **before** the initiative is marked submitted. Loads merged `drafts.payload.tabs` (with snapshot fallback), reads **`application_artifacts`** for `research_evidence` / `textbook_evidence` / `technical_evidence` (non-empty `storage_uri`), and validates tab JSON + honesty flags to match the applicant UI. On failure: **400** with `detail: { "message": "…", "missing": ["…", …] }` (see `ApplicationSubmissionNotReadyError` handling in `be0/main.py`). The client maps this in `fe0/src/components/applicant/submitInitiativePdf.ts`. Partial PDF written on disk is removed when Postgres validation fails.
- **`POST /api/v1/application-drafts/{case_id}/evidence`** — Already the canonical upload path; reject non-PDF or oversize files (existing behaviour).
### PostgreSQL
- Tab JSON lives under **`drafts.payload`** (and/or tab snapshots). Honesty flags are plain booleans: `report.honestyConfirmed`, `application.honestyConfirmed`, `contribution.digitalSignatureConfirmed`. No migration is required for gating unless you add a **server-side** “submission readiness” snapshot column.
### MinIO
- Required PDF for Đơn is stored under **`initiative-attachments`** with keys from `build_key_for_initiative`; metadata is reflected in **`application_artifacts`** (`research_evidence` | `textbook_evidence` | `technical_evidence`). Frontend readiness should agree with **either** the draft file handle (`serverStorageKey`) **or** a fresh **`GET .../evidence`** bundle (see `collectDocxTemplateCompletenessGaps` in admin review for a related pattern).
---
## Retrieving everything for one submission (interim checklist)
Until Phases 12 are done, a reader resolving **`applicationId`** (`sub-…`) should:
1. **Postgres:** Resolve `initiatives` + latest `drafts` (today: `get_application_by_id` scan; target: indexed `submission_public_id` — [Phase 4](#phase-4-identifiers--schema-hygiene)).
2. **Submitted full-package PDF (`full_pdf` artifact):** Read `application_artifacts` with `role = 'full_pdf'`. Dispatch on **`storage_kind`** once added; until then, avoid relying only on string-prefix heuristics for production backups.
3. **Evidence:** Roles `research_evidence`, `textbook_evidence`, `technical_evidence` → keys in **`initiative-attachments`**.
4. **Printable mẫu DOCX/PDF:** After Phase 1, stream from MinIO using new artifact roles; until then see **legacy** note in Phase 3.
Optional ZIP extras: latest `application_review_documents` JSON, `draft_tab_snapshots`, read-only copies of `application_submit_snapshots` for audit.
**Related rationale and risks** (regeneration vs backup, polymorphic `storage_uri`, integrity): [`feedback-data-management.md`](feedback-data-management.md).
---
## Implementation plan: admin backup (database + document management)
Goal: **admin downloads one ZIP** containing **all evidence attachments**, the **submitted full-package PDF**, and the **printable application DOCX + PDF** (mẫu), with **verifiable integrity** and **no reliance on regenerating** printable documents at download time (after prerequisites).
Phasing follows the sequencing in [`feedback-data-management.md`](feedback-data-management.md) §“Suggested order of work”, expanded into concrete schema and API work.
### Phase 0 — Decisions & prerequisites
| Item | Action |
|------|--------|
| **Canonical bytes for printable mẫu** | Store immutable DOCX + PDF in MinIO at submit (or immediately pre-submit in the same transaction as finalize), not only JSON. |
| **Evidence versioning** | Decide: append-only evidence history vs “latest only”. For approvals, prefer **versioned or append-only** so backup matches what was reviewed ([`feedback-data-management.md`](feedback-data-management.md) §7). |
| **Quarantine bucket** | Define behavior if objects exist in **`initiative-quarantine`**: include/exclude/fail backup ([`feedback-data-management.md`](feedback-data-management.md) §11). |
| **MinIO operations** | Document versioning, lifecycle, retention, DR (suggested spin-off: `MINIO_OPERATIONS.md` per feedback §9). |
| **Dead schema** | Move or clearly label [`database/schema.sql`](../database/schema.sql) so tooling does not confuse INT `application_id` with `sub-…` ([`feedback-data-management.md`](feedback-data-management.md) §6). |
### Phase 1 — Canonical bytes for printable DOCX + PDF (before backup ships)
**Problem:** Regenerating DOCX/PDF at backup time uses **current** template, docxtpl, LibreOffice, and fonts — not provably what the applicant saw ([`feedback-data-management.md`](feedback-data-management.md) §1).
**Database**
- Extend `application_artifacts.role` CHECK (new migration) with two roles, e.g. **`official_form_docx`** and **`official_form_pdf`** (names TBD; must be distinct from **`full_pdf`**, which is the **client-uploaded full hồ sơ** PDF).
- On successful submit (or single “finalize” step server-side): compute SHA-256 for each file; **`INSERT`/upsert** rows with `storage_uri` = MinIO key, `sha256`, `byte_size`, `mime_type`, `original_name`, **`storage_kind = 'minio_exports'`** (once column exists).
**Application logic**
- Server: build `officialBieuMau` from the same snapshot used for submission (bundle already available in draft + review document path), call existing **`fill_application_form_docx`** → bytes; call **`convert_docx_bytes_to_pdf`** → bytes; upload both to **`initiative-exports`** using `build_key_for_initiative`.
- **Do not** put LibreOffice on the admin **download** path after this; optional background **verify-only** job may re-read objects.
**JSON**
- Keep saving **`application_review_documents`** for re-render/diff; it is **not** the sole legal snapshot of the printable files once binaries exist.
**Gate:** Do **not** release the admin backup endpoint that promises “printable DOCX/PDF” until this phase is done for **new** submits; for **legacy** rows without these artifacts, define policy (backfill job vs manifest flag `missing_official_form: true`).
### Phase 2 — Canonical storage for submitted full-package PDF
**Problem:** `full_pdf` may point at filesystem-only, MinIO-only, or both; best-effort upload risks silent loss ([`feedback-data-management.md`](feedback-data-management.md) §2).
**Database**
- Add **`storage_kind`** on **`application_artifacts`** (enum/text): e.g. `minio_exports`, `minio_attachments`, `filesystem`, `external_url`. Backfill from existing `storage_uri` shape; default new rows explicitly.
- Optionally add **`content_sha256_verified_at`** or rely on manifest at backup time only.
**Application logic**
- Make **MinIO upload of `full_pdf` synchronous and required** when persistence is enabled: if upload fails, **fail submit** with retryable error.
- Treat filesystem write as **cache** for dev/static serving if desired, not sole store.
- **Backfill job:** filesystem-only historical PDFs → **`initiative-exports`**, then update artifact row + `storage_kind`.
**Infrastructure**
- Ensure **`SUBMITTED_INITIATIVES_DIR`** is on a **persistent volume** in every environment, or stop relying on it for production.
### Phase 3 — Admin backup endpoint + ZIP contract
**Authorization:** admin-only; **audit** every request: actor, `applicationId`, timestamp, outcome, bytes streamed ([`feedback-data-management.md`](feedback-data-management.md) §10).
**Resolution:** load initiative by **`submission_public_id`** or **`case_code`** (indexed) after Phase 4; until then use existing lookup with awareness of scan cost for **bulk** exports.
**Integrity**
- While streaming each file into the ZIP, **compute SHA-256** and **compare** to `application_artifacts.sha256`. On mismatch: **fail entire export**, log at high severity ([`feedback-data-management.md`](feedback-data-management.md) §4).
- Optional **`POST /admin/…/backup/verify`** (verify-only, no ZIP) for periodic audits.
**ZIP layout** (suggested; ASCII-safe entry names, original names in manifest):
```text
manifest.json
submitted/full-package.pdf
submitted/official-form.docx
submitted/official-form.pdf
evidence/research/{safe-name-or-id}
evidence/textbook/…
evidence/technical/…
metadata/application_review_documents.json # optional
```
**`manifest.json`** (minimum fields): `applicationId`, `case_code`, `initiative_id`, submitted timestamps, owner id, **list of files** with `role`, `original_name`, `mime_type`, `byte_size`, **stored** `sha256`, **verified** `sha256` (computed during ZIP build), `storage_kind`.
**Transport**
- **Stream ZIP** with a streaming library (e.g. `zipstream-ng`); **do not** buffer whole archives in memory.
- Single-initiative: synchronous response acceptable.
- **Bulk** (date range, many rows): **async job** → write ZIP to **`initiative-exports`** or **`initiative-backups`** → presigned URL when ready (avoids proxy timeouts).
**Sources for each ZIP entry**
| Content | Source |
|--------|--------|
| Full hồ sơ PDF | `application_artifacts.full_pdf` → MinIO **`initiative-exports`** (after Phase 2) |
| Printable DOCX / PDF | `official_form_docx` / `official_form_pdf`**`initiative-exports`** |
| Evidence | `research_*`, `textbook_*`, `technical_*`**`initiative-attachments`** |
| Structured snapshot | Optional: latest `application_review_documents` JSON |
**Legacy:** If `official_form_*` missing, either skip with manifest flags or run **one-time backfill** using frozen template policy — **document** that backfilled bytes are “as-of backfill date” not original submit date.
### Phase 4 — Identifiers & schema hygiene
- Add **`submission_public_id`** (unique, indexed) on **`initiatives`**, set once at submit; replace linear scan in `get_application_by_id` with indexed lookup ([`feedback-data-management.md`](feedback-data-management.md) §5).
- Document resolution: **`sub-…`** vs **`CASE-…`** explicitly (remove “sometimes” from ops docs).
### Phase 5 — Hardening (ongoing)
- MinIO **versioning** / **object lock** if compliance requires; off-cluster backup of MinIO; periodic verify-only sweeps ([`feedback-data-management.md`](feedback-data-management.md) §9, §10, quarter roadmap).
---
### Frontend (admin)
- New **“Tải bản sao lưu”** (or similar) on application detail: call backup endpoint, handle long downloads (progress if async + poll).
- For async pattern: show job id, link when presigned URL ready.
- Ensure **admin audit** expectations match backend logging.
---
### Summary
| Layer | Current summary | After plan |
|--------|-----------------|------------|
| **Postgres** | Artifacts + polymorphic `storage_uri` | Explicit `storage_kind`, optional `submission_public_id`, new artifact roles for official DOCX/PDF |
| **MinIO** | Evidence + best-effort full PDF | Required `full_pdf` + official form binaries on **`initiative-exports`**; evidence on **`initiative-attachments`** |
| **Admin backup** | Would require regeneration / fragile dispatch | Streaming ZIP + manifest + verified SHA + audit; optional async for bulk |
This aligns the **database and document management system** with a backup that **admins can trust**: **stored bytes**, **verified at pack time**, and **operationally grounded** in explicit storage metadata.