Files
sciagent/docs/application-files-persistence-and-backup.md
T
Thinh Lam 688fac73e9
CI/CD / backend (push) Failing after 2m8s
CI/CD / frontend (push) Failing after 1m40s
CI/CD / deploy (push) Has been skipped
sciagent code + Gitea Actions CI/CD
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 09:38:30 +07:00

24 KiB
Raw Blame History

Application files: persistence, retrieval by applicationId, and backup notes

This document describes how the running initiative stack stores and loads:

  • Evidence attachments (minh chứng 2.1 / 2.2 / kỹ thuật)
  • The submitted full-package PDF (đơn + báo cáo from the « Xem lại » flow)
  • The filled DOCX / official PDF derived from the Word template

It focuses on what PostgreSQL and MinIO hold. The root file database/schema.sql describes a separate integer applications domain (attachments table with application_id INT); that schema is not wired into be0 today. Production behavior is driven by be0/migrations/*.sql and INITIATIVE_DATABASE_URL.

Implementation planning: The phased backup and storage-hardening plan below is refined against the review in feedback-data-management.md (canonical bytes, storage_kind, SHA verification on pack, streaming ZIP + manifest, indexed IDs, evidence versioning, and sequencing).


Identifiers: what “applicationId” means

The UI and APIs expose a public submission id shaped like sub-{16 hex chars} (see save_submitted_application in be0/src/initiative_db/submissions.py). Internally, persistence is keyed by:

Concept Example Where
Public applicationId (list/detail) sub-abc123def4567890 drafts.payload.submissionRecord.id, API responses
Draft / case code CASE-… or SUB-… initiatives.case_code, draft_case_id on API rows
Initiative primary key UUID initiatives.id, MinIO key prefix, application_artifacts.initiative_id

Resolving a row: get_application_by_id (be0/src/initiative_db/submissions.py) scans submitted initiatives and matches when either:

  • _submission_display_id(initiative, submissionRecord) == applicationId, or
  • initiative.case_code == applicationId.

So admins can deep-link with sub-… or sometimes CASE-…. For backups, always persist initiatives.id, case_code, and sub-… together.


MinIO

Configured in Docker via S3_* env vars (docker-compose.yml):

Bucket (env) Purpose
initiative-attachments (S3_BUCKET_ATTACHMENTS) Evidence uploads for Đơn (research / textbook / technical)
initiative-exports (S3_BUCKET_EXPORTS) Optional copy of the submitted full PDF after successful submit
initiative-quarantine (S3_BUCKET_QUARANTINE) Reserved for quarantine flows (not detailed here)

Object key layout (be0/src/minio/storage.py):

  • Evidence and export artifacts use build_key_for_initiative:
    initiatives/{initiative_uuid_no_hyphens}/{yyyy}/{mm}/{uuid}-{safe_filename}

The API uses the internal endpoint for the server (S3_ENDPOINT_URL, e.g. http://minio:9000) and S3_PUBLIC_ENDPOINT_URL for presigned URLs the browser can open (e.g. http://localhost:19000).

Integrity: uploads compute SHA-256 and store it in object metadata and/or Postgres (application_artifacts.sha256).


PostgreSQL (initiative database)

Core tables (be0/migrations/001_initiative_schema.sql, 002_application_storage_extensions.sql, plus review-doc extensions):

initiatives

  • id (UUID), case_code (unique text), owner_id, status, submitted_at, etc.
  • Submitted applications have status != 'draft' (e.g. submitted).

drafts

  • payload JSONB holds the live bundle: tab data, submissionRecord, submissionFile, etc.

After submit, important keys include:

  • payload.submissionRecord — metadata including public id (sub-…)
  • payload.submissionFile — e.g. { "url": "/submitted-initiatives/sub-….pdf", "type": "pdf" }

application_artifacts

One row per (initiative_id, role) (002_application_storage_extensions.sql). Planned (Phase 1): add roles for the printable application form binaries (e.g. official_form_docx, official_form_pdf) — distinct from full_pdf (the client-uploaded full hồ sơ PDF).

role Meaning
full_pdf Submitted package PDF — storage_uri is either a MinIO key (under exports bucket) or a relative URL to static files
research_evidence Minh chứng 2.1 (nghiên cứu)
textbook_evidence Minh chứng 2.2 (giáo trình)
technical_evidence Minh chứng kỹ thuật (nhóm 1)

Columns: storage_uri, original_name, mime_type, byte_size, sha256, uploaded_by, uploaded_at, plus review fields for evidence.

application_submit_snapshots

Append-only rows: merged tabs, submit metadata, and full_pdf_uri (today this records the URL passed at submit time, typically /submitted-initiatives/..., not necessarily the MinIO key).

Treat this table as historical audit of the submit request, not as the driver for backup byte locations: application_artifacts (and storage_kind once added) is the operational source of truth (feedback-data-management.md §8).

application_review_documents

Versioned JSON used to regenerate the Word template output:

  • official_bieu_mau, template_data, full_bundle (JSONB)
  • Tied to initiative_id and case_id

Today: the binary filled DOCX is not stored in MinIO; this table is the only server-side input to regeneration. Target (for a trustworthy admin backup): treat this JSON as supporting data (re-render, analytics, diffing). The canonical bytes for “what the applicant signed off on” for the printable mẫu should be immutable objects in MinIO plus rows in application_artifacts (see Implementation plan — Phase 1).

Other useful tables

  • draft_tab_snapshots — history of tab JSON (report / application / contribution)

Backend flows

Evidence upload & download

  • POST /api/v1/application-drafts/{case_id}/evidence — multipart upload; stores object in initiative-attachments; upserts application_artifacts with role research_evidence | textbook_evidence | technical_evidence (be0/main.py).
  • GET /api/v1/application-drafts/{case_id}/evidence — returns metadata plus presigned downloadUrl / viewUrl for staff or owner.

case_id is normalized to the initiatives case_code (e.g. CASE-…).

Submit full PDF

  • POST /api/applications/submit — receives PDF + JSON metadata (be0/main.py).
  • Always writes the file to SUBMITTED_INITIATIVES_DIR (default: repo assets/submitted-initiatives or fe0/public/submitted-initiatives in dev), served under /submitted-initiatives/{sub-….pdf}.
  • If PostgreSQL is enabled: save_submitted_application updates initiatives / drafts, writes application_submit_snapshots, application_taxonomy, application_workflow, and upsert_artifact_full_pdf.
  • MinIO copy: _maybe_upload_submitted_pdf_to_exports_minio uploads the same bytes to initiative-exports and, on success, sets application_artifacts.full_pdf.storage_uri to the object key (not the /submitted-initiatives/... URL). If MinIO fails, the artifact still points at the filesystem URL only — this is slated to become a hard failure once canonical storage is enforced (Phase 2).

Filled DOCX / official PDF (preview; persistence plan)

  • POST /api/v1/docx/preview-application-form — renders template_application_form.docx with docxtpl; returns bytes (no DB/MinIO write today).
  • POST /api/v1/docx/preview-application-form-pdf — same merge, then LibreOffice conversion to PDF; returns bytes.

The client builds officialBieuMau from draft state; persistReviewDocumentBundle (POST /api/v1/review-documents) saves the JSON bundle to application_review_documents.

Preview endpoints remain useful for staff “what-if” and for regenerating with newer templates. They must not be the only path that feeds the admin backup ZIP once Phase 1 is done — backups should stream stored printable DOCX/PDF bytes unless a legacy row has no stored object (then document explicit fallback or backfill).

Admin detail: presigned full PDF

For GET /api/applications/{application_id}, when full_pdf.storage_uri looks like a MinIO key (not /submitted-initiatives or http), _enrich_application_detail_full_pdf_presign adds files.fullText.viewUrl (presigned GET on initiative-exports).


Frontend

Concern Location
Submit PDF fe0/src/components/applicant/submitInitiativePdf.tsPOST /api/applications/submit with FormData + JWT; metadata includes initiativeCaseId (must match Postgres case_code).
Draft load/save fe0/src/components/applicant/applicationDrafts.tsGET/POST /api/v1/application-drafts/....
DOCX/PDF from template fe0/src/lib/applicationFormDocxApi.ts → preview endpoints; ApplicationFormDocxPreview.tsx orchestrates save + review bundle persistence.
Evidence UI e.g. ApplicationEvidenceManagePage.tsx — uses GET /api/v1/application-drafts/{caseId}/evidence with presigned URLs.
Admin list/detail Uses GET /api/applications, GET list/detail with applicationId; detail exposes draft_case_id for loading drafts/evidence.

Important: sub-… is the list id; draft/evidence APIs use case_code (CASE-…). The API surfaces draft_case_id on submission rows to bridge the two.


Applicant honesty checkboxes, complete tabs & PDF minh chứng (engineering guide)

Goal: applicants cannot tick the cam kết trung thực checkboxes at the end of Báo cáo, Đơn, and Xác nhận đóng góp until the workflow rules below are satisfied; the UI shows a Sonner toast listing missing items. PDF minh chứng means the classification-specific evidence file for Đơn (research / textbook / technical), stored in MinIO via POST /api/v1/application-drafts/{case_id}/evidence (see Evidence upload & download).

Intended behaviour (product)

Control When it may be ticked
Báo cáo (InitiativeReportForm) All required fields on the report tab are non-empty (§1–§6 narrative + hiệu quả fields exposed in the UI).
Đơn (InitiativeApplicationForm) All required Đơn fields are complete and the correct PDF minh chứng slot is filled for the chosen classification (local File, or FileHandle with serverStorageKey after MinIO upload). Sub-forms (bản cam kết / biểu xác nhận) must match the selected nhóm.
Xác nhận đóng góp (ContributionConfirmationForm) Same checks as Đơn and Báo cáo, and the applicant has already ticked honesty on Báo cáo and Đơn.
Xem lại — Gửi (ApplicationFormDocxPreview) Same as contribution gate plus contribution.digitalSignatureConfirmed in the persisted contribution JSON.

Implementation reference:

  • Shared validators + messages: fe0/src/lib/applicantHonestyPrerequisites.ts (collectReportTabHonestyGaps, collectApplicationTabHonestyGaps, collectContributionDigitalSignaturePrerequisiteGaps, collectApplicantSubmitToAdminPrerequisiteGaps, formatApplicantPrerequisiteToastDescription).
  • Checkbox handlers toast with toast.error(..., { description }) and do not flip state when prerequisites fail.

Staff / council flows without DraftProvider skip the contribution-tab signature gate (no full draft in context); fields stay readOnly as today.

Frontend (detailed)

  1. Single source of truth for messages — Keep gap strings in applicantHonestyPrerequisites.ts so DOCX preview and forms stay aligned.
  2. Evidence PDF — Treat as present if applicantEvidencePdfPresent(file) is true: File with non-zero size, or FileHandle with serverStorageKey (MinIO) or positive size (IndexedDB). Matches hydration in DraftContext after getApplicationEvidence(caseId).
  3. Contribution tab — Uses draft.report and draft.application from DraftContext; authors/% totals are validated on Đơn; contribution UI mirrors authors when connected to Postgres drafts.
  4. Review submit — Besides tab JSON, enforce contribution signature flag on the object passed into ApplicationFormDocxPreview (from draftTabs.contribution).

Today, gates are client-side only. For integrity:

  • POST /api/applications/submit — Implemented in be0/src/initiative_db/submission_readiness.py, invoked from save_submitted_application before the initiative is marked submitted. Loads merged drafts.payload.tabs (with snapshot fallback), reads application_artifacts for research_evidence / textbook_evidence / technical_evidence (non-empty storage_uri), and validates tab JSON + honesty flags to match the applicant UI. On failure: 400 with detail: { "message": "…", "missing": ["…", …] } (see ApplicationSubmissionNotReadyError handling in be0/main.py). The client maps this in fe0/src/components/applicant/submitInitiativePdf.ts. Partial PDF written on disk is removed when Postgres validation fails.
  • POST /api/v1/application-drafts/{case_id}/evidence — Already the canonical upload path; reject non-PDF or oversize files (existing behaviour).

PostgreSQL

  • Tab JSON lives under drafts.payload (and/or tab snapshots). Honesty flags are plain booleans: report.honestyConfirmed, application.honestyConfirmed, contribution.digitalSignatureConfirmed. No migration is required for gating unless you add a server-side “submission readiness” snapshot column.

MinIO

  • Required PDF for Đơn is stored under initiative-attachments with keys from build_key_for_initiative; metadata is reflected in application_artifacts (research_evidence | textbook_evidence | technical_evidence). Frontend readiness should agree with either the draft file handle (serverStorageKey) or a fresh GET .../evidence bundle (see collectDocxTemplateCompletenessGaps in admin review for a related pattern).

Retrieving everything for one submission (interim checklist)

Until Phases 12 are done, a reader resolving applicationId (sub-…) should:

  1. Postgres: Resolve initiatives + latest drafts (today: get_application_by_id scan; target: indexed submission_public_idPhase 4).
  2. Submitted full-package PDF (full_pdf artifact): Read application_artifacts with role = 'full_pdf'. Dispatch on storage_kind once added; until then, avoid relying only on string-prefix heuristics for production backups.
  3. Evidence: Roles research_evidence, textbook_evidence, technical_evidence → keys in initiative-attachments.
  4. Printable mẫu DOCX/PDF: After Phase 1, stream from MinIO using new artifact roles; until then see legacy note in Phase 3.

Optional ZIP extras: latest application_review_documents JSON, draft_tab_snapshots, read-only copies of application_submit_snapshots for audit.

Related rationale and risks (regeneration vs backup, polymorphic storage_uri, integrity): feedback-data-management.md.


Implementation plan: admin backup (database + document management)

Goal: admin downloads one ZIP containing all evidence attachments, the submitted full-package PDF, and the printable application DOCX + PDF (mẫu), with verifiable integrity and no reliance on regenerating printable documents at download time (after prerequisites).

Phasing follows the sequencing in feedback-data-management.md §“Suggested order of work”, expanded into concrete schema and API work.

Phase 0 — Decisions & prerequisites

Item Action
Canonical bytes for printable mẫu Store immutable DOCX + PDF in MinIO at submit (or immediately pre-submit in the same transaction as finalize), not only JSON.
Evidence versioning Decide: append-only evidence history vs “latest only”. For approvals, prefer versioned or append-only so backup matches what was reviewed (feedback-data-management.md §7).
Quarantine bucket Define behavior if objects exist in initiative-quarantine: include/exclude/fail backup (feedback-data-management.md §11).
MinIO operations Document versioning, lifecycle, retention, DR (suggested spin-off: MINIO_OPERATIONS.md per feedback §9).
Dead schema Move or clearly label database/schema.sql so tooling does not confuse INT application_id with sub-… (feedback-data-management.md §6).

Phase 1 — Canonical bytes for printable DOCX + PDF (before backup ships)

Problem: Regenerating DOCX/PDF at backup time uses current template, docxtpl, LibreOffice, and fonts — not provably what the applicant saw (feedback-data-management.md §1).

Database

  • Extend application_artifacts.role CHECK (new migration) with two roles, e.g. official_form_docx and official_form_pdf (names TBD; must be distinct from full_pdf, which is the client-uploaded full hồ sơ PDF).
  • On successful submit (or single “finalize” step server-side): compute SHA-256 for each file; INSERT/upsert rows with storage_uri = MinIO key, sha256, byte_size, mime_type, original_name, storage_kind = 'minio_exports' (once column exists).

Application logic

  • Server: build officialBieuMau from the same snapshot used for submission (bundle already available in draft + review document path), call existing fill_application_form_docx → bytes; call convert_docx_bytes_to_pdf → bytes; upload both to initiative-exports using build_key_for_initiative.
  • Do not put LibreOffice on the admin download path after this; optional background verify-only job may re-read objects.

JSON

  • Keep saving application_review_documents for re-render/diff; it is not the sole legal snapshot of the printable files once binaries exist.

Gate: Do not release the admin backup endpoint that promises “printable DOCX/PDF” until this phase is done for new submits; for legacy rows without these artifacts, define policy (backfill job vs manifest flag missing_official_form: true).

Phase 2 — Canonical storage for submitted full-package PDF

Problem: full_pdf may point at filesystem-only, MinIO-only, or both; best-effort upload risks silent loss (feedback-data-management.md §2).

Database

  • Add storage_kind on application_artifacts (enum/text): e.g. minio_exports, minio_attachments, filesystem, external_url. Backfill from existing storage_uri shape; default new rows explicitly.
  • Optionally add content_sha256_verified_at or rely on manifest at backup time only.

Application logic

  • Make MinIO upload of full_pdf synchronous and required when persistence is enabled: if upload fails, fail submit with retryable error.
  • Treat filesystem write as cache for dev/static serving if desired, not sole store.
  • Backfill job: filesystem-only historical PDFs → initiative-exports, then update artifact row + storage_kind.

Infrastructure

  • Ensure SUBMITTED_INITIATIVES_DIR is on a persistent volume in every environment, or stop relying on it for production.

Phase 3 — Admin backup endpoint + ZIP contract

Authorization: admin-only; audit every request: actor, applicationId, timestamp, outcome, bytes streamed (feedback-data-management.md §10).

Resolution: load initiative by submission_public_id or case_code (indexed) after Phase 4; until then use existing lookup with awareness of scan cost for bulk exports.

Integrity

  • While streaming each file into the ZIP, compute SHA-256 and compare to application_artifacts.sha256. On mismatch: fail entire export, log at high severity (feedback-data-management.md §4).
  • Optional POST /admin/…/backup/verify (verify-only, no ZIP) for periodic audits.

ZIP layout (suggested; ASCII-safe entry names, original names in manifest):

manifest.json
submitted/full-package.pdf
submitted/official-form.docx
submitted/official-form.pdf
evidence/research/{safe-name-or-id}
evidence/textbook/…
evidence/technical/…
metadata/application_review_documents.json   # optional

manifest.json (minimum fields): applicationId, case_code, initiative_id, submitted timestamps, owner id, list of files with role, original_name, mime_type, byte_size, stored sha256, verified sha256 (computed during ZIP build), storage_kind.

Transport

  • Stream ZIP with a streaming library (e.g. zipstream-ng); do not buffer whole archives in memory.
  • Single-initiative: synchronous response acceptable.
  • Bulk (date range, many rows): async job → write ZIP to initiative-exports or initiative-backups → presigned URL when ready (avoids proxy timeouts).

Sources for each ZIP entry

Content Source
Full hồ sơ PDF application_artifacts.full_pdf → MinIO initiative-exports (after Phase 2)
Printable DOCX / PDF official_form_docx / official_form_pdfinitiative-exports
Evidence research_*, textbook_*, technical_*initiative-attachments
Structured snapshot Optional: latest application_review_documents JSON

Legacy: If official_form_* missing, either skip with manifest flags or run one-time backfill using frozen template policy — document that backfilled bytes are “as-of backfill date” not original submit date.

Phase 4 — Identifiers & schema hygiene

  • Add submission_public_id (unique, indexed) on initiatives, set once at submit; replace linear scan in get_application_by_id with indexed lookup (feedback-data-management.md §5).
  • Document resolution: sub-… vs CASE-… explicitly (remove “sometimes” from ops docs).

Phase 5 — Hardening (ongoing)

  • MinIO versioning / object lock if compliance requires; off-cluster backup of MinIO; periodic verify-only sweeps (feedback-data-management.md §9, §10, quarter roadmap).

Frontend (admin)

  • New “Tải bản sao lưu” (or similar) on application detail: call backup endpoint, handle long downloads (progress if async + poll).
  • For async pattern: show job id, link when presigned URL ready.
  • Ensure admin audit expectations match backend logging.

Summary

Layer Current summary After plan
Postgres Artifacts + polymorphic storage_uri Explicit storage_kind, optional submission_public_id, new artifact roles for official DOCX/PDF
MinIO Evidence + best-effort full PDF Required full_pdf + official form binaries on initiative-exports; evidence on initiative-attachments
Admin backup Would require regeneration / fragile dispatch Streaming ZIP + manifest + verified SHA + audit; optional async for bulk

This aligns the database and document management system with a backup that admins can trust: stored bytes, verified at pack time, and operationally grounded in explicit storage metadata.