sciagent code + Gitea Actions CI/CD

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 09:38:30 +07:00
commit 688fac73e9
1167 changed files with 158244 additions and 0 deletions
@@ -0,0 +1,363 @@
+# Specification: Browser-Based DOCX-to-PDF Converter
+
+**Status:** Ready for implementation
+**Audience:** Frontend engineer (React + TypeScript)
+**Estimated effort:** 1–2 days for a working component, +1 day for polish and tests
+
+---
+
+## 1. Overview
+
+This document specifies a React component, `DocxToPdfViewer`, that accepts a `.docx` file in the browser, renders it on screen with layout fidelity equivalent to Microsoft Word, and produces a downloadable PDF that matches the rendering page-for-page. The component runs entirely in the browser; no document content ever leaves the user's machine.
+
+The component is intended for use cases where users need to view a Word document and obtain a PDF copy without installing Word, opening a desktop converter, or trusting a third-party cloud service. Typical scenarios include legal forms, application packets, internal templates, and document submission flows where PDF is the required output format.
+
+## 2. Goals and Non-Goals
+
+### 2.1 Goals
+
+The component must preserve the document's page size, margins, fonts (where embedded or system-available), paragraph alignment, tables, inline and floating images, headers, footers, footnotes, bullet and numbered lists, and basic text formatting (bold, italic, underline, color, size). It must correctly render documents containing non-Latin scripts, with Vietnamese diacritics, CJK characters, and right-to-left scripts as concrete test cases. It must work on the current versions of Chromium-based browsers, Firefox, and Safari without server assistance. It must expose a clear TypeScript API and emit lifecycle events suitable for integration into larger applications.
+
+### 2.2 Non-Goals
+
+The output PDF is **rasterised**: each page is a JPEG image embedded in a PDF page of matching dimensions. Text in the output is therefore not selectable or searchable. If selectable text is required, the implementer should use a server-side converter (LibreOffice headless, Aspose, or a paid API) instead — this is documented in Section 12.
+
+The component does not edit, sign, redact, fill forms in, or otherwise modify the source document. It does not support `.doc` (legacy binary format); callers must convert to `.docx` upstream. It does not attempt to be a general-purpose Word viewer with comments, track changes, or revision history rendering; only the final accepted state is rendered.
+
+## 3. System Context
+
+The pipeline has three stages, executed in order:
+
+```
+┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
+│  .docx file │ -> │ docx-preview │ -> │ html2canvas  │ -> │    jsPDF     │ -> Blob
+│  (Blob)     │    │   (HTML)     │    │  (Canvas[])  │    │  (PDF Blob)  │
+└─────────────┘    └──────────────┘    └──────────────┘    └──────────────┘
+                          │
+                          └─> visible to user as the on-screen preview
+```
+
+The rendered HTML serves a dual purpose: it is both the on-screen preview shown to the user *and* the source material from which the PDF is rasterised. There is no separate hidden render pass. This is a deliberate architectural choice; see Rule 3 in Section 7.
+
+## 4. Dependencies
+
+The implementation requires three runtime dependencies and their type definitions:
+
+| Package | Version | Purpose |
+|---|---|---|
+| `docx-preview` | `^0.3.5` | Parses `.docx` and renders to HTML with high layout fidelity. |
+| `html2canvas` | `^1.4.1` | Rasterises a DOM subtree to an HTMLCanvasElement. |
+| `jspdf` | `^2.5.1` | Assembles canvas images into a multi-page PDF. |
+
+`docx-preview` has a transitive runtime dependency on `jszip`, which it imports via its package; no direct install is required when bundling with npm. When loading via CDN, `jszip` must be loaded as a separate `<script>` tag *before* `docx-preview`.
+
+The React peer dependency is React 18 or later. Install:
+
+```bash
+npm install docx-preview jspdf html2canvas
+```
+
+TypeScript definitions ship with `jspdf` and `docx-preview`. `html2canvas` includes its own declarations in recent versions.
+
+## 5. Public API
+
+The component is the default export of a single file, `DocxToPdfViewer.tsx`. Its props are:
+
+```ts
+type ConverterStatus = "idle" | "rendering" | "capturing" | "ready" | "error";
+
+interface DocxToPdfViewerProps {
+  /** Pre-supplied .docx file. If omitted, the built-in file picker is shown. */
+  file?: File | null;
+
+  /** Hide the built-in file picker. Use when `file` is controlled externally. */
+  hideFilePicker?: boolean;
+
+  /** Hide the inline HTML preview. Use when only the PDF blob is needed. */
+  hidePreview?: boolean;
+
+  /** Called when the PDF blob is ready. */
+  onPdfReady?: (pdfBlob: Blob, sourceFile: File) => void;
+
+  /** Called on every stage of the conversion lifecycle. */
+  onStatusChange?: (status: ConverterStatus) => void;
+
+  /** Rendering scale passed to html2canvas. Default 2. Range 1–4. */
+  renderScale?: number;
+
+  /** JPEG quality 0–1 for embedded page images. Default 0.95. */
+  imageQuality?: number;
+
+  /** Use PNG (lossless, larger files) instead of JPEG. Default false. */
+  losslessImages?: boolean;
+
+  className?: string;
+  style?: React.CSSProperties;
+}
+```
+
+The component is fully self-contained: with no props, dropping it into a tree produces a working drag-and-drop converter. With `file` supplied externally, conversion starts automatically whenever the prop changes. `onPdfReady` is the integration seam for callers who need to upload, store, or further process the PDF.
+
+## 6. Implementation Guide
+
+### 6.1 Project Structure
+
+A single-file component is sufficient. Place `DocxToPdfViewer.tsx` in your component directory. No additional CSS files, context providers, or build configuration are required beyond what a standard Vite/Next/CRA React project already provides.
+
+### 6.2 The Rendering Stage
+
+The component holds a `ref` to a single visible `<div>`. When a file is received, the implementation calls `docx-preview`'s `renderAsync` with that ref as the body container. The library injects a `<div class="docx-wrapper">` containing one `<section class="docx">` element per page, plus a `<style>` block of derived CSS at the top of the container.
+
+```ts
+await renderAsync(source, container, undefined, {
+  inWrapper: true,
+  breakPages: true,
+  ignoreLastRenderedPageBreak: false,
+  useBase64URL: true,
+  experimental: true,
+  renderHeaders: true,
+  renderFooters: true,
+  renderFootnotes: true,
+});
+```
+
+`breakPages: true` is essential — it causes the library to emit one section per page rather than a single continuous flow, which is what makes per-page capture possible later. `useBase64URL: true` inlines images and fonts as data URLs, which avoids cross-origin issues during canvas capture (see Section 8). `experimental: true` enables tab-stop calculation; the option name is misleading but the feature is stable in practice.
+
+After `renderAsync` resolves, wait one animation frame plus a short `setTimeout` before measuring page dimensions. Browsers do not guarantee that injected styles have been applied and font metrics finalised by the time the promise resolves; measuring too early produces zero-width pages.
+
+```ts
+await new Promise<void>(r => requestAnimationFrame(() => r()));
+await new Promise<void>(r => setTimeout(r, 50));
+```
+
+### 6.3 The Capture Stage
+
+Once rendered, locate the page elements:
+
+```ts
+let pages = Array.from(
+  container.querySelectorAll<HTMLElement>("section.docx")
+);
+if (pages.length === 0) {
+  pages = Array.from(container.querySelectorAll<HTMLElement>("section"));
+}
+if (pages.length === 0) {
+  throw new Error("docx-preview produced no page sections.");
+}
+```
+
+The fallback selector exists to defend against future `docx-preview` versions that might change the section classname; it has no cost when the primary selector succeeds.
+
+For each page, call `html2canvas` with the page element as the target. The recommended configuration:
+
+```ts
+const canvas = await html2canvas(page, {
+  scale: renderScale,        // 2 for crisp output
+  useCORS: true,             // honour CORS headers on any external images
+  backgroundColor: "#ffffff",// avoid transparent pages
+  logging: false,
+  windowWidth:  page.offsetWidth,
+  windowHeight: page.offsetHeight,
+});
+```
+
+`scale: 2` is the sweet spot. `scale: 1` produces visibly blurry text; `scale: 3+` quadruples memory consumption per page and offers diminishing visual return except for print output.
+
+### 6.4 The PDF Assembly Stage
+
+Initialise `jsPDF` once, using the first page's dimensions. Convert CSS pixels to millimetres using the constant `25.4 / 96` (millimetres per inch divided by CSS pixels per inch at the standard 96 DPI):
+
+```ts
+const PX_TO_MM = 25.4 / 96;
+const widthMm  = firstPage.offsetWidth  * PX_TO_MM;
+const heightMm = firstPage.offsetHeight * PX_TO_MM;
+
+const pdf = new jsPDF({
+  orientation: widthMm > heightMm ? "landscape" : "portrait",
+  unit: "mm",
+  format: [widthMm, heightMm],
+  compress: true,
+});
+```
+
+For each captured canvas, derive that page's own dimensions (a document may mix portrait and landscape sections) and add it. The first page is implicit; subsequent pages require explicit `addPage`:
+
+```ts
+for (let i = 0; i < pages.length; i++) {
+  const page = pages[i];
+  const pwMm = page.offsetWidth  * PX_TO_MM;
+  const phMm = page.offsetHeight * PX_TO_MM;
+  const imgData = canvas.toDataURL("image/jpeg", 0.95);
+
+  if (i > 0) {
+    pdf.addPage([pwMm, phMm], pwMm > phMm ? "landscape" : "portrait");
+  }
+  pdf.addImage(imgData, "JPEG", 0, 0, pwMm, phMm, undefined, "FAST");
+}
+
+const blob = pdf.output("blob");
+```
+
+The `"FAST"` compression mode is the correct choice for embedded JPEGs. The image is already compressed; asking jsPDF to re-compress with `"SLOW"` or `"MEDIUM"` adds significant CPU time and no file-size benefit. For the lossless variant (`losslessImages: true`), substitute `"image/png"` and `"PNG"`; expect 5–10× larger output.
+
+### 6.5 The Component Shell
+
+The UI surface comprises four elements: a file picker that doubles as a drop zone, a status line, a download button that appears when the PDF is ready, and the preview container that `docx-preview` renders into. Detailed visual design is out of scope for this spec — the component should accept `className` and `style` props and ship with neutral default styles that integrate into any application without requiring a CSS reset.
+
+The drop zone must accept both click-to-browse and drag-and-drop. On drag-over, prevent the default to enable drop. On drop, validate that the file has a `.docx` extension before passing it to the conversion pipeline.
+
+### 6.6 Lifecycle Management
+
+The PDF blob is held in a ref rather than React state, because re-renders triggered by other state changes (progress updates, status changes) should not re-create the URL or re-trigger downstream consumers. A separate boolean state (`pdfReady`) controls the visibility of the Download button.
+
+Object URLs are created lazily, at the moment the user clicks Download, and revoked after a short delay sufficient for the browser to initiate the download (4 seconds is a conservative value):
+
+```ts
+const url = URL.createObjectURL(blob);
+const a = document.createElement("a");
+a.href = url;
+a.download = sourceFile.name.replace(/\.docx$/i, "") + ".pdf";
+document.body.appendChild(a);
+a.click();
+a.remove();
+setTimeout(() => URL.revokeObjectURL(url), 4000);
+```
+
+Creating the URL only at click time avoids holding a long-lived blob URL in memory for users who never download.
+
+When the `file` prop changes or the user selects a new file via the picker, the conversion pipeline restarts and the previous blob is discarded. The previous preview DOM is cleared by setting `container.innerHTML = ""` before the next `renderAsync` call.
+
+## 7. Critical Implementation Rules
+
+The following four rules each correspond to a non-obvious failure mode that has cost real engineering time. They are not stylistic preferences — they will cause the component to fail or produce blank output if violated.
+
+### Rule 1: Do not override `docx-preview`'s `className` option
+
+The option is documented as "class name/prefix for default and document style classes". In practice, it controls the **literal class name applied to each page section**. If `className: "my-pages"` is passed, the sections come out as `<section class="my-pages">`, not `<section class="docx">`. Any selector that looks for `section.docx` will return zero pages, and the implementation will throw "no page sections" despite a successful render.
+
+Leave the option at its default. If the page selector needs to be defensive against future library changes, query both `section.docx` and `section` as fallbacks, but do not solve the problem by changing `className`.
+
+### Rule 2: Do not hide the capture target with CSS `visibility`, `display`, or `opacity`
+
+It is tempting to render `docx-preview`'s output into a hidden off-screen container and only show the resulting PDF. This does not work. `html2canvas` respects computed CSS visibility: an element with `visibility: hidden`, `display: none`, or `opacity: 0` (or any ancestor with those properties) will be rasterised as blank or transparent pixels. The capture stage will complete without error, and the resulting PDF will have the correct page count and dimensions but be entirely empty.
+
+If the rendered HTML must not be visible to the user, position it off-screen with `position: fixed; left: -100000px;` *without* applying any visibility, display, or opacity rules. Mark it `aria-hidden="true"` and `inert` for accessibility. In practice, however, see Rule 3 — the rendered HTML should usually be the visible preview.
+
+### Rule 3: Do not preview the generated PDF in an `<iframe>`
+
+Browsers' built-in PDF viewer is unreliable inside sandboxed iframes, embedded extension contexts, and certain CSP-restricted hosts. A `blob:` URL pointing to a valid PDF will load into a top-level tab without issue but stay blank in `<iframe src="blob:...">` inside a sandbox. The conversion will succeed, the blob will be valid, the download will work, but the inline preview will be empty.
+
+The architectural fix is to recognise that `docx-preview` is already producing a high-fidelity, paginated, **selectable** HTML rendering of the document. That rendering is the preview. The PDF is a derivative artefact that only needs to materialise at download time. The implementation should render `docx-preview` directly into the visible preview container — never into a hidden stage that is then mirrored into an iframe. This is also better UX outside sandboxed contexts: the HTML preview has selectable text, is scrollable, and renders faster than asking the browser to display a PDF.
+
+### Rule 4: Choose CDN sources deliberately when loading without a bundler
+
+`docx-preview` is **not** published on cdnjs. It is available on npm, jsDelivr, and unpkg. Hosts that enforce a strict CSP allowing only cdnjs (such as Claude's artifact iframe, Chrome extension contexts, and some enterprise application shells) will block loading from unpkg with a `script-src` violation. The library script never executes, the global `window.docx` is undefined, and the first call into the pipeline throws `TypeError: Cannot read properties of undefined (reading 'renderAsync')`.
+
+For browser-only HTML deployments, use jsDelivr's `/npm/` path:
+
+```html
+<script src="https://cdn.jsdelivr.net/npm/docx-preview@0.3.5/dist/docx-preview.min.js"></script>
+```
+
+For bundled React applications, install via npm; CDN choice does not apply. Verify each library's presence on `window` before the first conversion call and surface a clear error to the user if any failed to load.
+
+## 8. Error Handling and Edge Cases
+
+The implementation must handle the following scenarios gracefully:
+
+**Wrong file type.** When the user drops a `.pdf`, `.txt`, `.doc`, or any non-`.docx` file, the component shows an inline error message and does not enter the rendering stage. Validation is by file extension; MIME-type sniffing is unreliable across browsers.
+
+**Corrupted or malformed `.docx`.** `docx-preview` will throw during `renderAsync` if the file is not a valid OOXML package or contains unparseable XML. The error must be caught, the status set to `"error"`, and the error message surfaced to the user. The component must remain in a state where another file can be selected.
+
+**Empty document.** A valid `.docx` containing no content will produce an empty wrapper with no `<section>` elements. The implementation throws an explicit error rather than producing an empty PDF.
+
+**Images with restrictive CORS.** With `useBase64URL: true`, `docx-preview` inlines embedded images as data URLs and CORS does not apply. If the option is changed to `false`, externally hosted images will taint the canvas and cause `toDataURL` to throw a `SecurityError`. Do not change this option.
+
+**Very large documents.** Documents with more than ~50 pages may exhaust memory at `scale: 2` because each captured canvas is held in memory before being added to the PDF. For documents this large, the implementation should release each canvas (by setting its reference to null) immediately after `addImage` returns, and consider lowering `renderScale` to 1.5 when page count exceeds a threshold.
+
+**Mixed page orientations.** Documents that switch from portrait to landscape mid-flow are handled by the per-page dimension calculation in Section 6.4. Do not assume all pages share the first page's dimensions.
+
+**Rapid file changes.** If the user drops a second file while the first is still converting, the in-flight conversion must be cancelled or its results discarded. The simplest approach is to track an incrementing conversion ID; results from a non-current ID are ignored on completion. This is not strictly required for correctness — the second call will overwrite the first — but it prevents stale progress updates from confusing the status display.
+
+## 9. Performance Considerations
+
+For a typical 5-page A4 document, end-to-end conversion on mid-range 2024 hardware takes 1.5–3 seconds. The dominant cost is `html2canvas` capture, which scales roughly linearly with page count and quadratically with `renderScale`. The `docx-preview` rendering stage typically takes 100–300 ms regardless of page count. PDF assembly is negligible.
+
+Memory peaks during the capture loop, holding one canvas worth of pixels per page until added to the PDF. At `scale: 2` with US Letter pages, a single canvas is approximately 8 MB of RGBA data. A 20-page document briefly holds ~160 MB before garbage collection.
+
+Output PDF file sizes for a 5-page document at default settings are approximately 1.5–3 MB. Lowering `imageQuality` from 0.95 to 0.85 typically reduces output by 30% with no visible degradation; lowering below 0.80 introduces visible JPEG artifacts on text edges.
+
+## 10. Browser Support
+
+The component targets the current and one prior major version of Chrome, Edge, Firefox, and Safari. Internet Explorer is not supported. The relevant browser features are:
+
+- `File` and `FileReader` APIs (universal since 2014)
+- `Blob` and `URL.createObjectURL` (universal since 2014)
+- Canvas `toDataURL` with JPEG support (universal since 2012)
+- ES2020 syntax targets in `tsconfig.json`
+
+`html2canvas` has known limitations rendering certain CSS features — `mix-blend-mode`, `backdrop-filter`, complex `clip-path` — that may affect documents using heavy graphical design. For Word documents this is rarely relevant; standard business documents do not invoke these features.
+
+## 11. Testing
+
+Implementations should be verified against the following test corpus:
+
+| Test document | Asserts |
+|---|---|
+| Plain prose, 3 pages, A4 | Basic flow; page count and dimensions match |
+| Document with one table per page | Tables render with borders and cell shading |
+| Mixed portrait and landscape sections | Each PDF page matches its source orientation |
+| Document with embedded PNG and JPEG images | Images appear in correct positions |
+| Vietnamese-language document with diacritics | All characters render; no missing glyphs |
+| Document with header and footer including page numbers | Headers/footers appear on every page |
+| Document with bulleted and numbered lists | List markers render with correct indentation |
+| 30-page document | Memory does not exceed 500 MB during capture |
+| Corrupted .docx (truncated zip) | Component shows error and remains usable |
+
+Beyond visual diffing of the rendered preview against the source `.docx` opened in Word, the captured PDF should be opened in a separate PDF reader (Acrobat, Preview, or Firefox's built-in viewer) to confirm that page dimensions, count, and rendered content match. Programmatic visual regression testing of the PDF output is beyond the scope of this spec but can be implemented using `pdf-parse` + `pixelmatch` if needed.
+
+## 12. Known Limitations and Alternatives
+
+The text in the output PDF is rasterised and therefore not selectable, searchable, copyable, or screen-readable. Users who need any of these properties — particularly accessibility for visually impaired users — must use a server-side converter that emits real PDF text objects. Recommended alternatives in decreasing order of fidelity and increasing order of cost:
+
+1. **LibreOffice headless** (`soffice --convert-to pdf`): free, self-hosted, very high fidelity, requires Linux server with LibreOffice installed. ~1–3 seconds per document.
+2. **Aspose.Words Cloud or self-hosted**: paid, very high fidelity, native PDF text output, requires license.
+3. **CloudConvert, ConvertAPI, or similar SaaS**: paid per-document, simple HTTP API, sends document contents to a third party.
+
+The HTML preview produced by `docx-preview` *is* accessible — screen readers can navigate it, text is selectable, and users can zoom — so the component's accessibility story is intact for users who don't need the PDF artifact itself.
+
+This component cannot edit, sign, redact, or annotate documents. For those features, evaluate `pdf-lib` (PDF mutation) or `docx` (DOCX generation, which is a different package than `docx-preview`).
+
+## 13. Appendix: Algorithm Pseudocode
+
+For reference, the complete conversion algorithm in 20 lines:
+
+```
+function convert(file, container):
+  clear container
+  await renderAsync(file, container, {
+    inWrapper: true,
+    breakPages: true,
+    useBase64URL: true,
+    experimental: true,
+    renderHeaders: true, renderFooters: true, renderFootnotes: true,
+  })
+  await rAF; await sleep(50)
+
+  pages = container.querySelectorAll("section.docx") || container.querySelectorAll("section")
+  if pages is empty: throw
+
+  pdf = new jsPDF using pages[0] dimensions in mm
+  for each page in pages:
+    canvas = await html2canvas(page, scale=2, useCORS=true, bg=white)
+    if not first page: pdf.addPage(page dimensions)
+    pdf.addImage(canvas.toDataURL("image/jpeg", 0.95), 0, 0, w_mm, h_mm)
+
+  return pdf.output("blob")
+```
+
+The pseudocode omits error handling, lifecycle management, and progress reporting, all of which are required in the production implementation per Sections 6.6 and 8.
+
+---
+
+*End of specification.*