# Specification: Browser-Based DOCX-to-PDF Converter **Status:** Ready for implementation **Audience:** Frontend engineer (React + TypeScript) **Estimated effort:** 1–2 days for a working component, +1 day for polish and tests --- ## 1. Overview This document specifies a React component, `DocxToPdfViewer`, that accepts a `.docx` file in the browser, renders it on screen with layout fidelity equivalent to Microsoft Word, and produces a downloadable PDF that matches the rendering page-for-page. The component runs entirely in the browser; no document content ever leaves the user's machine. The component is intended for use cases where users need to view a Word document and obtain a PDF copy without installing Word, opening a desktop converter, or trusting a third-party cloud service. Typical scenarios include legal forms, application packets, internal templates, and document submission flows where PDF is the required output format. ## 2. Goals and Non-Goals ### 2.1 Goals The component must preserve the document's page size, margins, fonts (where embedded or system-available), paragraph alignment, tables, inline and floating images, headers, footers, footnotes, bullet and numbered lists, and basic text formatting (bold, italic, underline, color, size). It must correctly render documents containing non-Latin scripts, with Vietnamese diacritics, CJK characters, and right-to-left scripts as concrete test cases. It must work on the current versions of Chromium-based browsers, Firefox, and Safari without server assistance. It must expose a clear TypeScript API and emit lifecycle events suitable for integration into larger applications. ### 2.2 Non-Goals The output PDF is **rasterised**: each page is a JPEG image embedded in a PDF page of matching dimensions. Text in the output is therefore not selectable or searchable. If selectable text is required, the implementer should use a server-side converter (LibreOffice headless, Aspose, or a paid API) instead — this is documented in Section 12. The component does not edit, sign, redact, fill forms in, or otherwise modify the source document. It does not support `.doc` (legacy binary format); callers must convert to `.docx` upstream. It does not attempt to be a general-purpose Word viewer with comments, track changes, or revision history rendering; only the final accepted state is rendered. ## 3. System Context The pipeline has three stages, executed in order: ``` ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ .docx file │ -> │ docx-preview │ -> │ html2canvas │ -> │ jsPDF │ -> Blob │ (Blob) │ │ (HTML) │ │ (Canvas[]) │ │ (PDF Blob) │ └─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ └─> visible to user as the on-screen preview ``` The rendered HTML serves a dual purpose: it is both the on-screen preview shown to the user *and* the source material from which the PDF is rasterised. There is no separate hidden render pass. This is a deliberate architectural choice; see Rule 3 in Section 7. ## 4. Dependencies The implementation requires three runtime dependencies and their type definitions: | Package | Version | Purpose | |---|---|---| | `docx-preview` | `^0.3.5` | Parses `.docx` and renders to HTML with high layout fidelity. | | `html2canvas` | `^1.4.1` | Rasterises a DOM subtree to an HTMLCanvasElement. | | `jspdf` | `^2.5.1` | Assembles canvas images into a multi-page PDF. | `docx-preview` has a transitive runtime dependency on `jszip`, which it imports via its package; no direct install is required when bundling with npm. When loading via CDN, `jszip` must be loaded as a separate ` ``` For bundled React applications, install via npm; CDN choice does not apply. Verify each library's presence on `window` before the first conversion call and surface a clear error to the user if any failed to load. ## 8. Error Handling and Edge Cases The implementation must handle the following scenarios gracefully: **Wrong file type.** When the user drops a `.pdf`, `.txt`, `.doc`, or any non-`.docx` file, the component shows an inline error message and does not enter the rendering stage. Validation is by file extension; MIME-type sniffing is unreliable across browsers. **Corrupted or malformed `.docx`.** `docx-preview` will throw during `renderAsync` if the file is not a valid OOXML package or contains unparseable XML. The error must be caught, the status set to `"error"`, and the error message surfaced to the user. The component must remain in a state where another file can be selected. **Empty document.** A valid `.docx` containing no content will produce an empty wrapper with no `
` elements. The implementation throws an explicit error rather than producing an empty PDF. **Images with restrictive CORS.** With `useBase64URL: true`, `docx-preview` inlines embedded images as data URLs and CORS does not apply. If the option is changed to `false`, externally hosted images will taint the canvas and cause `toDataURL` to throw a `SecurityError`. Do not change this option. **Very large documents.** Documents with more than ~50 pages may exhaust memory at `scale: 2` because each captured canvas is held in memory before being added to the PDF. For documents this large, the implementation should release each canvas (by setting its reference to null) immediately after `addImage` returns, and consider lowering `renderScale` to 1.5 when page count exceeds a threshold. **Mixed page orientations.** Documents that switch from portrait to landscape mid-flow are handled by the per-page dimension calculation in Section 6.4. Do not assume all pages share the first page's dimensions. **Rapid file changes.** If the user drops a second file while the first is still converting, the in-flight conversion must be cancelled or its results discarded. The simplest approach is to track an incrementing conversion ID; results from a non-current ID are ignored on completion. This is not strictly required for correctness — the second call will overwrite the first — but it prevents stale progress updates from confusing the status display. ## 9. Performance Considerations For a typical 5-page A4 document, end-to-end conversion on mid-range 2024 hardware takes 1.5–3 seconds. The dominant cost is `html2canvas` capture, which scales roughly linearly with page count and quadratically with `renderScale`. The `docx-preview` rendering stage typically takes 100–300 ms regardless of page count. PDF assembly is negligible. Memory peaks during the capture loop, holding one canvas worth of pixels per page until added to the PDF. At `scale: 2` with US Letter pages, a single canvas is approximately 8 MB of RGBA data. A 20-page document briefly holds ~160 MB before garbage collection. Output PDF file sizes for a 5-page document at default settings are approximately 1.5–3 MB. Lowering `imageQuality` from 0.95 to 0.85 typically reduces output by 30% with no visible degradation; lowering below 0.80 introduces visible JPEG artifacts on text edges. ## 10. Browser Support The component targets the current and one prior major version of Chrome, Edge, Firefox, and Safari. Internet Explorer is not supported. The relevant browser features are: - `File` and `FileReader` APIs (universal since 2014) - `Blob` and `URL.createObjectURL` (universal since 2014) - Canvas `toDataURL` with JPEG support (universal since 2012) - ES2020 syntax targets in `tsconfig.json` `html2canvas` has known limitations rendering certain CSS features — `mix-blend-mode`, `backdrop-filter`, complex `clip-path` — that may affect documents using heavy graphical design. For Word documents this is rarely relevant; standard business documents do not invoke these features. ## 11. Testing Implementations should be verified against the following test corpus: | Test document | Asserts | |---|---| | Plain prose, 3 pages, A4 | Basic flow; page count and dimensions match | | Document with one table per page | Tables render with borders and cell shading | | Mixed portrait and landscape sections | Each PDF page matches its source orientation | | Document with embedded PNG and JPEG images | Images appear in correct positions | | Vietnamese-language document with diacritics | All characters render; no missing glyphs | | Document with header and footer including page numbers | Headers/footers appear on every page | | Document with bulleted and numbered lists | List markers render with correct indentation | | 30-page document | Memory does not exceed 500 MB during capture | | Corrupted .docx (truncated zip) | Component shows error and remains usable | Beyond visual diffing of the rendered preview against the source `.docx` opened in Word, the captured PDF should be opened in a separate PDF reader (Acrobat, Preview, or Firefox's built-in viewer) to confirm that page dimensions, count, and rendered content match. Programmatic visual regression testing of the PDF output is beyond the scope of this spec but can be implemented using `pdf-parse` + `pixelmatch` if needed. ## 12. Known Limitations and Alternatives The text in the output PDF is rasterised and therefore not selectable, searchable, copyable, or screen-readable. Users who need any of these properties — particularly accessibility for visually impaired users — must use a server-side converter that emits real PDF text objects. Recommended alternatives in decreasing order of fidelity and increasing order of cost: 1. **LibreOffice headless** (`soffice --convert-to pdf`): free, self-hosted, very high fidelity, requires Linux server with LibreOffice installed. ~1–3 seconds per document. 2. **Aspose.Words Cloud or self-hosted**: paid, very high fidelity, native PDF text output, requires license. 3. **CloudConvert, ConvertAPI, or similar SaaS**: paid per-document, simple HTTP API, sends document contents to a third party. The HTML preview produced by `docx-preview` *is* accessible — screen readers can navigate it, text is selectable, and users can zoom — so the component's accessibility story is intact for users who don't need the PDF artifact itself. This component cannot edit, sign, redact, or annotate documents. For those features, evaluate `pdf-lib` (PDF mutation) or `docx` (DOCX generation, which is a different package than `docx-preview`). ## 13. Appendix: Algorithm Pseudocode For reference, the complete conversion algorithm in 20 lines: ``` function convert(file, container): clear container await renderAsync(file, container, { inWrapper: true, breakPages: true, useBase64URL: true, experimental: true, renderHeaders: true, renderFooters: true, renderFootnotes: true, }) await rAF; await sleep(50) pages = container.querySelectorAll("section.docx") || container.querySelectorAll("section") if pages is empty: throw pdf = new jsPDF using pages[0] dimensions in mm for each page in pages: canvas = await html2canvas(page, scale=2, useCORS=true, bg=white) if not first page: pdf.addPage(page dimensions) pdf.addImage(canvas.toDataURL("image/jpeg", 0.95), 0, 0, w_mm, h_mm) return pdf.output("blob") ``` The pseudocode omits error handling, lifecycle management, and progress reporting, all of which are required in the production implementation per Sections 6.6 and 8. --- *End of specification.*