Files
sciagent/docs/architecture/platform_architecture.md
Thinh Lam 688fac73e9
CI/CD / backend (push) Failing after 2m8s
CI/CD / frontend (push) Failing after 1m40s
CI/CD / deploy (push) Has been skipped
sciagent code + Gitea Actions CI/CD
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 09:38:30 +07:00

20 KiB
Raw Permalink Blame History

ImageHub — Architecture

"GitHub for medical-imaging research datasets." A self-hosted platform for versioning, viewing, de-identifying, and collaborating on imaging datasets (DICOM / NIfTI / WSI), modeled on Gitea's architecture but rebuilt on a Python-centric stack suited to the imaging + ML ecosystem.

"ImageHub" is a placeholder name — rename freely.

This document describes (1) the Gitea patterns we are reproducing, (2) how each maps to the imaging domain, (3) the recommended stack, (4) the subsystems and data model, and (5) an MVP-first roadmap.


1. Design philosophy (inherited from Gitea)

Gitea is worth copying for five structural decisions. We keep all five:

  1. Modular monolith, not microservices. One deployable core app with clear internal layers. You can scale the heavy parts out later (we do — the worker tier) without paying distributed-systems tax up front.
  2. Strict downward layering. cli → api → services → models → core. Dependencies only point down. Business logic lives in services, never in models or HTTP handlers.
  3. Server-rendered UI + progressive enhancement, not a SPA. Pages are rendered server-side; rich client behavior (the image viewer) is embedded as self-contained widgets. Faster to build, easy to deep-link, SEO/printable.
  4. Pluggable infrastructure behind interfaces. Storage, queue, search, cache, and auth are interfaces with swappable drivers (local disk ↔ S3, in-proc ↔ Redis, Postgres FTS ↔ OpenSearch). Same idea as Gitea's modules/storage, modules/queue, modules/indexer.
  5. The domain engine is a first-class subsystem. For Gitea that engine is Git. For us it is the Dataset Versioning Engine — a content-addressed, Merkle-DAG version control system specialized for large imaging files. This is the single most important component and the heart of the product.

What we deliberately change from Gitea:

  • Workers are externalized. Gitea runs background jobs in-process. Imaging jobs (de-identification, format conversion, thumbnailing, ML) are heavy, Python-bound, and sometimes need GPUs — so they run in a separate, scalable worker tier driven by a real queue.
  • All "files" are large binaries. Gitea bolts on Git-LFS for large files; for us large-file handling is the default and only path — every blob is content-addressed and stored in object storage.
  • De-identification & audit are core, not afterthoughts (domain requirement).

2. Concept mapping: Gitea → ImageHub

Gitea concept ImageHub equivalent Notes
Repository Dataset A versioned collection of imaging studies/series + metadata + labels.
Git commit Version (commit) Immutable snapshot = a content-addressed manifest + parent links.
Branch / tag Branch / tag e.g. raw, deidentified, train-split-v3; tags for citable releases.
Blob / tree Blob / manifest Blob = one file (DICOM instance, NIfTI, label). Manifest = the tree of a version.
Git-LFS (native) Every blob is large; content-addressed object store is the only path.
Git transport (SSH/HTTP) Transport API + CLI/SDK Resumable chunked upload/download; "have/want" blob negotiation like LFS batch.
Pull Request Change Proposal Review added/changed/relabeled data before merging into a branch.
Diff / code review Dataset diff + image diff Added/removed/changed series and label diffs, viewed side-by-side.
Issues Issues / annotation tasks QC findings, labeling tasks, discussions.
Releases Dataset releases Frozen, citable snapshots (DOI-friendly) — key for research reproducibility.
Wiki Datasheet / data dictionary Dataset documentation, "Datasheets for Datasets".
Actions / act_runner Pipelines / runners Event-driven compute: de-id, QC, train/eval; pins exact data version.
Webhooks Webhooks Same.
Code search indexer Metadata + tag search Faceted search over modality/body-part/labels; optional image-embedding search.
Org / Team / User / RBAC Org / Team / User / RBAC Nearly identical; plus dataset access requests / data-use agreements.
app.ini + modules/setting Config system Typed config from file + env.
XORM migrations Alembic migrations Ordered, append-only schema migrations.
Storage (local/minio/s3) Object storage Same abstraction; blobs live here.
(minimal in Gitea) Audit & compliance log First-class, append-only PHI-access trail.
(none) De-identification engine Domain-specific; no Gitea analogue.

Rationale: the medical-imaging and ML ecosystems (pydicom, SimpleITK, nibabel, dcm2niix, highdicom, MONAI, the de-id tooling) are overwhelmingly Python. A single-language core + worker stack removes the model-duplication friction you'd get from a Go core calling Python workers.

Layer Choice Gitea analogue
Core web/API Python 3.12 + FastAPI (uvicorn/gunicorn) routers/ (chi)
Templating Jinja2 + HTMX for progressive enhancement templates/
Frontend build Vite + TypeScript web_src/ + Vite
DICOM viewer Cornerstone3D (DICOM), NiiVue (NIfTI) embedded widgets
ORM / migrations SQLAlchemy 2.0 + Alembic XORM + migrations
Primary DB PostgreSQL (single target) multi-DB → standardize on PG
Queue / workers Redis + Arq (async) or Celery modules/queue + workers
Object storage S3 / MinIO (self-host) modules/storage
Search OpenSearch (or Postgres FTS to start) modules/indexer
Cache / pubsub / sessions Redis modules/cache, eventsource
Auth Authlib (OIDC/OAuth2) + sessions + API tokens services/auth
Imaging libs pydicom, highdicom, SimpleITK, nibabel, dcm2niix, Pillow; OpenSlide for WSI
ML integration MONAI / PyTorch dataset adapters via the SDK
De-id pydicom + deid (CTP rules) + Presidio (text) + OCR (burned-in PHI)
Client Python SDK + CLI (imagehub clone/pull/push/commit) the git client

Alternative if you want Gitea-grade transport performance: keep a Go core for the API/transport/auth layer and use Python only in the worker tier. Faithful to Gitea, but you maintain two languages and duplicate the dataset/manifest types across the boundary. Recommended only if the upload/ download path is your dominant bottleneck. Default to all-Python.


4. Layered architecture

cli/         Admin & ops commands (Typer): serve, migrate, doctor, deid-batch, user-admin
  └─ api/        FastAPI routers — UI pages + REST API + transport endpoints (thin: parse → service → render)
       └─ services/   Business logic: dataset ops, versioning workflows, review, pipelines, de-id orchestration
            └─ models/     SQLAlchemy entities + queries (one module per domain: user, dataset, version, annotation…)
                 └─ core/       Leaf infra & domain engines — MUST NOT import the layers above
                                 ├─ vcs/        ← the Dataset Versioning Engine (the "Git")
                                 ├─ storage/    ← content-addressed blob store over S3/MinIO
                                 ├─ imaging/    ← DICOM/NIfTI parsing, metadata, thumbnails, conversion
                                 ├─ deid/       ← de-identification pipeline stages
                                 ├─ queue/      ← Redis/Arq job abstraction
                                 ├─ index/      ← search abstraction (OpenSearch / PG FTS)
                                 ├─ audit/      ← append-only audit log
                                 ├─ config/     ← typed settings
                                 └─ auth/       ← tokens, sessions, OIDC, permissions

Layer rules (enforce with import-linter, the analogue of Gitea's depguard):

  • core/ is the foundation; it may not import models/, services/, or api/.
  • Cross-entity business logic goes in services/, never in models/.
  • api/ handlers stay thin — no business logic, no direct DB-engine access.
  • Every DB query takes a session/context so it enlists in the request transaction.

5. Core subsystems

5.1 Dataset Versioning Engine (core/vcs) — the heart

A content-addressed Merkle DAG, like Git, specialized for large imaging files.

  • Blob store. Every file is hashed (SHA-256) and stored once in object storage at blobs/<aa>/<bb>/<hash>. Identical files across versions/datasets dedupe for free (huge win — imaging datasets share many instances).
  • Manifest (tree). A version's manifest lists logical_path → {blob_hash, size, media_type, imaging_meta}. The manifest is itself content-addressed.
  • Commit. {manifest_hash, parents[], author, timestamp, message}. The parent chain is the history DAG.
  • Refs. Branches/tags map name → commit_id, stored in Postgres (not in object storage) so they're transactional and queryable.
  • Transport / negotiation. On push, the client hashes locally and asks the server which blobs are missing ("have/want", like the LFS batch API), uploads only those (resumable, chunked), then posts the commit. Pull is the reverse.
  • Diff. Compare two manifests → added / removed / modified entries; surfaced in the UI as a dataset diff and, per-image, as a viewer side-by-side.
  • Merge. Three-way path-level merge of manifests; conflicts when the same path changed on both sides. Label/annotation merges can be semantic.

Build vs. buy: building this custom gives full control and the cleanest domain fit (recommended). If you need to move faster, back it with lakeFS (git-like branches/commits/merge over S3) or DVC, and keep your manifest API as the stable interface so you can swap the backend later.

5.2 Object storage (core/storage)

Driver interface (put/get/stat/delete/presign) with local and s3/minio implementations — exactly Gitea's modules/storage pattern. Stores blobs, manifests, thumbnails, pipeline artifacts. Presigned URLs let clients up/download directly to S3 for large transfers, bypassing the app.

5.3 Ingestion & processing pipeline (core/queue + workers)

On upload, enqueue jobs; workers (Arq) process them:

  1. Verify checksums, store blobs (dedup).
  2. Extract metadata (pydicom/nibabel): modality, body part, study/series UIDs, dimensions, acquisition params → indexed + linked to blobs.
  3. Thumbnails / previews for the browse UI.
  4. De-identification (§5.4).
  5. Format normalization (optional: DICOM→NIfTI via dcm2niix for ML).
  6. Commit the resulting version; update search index; write audit entries. Workers scale independently; GPU nodes handle ML jobs.

5.4 De-identification engine (core/deid) — compliance must-have

A configurable, multi-stage pipeline producing a deidentified branch from a raw/PHI version:

  • Tag de-id per DICOM PS3.15 Annex E confidentiality profiles: remove/ replace PHI tags, regenerate UIDs consistently (so series stay linked), handle private tags.
  • Date shifting: consistent per-patient offset to preserve intervals.
  • Burned-in pixel PHI: OCR (Tesseract/EasyOCR) to detect text in pixels, redact, and flag for human review.
  • Free-text / report de-id: Presidio NER over any text fields/reports.
  • Re-identification map (only if policy allows): the original↔pseudonym mapping is encrypted, access-restricted, and fully audited; otherwise the PHI source is dropped.
  • Verification stage emits a report of exactly what changed. Tooling: pydicom, Stanford deid / MIRC CTP rule sets, Presidio, an OCR engine. Profiles are configurable per org/dataset.

5.5 Web viewer (api + embedded TS widgets)

Progressive-enhancement widgets (not a separate SPA), true to Gitea:

  • Cornerstone3D for DICOM (multi-frame, MPR, windowing, measurements, segmentation overlays).
  • NiiVue for NIfTI volumes (great for neuro/research).
  • OpenSlide-backed deep-zoom tiles for whole-slide pathology (optional). The server exposes a frame/tile API (a WADO-RS-like read path even without full DICOMweb). Annotations are structured objects (DICOM SR or JSON), versioned with the dataset.

5.6 Search & discovery (core/index)

Index extracted metadata + labels → faceted search ("brain MRI, T1, age<40, has tumor label"). Start on Postgres FTS; graduate to OpenSearch for scale. Optional later: compute image embeddings (a foundation model) → pgvector for "find similar studies/lesions".

5.7 Collaboration (services)

Change Proposals (PRs), reviews, issues, comments, annotation tasks, releases, datasheets — the GitHub social layer, mapped to datasets. A reviewer of a Change Proposal sees the dataset diff and can open the viewer on changed series.

5.8 Pipelines & runners (Actions analogue, optional/advanced)

Event-driven compute (on: push | proposal | tag | schedule) executed by runners (containers that poll for jobs, à la act_runner). Use cases: auto de-id, QC/validation, dataset statistics, training/eval with MONAI. Each run pins the dataset version hash, giving reproducible ML by construction.

5.9 Auth, permissions, audit (core/auth, core/audit)

  • OIDC/OAuth2 login, sessions, scoped API tokens.
  • Org → Team → permission model; dataset visibility private | internal | public; dataset-level access requests / data-use agreements.
  • Audit log: append-only Postgres table (actor, action, object, dataset, version, IP, purpose-of-use, timestamp). Every PHI-bearing access (view original, download) is logged; optional hash-chaining for tamper-evidence; retention + legal-hold support.

5.10 API, SDK, CLI

  • REST API (FastAPI, OpenAPI-documented — the swagger analogue).
  • Python SDK (the most important client for ML users): pull a pinned version straight into a torch/MONAI Dataset.
  • CLI (imagehub clone/pull/push/checkout/commit/diff) — the git/dvc analogue for data engineers.

6. Data model (core tables)

user, organization, team, team_membership, team_access
dataset(id, owner_id, name, visibility, default_branch, description)
ref(dataset_id, name, type[branch|tag], commit_id)             -- transactional refs
commit(id, dataset_id, manifest_hash, parent_ids[], author_id, message, created_at)
blob(hash PK, size, storage_key, media_type, refcount)         -- content-addressed, deduped
manifest(hash PK, storage_key)                                 -- stored in object store, hash in DB
instance_meta(blob_hash, dataset_id, study_uid, series_uid, modality, body_part, dims, params…)
annotation(id, dataset_id, commit_id, target, type, payload, author_id)
label_schema(id, dataset_id, spec)         label(id, schema_id, value)
change_proposal(id, dataset_id, src_ref, dst_ref, status)  review, comment
issue(id, dataset_id, …)  issue_comment
release(id, dataset_id, tag, notes, doi?)
pipeline(id, dataset_id, spec)  pipeline_run(id, pipeline_id, commit_id, status, artifacts)  runner
webhook  webhook_delivery
audit_log(id, actor_id, action, object_type, object_id, dataset_id, ip, purpose, created_at)  -- append-only
access_request, data_use_agreement
phi_map(dataset_id, original_ref, pseudonym, …)               -- encrypted, restricted, audited

7. Key flows

  1. Ingest & de-identify: upload → blobs stored (deduped) → metadata extracted → de-id pipeline → new commit on deidentified branch → indexed → audited.
  2. Browse & view: datasets list → dataset → series list → Cornerstone3D/NiiVue streams frames → annotation overlays.
  3. Curate an ML subset (zero-copy): faceted query → new branch/dataset whose manifest references existing blobs (no data copied) → commit → tag a release → sdk.pull(tag) in training.
  4. Propose a change (PR): push new/relabeled data to a branch → open Change Proposal → reviewer sees dataset diff + image diff → approve → merge.
  5. Reproducible training: tag triggers a pipeline that pins the version hash, runs MONAI train/eval, and links metrics + model artifact to that exact data version.

8. Deployment topology

            ┌──────────── reverse proxy (Caddy/Traefik) + TLS ────────────┐
            │                                                              │
   ┌────────▼────────┐    ┌──────────────────┐    ┌───────────────────────▼─┐
   │  Core app (N×)  │    │  Worker tier (M×) │    │  Runners (K×, GPU opt.)  │
   │  FastAPI/uvicorn│    │  Arq + imaging/ML │    │  pipelines (train/eval)  │
   └───┬─────┬───┬───┘    └───┬─────────┬─────┘    └────────────┬────────────┘
       │     │   │            │         │                       │
   ┌───▼─┐ ┌─▼─┐ │        ┌───▼───┐ ┌───▼────────┐         ┌────▼────┐
   │ PG  │ │Redis│ └──────▶│ Redis │ │ Object store│◀────────┤Object st.│
   │(state│ │queue│        │ queue │ │ S3 / MinIO  │         │ (blobs) │
   │ refs)│ │cache│        └───────┘ │ (blobs)     │         └─────────┘
   └─────┘ └────┘                    └─────────────┘
                          ┌───────────────┐
                          │  OpenSearch    │  (metadata/label search)
                          └───────────────┘
  • Core app: stateless, horizontally scalable.
  • Worker tier: scales independently; CPU for de-id/convert, GPU for ML.
  • Postgres: state, refs, metadata, audit. Redis: queue, cache, sessions, server-sent events. Object storage: all blobs. OpenSearch: search.
  • Dev / small self-host: a single docker-compose (app + worker + PG + Redis
    • MinIO + OpenSearch). Scale: Kubernetes with separate node pools.
  • Contrast with Gitea (one binary, in-proc workers): we externalize workers and object storage because imaging/ML work is heavy, Python-bound, and GPU-hungry.

9. Build-vs-buy summary

Component Recommendation
Versioning engine Build the manifest/commit model (custom) — or back it with lakeFS/DVC behind your API to ship faster.
Viewer Adopt Cornerstone3D + NiiVue (+ OpenSlide for WSI). Don't build.
De-identification Assemble from pydicom + deid/CTP rules + Presidio + OCR. Don't build from scratch.
Search Postgres FTS first → OpenSearch at scale.
Auth Authlib (OIDC).
Queue Arq (async) or Celery.
Object storage MinIO self-host / S3 cloud.

10. MVP-first roadmap

Ordered for the chosen must-haves (versioning + viewer + de-id + audit):

  • Phase 0 — Skeleton. Layered project structure, config, Postgres + Alembic, object-storage driver, auth (user/org/team), dataset CRUD.
  • Phase 1 — Versioning engine. Blobs, manifests, commits, branches; push/pull via CLI + SDK; dataset diff. (This is the product's spine — invest here.)
  • Phase 2 — Ingestion + de-id + audit. Worker tier, metadata extraction, de-identification pipeline, append-only audit log. (The compliance core.)
  • Phase 3 — Viewer + search. Cornerstone3D/NiiVue widgets, thumbnails, faceted metadata search, browse UI.
  • Phase 4 — Collaboration. Change Proposals, reviews, issues, annotations, citable releases, datasheets.
  • Phase 5 — Pipelines. Runners, event triggers, reproducible MONAI train/eval, webhooks.
  • Later / optional. DICOMweb + PACS adapter (QIDO/WADO/STOW), image-embedding similarity search (pgvector), whole-slide pathology.

Appendix — naming parallels for orientation

git cloneimagehub clone · repository → dataset · commit → version · push/pull → push/pull · PR → change proposal · .git/objects → content-addressed blob store · act_runner → pipeline runner · app.ini → config · XORM → SQLAlchemy.