357 lines
20 KiB
Markdown
357 lines
20 KiB
Markdown
# ImageHub — Architecture
|
||
|
||
> **"GitHub for medical-imaging research datasets."** A self-hosted platform for
|
||
> versioning, viewing, de-identifying, and collaborating on imaging datasets
|
||
> (DICOM / NIfTI / WSI), modeled on Gitea's architecture but rebuilt on a
|
||
> Python-centric stack suited to the imaging + ML ecosystem.
|
||
>
|
||
> *"ImageHub" is a placeholder name — rename freely.*
|
||
|
||
This document describes (1) the Gitea patterns we are reproducing, (2) how each
|
||
maps to the imaging domain, (3) the recommended stack, (4) the subsystems and
|
||
data model, and (5) an MVP-first roadmap.
|
||
|
||
---
|
||
|
||
## 1. Design philosophy (inherited from Gitea)
|
||
|
||
Gitea is worth copying for five structural decisions. We keep all five:
|
||
|
||
1. **Modular monolith, not microservices.** One deployable core app with clear
|
||
internal layers. You can scale the heavy parts out later (we do — the worker
|
||
tier) without paying distributed-systems tax up front.
|
||
2. **Strict downward layering.** `cli → api → services → models → core`.
|
||
Dependencies only point down. Business logic lives in `services`, never in
|
||
models or HTTP handlers.
|
||
3. **Server-rendered UI + progressive enhancement, not a SPA.** Pages are
|
||
rendered server-side; rich client behavior (the image viewer) is embedded as
|
||
self-contained widgets. Faster to build, easy to deep-link, SEO/printable.
|
||
4. **Pluggable infrastructure behind interfaces.** Storage, queue, search,
|
||
cache, and auth are interfaces with swappable drivers (local disk ↔ S3,
|
||
in-proc ↔ Redis, Postgres FTS ↔ OpenSearch). Same idea as Gitea's
|
||
`modules/storage`, `modules/queue`, `modules/indexer`.
|
||
5. **The domain engine is a first-class subsystem.** For Gitea that engine is
|
||
Git. For us it is the **Dataset Versioning Engine** — a content-addressed,
|
||
Merkle-DAG version control system specialized for large imaging files. This is
|
||
the single most important component and the heart of the product.
|
||
|
||
What we deliberately change from Gitea:
|
||
|
||
- **Workers are externalized.** Gitea runs background jobs in-process. Imaging
|
||
jobs (de-identification, format conversion, thumbnailing, ML) are heavy,
|
||
Python-bound, and sometimes need GPUs — so they run in a separate, scalable
|
||
worker tier driven by a real queue.
|
||
- **All "files" are large binaries.** Gitea bolts on Git-LFS for large files; for
|
||
us large-file handling is the *default and only* path — every blob is
|
||
content-addressed and stored in object storage.
|
||
- **De-identification & audit are core**, not afterthoughts (domain requirement).
|
||
|
||
---
|
||
|
||
## 2. Concept mapping: Gitea → ImageHub
|
||
|
||
| Gitea concept | ImageHub equivalent | Notes |
|
||
|---|---|---|
|
||
| Repository | **Dataset** | A versioned collection of imaging studies/series + metadata + labels. |
|
||
| Git commit | **Version** (commit) | Immutable snapshot = a content-addressed manifest + parent links. |
|
||
| Branch / tag | **Branch / tag** | e.g. `raw`, `deidentified`, `train-split-v3`; tags for citable releases. |
|
||
| Blob / tree | **Blob / manifest** | Blob = one file (DICOM instance, NIfTI, label). Manifest = the tree of a version. |
|
||
| Git-LFS | *(native)* | Every blob is large; content-addressed object store is the only path. |
|
||
| Git transport (SSH/HTTP) | **Transport API + CLI/SDK** | Resumable chunked upload/download; "have/want" blob negotiation like LFS batch. |
|
||
| Pull Request | **Change Proposal** | Review added/changed/relabeled data before merging into a branch. |
|
||
| Diff / code review | **Dataset diff + image diff** | Added/removed/changed series and label diffs, viewed side-by-side. |
|
||
| Issues | **Issues / annotation tasks** | QC findings, labeling tasks, discussions. |
|
||
| Releases | **Dataset releases** | Frozen, citable snapshots (DOI-friendly) — key for research reproducibility. |
|
||
| Wiki | **Datasheet / data dictionary** | Dataset documentation, "Datasheets for Datasets". |
|
||
| Actions / act_runner | **Pipelines / runners** | Event-driven compute: de-id, QC, train/eval; pins exact data version. |
|
||
| Webhooks | **Webhooks** | Same. |
|
||
| Code search indexer | **Metadata + tag search** | Faceted search over modality/body-part/labels; optional image-embedding search. |
|
||
| Org / Team / User / RBAC | **Org / Team / User / RBAC** | Nearly identical; plus dataset access requests / data-use agreements. |
|
||
| `app.ini` + `modules/setting` | **Config system** | Typed config from file + env. |
|
||
| XORM migrations | **Alembic migrations** | Ordered, append-only schema migrations. |
|
||
| Storage (local/minio/s3) | **Object storage** | Same abstraction; blobs live here. |
|
||
| *(minimal in Gitea)* | **Audit & compliance log** | First-class, append-only PHI-access trail. |
|
||
| *(none)* | **De-identification engine** | Domain-specific; no Gitea analogue. |
|
||
|
||
---
|
||
|
||
## 3. Recommended stack ("own stack", Python-centric)
|
||
|
||
Rationale: the medical-imaging and ML ecosystems (pydicom, SimpleITK, nibabel,
|
||
dcm2niix, highdicom, MONAI, the de-id tooling) are overwhelmingly Python. A
|
||
single-language core + worker stack removes the model-duplication friction you'd
|
||
get from a Go core calling Python workers.
|
||
|
||
| Layer | Choice | Gitea analogue |
|
||
|---|---|---|
|
||
| Core web/API | **Python 3.12 + FastAPI** (uvicorn/gunicorn) | `routers/` (chi) |
|
||
| Templating | **Jinja2 + HTMX** for progressive enhancement | `templates/` |
|
||
| Frontend build | **Vite + TypeScript** | `web_src/` + Vite |
|
||
| DICOM viewer | **Cornerstone3D** (DICOM), **NiiVue** (NIfTI) | embedded widgets |
|
||
| ORM / migrations | **SQLAlchemy 2.0 + Alembic** | XORM + migrations |
|
||
| Primary DB | **PostgreSQL** (single target) | multi-DB → standardize on PG |
|
||
| Queue / workers | **Redis + Arq** (async) or Celery | `modules/queue` + workers |
|
||
| Object storage | **S3 / MinIO** (self-host) | `modules/storage` |
|
||
| Search | **OpenSearch** (or Postgres FTS to start) | `modules/indexer` |
|
||
| Cache / pubsub / sessions | **Redis** | `modules/cache`, eventsource |
|
||
| Auth | **Authlib** (OIDC/OAuth2) + sessions + API tokens | `services/auth` |
|
||
| Imaging libs | pydicom, highdicom, SimpleITK, nibabel, dcm2niix, Pillow; OpenSlide for WSI | — |
|
||
| ML integration | MONAI / PyTorch dataset adapters via the SDK | — |
|
||
| De-id | pydicom + `deid` (CTP rules) + Presidio (text) + OCR (burned-in PHI) | — |
|
||
| Client | **Python SDK + CLI** (`imagehub clone/pull/push/commit`) | the `git` client |
|
||
|
||
> **Alternative if you want Gitea-grade transport performance:** keep a **Go
|
||
> core** for the API/transport/auth layer and use **Python only in the worker
|
||
> tier**. Faithful to Gitea, but you maintain two languages and duplicate the
|
||
> dataset/manifest types across the boundary. Recommended only if the upload/
|
||
> download path is your dominant bottleneck. Default to all-Python.
|
||
|
||
---
|
||
|
||
## 4. Layered architecture
|
||
|
||
```
|
||
cli/ Admin & ops commands (Typer): serve, migrate, doctor, deid-batch, user-admin
|
||
└─ api/ FastAPI routers — UI pages + REST API + transport endpoints (thin: parse → service → render)
|
||
└─ services/ Business logic: dataset ops, versioning workflows, review, pipelines, de-id orchestration
|
||
└─ models/ SQLAlchemy entities + queries (one module per domain: user, dataset, version, annotation…)
|
||
└─ core/ Leaf infra & domain engines — MUST NOT import the layers above
|
||
├─ vcs/ ← the Dataset Versioning Engine (the "Git")
|
||
├─ storage/ ← content-addressed blob store over S3/MinIO
|
||
├─ imaging/ ← DICOM/NIfTI parsing, metadata, thumbnails, conversion
|
||
├─ deid/ ← de-identification pipeline stages
|
||
├─ queue/ ← Redis/Arq job abstraction
|
||
├─ index/ ← search abstraction (OpenSearch / PG FTS)
|
||
├─ audit/ ← append-only audit log
|
||
├─ config/ ← typed settings
|
||
└─ auth/ ← tokens, sessions, OIDC, permissions
|
||
```
|
||
|
||
**Layer rules (enforce with import-linter, the analogue of Gitea's depguard):**
|
||
- `core/` is the foundation; it may not import `models/`, `services/`, or `api/`.
|
||
- Cross-entity business logic goes in `services/`, never in `models/`.
|
||
- `api/` handlers stay thin — no business logic, no direct DB-engine access.
|
||
- Every DB query takes a `session`/context so it enlists in the request transaction.
|
||
|
||
---
|
||
|
||
## 5. Core subsystems
|
||
|
||
### 5.1 Dataset Versioning Engine (`core/vcs`) — the heart
|
||
|
||
A content-addressed Merkle DAG, like Git, specialized for large imaging files.
|
||
|
||
- **Blob store.** Every file is hashed (SHA-256) and stored once in object
|
||
storage at `blobs/<aa>/<bb>/<hash>`. Identical files across versions/datasets
|
||
dedupe for free (huge win — imaging datasets share many instances).
|
||
- **Manifest (tree).** A version's manifest lists `logical_path → {blob_hash,
|
||
size, media_type, imaging_meta}`. The manifest is itself content-addressed.
|
||
- **Commit.** `{manifest_hash, parents[], author, timestamp, message}`. The
|
||
parent chain is the history DAG.
|
||
- **Refs.** Branches/tags map `name → commit_id`, stored in **Postgres** (not in
|
||
object storage) so they're transactional and queryable.
|
||
- **Transport / negotiation.** On push, the client hashes locally and asks the
|
||
server which blobs are missing ("have/want", like the LFS batch API), uploads
|
||
only those (resumable, chunked), then posts the commit. Pull is the reverse.
|
||
- **Diff.** Compare two manifests → added / removed / modified entries; surfaced
|
||
in the UI as a dataset diff and, per-image, as a viewer side-by-side.
|
||
- **Merge.** Three-way path-level merge of manifests; conflicts when the same
|
||
path changed on both sides. Label/annotation merges can be semantic.
|
||
|
||
**Build vs. buy:** building this custom gives full control and the cleanest
|
||
domain fit (recommended). If you need to move faster, back it with **lakeFS**
|
||
(git-like branches/commits/merge over S3) or **DVC**, and keep your manifest API
|
||
as the stable interface so you can swap the backend later.
|
||
|
||
### 5.2 Object storage (`core/storage`)
|
||
Driver interface (`put/get/stat/delete/presign`) with `local` and `s3/minio`
|
||
implementations — exactly Gitea's `modules/storage` pattern. Stores blobs,
|
||
manifests, thumbnails, pipeline artifacts. Presigned URLs let clients up/download
|
||
directly to S3 for large transfers, bypassing the app.
|
||
|
||
### 5.3 Ingestion & processing pipeline (`core/queue` + workers)
|
||
On upload, enqueue jobs; workers (Arq) process them:
|
||
1. Verify checksums, store blobs (dedup).
|
||
2. **Extract metadata** (pydicom/nibabel): modality, body part, study/series UIDs,
|
||
dimensions, acquisition params → indexed + linked to blobs.
|
||
3. **Thumbnails / previews** for the browse UI.
|
||
4. **De-identification** (§5.4).
|
||
5. **Format normalization** (optional: DICOM→NIfTI via dcm2niix for ML).
|
||
6. Commit the resulting version; update search index; write audit entries.
|
||
Workers scale independently; GPU nodes handle ML jobs.
|
||
|
||
### 5.4 De-identification engine (`core/deid`) — compliance must-have
|
||
A configurable, multi-stage pipeline producing a `deidentified` branch from a
|
||
`raw`/PHI version:
|
||
- **Tag de-id** per **DICOM PS3.15 Annex E** confidentiality profiles: remove/
|
||
replace PHI tags, regenerate UIDs *consistently* (so series stay linked),
|
||
handle private tags.
|
||
- **Date shifting**: consistent per-patient offset to preserve intervals.
|
||
- **Burned-in pixel PHI**: OCR (Tesseract/EasyOCR) to detect text in pixels,
|
||
redact, and flag for human review.
|
||
- **Free-text / report de-id**: Presidio NER over any text fields/reports.
|
||
- **Re-identification map** (only if policy allows): the original↔pseudonym
|
||
mapping is encrypted, access-restricted, and fully audited; otherwise the PHI
|
||
source is dropped.
|
||
- **Verification stage** emits a report of exactly what changed.
|
||
Tooling: pydicom, Stanford `deid` / MIRC CTP rule sets, Presidio, an OCR engine.
|
||
Profiles are configurable per org/dataset.
|
||
|
||
### 5.5 Web viewer (`api` + embedded TS widgets)
|
||
Progressive-enhancement widgets (not a separate SPA), true to Gitea:
|
||
- **Cornerstone3D** for DICOM (multi-frame, MPR, windowing, measurements,
|
||
segmentation overlays).
|
||
- **NiiVue** for NIfTI volumes (great for neuro/research).
|
||
- **OpenSlide**-backed deep-zoom tiles for whole-slide pathology (optional).
|
||
The server exposes a frame/tile API (a WADO-RS-like read path even without full
|
||
DICOMweb). Annotations are structured objects (DICOM SR or JSON), **versioned
|
||
with the dataset**.
|
||
|
||
### 5.6 Search & discovery (`core/index`)
|
||
Index extracted metadata + labels → faceted search ("brain MRI, T1, age<40, has
|
||
tumor label"). Start on **Postgres FTS**; graduate to **OpenSearch** for scale.
|
||
Optional later: compute image embeddings (a foundation model) → **pgvector** for
|
||
"find similar studies/lesions".
|
||
|
||
### 5.7 Collaboration (`services`)
|
||
Change Proposals (PRs), reviews, issues, comments, annotation tasks, releases,
|
||
datasheets — the GitHub social layer, mapped to datasets. A reviewer of a Change
|
||
Proposal sees the dataset diff and can open the viewer on changed series.
|
||
|
||
### 5.8 Pipelines & runners (Actions analogue, optional/advanced)
|
||
Event-driven compute (`on: push | proposal | tag | schedule`) executed by
|
||
**runners** (containers that poll for jobs, à la `act_runner`). Use cases: auto
|
||
de-id, QC/validation, dataset statistics, **training/eval** with MONAI. Each run
|
||
**pins the dataset version hash**, giving reproducible ML by construction.
|
||
|
||
### 5.9 Auth, permissions, audit (`core/auth`, `core/audit`)
|
||
- OIDC/OAuth2 login, sessions, scoped API tokens.
|
||
- Org → Team → permission model; dataset visibility `private | internal | public`;
|
||
dataset-level access requests / data-use agreements.
|
||
- **Audit log**: append-only Postgres table (actor, action, object, dataset,
|
||
version, IP, purpose-of-use, timestamp). Every PHI-bearing access (view
|
||
original, download) is logged; optional hash-chaining for tamper-evidence;
|
||
retention + legal-hold support.
|
||
|
||
### 5.10 API, SDK, CLI
|
||
- **REST API** (FastAPI, OpenAPI-documented — the swagger analogue).
|
||
- **Python SDK** (the most important client for ML users): pull a pinned version
|
||
straight into a `torch`/MONAI `Dataset`.
|
||
- **CLI** (`imagehub clone/pull/push/checkout/commit/diff`) — the `git`/`dvc`
|
||
analogue for data engineers.
|
||
|
||
---
|
||
|
||
## 6. Data model (core tables)
|
||
|
||
```
|
||
user, organization, team, team_membership, team_access
|
||
dataset(id, owner_id, name, visibility, default_branch, description)
|
||
ref(dataset_id, name, type[branch|tag], commit_id) -- transactional refs
|
||
commit(id, dataset_id, manifest_hash, parent_ids[], author_id, message, created_at)
|
||
blob(hash PK, size, storage_key, media_type, refcount) -- content-addressed, deduped
|
||
manifest(hash PK, storage_key) -- stored in object store, hash in DB
|
||
instance_meta(blob_hash, dataset_id, study_uid, series_uid, modality, body_part, dims, params…)
|
||
annotation(id, dataset_id, commit_id, target, type, payload, author_id)
|
||
label_schema(id, dataset_id, spec) label(id, schema_id, value)
|
||
change_proposal(id, dataset_id, src_ref, dst_ref, status) review, comment
|
||
issue(id, dataset_id, …) issue_comment
|
||
release(id, dataset_id, tag, notes, doi?)
|
||
pipeline(id, dataset_id, spec) pipeline_run(id, pipeline_id, commit_id, status, artifacts) runner
|
||
webhook webhook_delivery
|
||
audit_log(id, actor_id, action, object_type, object_id, dataset_id, ip, purpose, created_at) -- append-only
|
||
access_request, data_use_agreement
|
||
phi_map(dataset_id, original_ref, pseudonym, …) -- encrypted, restricted, audited
|
||
```
|
||
|
||
---
|
||
|
||
## 7. Key flows
|
||
|
||
1. **Ingest & de-identify:** upload → blobs stored (deduped) → metadata extracted
|
||
→ de-id pipeline → new commit on `deidentified` branch → indexed → audited.
|
||
2. **Browse & view:** datasets list → dataset → series list → Cornerstone3D/NiiVue
|
||
streams frames → annotation overlays.
|
||
3. **Curate an ML subset (zero-copy):** faceted query → new branch/dataset whose
|
||
manifest *references existing blobs* (no data copied) → commit → tag a release
|
||
→ `sdk.pull(tag)` in training.
|
||
4. **Propose a change (PR):** push new/relabeled data to a branch → open Change
|
||
Proposal → reviewer sees dataset diff + image diff → approve → merge.
|
||
5. **Reproducible training:** tag triggers a pipeline that pins the version hash,
|
||
runs MONAI train/eval, and links metrics + model artifact to that exact data
|
||
version.
|
||
|
||
---
|
||
|
||
## 8. Deployment topology
|
||
|
||
```
|
||
┌──────────── reverse proxy (Caddy/Traefik) + TLS ────────────┐
|
||
│ │
|
||
┌────────▼────────┐ ┌──────────────────┐ ┌───────────────────────▼─┐
|
||
│ Core app (N×) │ │ Worker tier (M×) │ │ Runners (K×, GPU opt.) │
|
||
│ FastAPI/uvicorn│ │ Arq + imaging/ML │ │ pipelines (train/eval) │
|
||
└───┬─────┬───┬───┘ └───┬─────────┬─────┘ └────────────┬────────────┘
|
||
│ │ │ │ │ │
|
||
┌───▼─┐ ┌─▼─┐ │ ┌───▼───┐ ┌───▼────────┐ ┌────▼────┐
|
||
│ PG │ │Redis│ └──────▶│ Redis │ │ Object store│◀────────┤Object st.│
|
||
│(state│ │queue│ │ queue │ │ S3 / MinIO │ │ (blobs) │
|
||
│ refs)│ │cache│ └───────┘ │ (blobs) │ └─────────┘
|
||
└─────┘ └────┘ └─────────────┘
|
||
┌───────────────┐
|
||
│ OpenSearch │ (metadata/label search)
|
||
└───────────────┘
|
||
```
|
||
|
||
- **Core app**: stateless, horizontally scalable.
|
||
- **Worker tier**: scales independently; CPU for de-id/convert, GPU for ML.
|
||
- **Postgres**: state, refs, metadata, audit. **Redis**: queue, cache, sessions,
|
||
server-sent events. **Object storage**: all blobs. **OpenSearch**: search.
|
||
- **Dev / small self-host**: a single `docker-compose` (app + worker + PG + Redis
|
||
+ MinIO + OpenSearch). **Scale**: Kubernetes with separate node pools.
|
||
- Contrast with Gitea (one binary, in-proc workers): we externalize workers and
|
||
object storage because imaging/ML work is heavy, Python-bound, and GPU-hungry.
|
||
|
||
---
|
||
|
||
## 9. Build-vs-buy summary
|
||
|
||
| Component | Recommendation |
|
||
|---|---|
|
||
| Versioning engine | **Build** the manifest/commit model (custom) — or back it with **lakeFS/DVC** behind your API to ship faster. |
|
||
| Viewer | **Adopt** Cornerstone3D + NiiVue (+ OpenSlide for WSI). Don't build. |
|
||
| De-identification | **Assemble** from pydicom + `deid`/CTP rules + Presidio + OCR. Don't build from scratch. |
|
||
| Search | **Postgres FTS** first → **OpenSearch** at scale. |
|
||
| Auth | **Authlib** (OIDC). |
|
||
| Queue | **Arq** (async) or **Celery**. |
|
||
| Object storage | **MinIO** self-host / **S3** cloud. |
|
||
|
||
---
|
||
|
||
## 10. MVP-first roadmap
|
||
|
||
Ordered for the chosen must-haves (versioning + viewer + de-id + audit):
|
||
|
||
- **Phase 0 — Skeleton.** Layered project structure, config, Postgres + Alembic,
|
||
object-storage driver, auth (user/org/team), dataset CRUD.
|
||
- **Phase 1 — Versioning engine.** Blobs, manifests, commits, branches; push/pull
|
||
via CLI + SDK; dataset diff. *(This is the product's spine — invest here.)*
|
||
- **Phase 2 — Ingestion + de-id + audit.** Worker tier, metadata extraction,
|
||
de-identification pipeline, append-only audit log. *(The compliance core.)*
|
||
- **Phase 3 — Viewer + search.** Cornerstone3D/NiiVue widgets, thumbnails,
|
||
faceted metadata search, browse UI.
|
||
- **Phase 4 — Collaboration.** Change Proposals, reviews, issues, annotations,
|
||
citable releases, datasheets.
|
||
- **Phase 5 — Pipelines.** Runners, event triggers, reproducible MONAI train/eval,
|
||
webhooks.
|
||
- **Later / optional.** DICOMweb + PACS adapter (QIDO/WADO/STOW), image-embedding
|
||
similarity search (pgvector), whole-slide pathology.
|
||
|
||
---
|
||
|
||
## Appendix — naming parallels for orientation
|
||
|
||
`git clone` → `imagehub clone` · repository → dataset · commit → version ·
|
||
push/pull → push/pull · PR → change proposal · `.git/objects` → content-addressed
|
||
blob store · act_runner → pipeline runner · `app.ini` → config · XORM → SQLAlchemy.
|