sciagent/docs/architecture/platform_architecture.md

# ImageHub — Architecture

> **"GitHub for medical-imaging research datasets."** A self-hosted platform for
> versioning, viewing, de-identifying, and collaborating on imaging datasets
> (DICOM / NIfTI / WSI), modeled on Gitea's architecture but rebuilt on a
> Python-centric stack suited to the imaging + ML ecosystem.
>
> *"ImageHub" is a placeholder name — rename freely.*

This document describes (1) the Gitea patterns we are reproducing, (2) how each
maps to the imaging domain, (3) the recommended stack, (4) the subsystems and
data model, and (5) an MVP-first roadmap.

---

## 1. Design philosophy (inherited from Gitea)

Gitea is worth copying for five structural decisions. We keep all five:

1. **Modular monolith, not microservices.** One deployable core app with clear
   internal layers. You can scale the heavy parts out later (we do — the worker
   tier) without paying distributed-systems tax up front.
2. **Strict downward layering.** `cli → api → services → models → core`.
   Dependencies only point down. Business logic lives in `services`, never in
   models or HTTP handlers.
3. **Server-rendered UI + progressive enhancement, not a SPA.** Pages are
   rendered server-side; rich client behavior (the image viewer) is embedded as
   self-contained widgets. Faster to build, easy to deep-link, SEO/printable.
4. **Pluggable infrastructure behind interfaces.** Storage, queue, search,
   cache, and auth are interfaces with swappable drivers (local disk ↔ S3,
   in-proc ↔ Redis, Postgres FTS ↔ OpenSearch). Same idea as Gitea's
   `modules/storage`, `modules/queue`, `modules/indexer`.
5. **The domain engine is a first-class subsystem.** For Gitea that engine is
   Git. For us it is the **Dataset Versioning Engine** — a content-addressed,
   Merkle-DAG version control system specialized for large imaging files. This is
   the single most important component and the heart of the product.

What we deliberately change from Gitea:

- **Workers are externalized.** Gitea runs background jobs in-process. Imaging
  jobs (de-identification, format conversion, thumbnailing, ML) are heavy,
  Python-bound, and sometimes need GPUs — so they run in a separate, scalable
  worker tier driven by a real queue.
- **All "files" are large binaries.** Gitea bolts on Git-LFS for large files; for
  us large-file handling is the *default and only* path — every blob is
  content-addressed and stored in object storage.
- **De-identification & audit are core**, not afterthoughts (domain requirement).

---

## 2. Concept mapping: Gitea → ImageHub

| Gitea concept | ImageHub equivalent | Notes |
|---|---|---|
| Repository | **Dataset** | A versioned collection of imaging studies/series + metadata + labels. |
| Git commit | **Version** (commit) | Immutable snapshot = a content-addressed manifest + parent links. |
| Branch / tag | **Branch / tag** | e.g. `raw`, `deidentified`, `train-split-v3`; tags for citable releases. |
| Blob / tree | **Blob / manifest** | Blob = one file (DICOM instance, NIfTI, label). Manifest = the tree of a version. |
| Git-LFS | *(native)* | Every blob is large; content-addressed object store is the only path. |
| Git transport (SSH/HTTP) | **Transport API + CLI/SDK** | Resumable chunked upload/download; "have/want" blob negotiation like LFS batch. |
| Pull Request | **Change Proposal** | Review added/changed/relabeled data before merging into a branch. |
| Diff / code review | **Dataset diff + image diff** | Added/removed/changed series and label diffs, viewed side-by-side. |
| Issues | **Issues / annotation tasks** | QC findings, labeling tasks, discussions. |
| Releases | **Dataset releases** | Frozen, citable snapshots (DOI-friendly) — key for research reproducibility. |
| Wiki | **Datasheet / data dictionary** | Dataset documentation, "Datasheets for Datasets". |
| Actions / act_runner | **Pipelines / runners** | Event-driven compute: de-id, QC, train/eval; pins exact data version. |
| Webhooks | **Webhooks** | Same. |
| Code search indexer | **Metadata + tag search** | Faceted search over modality/body-part/labels; optional image-embedding search. |
| Org / Team / User / RBAC | **Org / Team / User / RBAC** | Nearly identical; plus dataset access requests / data-use agreements. |
| `app.ini` + `modules/setting` | **Config system** | Typed config from file + env. |
| XORM migrations | **Alembic migrations** | Ordered, append-only schema migrations. |
| Storage (local/minio/s3) | **Object storage** | Same abstraction; blobs live here. |
| *(minimal in Gitea)* | **Audit & compliance log** | First-class, append-only PHI-access trail. |
| *(none)* | **De-identification engine** | Domain-specific; no Gitea analogue. |

---

## 3. Recommended stack ("own stack", Python-centric)

Rationale: the medical-imaging and ML ecosystems (pydicom, SimpleITK, nibabel,
dcm2niix, highdicom, MONAI, the de-id tooling) are overwhelmingly Python. A
single-language core + worker stack removes the model-duplication friction you'd
get from a Go core calling Python workers.

| Layer | Choice | Gitea analogue |
|---|---|---|
| Core web/API | **Python 3.12 + FastAPI** (uvicorn/gunicorn) | `routers/` (chi) |
| Templating | **Jinja2 + HTMX** for progressive enhancement | `templates/` |
| Frontend build | **Vite + TypeScript** | `web_src/` + Vite |
| DICOM viewer | **Cornerstone3D** (DICOM), **NiiVue** (NIfTI) | embedded widgets |
| ORM / migrations | **SQLAlchemy 2.0 + Alembic** | XORM + migrations |
| Primary DB | **PostgreSQL** (single target) | multi-DB → standardize on PG |
| Queue / workers | **Redis + Arq** (async) or Celery | `modules/queue` + workers |
| Object storage | **S3 / MinIO** (self-host) | `modules/storage` |
| Search | **OpenSearch** (or Postgres FTS to start) | `modules/indexer` |
| Cache / pubsub / sessions | **Redis** | `modules/cache`, eventsource |
| Auth | **Authlib** (OIDC/OAuth2) + sessions + API tokens | `services/auth` |
| Imaging libs | pydicom, highdicom, SimpleITK, nibabel, dcm2niix, Pillow; OpenSlide for WSI | — |
| ML integration | MONAI / PyTorch dataset adapters via the SDK | — |
| De-id | pydicom + `deid` (CTP rules) + Presidio (text) + OCR (burned-in PHI) | — |
| Client | **Python SDK + CLI** (`imagehub clone/pull/push/commit`) | the `git` client |

> **Alternative if you want Gitea-grade transport performance:** keep a **Go
> core** for the API/transport/auth layer and use **Python only in the worker
> tier**. Faithful to Gitea, but you maintain two languages and duplicate the
> dataset/manifest types across the boundary. Recommended only if the upload/
> download path is your dominant bottleneck. Default to all-Python.

---

## 4. Layered architecture

```
cli/         Admin & ops commands (Typer): serve, migrate, doctor, deid-batch, user-admin
  └─ api/        FastAPI routers — UI pages + REST API + transport endpoints (thin: parse → service → render)
       └─ services/   Business logic: dataset ops, versioning workflows, review, pipelines, de-id orchestration
            └─ models/     SQLAlchemy entities + queries (one module per domain: user, dataset, version, annotation…)
                 └─ core/       Leaf infra & domain engines — MUST NOT import the layers above
                                 ├─ vcs/        ← the Dataset Versioning Engine (the "Git")
                                 ├─ storage/    ← content-addressed blob store over S3/MinIO
                                 ├─ imaging/    ← DICOM/NIfTI parsing, metadata, thumbnails, conversion
                                 ├─ deid/       ← de-identification pipeline stages
                                 ├─ queue/      ← Redis/Arq job abstraction
                                 ├─ index/      ← search abstraction (OpenSearch / PG FTS)
                                 ├─ audit/      ← append-only audit log
                                 ├─ config/     ← typed settings
                                 └─ auth/       ← tokens, sessions, OIDC, permissions
```

**Layer rules (enforce with import-linter, the analogue of Gitea's depguard):**
- `core/` is the foundation; it may not import `models/`, `services/`, or `api/`.
- Cross-entity business logic goes in `services/`, never in `models/`.
- `api/` handlers stay thin — no business logic, no direct DB-engine access.
- Every DB query takes a `session`/context so it enlists in the request transaction.

---

## 5. Core subsystems

### 5.1 Dataset Versioning Engine (`core/vcs`) — the heart

A content-addressed Merkle DAG, like Git, specialized for large imaging files.

- **Blob store.** Every file is hashed (SHA-256) and stored once in object
  storage at `blobs/<aa>/<bb>/<hash>`. Identical files across versions/datasets
  dedupe for free (huge win — imaging datasets share many instances).
- **Manifest (tree).** A version's manifest lists `logical_path → {blob_hash,
  size, media_type, imaging_meta}`. The manifest is itself content-addressed.
- **Commit.** `{manifest_hash, parents[], author, timestamp, message}`. The
  parent chain is the history DAG.
- **Refs.** Branches/tags map `name → commit_id`, stored in **Postgres** (not in
  object storage) so they're transactional and queryable.
- **Transport / negotiation.** On push, the client hashes locally and asks the
  server which blobs are missing ("have/want", like the LFS batch API), uploads
  only those (resumable, chunked), then posts the commit. Pull is the reverse.
- **Diff.** Compare two manifests → added / removed / modified entries; surfaced
  in the UI as a dataset diff and, per-image, as a viewer side-by-side.
- **Merge.** Three-way path-level merge of manifests; conflicts when the same
  path changed on both sides. Label/annotation merges can be semantic.

**Build vs. buy:** building this custom gives full control and the cleanest
domain fit (recommended). If you need to move faster, back it with **lakeFS**
(git-like branches/commits/merge over S3) or **DVC**, and keep your manifest API
as the stable interface so you can swap the backend later.

### 5.2 Object storage (`core/storage`)
Driver interface (`put/get/stat/delete/presign`) with `local` and `s3/minio`
implementations — exactly Gitea's `modules/storage` pattern. Stores blobs,
manifests, thumbnails, pipeline artifacts. Presigned URLs let clients up/download
directly to S3 for large transfers, bypassing the app.

### 5.3 Ingestion & processing pipeline (`core/queue` + workers)
On upload, enqueue jobs; workers (Arq) process them:
1. Verify checksums, store blobs (dedup).
2. **Extract metadata** (pydicom/nibabel): modality, body part, study/series UIDs,
   dimensions, acquisition params → indexed + linked to blobs.
3. **Thumbnails / previews** for the browse UI.
4. **De-identification** (§5.4).
5. **Format normalization** (optional: DICOM→NIfTI via dcm2niix for ML).
6. Commit the resulting version; update search index; write audit entries.
Workers scale independently; GPU nodes handle ML jobs.

### 5.4 De-identification engine (`core/deid`) — compliance must-have
A configurable, multi-stage pipeline producing a `deidentified` branch from a
`raw`/PHI version:
- **Tag de-id** per **DICOM PS3.15 Annex E** confidentiality profiles: remove/
  replace PHI tags, regenerate UIDs *consistently* (so series stay linked),
  handle private tags.
- **Date shifting**: consistent per-patient offset to preserve intervals.
- **Burned-in pixel PHI**: OCR (Tesseract/EasyOCR) to detect text in pixels,
  redact, and flag for human review.
- **Free-text / report de-id**: Presidio NER over any text fields/reports.
- **Re-identification map** (only if policy allows): the original↔pseudonym
  mapping is encrypted, access-restricted, and fully audited; otherwise the PHI
  source is dropped.
- **Verification stage** emits a report of exactly what changed.
Tooling: pydicom, Stanford `deid` / MIRC CTP rule sets, Presidio, an OCR engine.
Profiles are configurable per org/dataset.

### 5.5 Web viewer (`api` + embedded TS widgets)
Progressive-enhancement widgets (not a separate SPA), true to Gitea:
- **Cornerstone3D** for DICOM (multi-frame, MPR, windowing, measurements,
  segmentation overlays).
- **NiiVue** for NIfTI volumes (great for neuro/research).
- **OpenSlide**-backed deep-zoom tiles for whole-slide pathology (optional).
The server exposes a frame/tile API (a WADO-RS-like read path even without full
DICOMweb). Annotations are structured objects (DICOM SR or JSON), **versioned
with the dataset**.

### 5.6 Search & discovery (`core/index`)
Index extracted metadata + labels → faceted search ("brain MRI, T1, age<40, has
tumor label"). Start on **Postgres FTS**; graduate to **OpenSearch** for scale.
Optional later: compute image embeddings (a foundation model) → **pgvector** for
"find similar studies/lesions".

### 5.7 Collaboration (`services`)
Change Proposals (PRs), reviews, issues, comments, annotation tasks, releases,
datasheets — the GitHub social layer, mapped to datasets. A reviewer of a Change
Proposal sees the dataset diff and can open the viewer on changed series.

### 5.8 Pipelines & runners (Actions analogue, optional/advanced)
Event-driven compute (`on: push | proposal | tag | schedule`) executed by
**runners** (containers that poll for jobs, à la `act_runner`). Use cases: auto
de-id, QC/validation, dataset statistics, **training/eval** with MONAI. Each run
**pins the dataset version hash**, giving reproducible ML by construction.

### 5.9 Auth, permissions, audit (`core/auth`, `core/audit`)
- OIDC/OAuth2 login, sessions, scoped API tokens.
- Org → Team → permission model; dataset visibility `private | internal | public`;
  dataset-level access requests / data-use agreements.
- **Audit log**: append-only Postgres table (actor, action, object, dataset,
  version, IP, purpose-of-use, timestamp). Every PHI-bearing access (view
  original, download) is logged; optional hash-chaining for tamper-evidence;
  retention + legal-hold support.

### 5.10 API, SDK, CLI
- **REST API** (FastAPI, OpenAPI-documented — the swagger analogue).
- **Python SDK** (the most important client for ML users): pull a pinned version
  straight into a `torch`/MONAI `Dataset`.
- **CLI** (`imagehub clone/pull/push/checkout/commit/diff`) — the `git`/`dvc`
  analogue for data engineers.

---

## 6. Data model (core tables)

```
user, organization, team, team_membership, team_access
dataset(id, owner_id, name, visibility, default_branch, description)
ref(dataset_id, name, type[branch|tag], commit_id)             -- transactional refs
commit(id, dataset_id, manifest_hash, parent_ids[], author_id, message, created_at)
blob(hash PK, size, storage_key, media_type, refcount)         -- content-addressed, deduped
manifest(hash PK, storage_key)                                 -- stored in object store, hash in DB
instance_meta(blob_hash, dataset_id, study_uid, series_uid, modality, body_part, dims, params…)
annotation(id, dataset_id, commit_id, target, type, payload, author_id)
label_schema(id, dataset_id, spec)         label(id, schema_id, value)
change_proposal(id, dataset_id, src_ref, dst_ref, status)  review, comment
issue(id, dataset_id, …)  issue_comment
release(id, dataset_id, tag, notes, doi?)
pipeline(id, dataset_id, spec)  pipeline_run(id, pipeline_id, commit_id, status, artifacts)  runner
webhook  webhook_delivery
audit_log(id, actor_id, action, object_type, object_id, dataset_id, ip, purpose, created_at)  -- append-only
access_request, data_use_agreement
phi_map(dataset_id, original_ref, pseudonym, …)               -- encrypted, restricted, audited
```

---

## 7. Key flows

1. **Ingest & de-identify:** upload → blobs stored (deduped) → metadata extracted
   → de-id pipeline → new commit on `deidentified` branch → indexed → audited.
2. **Browse & view:** datasets list → dataset → series list → Cornerstone3D/NiiVue
   streams frames → annotation overlays.
3. **Curate an ML subset (zero-copy):** faceted query → new branch/dataset whose
   manifest *references existing blobs* (no data copied) → commit → tag a release
   → `sdk.pull(tag)` in training.
4. **Propose a change (PR):** push new/relabeled data to a branch → open Change
   Proposal → reviewer sees dataset diff + image diff → approve → merge.
5. **Reproducible training:** tag triggers a pipeline that pins the version hash,
   runs MONAI train/eval, and links metrics + model artifact to that exact data
   version.

---

## 8. Deployment topology

```
            ┌──────────── reverse proxy (Caddy/Traefik) + TLS ────────────┐
            │                                                              │
   ┌────────▼────────┐    ┌──────────────────┐    ┌───────────────────────▼─┐
   │  Core app (N×)  │    │  Worker tier (M×) │    │  Runners (K×, GPU opt.)  │
   │  FastAPI/uvicorn│    │  Arq + imaging/ML │    │  pipelines (train/eval)  │
   └───┬─────┬───┬───┘    └───┬─────────┬─────┘    └────────────┬────────────┘
       │     │   │            │         │                       │
   ┌───▼─┐ ┌─▼─┐ │        ┌───▼───┐ ┌───▼────────┐         ┌────▼────┐
   │ PG  │ │Redis│ └──────▶│ Redis │ │ Object store│◀────────┤Object st.│
   │(state│ │queue│        │ queue │ │ S3 / MinIO  │         │ (blobs) │
   │ refs)│ │cache│        └───────┘ │ (blobs)     │         └─────────┘
   └─────┘ └────┘                    └─────────────┘
                          ┌───────────────┐
                          │  OpenSearch    │  (metadata/label search)
                          └───────────────┘
```

- **Core app**: stateless, horizontally scalable.
- **Worker tier**: scales independently; CPU for de-id/convert, GPU for ML.
- **Postgres**: state, refs, metadata, audit. **Redis**: queue, cache, sessions,
  server-sent events. **Object storage**: all blobs. **OpenSearch**: search.
- **Dev / small self-host**: a single `docker-compose` (app + worker + PG + Redis
  + MinIO + OpenSearch). **Scale**: Kubernetes with separate node pools.
- Contrast with Gitea (one binary, in-proc workers): we externalize workers and
  object storage because imaging/ML work is heavy, Python-bound, and GPU-hungry.

---

## 9. Build-vs-buy summary

| Component | Recommendation |
|---|---|
| Versioning engine | **Build** the manifest/commit model (custom) — or back it with **lakeFS/DVC** behind your API to ship faster. |
| Viewer | **Adopt** Cornerstone3D + NiiVue (+ OpenSlide for WSI). Don't build. |
| De-identification | **Assemble** from pydicom + `deid`/CTP rules + Presidio + OCR. Don't build from scratch. |
| Search | **Postgres FTS** first → **OpenSearch** at scale. |
| Auth | **Authlib** (OIDC). |
| Queue | **Arq** (async) or **Celery**. |
| Object storage | **MinIO** self-host / **S3** cloud. |

---

## 10. MVP-first roadmap

Ordered for the chosen must-haves (versioning + viewer + de-id + audit):

- **Phase 0 — Skeleton.** Layered project structure, config, Postgres + Alembic,
  object-storage driver, auth (user/org/team), dataset CRUD.
- **Phase 1 — Versioning engine.** Blobs, manifests, commits, branches; push/pull
  via CLI + SDK; dataset diff. *(This is the product's spine — invest here.)*
- **Phase 2 — Ingestion + de-id + audit.** Worker tier, metadata extraction,
  de-identification pipeline, append-only audit log. *(The compliance core.)*
- **Phase 3 — Viewer + search.** Cornerstone3D/NiiVue widgets, thumbnails,
  faceted metadata search, browse UI.
- **Phase 4 — Collaboration.** Change Proposals, reviews, issues, annotations,
  citable releases, datasheets.
- **Phase 5 — Pipelines.** Runners, event triggers, reproducible MONAI train/eval,
  webhooks.
- **Later / optional.** DICOMweb + PACS adapter (QIDO/WADO/STOW), image-embedding
  similarity search (pgvector), whole-slide pathology.

---

## Appendix — naming parallels for orientation

`git clone` → `imagehub clone` · repository → dataset · commit → version ·
push/pull → push/pull · PR → change proposal · `.git/objects` → content-addressed
blob store · act_runner → pipeline runner · `app.ini` → config · XORM → SQLAlchemy.