sciagent code + Gitea Actions CI/CD

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 09:38:30 +07:00
commit 688fac73e9
1167 changed files with 158244 additions and 0 deletions
@@ -0,0 +1,356 @@
+# ImageHub — Architecture
+
+> **"GitHub for medical-imaging research datasets."** A self-hosted platform for
+> versioning, viewing, de-identifying, and collaborating on imaging datasets
+> (DICOM / NIfTI / WSI), modeled on Gitea's architecture but rebuilt on a
+> Python-centric stack suited to the imaging + ML ecosystem.
+>
+> *"ImageHub" is a placeholder name — rename freely.*
+
+This document describes (1) the Gitea patterns we are reproducing, (2) how each
+maps to the imaging domain, (3) the recommended stack, (4) the subsystems and
+data model, and (5) an MVP-first roadmap.
+
+---
+
+## 1. Design philosophy (inherited from Gitea)
+
+Gitea is worth copying for five structural decisions. We keep all five:
+
+1. **Modular monolith, not microservices.** One deployable core app with clear
+   internal layers. You can scale the heavy parts out later (we do — the worker
+   tier) without paying distributed-systems tax up front.
+2. **Strict downward layering.** `cli → api → services → models → core`.
+   Dependencies only point down. Business logic lives in `services`, never in
+   models or HTTP handlers.
+3. **Server-rendered UI + progressive enhancement, not a SPA.** Pages are
+   rendered server-side; rich client behavior (the image viewer) is embedded as
+   self-contained widgets. Faster to build, easy to deep-link, SEO/printable.
+4. **Pluggable infrastructure behind interfaces.** Storage, queue, search,
+   cache, and auth are interfaces with swappable drivers (local disk ↔ S3,
+   in-proc ↔ Redis, Postgres FTS ↔ OpenSearch). Same idea as Gitea's
+   `modules/storage`, `modules/queue`, `modules/indexer`.
+5. **The domain engine is a first-class subsystem.** For Gitea that engine is
+   Git. For us it is the **Dataset Versioning Engine** — a content-addressed,
+   Merkle-DAG version control system specialized for large imaging files. This is
+   the single most important component and the heart of the product.
+
+What we deliberately change from Gitea:
+
+- **Workers are externalized.** Gitea runs background jobs in-process. Imaging
+  jobs (de-identification, format conversion, thumbnailing, ML) are heavy,
+  Python-bound, and sometimes need GPUs — so they run in a separate, scalable
+  worker tier driven by a real queue.
+- **All "files" are large binaries.** Gitea bolts on Git-LFS for large files; for
+  us large-file handling is the *default and only* path — every blob is
+  content-addressed and stored in object storage.
+- **De-identification & audit are core**, not afterthoughts (domain requirement).
+
+---
+
+## 2. Concept mapping: Gitea → ImageHub
+
+| Gitea concept | ImageHub equivalent | Notes |
+|---|---|---|
+| Repository | **Dataset** | A versioned collection of imaging studies/series + metadata + labels. |
+| Git commit | **Version** (commit) | Immutable snapshot = a content-addressed manifest + parent links. |
+| Branch / tag | **Branch / tag** | e.g. `raw`, `deidentified`, `train-split-v3`; tags for citable releases. |
+| Blob / tree | **Blob / manifest** | Blob = one file (DICOM instance, NIfTI, label). Manifest = the tree of a version. |
+| Git-LFS | *(native)* | Every blob is large; content-addressed object store is the only path. |
+| Git transport (SSH/HTTP) | **Transport API + CLI/SDK** | Resumable chunked upload/download; "have/want" blob negotiation like LFS batch. |
+| Pull Request | **Change Proposal** | Review added/changed/relabeled data before merging into a branch. |
+| Diff / code review | **Dataset diff + image diff** | Added/removed/changed series and label diffs, viewed side-by-side. |
+| Issues | **Issues / annotation tasks** | QC findings, labeling tasks, discussions. |
+| Releases | **Dataset releases** | Frozen, citable snapshots (DOI-friendly) — key for research reproducibility. |
+| Wiki | **Datasheet / data dictionary** | Dataset documentation, "Datasheets for Datasets". |
+| Actions / act_runner | **Pipelines / runners** | Event-driven compute: de-id, QC, train/eval; pins exact data version. |
+| Webhooks | **Webhooks** | Same. |
+| Code search indexer | **Metadata + tag search** | Faceted search over modality/body-part/labels; optional image-embedding search. |
+| Org / Team / User / RBAC | **Org / Team / User / RBAC** | Nearly identical; plus dataset access requests / data-use agreements. |
+| `app.ini` + `modules/setting` | **Config system** | Typed config from file + env. |
+| XORM migrations | **Alembic migrations** | Ordered, append-only schema migrations. |
+| Storage (local/minio/s3) | **Object storage** | Same abstraction; blobs live here. |
+| *(minimal in Gitea)* | **Audit & compliance log** | First-class, append-only PHI-access trail. |
+| *(none)* | **De-identification engine** | Domain-specific; no Gitea analogue. |
+
+---
+
+## 3. Recommended stack ("own stack", Python-centric)
+
+Rationale: the medical-imaging and ML ecosystems (pydicom, SimpleITK, nibabel,
+dcm2niix, highdicom, MONAI, the de-id tooling) are overwhelmingly Python. A
+single-language core + worker stack removes the model-duplication friction you'd
+get from a Go core calling Python workers.
+
+| Layer | Choice | Gitea analogue |
+|---|---|---|
+| Core web/API | **Python 3.12 + FastAPI** (uvicorn/gunicorn) | `routers/` (chi) |
+| Templating | **Jinja2 + HTMX** for progressive enhancement | `templates/` |
+| Frontend build | **Vite + TypeScript** | `web_src/` + Vite |
+| DICOM viewer | **Cornerstone3D** (DICOM), **NiiVue** (NIfTI) | embedded widgets |
+| ORM / migrations | **SQLAlchemy 2.0 + Alembic** | XORM + migrations |
+| Primary DB | **PostgreSQL** (single target) | multi-DB → standardize on PG |
+| Queue / workers | **Redis + Arq** (async) or Celery | `modules/queue` + workers |
+| Object storage | **S3 / MinIO** (self-host) | `modules/storage` |
+| Search | **OpenSearch** (or Postgres FTS to start) | `modules/indexer` |
+| Cache / pubsub / sessions | **Redis** | `modules/cache`, eventsource |
+| Auth | **Authlib** (OIDC/OAuth2) + sessions + API tokens | `services/auth` |
+| Imaging libs | pydicom, highdicom, SimpleITK, nibabel, dcm2niix, Pillow; OpenSlide for WSI | — |
+| ML integration | MONAI / PyTorch dataset adapters via the SDK | — |
+| De-id | pydicom + `deid` (CTP rules) + Presidio (text) + OCR (burned-in PHI) | — |
+| Client | **Python SDK + CLI** (`imagehub clone/pull/push/commit`) | the `git` client |
+
+> **Alternative if you want Gitea-grade transport performance:** keep a **Go
+> core** for the API/transport/auth layer and use **Python only in the worker
+> tier**. Faithful to Gitea, but you maintain two languages and duplicate the
+> dataset/manifest types across the boundary. Recommended only if the upload/
+> download path is your dominant bottleneck. Default to all-Python.
+
+---
+
+## 4. Layered architecture
+
+```
+cli/         Admin & ops commands (Typer): serve, migrate, doctor, deid-batch, user-admin
+  └─ api/        FastAPI routers — UI pages + REST API + transport endpoints (thin: parse → service → render)
+       └─ services/   Business logic: dataset ops, versioning workflows, review, pipelines, de-id orchestration
+            └─ models/     SQLAlchemy entities + queries (one module per domain: user, dataset, version, annotation…)
+                 └─ core/       Leaf infra & domain engines — MUST NOT import the layers above
+                                 ├─ vcs/        ← the Dataset Versioning Engine (the "Git")
+                                 ├─ storage/    ← content-addressed blob store over S3/MinIO
+                                 ├─ imaging/    ← DICOM/NIfTI parsing, metadata, thumbnails, conversion
+                                 ├─ deid/       ← de-identification pipeline stages
+                                 ├─ queue/      ← Redis/Arq job abstraction
+                                 ├─ index/      ← search abstraction (OpenSearch / PG FTS)
+                                 ├─ audit/      ← append-only audit log
+                                 ├─ config/     ← typed settings
+                                 └─ auth/       ← tokens, sessions, OIDC, permissions
+```
+
+**Layer rules (enforce with import-linter, the analogue of Gitea's depguard):**
+- `core/` is the foundation; it may not import `models/`, `services/`, or `api/`.
+- Cross-entity business logic goes in `services/`, never in `models/`.
+- `api/` handlers stay thin — no business logic, no direct DB-engine access.
+- Every DB query takes a `session`/context so it enlists in the request transaction.
+
+---
+
+## 5. Core subsystems
+
+### 5.1 Dataset Versioning Engine (`core/vcs`) — the heart
+
+A content-addressed Merkle DAG, like Git, specialized for large imaging files.
+
+- **Blob store.** Every file is hashed (SHA-256) and stored once in object
+  storage at `blobs/<aa>/<bb>/<hash>`. Identical files across versions/datasets
+  dedupe for free (huge win — imaging datasets share many instances).
+- **Manifest (tree).** A version's manifest lists `logical_path → {blob_hash,
+  size, media_type, imaging_meta}`. The manifest is itself content-addressed.
+- **Commit.** `{manifest_hash, parents[], author, timestamp, message}`. The
+  parent chain is the history DAG.
+- **Refs.** Branches/tags map `name → commit_id`, stored in **Postgres** (not in
+  object storage) so they're transactional and queryable.
+- **Transport / negotiation.** On push, the client hashes locally and asks the
+  server which blobs are missing ("have/want", like the LFS batch API), uploads
+  only those (resumable, chunked), then posts the commit. Pull is the reverse.
+- **Diff.** Compare two manifests → added / removed / modified entries; surfaced
+  in the UI as a dataset diff and, per-image, as a viewer side-by-side.
+- **Merge.** Three-way path-level merge of manifests; conflicts when the same
+  path changed on both sides. Label/annotation merges can be semantic.
+
+**Build vs. buy:** building this custom gives full control and the cleanest
+domain fit (recommended). If you need to move faster, back it with **lakeFS**
+(git-like branches/commits/merge over S3) or **DVC**, and keep your manifest API
+as the stable interface so you can swap the backend later.
+
+### 5.2 Object storage (`core/storage`)
+Driver interface (`put/get/stat/delete/presign`) with `local` and `s3/minio`
+implementations — exactly Gitea's `modules/storage` pattern. Stores blobs,
+manifests, thumbnails, pipeline artifacts. Presigned URLs let clients up/download
+directly to S3 for large transfers, bypassing the app.
+
+### 5.3 Ingestion & processing pipeline (`core/queue` + workers)
+On upload, enqueue jobs; workers (Arq) process them:
+1. Verify checksums, store blobs (dedup).
+2. **Extract metadata** (pydicom/nibabel): modality, body part, study/series UIDs,
+   dimensions, acquisition params → indexed + linked to blobs.
+3. **Thumbnails / previews** for the browse UI.
+4. **De-identification** (§5.4).
+5. **Format normalization** (optional: DICOM→NIfTI via dcm2niix for ML).
+6. Commit the resulting version; update search index; write audit entries.
+Workers scale independently; GPU nodes handle ML jobs.
+
+### 5.4 De-identification engine (`core/deid`) — compliance must-have
+A configurable, multi-stage pipeline producing a `deidentified` branch from a
+`raw`/PHI version:
+- **Tag de-id** per **DICOM PS3.15 Annex E** confidentiality profiles: remove/
+  replace PHI tags, regenerate UIDs *consistently* (so series stay linked),
+  handle private tags.
+- **Date shifting**: consistent per-patient offset to preserve intervals.
+- **Burned-in pixel PHI**: OCR (Tesseract/EasyOCR) to detect text in pixels,
+  redact, and flag for human review.
+- **Free-text / report de-id**: Presidio NER over any text fields/reports.
+- **Re-identification map** (only if policy allows): the original↔pseudonym
+  mapping is encrypted, access-restricted, and fully audited; otherwise the PHI
+  source is dropped.
+- **Verification stage** emits a report of exactly what changed.
+Tooling: pydicom, Stanford `deid` / MIRC CTP rule sets, Presidio, an OCR engine.
+Profiles are configurable per org/dataset.
+
+### 5.5 Web viewer (`api` + embedded TS widgets)
+Progressive-enhancement widgets (not a separate SPA), true to Gitea:
+- **Cornerstone3D** for DICOM (multi-frame, MPR, windowing, measurements,
+  segmentation overlays).
+- **NiiVue** for NIfTI volumes (great for neuro/research).
+- **OpenSlide**-backed deep-zoom tiles for whole-slide pathology (optional).
+The server exposes a frame/tile API (a WADO-RS-like read path even without full
+DICOMweb). Annotations are structured objects (DICOM SR or JSON), **versioned
+with the dataset**.
+
+### 5.6 Search & discovery (`core/index`)
+Index extracted metadata + labels → faceted search ("brain MRI, T1, age<40, has
+tumor label"). Start on **Postgres FTS**; graduate to **OpenSearch** for scale.
+Optional later: compute image embeddings (a foundation model) → **pgvector** for
+"find similar studies/lesions".
+
+### 5.7 Collaboration (`services`)
+Change Proposals (PRs), reviews, issues, comments, annotation tasks, releases,
+datasheets — the GitHub social layer, mapped to datasets. A reviewer of a Change
+Proposal sees the dataset diff and can open the viewer on changed series.
+
+### 5.8 Pipelines & runners (Actions analogue, optional/advanced)
+Event-driven compute (`on: push | proposal | tag | schedule`) executed by
+**runners** (containers that poll for jobs, à la `act_runner`). Use cases: auto
+de-id, QC/validation, dataset statistics, **training/eval** with MONAI. Each run
+**pins the dataset version hash**, giving reproducible ML by construction.
+
+### 5.9 Auth, permissions, audit (`core/auth`, `core/audit`)
+- OIDC/OAuth2 login, sessions, scoped API tokens.
+- Org → Team → permission model; dataset visibility `private | internal | public`;
+  dataset-level access requests / data-use agreements.
+- **Audit log**: append-only Postgres table (actor, action, object, dataset,
+  version, IP, purpose-of-use, timestamp). Every PHI-bearing access (view
+  original, download) is logged; optional hash-chaining for tamper-evidence;
+  retention + legal-hold support.
+
+### 5.10 API, SDK, CLI
+- **REST API** (FastAPI, OpenAPI-documented — the swagger analogue).
+- **Python SDK** (the most important client for ML users): pull a pinned version
+  straight into a `torch`/MONAI `Dataset`.
+- **CLI** (`imagehub clone/pull/push/checkout/commit/diff`) — the `git`/`dvc`
+  analogue for data engineers.
+
+---
+
+## 6. Data model (core tables)
+
+```
+user, organization, team, team_membership, team_access
+dataset(id, owner_id, name, visibility, default_branch, description)
+ref(dataset_id, name, type[branch|tag], commit_id)             -- transactional refs
+commit(id, dataset_id, manifest_hash, parent_ids[], author_id, message, created_at)
+blob(hash PK, size, storage_key, media_type, refcount)         -- content-addressed, deduped
+manifest(hash PK, storage_key)                                 -- stored in object store, hash in DB
+instance_meta(blob_hash, dataset_id, study_uid, series_uid, modality, body_part, dims, params…)
+annotation(id, dataset_id, commit_id, target, type, payload, author_id)
+label_schema(id, dataset_id, spec)         label(id, schema_id, value)
+change_proposal(id, dataset_id, src_ref, dst_ref, status)  review, comment
+issue(id, dataset_id, …)  issue_comment
+release(id, dataset_id, tag, notes, doi?)
+pipeline(id, dataset_id, spec)  pipeline_run(id, pipeline_id, commit_id, status, artifacts)  runner
+webhook  webhook_delivery
+audit_log(id, actor_id, action, object_type, object_id, dataset_id, ip, purpose, created_at)  -- append-only
+access_request, data_use_agreement
+phi_map(dataset_id, original_ref, pseudonym, …)               -- encrypted, restricted, audited
+```
+
+---
+
+## 7. Key flows
+
+1. **Ingest & de-identify:** upload → blobs stored (deduped) → metadata extracted
+   → de-id pipeline → new commit on `deidentified` branch → indexed → audited.
+2. **Browse & view:** datasets list → dataset → series list → Cornerstone3D/NiiVue
+   streams frames → annotation overlays.
+3. **Curate an ML subset (zero-copy):** faceted query → new branch/dataset whose
+   manifest *references existing blobs* (no data copied) → commit → tag a release
+   → `sdk.pull(tag)` in training.
+4. **Propose a change (PR):** push new/relabeled data to a branch → open Change
+   Proposal → reviewer sees dataset diff + image diff → approve → merge.
+5. **Reproducible training:** tag triggers a pipeline that pins the version hash,
+   runs MONAI train/eval, and links metrics + model artifact to that exact data
+   version.
+
+---
+
+## 8. Deployment topology
+
+```
+            ┌──────────── reverse proxy (Caddy/Traefik) + TLS ────────────┐
+            │                                                              │
+   ┌────────▼────────┐    ┌──────────────────┐    ┌───────────────────────▼─┐
+   │  Core app (N×)  │    │  Worker tier (M×) │    │  Runners (K×, GPU opt.)  │
+   │  FastAPI/uvicorn│    │  Arq + imaging/ML │    │  pipelines (train/eval)  │
+   └───┬─────┬───┬───┘    └───┬─────────┬─────┘    └────────────┬────────────┘
+       │     │   │            │         │                       │
+   ┌───▼─┐ ┌─▼─┐ │        ┌───▼───┐ ┌───▼────────┐         ┌────▼────┐
+   │ PG  │ │Redis│ └──────▶│ Redis │ │ Object store│◀────────┤Object st.│
+   │(state│ │queue│        │ queue │ │ S3 / MinIO  │         │ (blobs) │
+   │ refs)│ │cache│        └───────┘ │ (blobs)     │         └─────────┘
+   └─────┘ └────┘                    └─────────────┘
+                          ┌───────────────┐
+                          │  OpenSearch    │  (metadata/label search)
+                          └───────────────┘
+```
+
+- **Core app**: stateless, horizontally scalable.
+- **Worker tier**: scales independently; CPU for de-id/convert, GPU for ML.
+- **Postgres**: state, refs, metadata, audit. **Redis**: queue, cache, sessions,
+  server-sent events. **Object storage**: all blobs. **OpenSearch**: search.
+- **Dev / small self-host**: a single `docker-compose` (app + worker + PG + Redis
+  + MinIO + OpenSearch). **Scale**: Kubernetes with separate node pools.
+- Contrast with Gitea (one binary, in-proc workers): we externalize workers and
+  object storage because imaging/ML work is heavy, Python-bound, and GPU-hungry.
+
+---
+
+## 9. Build-vs-buy summary
+
+| Component | Recommendation |
+|---|---|
+| Versioning engine | **Build** the manifest/commit model (custom) — or back it with **lakeFS/DVC** behind your API to ship faster. |
+| Viewer | **Adopt** Cornerstone3D + NiiVue (+ OpenSlide for WSI). Don't build. |
+| De-identification | **Assemble** from pydicom + `deid`/CTP rules + Presidio + OCR. Don't build from scratch. |
+| Search | **Postgres FTS** first → **OpenSearch** at scale. |
+| Auth | **Authlib** (OIDC). |
+| Queue | **Arq** (async) or **Celery**. |
+| Object storage | **MinIO** self-host / **S3** cloud. |
+
+---
+
+## 10. MVP-first roadmap
+
+Ordered for the chosen must-haves (versioning + viewer + de-id + audit):
+
+- **Phase 0 — Skeleton.** Layered project structure, config, Postgres + Alembic,
+  object-storage driver, auth (user/org/team), dataset CRUD.
+- **Phase 1 — Versioning engine.** Blobs, manifests, commits, branches; push/pull
+  via CLI + SDK; dataset diff. *(This is the product's spine — invest here.)*
+- **Phase 2 — Ingestion + de-id + audit.** Worker tier, metadata extraction,
+  de-identification pipeline, append-only audit log. *(The compliance core.)*
+- **Phase 3 — Viewer + search.** Cornerstone3D/NiiVue widgets, thumbnails,
+  faceted metadata search, browse UI.
+- **Phase 4 — Collaboration.** Change Proposals, reviews, issues, annotations,
+  citable releases, datasheets.
+- **Phase 5 — Pipelines.** Runners, event triggers, reproducible MONAI train/eval,
+  webhooks.
+- **Later / optional.** DICOMweb + PACS adapter (QIDO/WADO/STOW), image-embedding
+  similarity search (pgvector), whole-slide pathology.
+
+---
+
+## Appendix — naming parallels for orientation
+
+`git clone` → `imagehub clone` · repository → dataset · commit → version ·
+push/pull → push/pull · PR → change proposal · `.git/objects` → content-addressed
+blob store · act_runner → pipeline runner · `app.ini` → config · XORM → SQLAlchemy.