tlam89/sciagent

Fork 0

Files

T

Thinh Lam 688fac73e9

CI/CD / backend (push) Failing after 2m8s

Details

CI/CD / frontend (push) Failing after 1m40s

Details

CI/CD / deploy (push) Has been skipped

Details

sciagent code + Gitea Actions CI/CD

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-30 09:38:30 +07:00

20 KiB

Raw Permalink Blame History

ImageHub — Architecture

"GitHub for medical-imaging research datasets." A self-hosted platform for versioning, viewing, de-identifying, and collaborating on imaging datasets (DICOM / NIfTI / WSI), modeled on Gitea's architecture but rebuilt on a Python-centric stack suited to the imaging + ML ecosystem.

"ImageHub" is a placeholder name — rename freely.

This document describes (1) the Gitea patterns we are reproducing, (2) how each maps to the imaging domain, (3) the recommended stack, (4) the subsystems and data model, and (5) an MVP-first roadmap.

1. Design philosophy (inherited from Gitea)

Gitea is worth copying for five structural decisions. We keep all five:

Modular monolith, not microservices. One deployable core app with clear internal layers. You can scale the heavy parts out later (we do — the worker tier) without paying distributed-systems tax up front.
Strict downward layering. cli → api → services → models → core. Dependencies only point down. Business logic lives in services, never in models or HTTP handlers.
Server-rendered UI + progressive enhancement, not a SPA. Pages are rendered server-side; rich client behavior (the image viewer) is embedded as self-contained widgets. Faster to build, easy to deep-link, SEO/printable.
Pluggable infrastructure behind interfaces. Storage, queue, search, cache, and auth are interfaces with swappable drivers (local disk ↔ S3, in-proc ↔ Redis, Postgres FTS ↔ OpenSearch). Same idea as Gitea's modules/storage, modules/queue, modules/indexer.
The domain engine is a first-class subsystem. For Gitea that engine is Git. For us it is the Dataset Versioning Engine — a content-addressed, Merkle-DAG version control system specialized for large imaging files. This is the single most important component and the heart of the product.

What we deliberately change from Gitea:

Workers are externalized. Gitea runs background jobs in-process. Imaging jobs (de-identification, format conversion, thumbnailing, ML) are heavy, Python-bound, and sometimes need GPUs — so they run in a separate, scalable worker tier driven by a real queue.
All "files" are large binaries. Gitea bolts on Git-LFS for large files; for us large-file handling is the default and only path — every blob is content-addressed and stored in object storage.
De-identification & audit are core, not afterthoughts (domain requirement).

2. Concept mapping: Gitea → ImageHub

Gitea concept	ImageHub equivalent	Notes
Repository	Dataset	A versioned collection of imaging studies/series + metadata + labels.
Git commit	Version (commit)	Immutable snapshot = a content-addressed manifest + parent links.
Branch / tag	Branch / tag	e.g. `raw`, `deidentified`, `train-split-v3`; tags for citable releases.
Blob / tree	Blob / manifest	Blob = one file (DICOM instance, NIfTI, label). Manifest = the tree of a version.
Git-LFS	(native)	Every blob is large; content-addressed object store is the only path.
Git transport (SSH/HTTP)	Transport API + CLI/SDK	Resumable chunked upload/download; "have/want" blob negotiation like LFS batch.
Pull Request	Change Proposal	Review added/changed/relabeled data before merging into a branch.
Diff / code review	Dataset diff + image diff	Added/removed/changed series and label diffs, viewed side-by-side.
Issues	Issues / annotation tasks	QC findings, labeling tasks, discussions.
Releases	Dataset releases	Frozen, citable snapshots (DOI-friendly) — key for research reproducibility.
Wiki	Datasheet / data dictionary	Dataset documentation, "Datasheets for Datasets".
Actions / act_runner	Pipelines / runners	Event-driven compute: de-id, QC, train/eval; pins exact data version.
Webhooks	Webhooks	Same.
Code search indexer	Metadata + tag search	Faceted search over modality/body-part/labels; optional image-embedding search.
Org / Team / User / RBAC	Org / Team / User / RBAC	Nearly identical; plus dataset access requests / data-use agreements.
`app.ini` + `modules/setting`	Config system	Typed config from file + env.
XORM migrations	Alembic migrations	Ordered, append-only schema migrations.
Storage (local/minio/s3)	Object storage	Same abstraction; blobs live here.
(minimal in Gitea)	Audit & compliance log	First-class, append-only PHI-access trail.
(none)	De-identification engine	Domain-specific; no Gitea analogue.

3. Recommended stack ("own stack", Python-centric)

Rationale: the medical-imaging and ML ecosystems (pydicom, SimpleITK, nibabel, dcm2niix, highdicom, MONAI, the de-id tooling) are overwhelmingly Python. A single-language core + worker stack removes the model-duplication friction you'd get from a Go core calling Python workers.

Layer	Choice	Gitea analogue
Core web/API	Python 3.12 + FastAPI (uvicorn/gunicorn)	`routers/` (chi)
Templating	Jinja2 + HTMX for progressive enhancement	`templates/`
Frontend build	Vite + TypeScript	`web_src/` + Vite
DICOM viewer	Cornerstone3D (DICOM), NiiVue (NIfTI)	embedded widgets
ORM / migrations	SQLAlchemy 2.0 + Alembic	XORM + migrations
Primary DB	PostgreSQL (single target)	multi-DB → standardize on PG
Queue / workers	Redis + Arq (async) or Celery	`modules/queue` + workers
Object storage	S3 / MinIO (self-host)	`modules/storage`
Search	OpenSearch (or Postgres FTS to start)	`modules/indexer`
Cache / pubsub / sessions	Redis	`modules/cache`, eventsource
Auth	Authlib (OIDC/OAuth2) + sessions + API tokens	`services/auth`
Imaging libs	pydicom, highdicom, SimpleITK, nibabel, dcm2niix, Pillow; OpenSlide for WSI	—
ML integration	MONAI / PyTorch dataset adapters via the SDK	—
De-id	pydicom + `deid` (CTP rules) + Presidio (text) + OCR (burned-in PHI)	—
Client	Python SDK + CLI (`imagehub clone/pull/push/commit`)	the `git` client

Alternative if you want Gitea-grade transport performance: keep a Go core for the API/transport/auth layer and use Python only in the worker tier. Faithful to Gitea, but you maintain two languages and duplicate the dataset/manifest types across the boundary. Recommended only if the upload/ download path is your dominant bottleneck. Default to all-Python.

4. Layered architecture

cli/         Admin & ops commands (Typer): serve, migrate, doctor, deid-batch, user-admin
  └─ api/        FastAPI routers — UI pages + REST API + transport endpoints (thin: parse → service → render)
       └─ services/   Business logic: dataset ops, versioning workflows, review, pipelines, de-id orchestration
            └─ models/     SQLAlchemy entities + queries (one module per domain: user, dataset, version, annotation…)
                 └─ core/       Leaf infra & domain engines — MUST NOT import the layers above
                                 ├─ vcs/        ← the Dataset Versioning Engine (the "Git")
                                 ├─ storage/    ← content-addressed blob store over S3/MinIO
                                 ├─ imaging/    ← DICOM/NIfTI parsing, metadata, thumbnails, conversion
                                 ├─ deid/       ← de-identification pipeline stages
                                 ├─ queue/      ← Redis/Arq job abstraction
                                 ├─ index/      ← search abstraction (OpenSearch / PG FTS)
                                 ├─ audit/      ← append-only audit log
                                 ├─ config/     ← typed settings
                                 └─ auth/       ← tokens, sessions, OIDC, permissions

Layer rules (enforce with import-linter, the analogue of Gitea's depguard):

core/ is the foundation; it may not import models/, services/, or api/.
Cross-entity business logic goes in services/, never in models/.
api/ handlers stay thin — no business logic, no direct DB-engine access.
Every DB query takes a session/context so it enlists in the request transaction.

5. Core subsystems

5.1 Dataset Versioning Engine (`core/vcs`) — the heart

A content-addressed Merkle DAG, like Git, specialized for large imaging files.

Blob store. Every file is hashed (SHA-256) and stored once in object storage at blobs/<aa>/<bb>/<hash>. Identical files across versions/datasets dedupe for free (huge win — imaging datasets share many instances).
Manifest (tree). A version's manifest lists logical_path → {blob_hash, size, media_type, imaging_meta}. The manifest is itself content-addressed.
Commit. {manifest_hash, parents[], author, timestamp, message}. The parent chain is the history DAG.
Refs. Branches/tags map name → commit_id, stored in Postgres (not in object storage) so they're transactional and queryable.
Transport / negotiation. On push, the client hashes locally and asks the server which blobs are missing ("have/want", like the LFS batch API), uploads only those (resumable, chunked), then posts the commit. Pull is the reverse.
Diff. Compare two manifests → added / removed / modified entries; surfaced in the UI as a dataset diff and, per-image, as a viewer side-by-side.
Merge. Three-way path-level merge of manifests; conflicts when the same path changed on both sides. Label/annotation merges can be semantic.

Build vs. buy: building this custom gives full control and the cleanest domain fit (recommended). If you need to move faster, back it with lakeFS (git-like branches/commits/merge over S3) or DVC, and keep your manifest API as the stable interface so you can swap the backend later.

5.2 Object storage (`core/storage`)

Driver interface (put/get/stat/delete/presign) with local and s3/minio implementations — exactly Gitea's modules/storage pattern. Stores blobs, manifests, thumbnails, pipeline artifacts. Presigned URLs let clients up/download directly to S3 for large transfers, bypassing the app.

5.3 Ingestion & processing pipeline (`core/queue` + workers)

On upload, enqueue jobs; workers (Arq) process them:

Verify checksums, store blobs (dedup).
Extract metadata (pydicom/nibabel): modality, body part, study/series UIDs, dimensions, acquisition params → indexed + linked to blobs.
Thumbnails / previews for the browse UI.
De-identification (§5.4).
Format normalization (optional: DICOM→NIfTI via dcm2niix for ML).
Commit the resulting version; update search index; write audit entries. Workers scale independently; GPU nodes handle ML jobs.

5.4 De-identification engine (`core/deid`) — compliance must-have

A configurable, multi-stage pipeline producing a deidentified branch from a raw/PHI version:

Tag de-id per DICOM PS3.15 Annex E confidentiality profiles: remove/ replace PHI tags, regenerate UIDs consistently (so series stay linked), handle private tags.
Date shifting: consistent per-patient offset to preserve intervals.
Burned-in pixel PHI: OCR (Tesseract/EasyOCR) to detect text in pixels, redact, and flag for human review.
Free-text / report de-id: Presidio NER over any text fields/reports.
Re-identification map (only if policy allows): the original↔pseudonym mapping is encrypted, access-restricted, and fully audited; otherwise the PHI source is dropped.
Verification stage emits a report of exactly what changed. Tooling: pydicom, Stanford deid / MIRC CTP rule sets, Presidio, an OCR engine. Profiles are configurable per org/dataset.

5.5 Web viewer (`api` + embedded TS widgets)

Progressive-enhancement widgets (not a separate SPA), true to Gitea:

Cornerstone3D for DICOM (multi-frame, MPR, windowing, measurements, segmentation overlays).
NiiVue for NIfTI volumes (great for neuro/research).
OpenSlide-backed deep-zoom tiles for whole-slide pathology (optional). The server exposes a frame/tile API (a WADO-RS-like read path even without full DICOMweb). Annotations are structured objects (DICOM SR or JSON), versioned with the dataset.

5.6 Search & discovery (`core/index`)

Index extracted metadata + labels → faceted search ("brain MRI, T1, age<40, has tumor label"). Start on Postgres FTS; graduate to OpenSearch for scale. Optional later: compute image embeddings (a foundation model) → pgvector for "find similar studies/lesions".

5.7 Collaboration (`services`)

Change Proposals (PRs), reviews, issues, comments, annotation tasks, releases, datasheets — the GitHub social layer, mapped to datasets. A reviewer of a Change Proposal sees the dataset diff and can open the viewer on changed series.

5.8 Pipelines & runners (Actions analogue, optional/advanced)

Event-driven compute (on: push | proposal | tag | schedule) executed by runners (containers that poll for jobs, à la act_runner). Use cases: auto de-id, QC/validation, dataset statistics, training/eval with MONAI. Each run pins the dataset version hash, giving reproducible ML by construction.

5.9 Auth, permissions, audit (`core/auth`, `core/audit`)

OIDC/OAuth2 login, sessions, scoped API tokens.
Org → Team → permission model; dataset visibility private | internal | public; dataset-level access requests / data-use agreements.
Audit log: append-only Postgres table (actor, action, object, dataset, version, IP, purpose-of-use, timestamp). Every PHI-bearing access (view original, download) is logged; optional hash-chaining for tamper-evidence; retention + legal-hold support.

5.10 API, SDK, CLI

REST API (FastAPI, OpenAPI-documented — the swagger analogue).
Python SDK (the most important client for ML users): pull a pinned version straight into a torch/MONAI Dataset.
CLI (imagehub clone/pull/push/checkout/commit/diff) — the git/dvc analogue for data engineers.

6. Data model (core tables)

user, organization, team, team_membership, team_access
dataset(id, owner_id, name, visibility, default_branch, description)
ref(dataset_id, name, type[branch|tag], commit_id)             -- transactional refs
commit(id, dataset_id, manifest_hash, parent_ids[], author_id, message, created_at)
blob(hash PK, size, storage_key, media_type, refcount)         -- content-addressed, deduped
manifest(hash PK, storage_key)                                 -- stored in object store, hash in DB
instance_meta(blob_hash, dataset_id, study_uid, series_uid, modality, body_part, dims, params…)
annotation(id, dataset_id, commit_id, target, type, payload, author_id)
label_schema(id, dataset_id, spec)         label(id, schema_id, value)
change_proposal(id, dataset_id, src_ref, dst_ref, status)  review, comment
issue(id, dataset_id, …)  issue_comment
release(id, dataset_id, tag, notes, doi?)
pipeline(id, dataset_id, spec)  pipeline_run(id, pipeline_id, commit_id, status, artifacts)  runner
webhook  webhook_delivery
audit_log(id, actor_id, action, object_type, object_id, dataset_id, ip, purpose, created_at)  -- append-only
access_request, data_use_agreement
phi_map(dataset_id, original_ref, pseudonym, …)               -- encrypted, restricted, audited

7. Key flows

Ingest & de-identify: upload → blobs stored (deduped) → metadata extracted → de-id pipeline → new commit on deidentified branch → indexed → audited.
Browse & view: datasets list → dataset → series list → Cornerstone3D/NiiVue streams frames → annotation overlays.
Curate an ML subset (zero-copy): faceted query → new branch/dataset whose manifest references existing blobs (no data copied) → commit → tag a release → sdk.pull(tag) in training.
Propose a change (PR): push new/relabeled data to a branch → open Change Proposal → reviewer sees dataset diff + image diff → approve → merge.
Reproducible training: tag triggers a pipeline that pins the version hash, runs MONAI train/eval, and links metrics + model artifact to that exact data version.

8. Deployment topology

            ┌──────────── reverse proxy (Caddy/Traefik) + TLS ────────────┐
            │                                                              │
   ┌────────▼────────┐    ┌──────────────────┐    ┌───────────────────────▼─┐
   │  Core app (N×)  │    │  Worker tier (M×) │    │  Runners (K×, GPU opt.)  │
   │  FastAPI/uvicorn│    │  Arq + imaging/ML │    │  pipelines (train/eval)  │
   └───┬─────┬───┬───┘    └───┬─────────┬─────┘    └────────────┬────────────┘
       │     │   │            │         │                       │
   ┌───▼─┐ ┌─▼─┐ │        ┌───▼───┐ ┌───▼────────┐         ┌────▼────┐
   │ PG  │ │Redis│ └──────▶│ Redis │ │ Object store│◀────────┤Object st.│
   │(state│ │queue│        │ queue │ │ S3 / MinIO  │         │ (blobs) │
   │ refs)│ │cache│        └───────┘ │ (blobs)     │         └─────────┘
   └─────┘ └────┘                    └─────────────┘
                          ┌───────────────┐
                          │  OpenSearch    │  (metadata/label search)
                          └───────────────┘

Core app: stateless, horizontally scalable.
Worker tier: scales independently; CPU for de-id/convert, GPU for ML.
Postgres: state, refs, metadata, audit. Redis: queue, cache, sessions, server-sent events. Object storage: all blobs. OpenSearch: search.
Dev / small self-host: a single docker-compose (app + worker + PG + Redis
- MinIO + OpenSearch). Scale: Kubernetes with separate node pools.
Contrast with Gitea (one binary, in-proc workers): we externalize workers and object storage because imaging/ML work is heavy, Python-bound, and GPU-hungry.

9. Build-vs-buy summary

Component	Recommendation
Versioning engine	Build the manifest/commit model (custom) — or back it with lakeFS/DVC behind your API to ship faster.
Viewer	Adopt Cornerstone3D + NiiVue (+ OpenSlide for WSI). Don't build.
De-identification	Assemble from pydicom + `deid`/CTP rules + Presidio + OCR. Don't build from scratch.
Search	Postgres FTS first → OpenSearch at scale.
Auth	Authlib (OIDC).
Queue	Arq (async) or Celery.
Object storage	MinIO self-host / S3 cloud.

10. MVP-first roadmap

Ordered for the chosen must-haves (versioning + viewer + de-id + audit):

Phase 0 — Skeleton. Layered project structure, config, Postgres + Alembic, object-storage driver, auth (user/org/team), dataset CRUD.
Phase 1 — Versioning engine. Blobs, manifests, commits, branches; push/pull via CLI + SDK; dataset diff. (This is the product's spine — invest here.)
Phase 2 — Ingestion + de-id + audit. Worker tier, metadata extraction, de-identification pipeline, append-only audit log. (The compliance core.)
Phase 3 — Viewer + search. Cornerstone3D/NiiVue widgets, thumbnails, faceted metadata search, browse UI.
Phase 4 — Collaboration. Change Proposals, reviews, issues, annotations, citable releases, datasheets.
Phase 5 — Pipelines. Runners, event triggers, reproducible MONAI train/eval, webhooks.
Later / optional. DICOMweb + PACS adapter (QIDO/WADO/STOW), image-embedding similarity search (pgvector), whole-slide pathology.

Appendix — naming parallels for orientation

git clone → imagehub clone · repository → dataset · commit → version · push/pull → push/pull · PR → change proposal · .git/objects → content-addressed blob store · act_runner → pipeline runner · app.ini → config · XORM → SQLAlchemy.

20 KiB Raw Permalink Blame History Unescape Escape

ImageHub — Architecture

1. Design philosophy (inherited from Gitea)

2. Concept mapping: Gitea → ImageHub

3. Recommended stack ("own stack", Python-centric)

4. Layered architecture

5. Core subsystems

5.1 Dataset Versioning Engine (core/vcs) — the heart

5.2 Object storage (core/storage)

5.3 Ingestion & processing pipeline (core/queue + workers)

5.4 De-identification engine (core/deid) — compliance must-have

5.5 Web viewer (api + embedded TS widgets)

5.6 Search & discovery (core/index)

5.7 Collaboration (services)

5.8 Pipelines & runners (Actions analogue, optional/advanced)

5.9 Auth, permissions, audit (core/auth, core/audit)

5.10 API, SDK, CLI

6. Data model (core tables)

7. Key flows

8. Deployment topology

9. Build-vs-buy summary

10. MVP-first roadmap

Appendix — naming parallels for orientation

20 KiB

Raw Permalink Blame History

5.1 Dataset Versioning Engine (`core/vcs`) — the heart

5.2 Object storage (`core/storage`)

5.3 Ingestion & processing pipeline (`core/queue` + workers)

5.4 De-identification engine (`core/deid`) — compliance must-have

5.5 Web viewer (`api` + embedded TS widgets)

5.6 Search & discovery (`core/index`)

5.7 Collaboration (`services`)

5.9 Auth, permissions, audit (`core/auth`, `core/audit`)