Files
sciagent/docs/user-profile-manager-db-review.md
Thinh Lam 688fac73e9
CI/CD / backend (push) Failing after 2m8s
CI/CD / frontend (push) Failing after 1m40s
CI/CD / deploy (push) Has been skipped
sciagent code + Gitea Actions CI/CD
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 09:38:30 +07:00

18 KiB

Database Engineering Review — User Profile Manager State Machine

Reviewer role: Senior database engineer Document under review: User profile manager — integration analysis and implementation state machine Audience: Backend / platform engineering team Status: Constructive feedback before implementation kickoff


1. Overall assessment

The document is well-organized and shows good systems thinking. In particular, three things are already right and should not be relitigated during implementation:

  • The MinIO boundary is correctly drawn. Profile scalars belong in PostgreSQL; binary artifacts (proof documents, exports) are the only thing that should ever touch object storage. Keep that line bright.
  • The naming separation between RBAC roles (admin, viewer, …) and the institutional job_title / position_title is essential and easy to get wrong. Holding to it prevents a class of authorization bugs.
  • Treating is_active (account disabled) and profile_verification_status (HR data trusted) as orthogonal axes is correct. They answer different questions and must not collapse into one column.

That said, the document is largely an interface-level plan. Several decisions that look like product-policy knobs are actually database design decisions with long-term cost — schema shape, constraint placement, transition enforcement, index strategy, and migration mechanics. The rest of this review focuses there.


2. Schema shape: pick the 1:1 child table, not extra columns on users

The document offers two options and leans neutral. The team should pick user_staff_profiles as a 1:1 child table keyed by user_id, and commit to it. Reasons:

  1. Hot-row contention. The users row is read on every authenticated request (token validation, /me, role checks). Every column added to that table widens the row, increases TOAST pressure for nullable text columns, and pulls more bytes through the cache for traffic that doesn't care about HR data. Profile fields are read on profile pages and admin queues — different access pattern, different table.
  2. Clear ownership. Auth code owns users + user_roles. HR/profile code owns user_staff_profiles. The boundary lines up with the API boundary you're already drawing (/auth/* vs /admin/users/*).
  3. Migration safety. Adding columns to users in a live system means coordinating with every read path that does SELECT * (hopefully none, but in practice some). A new table is additive and isolated.
  4. Permissions. Postgres-level GRANTs and row-level security policies, if you ever want them, are far easier to write per-table than per-column.

The one tradeoff is an extra LEFT JOIN on profile reads. That is cheap and indexable; it is not a reason to denormalize.

Recommendation: create user_staff_profiles with user_id PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE. The 1:1 is enforced by the PK.


3. Field-by-field schema critique

3.1 employee_id

The document says "uniqueness policy TBD". This must be decided before the migration is written, because it determines the constraint shape:

  • If globally unique when present, use a partial unique index: CREATE UNIQUE INDEX ... ON user_staff_profiles (employee_id) WHERE employee_id IS NOT NULL; A plain UNIQUE would forbid more than one NULL on some configurations and is the wrong tool here.
  • If required at registration, make it NOT NULL and use a regular UNIQUE.
  • Decide format validation (length, charset, leading zeros) and enforce with a CHECK constraint, not just in Pydantic. The DB is the last line of defense when a future script writes directly.

3.2 academic_title + academic_title_other

The enum + free-text "other" pattern works but has two failure modes worth pre-empting:

  • Drift between code and DB. If academic_title is a Postgres ENUM, adding a value requires ALTER TYPE, which is fine but easy to forget in code review. Prefer a lookup table (academic_titles(code, label_vi, label_en, sort_order, active)) referenced by FK. Adding a value is an INSERT, not a DDL change. Translations and ordering live next to the value.
  • other invariant. Enforce at the DB level: CHECK ((academic_title = 'other') = (academic_title_other IS NOT NULL AND length(trim(academic_title_other)) > 0)). Without this, you will eventually see rows where academic_title='professor' and academic_title_other='Something' from a buggy client.

3.3 department vs the existing users.unit_id

This is the most important inconsistency in the proposal. User already has an optional unit_id. The new design proposes department as free text. Pick one:

  • If units is a real catalog, drop department from the new design and reuse unit_id. Free text on top of a catalog guarantees data quality decay.
  • If the catalog is incomplete and free text is a transitional necessity, model it as unit_id NULLABLE + unit_name_freetext NULLABLE with a CHECK that exactly one is set. Plan a backfill story.

Don't ship two parallel notions of "where this person works."

3.4 job_title

Free text is fine here, but:

  • Length cap (e.g. VARCHAR(120) or TEXT with a CHECK length()). Without one, this is a future abuse vector in admin UI.
  • Keep it nullable. Some accounts (system, integration) won't have one.

3.5 Verification fields

profile_verification_status, verification_submitted_at, verified_at, verified_by_user_id, rejection_reason — fine, but enforce the invariants between them in the DB, not just in Python. See §4.

3.6 Timestamps

Every _at column should be TIMESTAMPTZ, never TIMESTAMP. You will eventually have admins in different time zones, and "naive timestamps interpreted as UTC by some code paths and as local time by others" is a recurring bug class. Make the DB authoritative.


4. Enforce the state machine in the database, not only in the application

The Mermaid diagram is clear, but right now nothing prevents a buggy code path or a one-off SQL fix from putting a row in an impossible state — e.g. status='verified' with verified_at IS NULL, or status='rejected' with no rejection_reason. Three layers of defense, in order of cost:

4.1 Status as a constrained type

Use either a Postgres ENUM or a CHECK (profile_verification_status IN ('draft','pending','verified','rejected')). Lookup table is also acceptable. Whichever you pick, don't store free-text statuses — they always rot.

4.2 Cross-column invariants

Express the state machine's value invariants as CHECK constraints. Concretely:

CONSTRAINT verified_requires_metadata CHECK (
  profile_verification_status <> 'verified'
  OR (verified_at IS NOT NULL AND verified_by_user_id IS NOT NULL)
),
CONSTRAINT rejected_requires_reason CHECK (
  profile_verification_status <> 'rejected'
  OR (rejection_reason IS NOT NULL AND length(trim(rejection_reason)) > 0)
),
CONSTRAINT pending_clears_verification CHECK (
  profile_verification_status NOT IN ('draft','pending')
  OR (verified_at IS NULL AND verified_by_user_id IS NULL)
)

These are cheap, write-time-only, and they turn an entire category of "how did we get here" production incidents into impossible-to-commit transactions.

4.3 Transition legality

The richer rule — "verified cannot transition to draft directly" — is harder to express as a column constraint. Two options:

  • Application-level only, but always wrap transitions in UPDATE … WHERE id = $1 AND profile_verification_status = $expected_prior_state and check the affected-row count. This gives you optimistic concurrency for free and rejects illegal transitions implicitly because the WHERE won't match.
  • Trigger-based transition table. Heavier; only worth it if multiple services write to the table. For a single backend, the conditional UPDATE pattern is enough.

Whichever you pick, document and test the legal transitions explicitly. The diagram is the spec; the test suite should encode it directly (one test per transition, including illegal ones expecting failure).


5. Concurrency — name the races now

Two scenarios will happen in production. Decide the answer before writing code:

  1. Two admins click "Verify" on the same pending profile within the same second. With the conditional-UPDATE pattern above, the second one's UPDATE affects 0 rows. The handler must return a clean 409 Conflict (or treat as idempotent no-op — pick one and document it). Don't return 200 with a stale read; that's how audit logs end up with two "verified" entries for one transition.
  2. Applicant PATCHes their profile while an admin is reviewing it. If the policy is "edits after verification revert to pending", this resolves itself. If the policy is "freeze after verified", you need a check at PATCH time. In both cases, the admin's verification UPDATE should be conditional on the content the admin reviewed, not just the prior state — consider an etag / version integer column bumped on every applicant edit, and require admins to send back the version they reviewed.

The document mentions idempotency as a "pick one behavior" — agreed, but the team should write down which one and link to it from the admin endpoint code.


6. Audit logging — beyond "extend audit_log"

§3.5 of the source doc says "extend audit_log with entity_type='user_profile', before/after snapshot." Some sharper guidance:

  • Store snapshots as JSONB, not stringified JSON. JSONB lets you query later (e.g. "show me all profiles where the verifier was admin X" or "find profiles whose job_title changed in Q3") without parsing.
  • Whitelist fields included in snapshots, don't blacklist. The whitelist won't accidentally include password_hash or future PII columns the auditor didn't think about.
  • Add a monotonic event ordering column (created_at TIMESTAMPTZ DEFAULT now() plus an event_id BIGSERIAL) and index by (entity_type, entity_id, event_id). Querying "the history of this profile in order" is the most common audit access pattern; make it fast.
  • Audit rows must be written in the same transaction as the state change. If the audit insert fails, the state change must roll back. Otherwise the log is not trustworthy.
  • Plan for growth. Audit logs are append-only and grow without bound. Decide retention (e.g. 7 years for HR-relevant entries, 2 years for cosmetic edits) and put a partitioning strategy on the table from day one — PARTITION BY RANGE (created_at) monthly is a low-effort default and means you can drop old partitions cheaply later.

7. Indexing strategy for the admin queue

The admin "Users" page will have predictable hot queries. Plan indexes for them up front, not after the page is slow:

  • Pending queue: CREATE INDEX ix_usp_pending ON user_staff_profiles (verification_submitted_at) WHERE profile_verification_status = 'pending'; Partial indexes are tiny and exactly match the query.
  • Filter by department/unit: composite index covering the typical filter + sort, e.g. (unit_id, profile_verification_status, verification_submitted_at).
  • Lookup by employee_id: covered by the unique index above.
  • Admin activity: (verified_by_user_id, verified_at DESC) for "what did admin X approve recently".

Skip speculative indexes. Add them when the access pattern is real. But the four above will be real on day one.


8. Migration mechanics

The document says "Alembic/SQL migration consistent with repo conventions" — fine, but a few points the team should agree on before writing the migration:

  1. Backfill plan. New columns on users (or new user_staff_profiles rows for existing accounts) need a defined initial state. Recommended: for every existing user, INSERT a user_staff_profiles row with profile_verification_status = 'draft' and all institutional fields NULL. Do this in the same migration that creates the table, in a single transaction if the user count is small (<100k); otherwise batch it.
  2. NOT NULL columns must be added in two steps for live tables: add nullable → backfill → set NOT NULL. The Alembic migration should do all three, and the second step should be batched.
  3. Reversibility. Every migration needs a working downgrade(). Verification status data will be lost on downgrade — that's fine, but make it explicit (DROP TABLE user_staff_profiles) rather than half-reversible.
  4. Lock duration. ALTER TABLE users ADD COLUMN with a default takes an ACCESS EXCLUSIVE lock; on a busy auth table this is user-visible downtime. Adding a separate table sidesteps this entirely — another reason to prefer user_staff_profiles.

9. Suggested DDL skeleton

Provided as a starting point for the migration author, not as the final word. Field list mirrors §5 of the source doc.

CREATE TYPE profile_verification_status AS ENUM
  ('draft', 'pending', 'verified', 'rejected');

CREATE TABLE academic_titles (
  code         TEXT PRIMARY KEY,
  label_vi     TEXT NOT NULL,
  label_en     TEXT NOT NULL,
  sort_order   INTEGER NOT NULL DEFAULT 0,
  active       BOOLEAN NOT NULL DEFAULT TRUE
);

CREATE TABLE user_staff_profiles (
  user_id                       UUID PRIMARY KEY
                                  REFERENCES users(id) ON DELETE CASCADE,

  employee_id                   TEXT,
  academic_title_code           TEXT REFERENCES academic_titles(code),
  academic_title_other          TEXT,
  unit_name_freetext            TEXT,        -- only if catalog is incomplete; see §3.3
  job_title                     TEXT,

  profile_verification_status   profile_verification_status
                                  NOT NULL DEFAULT 'draft',
  verification_submitted_at     TIMESTAMPTZ,
  verified_at                   TIMESTAMPTZ,
  verified_by_user_id           UUID REFERENCES users(id),
  rejection_reason              TEXT,

  version                       INTEGER NOT NULL DEFAULT 1,
  created_at                    TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at                    TIMESTAMPTZ NOT NULL DEFAULT now(),

  CONSTRAINT employee_id_shape
    CHECK (employee_id IS NULL OR employee_id ~ '^[A-Z0-9-]{3,32}$'),

  CONSTRAINT academic_title_other_invariant CHECK (
    (academic_title_code = 'other')
      = (academic_title_other IS NOT NULL
         AND length(trim(academic_title_other)) > 0)
  ),

  CONSTRAINT verified_requires_metadata CHECK (
    profile_verification_status <> 'verified'
    OR (verified_at IS NOT NULL AND verified_by_user_id IS NOT NULL)
  ),

  CONSTRAINT rejected_requires_reason CHECK (
    profile_verification_status <> 'rejected'
    OR (rejection_reason IS NOT NULL
        AND length(trim(rejection_reason)) > 0)
  ),

  CONSTRAINT non_terminal_clears_verification CHECK (
    profile_verification_status NOT IN ('draft','pending')
    OR (verified_at IS NULL AND verified_by_user_id IS NULL)
  ),

  CONSTRAINT job_title_length CHECK (
    job_title IS NULL OR length(job_title) <= 120
  )
);

CREATE UNIQUE INDEX ix_usp_employee_id_unique
  ON user_staff_profiles (employee_id)
  WHERE employee_id IS NOT NULL;

CREATE INDEX ix_usp_pending_queue
  ON user_staff_profiles (verification_submitted_at)
  WHERE profile_verification_status = 'pending';

CREATE INDEX ix_usp_verifier_activity
  ON user_staff_profiles (verified_by_user_id, verified_at DESC)
  WHERE verified_by_user_id IS NOT NULL;

The version column supports the optimistic concurrency pattern in §5. Bump it in every UPDATE; require it in admin verify/reject calls.


10. Testing the database layer specifically

API tests (mentioned in source §5) are necessary but not sufficient. Add:

  • Constraint tests that attempt to INSERT impossible rows and assert the CHECK rejects them. These guard the DDL itself.
  • Transition tests that walk every legal edge in the state machine and assert success, plus every illegal edge and assert rejection. The Mermaid diagram is the source of truth — generate the test matrix from it.
  • Concurrency tests using two real connections (not mocks) attempting the same verify simultaneously. Pytest can do this with threading + a proper test DB.
  • Migration round-trip tests: upgrade → downgrade → upgrade on a populated test DB to catch reversibility bugs early.

11. Open questions the team must answer before merge

These are not blockers for design, but they are blockers for "is the migration done":

  1. Is employee_id required at registration, optional, or required-before-pending? (Determines NULL policy and CHECK shape.)
  2. Strict or pragmatic re-verification on edit? (§4.1.B in source doc.) Pick one; encode it in tests.
  3. Verifying an already-verified profile: idempotent 200, or 409? (§3.4 in source doc.)
  4. Catalog vs free-text department: is units populated and trustworthy? If yes, drop unit_name_freetext. If no, what is the backfill plan?
  5. Audit log retention period and partitioning cadence?
  6. Does verification ever expire (e.g. annual re-verification for HR compliance)? If yes, schema needs verified_until and a scheduled job; better to design for it now than retrofit.

12. Summary checklist

Topic Recommendation
Schema shape Separate user_staff_profiles 1:1 table; do not widen users.
employee_id uniqueness Partial unique index, NULL policy decided before migration.
Academic title Lookup table over Postgres ENUM; CHECK enforces the other invariant.
Department Reconcile with existing unit_id; do not ship parallel notions.
State enforcement Postgres ENUM/CHECK for status; cross-column CHECKs for invariants.
Transitions Conditional UPDATE on prior state; reject illegal transitions implicitly.
Concurrency version column for optimistic locking; admin verify carries expected version.
Audit JSONB snapshots, whitelist fields, same-transaction writes, partitioned by month.
Indexes Partial index on pending queue; composite on admin activity; unique on employee_id.
Timestamps TIMESTAMPTZ everywhere.
Migration Two-phase NOT NULL; backfill in same migration; reversible downgrade.
Tests Constraint tests, transition matrix, concurrency tests, migration round-trip.

The plan in the source document is sound. The notes above are the difference between a design that survives one release and one that survives five.


Prepared as a database engineering review, complementary to the original integration analysis. Any path references mirror the source document and should be re-verified against the current tree before code changes land.