Files
sciagent/docs/feedback.md
T
Thinh Lam 688fac73e9
CI/CD / backend (push) Failing after 2m8s
CI/CD / frontend (push) Failing after 1m40s
CI/CD / deploy (push) Has been skipped
sciagent code + Gitea Actions CI/CD
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 09:38:30 +07:00

12 KiB
Raw Blame History

Frontend Stability Review — Dashboard Data Refresh

Reviewing fe0-dashboard-data-refresh-architecture.md from the perspective of stability, predictability, and operability. The current design is reasonable in spirit (polling for "good enough" freshness, invalidation where it's natural) but has several latent footguns that will eventually cause incidents or flakiness as the user base and feature set grow.

The notes below are ordered by stability impact, not by where they appear in the document.


1. Critical: two App.tsx entrypoints with divergent QueryClient defaults

This is the single most dangerous thing in the document.

"There is also fe0/src/app/App.tsx, which configures different defaults (refetchOnWindowFocus: false, staleTime: 5 minutes)... If you ever switch entrypoints, global refetch behavior would change without touching feature code."

This is a silent global config switch waiting to be flipped. A junior engineer cleaning up imports, a refactor that "consolidates the App shell," or a bad rebase can change every cache and refetch policy in the app without a single line of feature code being modified. Worse, the per-query refetchInterval: 10s would still appear correct, masking the change. PR review will not catch this — diff readers don't reason about which App.tsx main.tsx resolves to.

Action — do this first, before anything else on this list:

  1. Pick one entrypoint. Delete the other in the same PR. Do not leave a TODO.
  2. Add a CI check or ESLint rule (no-restricted-imports) that forbids importing the other path, in case anyone restores it.
  3. The surviving file should set explicit defaultOptions on the QueryClient, even if the values match library defaults. Implicit defaults are a version-upgrade hazard — TanStack Query has changed defaults across major versions before. You do not want a pnpm up to silently change refetch semantics.
const queryClient = new QueryClient({
  defaultOptions: {
    queries: {
      refetchOnWindowFocus: false,
      refetchOnReconnect: true,
      staleTime: 30_000,
      gcTime: 5 * 60_000,
      retry: (failureCount, err) => {
        if (isAuthError(err)) return false;   // never retry 401/403
        return failureCount < 2;
      },
      retryDelay: (attempt) => Math.min(1000 * 2 ** attempt, 8000),
    },
    mutations: { retry: false },
  },
});

Then per-query options (refetchInterval, refetchOnWindowFocus: "always", etc.) become deliberate opt-ins, which is what they should be.


2. Critical: refetchOnWindowFocus: "always" on the admin inbox

"always" means refetch on every focus event regardless of staleness. Combined with 10s polling, the practical effect is:

  • An admin with the tab focused: ~6 requests/min from polling.
  • An admin who tabs in/out a few times per minute: polling plus every focus refetch — easily 1220 requests/min per tab.
  • N admins doing this at 9am Monday: a synchronized burst that the backend has no way to absorb gracefully.

"always" is rarely the right answer. It exists for cases where stale data is genuinely dangerous (e.g. a trading screen). An approvals inbox is not that. Use true (refetch only if stale) and pair it with a sensible staleTime. The user experience difference will be imperceptible; the server load difference will not.


3. High: polling is not visibility-aware

A background tab still polls (browsers throttle timers but do not stop them, and some don't throttle aggressively when audio/SSE/etc. is active). For a dashboard people leave open all day in a second tab, this is pure waste — and it's waste that scales linearly with headcount.

The document already mentions the fix as a "common direction":

refetchInterval: (query) =>
  document.visibilityState === "visible" ? 10_000 : false,

This should not be in the "if you want fewer calls" section. It should be the default pattern for every polling query in the app, captured in a small wrapper hook so engineers don't have to remember:

function useVisibilityAwareInterval(ms: number) {
  return (query: Query) =>
    document.visibilityState === "visible" ? ms : false;
}

4. High: isFetching is leaking into unrelated UI

"Stabilizing export used export-only loading state so the button does not follow list refetch"

The fact that this fix was needed tells me the codebase has a pattern problem, not just one buggy button. Anywhere isLoading / isFetching from a list query is wired into a control that is conceptually independent (export, filters, "new application" button, pagination), the same bug exists — it just hasn't been noticed because dev latency hides it.

Recommendation: audit every consumer of the ["applications", ...] query for isFetching / isLoading usage. As a rule:

  • isLoading (true only on first load, no cached data) is usually fine to gate UI on.
  • isFetching (true on every refetch including background polls) should almost never disable user-initiated controls. It is for showing a subtle indicator at most.

Consider a lint rule or a code review checklist item: "If you're disabling a button on isFetching, justify it in a comment."

This bug profile — "works locally, broken in cloud" — is exactly the kind of thing that erodes trust in the frontend. Slow networks shouldn't reveal latent bugs; they should reveal honest slowness.


5. High: query key stability for ["applications", filters]

Not in the document, but worth checking: if filters is an object literal constructed in render ({ status, page, q }), its reference changes every render and TanStack Query may re-key on every render, producing far more requests than intended. Symptoms: requests fire on keystrokes, on unrelated state changes, etc.

Mitigations:

  • Memoize the filters object with useMemo, or
  • Pass primitive fields directly into the key: ["applications", status, page, q]. This is more verbose but eliminates a whole class of bugs and makes the key serializable for debugging.

If this isn't already enforced, add it to the team's TanStack Query conventions doc.


6. Medium: inconsistent refresh strategy between admin and council

Admin uses time-based polling. Council uses event-driven invalidation via reportSyncQuery. These are two different mental models for the same surface (an "approved applications list"), maintained in two different files.

This costs the team in three ways:

  1. Cognitive load — every engineer has to learn both patterns and remember which file uses which.
  2. Bug asymmetry — a bug fixed in one file will not be fixed in the other. Already visible: the isFetching export issue likely needs to be checked in both.
  3. Drift — over time the two implementations diverge in subtle ways (column order, filter semantics, error handling) and the union of behaviors becomes the de facto contract, which nobody actually documented.

Pick one strategy and apply it everywhere applications are listed. My recommendation:

  • Primary: invalidation on mutations (approve, reject, submit, assign) plus invalidation on a lightweight "version" or "report sync" signal.
  • Secondary: a slow safety-net poll (60120s, visibility-aware) so a missed invalidation doesn't leave the screen permanently stale.

This gets you near-instant updates after user actions (which is what people actually notice) and removes the 10s drumbeat against the backend.

If true realtime ever becomes a product requirement, SSE is the natural next step — one long-lived connection per tab is dramatically cheaper than 6 polls/minute, and it integrates cleanly behind the existing apiClient. WebSockets are overkill unless the server needs to push high-frequency updates.


7. Medium: no documented retry, timeout, or auth-failure policy

The document covers the happy path thoroughly but says nothing about:

  • Timeouts — Axios has no default timeout. A hung backend will leave requests pending indefinitely, which compounds with 10s polling: stuck requests stack up, the browser hits its per-host concurrency limit, and the UI appears frozen.
  • Retries — TanStack Query retries 3× by default. For a 10s poll, that means a transient 500 produces 4 requests in quick succession before the next poll, amplifying load exactly when the backend is already struggling.
  • 401/403 — does the response interceptor force logout, redirect, or attempt refresh? Silent 401 loops (auth expired → poll → 401 → poll → 401) are a classic source of "the dashboard is broken" reports that turn out to be expired sessions.
  • Network offlinerefetchOnReconnect: true is good, but is there UI feedback during the offline window? Otherwise users see "blank inbox" and assume data loss.

Add an explicit section to the architecture doc covering these. They are stability features even when they don't fire.


8. Medium: dev logging interceptor

Logging every successful response in dev is fine, but two small upgrades:

  1. Sample or summarize rather than logging full data payloads. Large list responses make the console unusable and hide the actual interesting events. console.debug (silenced by default in DevTools) is often a better choice than console.log.
  2. Guard against accidental production leakage with a build-time assertion, not just import.meta.env.DEV. A misconfigured deploy that ships dev mode to production would log user data into browser consoles — minor but real privacy/compliance concern.

9. Low: polling intervals are scattered magic numbers

10s, 30s, 30s, 60s, 60s across five files. None are configurable. None share a constant. If product asks "can we double polling intervals across the board for a load test?", the answer requires touching five files and hoping you found them all.

Centralize:

// fe0/src/shared/config/polling.ts
export const POLL_INTERVALS = {
  adminInbox: 10_000,
  notificationsCount: 60_000,
  notificationsList: 60_000,
  adminOverview: 30_000,
  aiHealth: 30_000,
} as const;

Bonus: makes it trivial to make them env-configurable later, and gives ops a single place to tune under load.


10. Low: thundering herd / no jitter

All polls fire at fixed intervals from mount time. With many admins logging in around the same wall-clock window (start of shift, morning standup), poll cycles tend to align. Adding ±1020% jitter to intervals smooths backend load with zero UX cost:

const jitter = (ms: number) => ms * (0.9 + Math.random() * 0.2);
refetchInterval: jitter(10_000)

Worth doing once you have more than a handful of concurrent admins.


Suggested order of work

A pragmatic sequencing if the team can only do this incrementally:

  1. This week — Delete one of the two App.tsx files. Set explicit QueryClient defaults. (Items 1.)
  2. Next sprint — Audit isFetching consumers; introduce visibility-aware polling helper; replace "always" with true. (Items 2, 3, 4.)
  3. Following sprint — Centralize polling intervals; document retry/timeout/auth policy and implement gaps; verify query key stability. (Items 5, 7, 9.)
  4. Quarter horizon — Converge admin and council on one refresh strategy (invalidation primary, slow poll as safety net). Consider SSE if/when realtime becomes a product ask. (Item 6.)

What's already good

To be fair to the existing design — the document describes several decisions that are correct and worth preserving:

  • placeholderData: (previous) => previous to avoid table flicker is the right call.
  • Invalidating notifications-unread-count after mutations rather than waiting for the next poll tick is exactly the pattern more of the app should follow.
  • A single shared apiClient is a good foundation; the issues above are about what to put on top of it, not about replacing it.
  • Documenting local-vs-cloud behavioral differences (latency hiding isFetching bugs) is the kind of institutional knowledge that usually only lives in people's heads. Keep doing this.

The bones are fine. The work is mostly about removing implicit behavior, consolidating duplicated patterns, and making polling polite.