Files
sciagent/docs/feedback.md
T
Thinh Lam 688fac73e9
CI/CD / backend (push) Failing after 2m8s
CI/CD / frontend (push) Failing after 1m40s
CI/CD / deploy (push) Has been skipped
sciagent code + Gitea Actions CI/CD
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 09:38:30 +07:00

208 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Frontend Stability Review — Dashboard Data Refresh
Reviewing `fe0-dashboard-data-refresh-architecture.md` from the perspective of stability, predictability, and operability. The current design is **reasonable in spirit** (polling for "good enough" freshness, invalidation where it's natural) but has several **latent footguns** that will eventually cause incidents or flakiness as the user base and feature set grow.
The notes below are ordered by stability impact, not by where they appear in the document.
---
## 1. Critical: two `App.tsx` entrypoints with divergent QueryClient defaults
This is the single most dangerous thing in the document.
> *"There is also `fe0/src/app/App.tsx`, which configures **different** defaults (`refetchOnWindowFocus: false`, `staleTime: 5 minutes`)... If you ever switch entrypoints, global refetch behavior would change without touching feature code."*
This is a **silent global config switch** waiting to be flipped. A junior engineer cleaning up imports, a refactor that "consolidates the App shell," or a bad rebase can change every cache and refetch policy in the app without a single line of feature code being modified. Worse, the per-query `refetchInterval: 10s` would still appear correct, masking the change. PR review will not catch this — diff readers don't reason about which `App.tsx` `main.tsx` resolves to.
**Action — do this first, before anything else on this list:**
1. Pick one entrypoint. Delete the other in the same PR. Do not leave a TODO.
2. Add a CI check or ESLint rule (`no-restricted-imports`) that forbids importing the other path, in case anyone restores it.
3. The surviving file should set **explicit** `defaultOptions` on the `QueryClient`, even if the values match library defaults. Implicit defaults are a version-upgrade hazard — TanStack Query has changed defaults across major versions before. You do not want a `pnpm up` to silently change refetch semantics.
```ts
const queryClient = new QueryClient({
defaultOptions: {
queries: {
refetchOnWindowFocus: false,
refetchOnReconnect: true,
staleTime: 30_000,
gcTime: 5 * 60_000,
retry: (failureCount, err) => {
if (isAuthError(err)) return false; // never retry 401/403
return failureCount < 2;
},
retryDelay: (attempt) => Math.min(1000 * 2 ** attempt, 8000),
},
mutations: { retry: false },
},
});
```
Then per-query options (`refetchInterval`, `refetchOnWindowFocus: "always"`, etc.) become **deliberate opt-ins**, which is what they should be.
---
## 2. Critical: `refetchOnWindowFocus: "always"` on the admin inbox
`"always"` means *refetch on every focus event regardless of staleness*. Combined with 10s polling, the practical effect is:
- An admin with the tab focused: ~6 requests/min from polling.
- An admin who tabs in/out a few times per minute: polling **plus** every focus refetch — easily 1220 requests/min per tab.
- N admins doing this at 9am Monday: a synchronized burst that the backend has no way to absorb gracefully.
`"always"` is rarely the right answer. It exists for cases where stale data is genuinely dangerous (e.g. a trading screen). An approvals inbox is not that. Use `true` (refetch only if stale) and pair it with a sensible `staleTime`. The user experience difference will be imperceptible; the server load difference will not.
---
## 3. High: polling is not visibility-aware
A background tab still polls (browsers throttle timers but do not stop them, and some don't throttle aggressively when audio/SSE/etc. is active). For a dashboard people leave open all day in a second tab, this is pure waste — and it's waste that scales linearly with headcount.
The document already mentions the fix as a "common direction":
```ts
refetchInterval: (query) =>
document.visibilityState === "visible" ? 10_000 : false,
```
This should not be in the "if you want fewer calls" section. It should be the **default pattern** for every polling query in the app, captured in a small wrapper hook so engineers don't have to remember:
```ts
function useVisibilityAwareInterval(ms: number) {
return (query: Query) =>
document.visibilityState === "visible" ? ms : false;
}
```
---
## 4. High: `isFetching` is leaking into unrelated UI
> *"Stabilizing export used export-only loading state so the button does not follow list refetch"*
The fact that this fix was needed tells me the codebase has a **pattern problem**, not just one buggy button. Anywhere `isLoading` / `isFetching` from a list query is wired into a control that is conceptually independent (export, filters, "new application" button, pagination), the same bug exists — it just hasn't been noticed because dev latency hides it.
**Recommendation:** audit every consumer of the `["applications", ...]` query for `isFetching` / `isLoading` usage. As a rule:
- `isLoading` (true only on first load, no cached data) is *usually* fine to gate UI on.
- `isFetching` (true on every refetch including background polls) should **almost never** disable user-initiated controls. It is for showing a subtle indicator at most.
Consider a lint rule or a code review checklist item: "If you're disabling a button on `isFetching`, justify it in a comment."
This bug profile — "works locally, broken in cloud" — is exactly the kind of thing that erodes trust in the frontend. Slow networks shouldn't reveal latent bugs; they should reveal honest slowness.
---
## 5. High: query key stability for `["applications", filters]`
Not in the document, but worth checking: if `filters` is an object literal constructed in render (`{ status, page, q }`), its reference changes every render and TanStack Query may re-key on every render, producing far more requests than intended. Symptoms: requests fire on keystrokes, on unrelated state changes, etc.
Mitigations:
- Memoize the filters object with `useMemo`, *or*
- Pass primitive fields directly into the key: `["applications", status, page, q]`. This is more verbose but eliminates a whole class of bugs and makes the key serializable for debugging.
If this isn't already enforced, add it to the team's TanStack Query conventions doc.
---
## 6. Medium: inconsistent refresh strategy between admin and council
Admin uses **time-based polling**. Council uses **event-driven invalidation via `reportSyncQuery`**. These are two different mental models for the same surface (an "approved applications list"), maintained in two different files.
This costs the team in three ways:
1. **Cognitive load** — every engineer has to learn both patterns and remember which file uses which.
2. **Bug asymmetry** — a bug fixed in one file will not be fixed in the other. Already visible: the `isFetching` export issue likely needs to be checked in both.
3. **Drift** — over time the two implementations diverge in subtle ways (column order, filter semantics, error handling) and the union of behaviors becomes the de facto contract, which nobody actually documented.
Pick one strategy and apply it everywhere applications are listed. My recommendation:
- **Primary**: invalidation on mutations (`approve`, `reject`, `submit`, `assign`) plus invalidation on a lightweight "version" or "report sync" signal.
- **Secondary**: a slow safety-net poll (60120s, visibility-aware) so a missed invalidation doesn't leave the screen permanently stale.
This gets you near-instant updates after user actions (which is what people actually notice) and removes the 10s drumbeat against the backend.
If true realtime ever becomes a product requirement, SSE is the natural next step — *one* long-lived connection per tab is dramatically cheaper than 6 polls/minute, and it integrates cleanly behind the existing `apiClient`. WebSockets are overkill unless the server needs to push high-frequency updates.
---
## 7. Medium: no documented retry, timeout, or auth-failure policy
The document covers the happy path thoroughly but says nothing about:
- **Timeouts** — Axios has no default timeout. A hung backend will leave requests pending indefinitely, which compounds with 10s polling: stuck requests stack up, the browser hits its per-host concurrency limit, and the UI appears frozen.
- **Retries** — TanStack Query retries 3× by default. For a 10s poll, that means a transient 500 produces 4 requests in quick succession before the next poll, amplifying load exactly when the backend is already struggling.
- **401/403** — does the response interceptor force logout, redirect, or attempt refresh? Silent 401 loops (auth expired → poll → 401 → poll → 401) are a classic source of "the dashboard is broken" reports that turn out to be expired sessions.
- **Network offline** — `refetchOnReconnect: true` is good, but is there UI feedback during the offline window? Otherwise users see "blank inbox" and assume data loss.
Add an explicit section to the architecture doc covering these. They are stability features even when they don't fire.
---
## 8. Medium: dev logging interceptor
Logging every successful response in dev is fine, but two small upgrades:
1. **Sample or summarize** rather than logging full `data` payloads. Large list responses make the console unusable and hide the actual interesting events. `console.debug` (silenced by default in DevTools) is often a better choice than `console.log`.
2. **Guard against accidental production leakage** with a build-time assertion, not just `import.meta.env.DEV`. A misconfigured deploy that ships dev mode to production would log user data into browser consoles — minor but real privacy/compliance concern.
---
## 9. Low: polling intervals are scattered magic numbers
`10s`, `30s`, `30s`, `60s`, `60s` across five files. None are configurable. None share a constant. If product asks "can we double polling intervals across the board for a load test?", the answer requires touching five files and hoping you found them all.
Centralize:
```ts
// fe0/src/shared/config/polling.ts
export const POLL_INTERVALS = {
adminInbox: 10_000,
notificationsCount: 60_000,
notificationsList: 60_000,
adminOverview: 30_000,
aiHealth: 30_000,
} as const;
```
Bonus: makes it trivial to make them env-configurable later, and gives ops a single place to tune under load.
---
## 10. Low: thundering herd / no jitter
All polls fire at fixed intervals from mount time. With many admins logging in around the same wall-clock window (start of shift, morning standup), poll cycles tend to align. Adding ±1020% jitter to intervals smooths backend load with zero UX cost:
```ts
const jitter = (ms: number) => ms * (0.9 + Math.random() * 0.2);
refetchInterval: jitter(10_000)
```
Worth doing once you have more than a handful of concurrent admins.
---
## Suggested order of work
A pragmatic sequencing if the team can only do this incrementally:
1. **This week** — Delete one of the two `App.tsx` files. Set explicit `QueryClient` defaults. (Items 1.)
2. **Next sprint** — Audit `isFetching` consumers; introduce visibility-aware polling helper; replace `"always"` with `true`. (Items 2, 3, 4.)
3. **Following sprint** — Centralize polling intervals; document retry/timeout/auth policy and implement gaps; verify query key stability. (Items 5, 7, 9.)
4. **Quarter horizon** — Converge admin and council on one refresh strategy (invalidation primary, slow poll as safety net). Consider SSE if/when realtime becomes a product ask. (Item 6.)
---
## What's already good
To be fair to the existing design — the document describes several decisions that are correct and worth preserving:
- `placeholderData: (previous) => previous` to avoid table flicker is the right call.
- Invalidating `notifications-unread-count` after mutations rather than waiting for the next poll tick is exactly the pattern more of the app should follow.
- A single shared `apiClient` is a good foundation; the issues above are about what to put on top of it, not about replacing it.
- Documenting local-vs-cloud behavioral differences (latency hiding `isFetching` bugs) is the kind of institutional knowledge that usually only lives in people's heads. Keep doing this.
The bones are fine. The work is mostly about removing implicit behavior, consolidating duplicated patterns, and making polling polite.