208 lines
12 KiB
Markdown
208 lines
12 KiB
Markdown
# Frontend Stability Review — Dashboard Data Refresh
|
||
|
||
Reviewing `fe0-dashboard-data-refresh-architecture.md` from the perspective of stability, predictability, and operability. The current design is **reasonable in spirit** (polling for "good enough" freshness, invalidation where it's natural) but has several **latent footguns** that will eventually cause incidents or flakiness as the user base and feature set grow.
|
||
|
||
The notes below are ordered by stability impact, not by where they appear in the document.
|
||
|
||
---
|
||
|
||
## 1. Critical: two `App.tsx` entrypoints with divergent QueryClient defaults
|
||
|
||
This is the single most dangerous thing in the document.
|
||
|
||
> *"There is also `fe0/src/app/App.tsx`, which configures **different** defaults (`refetchOnWindowFocus: false`, `staleTime: 5 minutes`)... If you ever switch entrypoints, global refetch behavior would change without touching feature code."*
|
||
|
||
This is a **silent global config switch** waiting to be flipped. A junior engineer cleaning up imports, a refactor that "consolidates the App shell," or a bad rebase can change every cache and refetch policy in the app without a single line of feature code being modified. Worse, the per-query `refetchInterval: 10s` would still appear correct, masking the change. PR review will not catch this — diff readers don't reason about which `App.tsx` `main.tsx` resolves to.
|
||
|
||
**Action — do this first, before anything else on this list:**
|
||
|
||
1. Pick one entrypoint. Delete the other in the same PR. Do not leave a TODO.
|
||
2. Add a CI check or ESLint rule (`no-restricted-imports`) that forbids importing the other path, in case anyone restores it.
|
||
3. The surviving file should set **explicit** `defaultOptions` on the `QueryClient`, even if the values match library defaults. Implicit defaults are a version-upgrade hazard — TanStack Query has changed defaults across major versions before. You do not want a `pnpm up` to silently change refetch semantics.
|
||
|
||
```ts
|
||
const queryClient = new QueryClient({
|
||
defaultOptions: {
|
||
queries: {
|
||
refetchOnWindowFocus: false,
|
||
refetchOnReconnect: true,
|
||
staleTime: 30_000,
|
||
gcTime: 5 * 60_000,
|
||
retry: (failureCount, err) => {
|
||
if (isAuthError(err)) return false; // never retry 401/403
|
||
return failureCount < 2;
|
||
},
|
||
retryDelay: (attempt) => Math.min(1000 * 2 ** attempt, 8000),
|
||
},
|
||
mutations: { retry: false },
|
||
},
|
||
});
|
||
```
|
||
|
||
Then per-query options (`refetchInterval`, `refetchOnWindowFocus: "always"`, etc.) become **deliberate opt-ins**, which is what they should be.
|
||
|
||
---
|
||
|
||
## 2. Critical: `refetchOnWindowFocus: "always"` on the admin inbox
|
||
|
||
`"always"` means *refetch on every focus event regardless of staleness*. Combined with 10s polling, the practical effect is:
|
||
|
||
- An admin with the tab focused: ~6 requests/min from polling.
|
||
- An admin who tabs in/out a few times per minute: polling **plus** every focus refetch — easily 12–20 requests/min per tab.
|
||
- N admins doing this at 9am Monday: a synchronized burst that the backend has no way to absorb gracefully.
|
||
|
||
`"always"` is rarely the right answer. It exists for cases where stale data is genuinely dangerous (e.g. a trading screen). An approvals inbox is not that. Use `true` (refetch only if stale) and pair it with a sensible `staleTime`. The user experience difference will be imperceptible; the server load difference will not.
|
||
|
||
---
|
||
|
||
## 3. High: polling is not visibility-aware
|
||
|
||
A background tab still polls (browsers throttle timers but do not stop them, and some don't throttle aggressively when audio/SSE/etc. is active). For a dashboard people leave open all day in a second tab, this is pure waste — and it's waste that scales linearly with headcount.
|
||
|
||
The document already mentions the fix as a "common direction":
|
||
|
||
```ts
|
||
refetchInterval: (query) =>
|
||
document.visibilityState === "visible" ? 10_000 : false,
|
||
```
|
||
|
||
This should not be in the "if you want fewer calls" section. It should be the **default pattern** for every polling query in the app, captured in a small wrapper hook so engineers don't have to remember:
|
||
|
||
```ts
|
||
function useVisibilityAwareInterval(ms: number) {
|
||
return (query: Query) =>
|
||
document.visibilityState === "visible" ? ms : false;
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 4. High: `isFetching` is leaking into unrelated UI
|
||
|
||
> *"Stabilizing export used export-only loading state so the button does not follow list refetch"*
|
||
|
||
The fact that this fix was needed tells me the codebase has a **pattern problem**, not just one buggy button. Anywhere `isLoading` / `isFetching` from a list query is wired into a control that is conceptually independent (export, filters, "new application" button, pagination), the same bug exists — it just hasn't been noticed because dev latency hides it.
|
||
|
||
**Recommendation:** audit every consumer of the `["applications", ...]` query for `isFetching` / `isLoading` usage. As a rule:
|
||
|
||
- `isLoading` (true only on first load, no cached data) is *usually* fine to gate UI on.
|
||
- `isFetching` (true on every refetch including background polls) should **almost never** disable user-initiated controls. It is for showing a subtle indicator at most.
|
||
|
||
Consider a lint rule or a code review checklist item: "If you're disabling a button on `isFetching`, justify it in a comment."
|
||
|
||
This bug profile — "works locally, broken in cloud" — is exactly the kind of thing that erodes trust in the frontend. Slow networks shouldn't reveal latent bugs; they should reveal honest slowness.
|
||
|
||
---
|
||
|
||
## 5. High: query key stability for `["applications", filters]`
|
||
|
||
Not in the document, but worth checking: if `filters` is an object literal constructed in render (`{ status, page, q }`), its reference changes every render and TanStack Query may re-key on every render, producing far more requests than intended. Symptoms: requests fire on keystrokes, on unrelated state changes, etc.
|
||
|
||
Mitigations:
|
||
|
||
- Memoize the filters object with `useMemo`, *or*
|
||
- Pass primitive fields directly into the key: `["applications", status, page, q]`. This is more verbose but eliminates a whole class of bugs and makes the key serializable for debugging.
|
||
|
||
If this isn't already enforced, add it to the team's TanStack Query conventions doc.
|
||
|
||
---
|
||
|
||
## 6. Medium: inconsistent refresh strategy between admin and council
|
||
|
||
Admin uses **time-based polling**. Council uses **event-driven invalidation via `reportSyncQuery`**. These are two different mental models for the same surface (an "approved applications list"), maintained in two different files.
|
||
|
||
This costs the team in three ways:
|
||
|
||
1. **Cognitive load** — every engineer has to learn both patterns and remember which file uses which.
|
||
2. **Bug asymmetry** — a bug fixed in one file will not be fixed in the other. Already visible: the `isFetching` export issue likely needs to be checked in both.
|
||
3. **Drift** — over time the two implementations diverge in subtle ways (column order, filter semantics, error handling) and the union of behaviors becomes the de facto contract, which nobody actually documented.
|
||
|
||
Pick one strategy and apply it everywhere applications are listed. My recommendation:
|
||
|
||
- **Primary**: invalidation on mutations (`approve`, `reject`, `submit`, `assign`) plus invalidation on a lightweight "version" or "report sync" signal.
|
||
- **Secondary**: a slow safety-net poll (60–120s, visibility-aware) so a missed invalidation doesn't leave the screen permanently stale.
|
||
|
||
This gets you near-instant updates after user actions (which is what people actually notice) and removes the 10s drumbeat against the backend.
|
||
|
||
If true realtime ever becomes a product requirement, SSE is the natural next step — *one* long-lived connection per tab is dramatically cheaper than 6 polls/minute, and it integrates cleanly behind the existing `apiClient`. WebSockets are overkill unless the server needs to push high-frequency updates.
|
||
|
||
---
|
||
|
||
## 7. Medium: no documented retry, timeout, or auth-failure policy
|
||
|
||
The document covers the happy path thoroughly but says nothing about:
|
||
|
||
- **Timeouts** — Axios has no default timeout. A hung backend will leave requests pending indefinitely, which compounds with 10s polling: stuck requests stack up, the browser hits its per-host concurrency limit, and the UI appears frozen.
|
||
- **Retries** — TanStack Query retries 3× by default. For a 10s poll, that means a transient 500 produces 4 requests in quick succession before the next poll, amplifying load exactly when the backend is already struggling.
|
||
- **401/403** — does the response interceptor force logout, redirect, or attempt refresh? Silent 401 loops (auth expired → poll → 401 → poll → 401) are a classic source of "the dashboard is broken" reports that turn out to be expired sessions.
|
||
- **Network offline** — `refetchOnReconnect: true` is good, but is there UI feedback during the offline window? Otherwise users see "blank inbox" and assume data loss.
|
||
|
||
Add an explicit section to the architecture doc covering these. They are stability features even when they don't fire.
|
||
|
||
---
|
||
|
||
## 8. Medium: dev logging interceptor
|
||
|
||
Logging every successful response in dev is fine, but two small upgrades:
|
||
|
||
1. **Sample or summarize** rather than logging full `data` payloads. Large list responses make the console unusable and hide the actual interesting events. `console.debug` (silenced by default in DevTools) is often a better choice than `console.log`.
|
||
2. **Guard against accidental production leakage** with a build-time assertion, not just `import.meta.env.DEV`. A misconfigured deploy that ships dev mode to production would log user data into browser consoles — minor but real privacy/compliance concern.
|
||
|
||
---
|
||
|
||
## 9. Low: polling intervals are scattered magic numbers
|
||
|
||
`10s`, `30s`, `30s`, `60s`, `60s` across five files. None are configurable. None share a constant. If product asks "can we double polling intervals across the board for a load test?", the answer requires touching five files and hoping you found them all.
|
||
|
||
Centralize:
|
||
|
||
```ts
|
||
// fe0/src/shared/config/polling.ts
|
||
export const POLL_INTERVALS = {
|
||
adminInbox: 10_000,
|
||
notificationsCount: 60_000,
|
||
notificationsList: 60_000,
|
||
adminOverview: 30_000,
|
||
aiHealth: 30_000,
|
||
} as const;
|
||
```
|
||
|
||
Bonus: makes it trivial to make them env-configurable later, and gives ops a single place to tune under load.
|
||
|
||
---
|
||
|
||
## 10. Low: thundering herd / no jitter
|
||
|
||
All polls fire at fixed intervals from mount time. With many admins logging in around the same wall-clock window (start of shift, morning standup), poll cycles tend to align. Adding ±10–20% jitter to intervals smooths backend load with zero UX cost:
|
||
|
||
```ts
|
||
const jitter = (ms: number) => ms * (0.9 + Math.random() * 0.2);
|
||
refetchInterval: jitter(10_000)
|
||
```
|
||
|
||
Worth doing once you have more than a handful of concurrent admins.
|
||
|
||
---
|
||
|
||
## Suggested order of work
|
||
|
||
A pragmatic sequencing if the team can only do this incrementally:
|
||
|
||
1. **This week** — Delete one of the two `App.tsx` files. Set explicit `QueryClient` defaults. (Items 1.)
|
||
2. **Next sprint** — Audit `isFetching` consumers; introduce visibility-aware polling helper; replace `"always"` with `true`. (Items 2, 3, 4.)
|
||
3. **Following sprint** — Centralize polling intervals; document retry/timeout/auth policy and implement gaps; verify query key stability. (Items 5, 7, 9.)
|
||
4. **Quarter horizon** — Converge admin and council on one refresh strategy (invalidation primary, slow poll as safety net). Consider SSE if/when realtime becomes a product ask. (Item 6.)
|
||
|
||
---
|
||
|
||
## What's already good
|
||
|
||
To be fair to the existing design — the document describes several decisions that are correct and worth preserving:
|
||
|
||
- `placeholderData: (previous) => previous` to avoid table flicker is the right call.
|
||
- Invalidating `notifications-unread-count` after mutations rather than waiting for the next poll tick is exactly the pattern more of the app should follow.
|
||
- A single shared `apiClient` is a good foundation; the issues above are about what to put on top of it, not about replacing it.
|
||
- Documenting local-vs-cloud behavioral differences (latency hiding `isFetching` bugs) is the kind of institutional knowledge that usually only lives in people's heads. Keep doing this.
|
||
|
||
The bones are fine. The work is mostly about removing implicit behavior, consolidating duplicated patterns, and making polling polite.
|