Architecture rework: fix protocol bugs, add origin-side cache, document spec

8-expert review identified 3 bugs in shipped code (Vary header hallucination, fn/function wire key mismatch, max-age=0 defeating PSR) — all fixed with tests updated across Python and TypeScript. Added: manifest version field, affects validation, wire format convention, origin-side cache module (HMAC key derivation, MemoryCache + RedisCache backends, reverse index for scoped invalidation, executor integration). 16 known issues documented in cache/KNOWN_ISSUES.md from expert review — critical items (user_id not passed, purge race condition, no Redis error handling) to be fixed in follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 22:40:55 -04:00
parent 97237ed1a4
commit b2f990b4e5
20 changed files with 1162 additions and 43 deletions
--- a/ARCHITECTURE-REWORK.md
+++ b/ARCHITECTURE-REWORK.md
@@ -0,0 +1,336 @@
+# Architecture Rework: Cache Keying & Invalidation
+
+**Date:** 2026-04-06
+**Source:** 8 independent expert reviews (Cloudflare, Enterprise Backend, Django, SaaS Founder, Next.js, React Query, Framework Authoring, Serverless Architecture)
+
+---
+
+## Status Key
+
+- [x] Fixed
+- [ ] **BUG** — broken in shipped code
+- [ ] **DESIGN** — must resolve before implementing cache layer
+- [ ] **SPEC** — needs specification before building
+- [ ] **OPS** — operational gap for production readiness
+- [ ] **DX** — developer experience issue
+- [ ] **BUSINESS** — product/pricing concern
+
+---
+
+## Bugs in Shipped Code
+
+### [x] BUG: `Vary: Authorization, Cookie` does nothing on Cloudflare
+**Files:** `executor.py:787`, `dispatch.ts:77`, `edge-compat.test.ts:75-79`
+
+Cloudflare ignores all Vary values except `Accept-Encoding` and `Accept` (images only). This header creates a false sense of security — someone reading the code assumes different Authorization headers produce different cache entries. They do not. The edge-compat tests assert the presence of this non-functional header, reinforcing the illusion.
+
+**Origin:** Claude hallucination in a prior session. Not a design decision.
+
+**Fix:** Remove the header from both Python and TypeScript. Remove test assertions. Add a code comment explaining why Vary is not used and pointing to the HMAC cache key strategy (when implemented).
+
+### [x] BUG: `fn` vs `function` wire protocol key mismatch
+**Files:** `executor.py:619`, `runtime/index.ts:128`
+
+The Django executor reads `body.get("fn")`. The TypeScript runtime sends `{ function: functionName }`. These don't match. Would break on first real use of the new TS runtime against the Django backend.
+
+**Fix:** Align on one key name. Whichever is chosen, document it as stable wire format.
+
+### [x] BUG: `max-age=0` defeats the PSR caching model
+**File:** `executor.py:786`
+
+`Cache-Control: public, max-age=0, stale-while-revalidate=300` means the origin gets hit on every request for background revalidation. This conflicts with PSR's purge-based freshness model, where content should be cached until explicitly invalidated.
+
+**Fix:** For PSR-eligible contexts, emit `Cache-Control: public, s-maxage=31536000`. The CDN caches forever; purge is the only freshness mechanism. Reserve `max-age=0, stale-while-revalidate` for contexts that opt out of PSR or use time-based revalidation.
+
+---
+
+## Critical Design Flaws
+
+### [ ] DESIGN: HMAC concatenation without delimiter
+**Severity:** Security vulnerability — cache key collisions across different logical entries
+
+`HMAC(secret, context + user_id + params)` without structured separation means `"user" + "12" + "3"` collides with `"user1" + "2" + "3"`.
+
+**Fix:** Use null-byte delimiters: `HMAC(secret, context + "\x00" + user_id + "\x00" + canonical_sorted_params)`. Or HMAC over a JSON-canonical form. Document the canonical form as part of the AFI protocol spec.
+
+### [ ] DESIGN: Full context flush on deploy = thundering herd
+**Severity:** Operational — self-inflicted DDoS on every deploy that changes a decorator
+
+Every deploy that changes any `@client` decorator nukes all cached content for affected contexts. Teams deploying 3-5x/day means the Edge cache is cold 3-5x/day. 100K concurrent users + 10 contexts = 1M origin requests in seconds post-deploy.
+
+**Preferred fix:** Versioned cache keys. Include a manifest content hash in the cache key. Old and new entries coexist during transition. No purge, no thundering herd. 2x cache storage during transition (negligible). Old entries expire naturally via TTL or LRU eviction.
+
+**Alternative fix:** Granular per-context diffing. Only flush contexts whose function signatures, params, or auth requirements actually changed. The manifest already contains per-context param lists to support this.
+
+### [ ] DESIGN: Purge token in customer Workers exposes shared cache
+**Severity:** Security — one compromised customer can purge all customers' cache
+
+Every customer Edge Worker deployment carries a Cloudflare API token with `Zone:Cache Purge` permission for `render.mizan.cloud`.
+
+**Fix:** Build a purge proxy Worker on the Mizan zone. Validates purge requests (HMAC signature + customer-scoped URL pattern matching) before forwarding to the Cloudflare purge API. No customer Worker ever holds a direct zone API token.
+
+### [ ] DESIGN: Permission key race condition
+**Severity:** Data correctness — stale content served for duration of JWT lifetime
+
+User permission changes (e.g., tier upgrade) don't take effect until JWT expires because: (1) cache key uses only `user_id`, not tier, and (2) permission key comparison uses the JWT-derived value, which is stale until refresh.
+
+**Options:**
+- (a) Make permission-relevant attributes part of the cache key (increases cardinality).
+- (b) Accept the JWT-lifetime staleness window, document as known constraint.
+- (c) Add short-TTL revalidation for permission-sensitive contexts.
+
+**Decision needed before implementation.**
+
+### [ ] DESIGN: No `waitUntil()` in purge/warm flow
+**Severity:** Latency — client blocks on cache management operations
+
+If a mutation invalidates N URLs, the Edge Worker must complete all purge API calls before responding. Each call is 50-200ms.
+
+**Fix:** Return mutation response immediately. Fire all purge and warming fetches inside `waitUntil()`. Same Worker invocation, no extra billing, client doesn't block.
+
+---
+
+## Missing Specifications
+
+### [ ] SPEC: Secret rotation protocol
+No rotation mechanism, no dual-secret acceptance window, no compromise recovery procedure. Rotating the single secret invalidates every HMAC globally.
+
+**Need:** Key derivation hierarchy (master secret -> per-context derived keys). Rotation at context level. Dual-secret acceptance window during rotation. Document compromise recovery procedure.
+
+### [ ] SPEC: GDPR right-to-erasure for cached content
+HMAC keys make targeted per-user cache purge difficult. Must reconstruct every possible HMAC for every context x param combination for a given user.
+
+**Need:** `purge_by_user(user_id)` operation that iterates manifest contexts to reconstruct all HMACs. Tractable if context count is bounded. Audit trail for compliance proof.
+
+### [ ] SPEC: Cache adapter conformance requirements
+
+Every Mizan backend adapter (Python, TypeScript, and future: PHP, C#, Go, etc.) must
+implement the origin-side cache protocol. This is NOT a binary ABI or pluggable backend
+interface. It is a set of operations each adapter implements in its own language, backed
+by Redis. Conformance is verified by a shared test suite (same model as the existing
+edge-compat tests that prove Python and TypeScript produce identical protocol output).
+
+**Storage:** Redis. Not pluggable. Not in-memory-only. Redis handles persistence,
+cross-worker sharing, and crash recovery. The adapter is a thin protocol layer over
+Redis commands.
+
+**Required operations:**
+
+```
+cache_get(context: string, params: dict, user_id: string | null, rev: int) -> CachedResponse | null
+```
+Derives HMAC key from inputs using JSON-canonical form, fetches from Redis.
+
+```
+cache_put(context: string, params: dict, user_id: string | null, rev: int, response: CachedResponse) -> void
+```
+Derives HMAC key, stores response in Redis. Also maintains a reverse index
+(context + params -> HMAC keys) so `cache_purge` can find entries to delete.
+
+```
+cache_purge(context: string, params: dict | null) -> int
+```
+Looks up the reverse index for matching entries, deletes them from Redis.
+Returns number of entries purged. When `params` is null, purges entire context.
+
+```
+cache_purge_user(user_id: string) -> int
+```
+Iterates all contexts in the manifest, reconstructs HMAC keys for the given
+user_id across all param combinations in the reverse index, deletes them.
+Required for GDPR right-to-erasure.
+
+**HMAC key derivation (must be identical across all adapters):**
+
+```
+key = HMAC-SHA256(secret, JSON.stringify({
+    "c": context,
+    "p": sorted_params,
+    "r": rev,
+    "u": user_id    // omitted for public content
+}, sort_keys=True))
+```
+
+**MWT validation (must be identical across all adapters):**
+
+Validate the `X-Mizan-Token` header as a standard JWT (HMAC-SHA256). Extract `sub`
+(user_id) for cache key derivation, check `exp` for token freshness.
+
+**Conformance test suite:**
+
+Each adapter must pass a shared set of protocol conformance tests verifying:
+- Identical HMAC output for identical inputs (cross-language determinism)
+- Identical MWT validation behavior
+- Correct purge semantics (scoped and broad)
+- Correct reverse index maintenance
+- Correct `cache_purge_user` behavior
+
+### [ ] SPEC: Client-side cache lifecycle
+Runtime is ~95 lines. No `staleTime`, `isFetching`/`isLoading` distinction, garbage collection, retry logic, optimistic updates, `refetchOnWindowFocus`.
+
+**Minimum viable:**
+- Loading/fetching state distinction (don't throw on missing data)
+- Error return shape: `{ data, isLoading, isFetching, error }`
+- `refetchOnWindowFocus` as default
+- Mutation lifecycle with rollback support for optimistic updates
+- Garbage collection for unmounted context data (configurable delay)
+
+### [x] SPEC: Per-context cache policy
+
+`cache=` on `@client` accepts three forms:
+
+- **Omitted (default):** Invalidation-based. Emits `s-maxage=31536000`. Cache forever,
+  purge on mutation. Use when your backend is the source of truth.
+- **`cache=60` (integer seconds):** TTL-based. Emits `s-maxage=60`. Accept bounded
+  staleness. Use for unobservable mutations — when your backend mirrors external data
+  (third-party APIs, aggregations, upstream services) and cannot know when it changes.
+- **`cache=False`:** Never cache. Emits `Cache-Control: no-store`. Use for
+  non-deterministic functions (`random()`, `datetime.now()`).
+
+This is the escape hatch for data the backend doesn't own the mutation scope for.
+Positioned in docs as: "Are you the source of truth, or a mirror? Source of truth →
+use `affects=`. Mirror → use `cache=N`."
+
+The `cache=int` value flows into the edge manifest per-context, so the Edge Worker
+and CDN respect it without special handling (`s-maxage` is standard CDN behavior).
+
+### [ ] SPEC: Extension points for cache/invalidation lifecycle
+Zero hooks for third-party code. No pre-invalidation hook, no custom cache key function, no invalidation transport plugin.
+
+**Minimum viable:**
+- `CacheBackend` protocol (third parties implement custom backends)
+- `on_invalidate(context, params)` event hook (monitoring/debugging)
+- Document these as public API from day one
+
+### [x] SPEC: Manifest versioning
+The manifest has no version field. When the schema evolves, Edge Workers can't distinguish v1 from v2 format.
+
+**Fix:** Add `"version": 1` to manifest root before anyone deploys it. Edge Workers check version and fail fast on unknown versions.
+
+### [x] SPEC: Wire format convention
+Python emits `snake_case` params (`user_id`). TypeScript conventionally uses `camelCase` (`userId`). The `USER_SCOPED_PARAMS` set in `manifest.ts` contains both conventions. Invalidation headers from Python won't match TypeScript keys expecting `camelCase`.
+
+**Fix:** Document `snake_case` as the wire format convention. TypeScript adapters convert at the boundary.
+
+---
+
+## Operational Gaps
+
+### [ ] OPS: No cache observability
+No hit/miss metrics, no cache key debugging, no invalidation audit trail, no manifest version tracking.
+
+**Need:** `X-Mizan-Cache-Status` response header (HIT/MISS/BYPASS/STALE/PURGED/DYNAMIC). Structured logging in Edge Worker. Console-level invalidation event log for devtools.
+
+### [ ] OPS: Purge rate limits at scale
+Cloudflare zone purge API: 500 req/10s (free/pro), 2500/10s (Enterprise). Bulk operations can exceed this.
+
+**Need:** Batch purge requests (up to 30 URLs per API call). Document rate limits. Design Cache Tags upgrade path for Enterprise.
+
+### [ ] OPS: Purge-then-warm race condition
+Warming fetch arriving at a PoP before purge propagates gets a cache HIT on stale data.
+
+**Fix:** Use `Cache-Control: no-cache` or `cf: { cacheTtl: 0 }` on warming requests to force revalidation.
+
+### [ ] OPS: PSR warming only warms one colo
+Warming fetch from a Worker runs in a single datacenter. Only warms that colo's cache (+ upper-tier if Tiered Cache active). Does not warm all 300+ PoPs.
+
+**Document:** PSR warming reduces origin load by warming the shield tier. First request from each edge PoP is still a cache miss to the shield. Not zero-latency for all users.
+
+---
+
+## Django Integration Concerns
+
+### [ ] DX: `@client` breaks decorator stacking
+`@client` returns a class (`FunctionWrapper`), not a callable. `@login_required`, `@csrf_exempt`, `@cache_page` cannot compose with it.
+
+**Options:**
+- (a) Make `@client` return a wraps-compatible callable that also carries metadata (Django Ninja approach).
+- (b) Document incompatibility prominently. Provide Mizan-native equivalents. State that `@client` replaces `@login_required` (via `auth=`), `@cache_page` (via context caching), etc.
+
+### [ ] DX: `JWTUser` too thin for complex auth checks
+Works for `is_staff`/`is_superuser`. Fails for allauth relations, DRF permissions, `request.user.groups.all()`, user model relations.
+
+**Need:** Document limitation. Provide `get_full_user()` helper that does DB lookup when needed. Or optionally expand JWT claims.
+
+### [ ] DX: Transaction safety of invalidation
+Invalidation in response body is optimistic — fires before `ATOMIC_REQUESTS` commits. If transaction rolls back, invalidation was already sent.
+
+**Need:** Document as known behavior. Recommend `transaction.on_commit()` for critical paths. When building `mizan-cache`, consider two-phase: mark for invalidation during request, execute purge on commit.
+
+### [ ] DX: Admin/ORM writes invisible to invalidation
+Only `@client(affects=...)` functions trigger invalidation. Django admin saves, management commands, direct ORM writes are invisible.
+
+**Need:** Document clearly. Provide manual purge API: `purge_context('products', params={'product_id': 42})`.
+
+### [ ] DX: Cache adapter integration for Django
+The Python cache adapter is a thin protocol layer over Redis (not a Django cache backend).
+Django developers call `mizan.cache.get(context, params, user_id, rev)` directly.
+Provide a `mizan.cache.clear()` for test fixture teardown. Document that this is
+separate from Django's `CACHES` framework — Mizan owns its own cache protocol.
+
+---
+
+## Business/Product Concerns
+
+### [ ] BUSINESS: Free tier + Cloudflare free = 80% of paid product
+Existing `Cache-Control` headers on context fetches are CDN-ready. A developer puts Cloudflare free tier in front and gets stale-while-revalidate at 300+ PoPs for $0. The 20% gap (user-scoped HMAC keying, PSR, render Workers) doesn't exist in code yet.
+
+### [ ] BUSINESS: $20/seat wrong pricing model
+"Seat" is undefined for a framework. Usage-based ($0.50/100K requests with generous free tier) or flat-per-project ($29/month) converts better for infrastructure products.
+
+### [ ] BUSINESS: Ship framework first, cloud second
+The framework has working code. The cloud product has zero. Risk: building both depletes runway before either has adoption. Recommended: get 500 devs using `@client` + `affects=` on their VPS first, then build the Edge product for the gap they actually hit.
+
+---
+
+## Validated Design Decisions (No Changes Needed)
+
+These were confirmed sound by multiple reviewers:
+
+- **Declarative invalidation graph** (`affects=` + auto-scoping) — unanimously praised as genuinely novel
+- **Two-zone `fetch()` pattern** — correct architecture for global CDN caching from Workers
+- **Cross-language protocol** — Python/TS with identical manifests, proven by parallel test suites
+- **Manifest-driven URL resolution** — eliminates need for cache inventory state (no KV/DOs needed)
+- **Typed `ReactContext` for `affects` targeting** — prevents the string-fragility concern (string form is escape hatch only)
+- **Replacing React Query** — correct decision given context bundling + transport transparency goals
+- **Cost model** — ~$5/month Cloudflare at 10K DAU, ~$20/month at 10x. Origin infra is the real cost.
+- **Origin-side Redis cache as L2** — viable fallback behind CDN, same protocol as Edge
+
+---
+
+## Unique Expert Insights
+
+**Cloudflare Expert:**
+- Add `cf.cacheTtl` and `cf.cacheEverything` to all `fetch()` subrequests — don't rely solely on response headers
+- Consider Cache Tags (`Cache-Tag` response header) from day one for Enterprise upgrade path
+- Consider Durable Objects for per-user cache coordination as alternative to HMAC-in-URL
+
+**Enterprise Architect:**
+- Key derivation hierarchy: master secret derives per-context keys. Compromise of one context doesn't affect others.
+- `X-Mizan-Cache-Version` header on every response for self-healing on version mismatch
+
+**Serverless Expert:**
+- Use `renderToReadableStream` (streaming SSR) in Render Worker, not `renderToString`. Memory and CPU budget are tight (128MB / 50ms).
+- Cache manifest in `globalThis` in Edge Worker — do not read from KV per-request
+- AWS portability: CloudFront invalidation pricing is 10-100x more expensive. Design TTL-based alternative.
+
+**Next.js Expert:**
+- PSR doesn't address cold-start pages (initial population before any mutation) or render fan-out (10K parameterized variants re-rendering on one mutation)
+- No streaming/Suspense/progressive delivery — entire context response blocks on slowest function
+
+**React Query Expert:**
+- Wire existing WebSocket push infrastructure to emit invalidation events for named contexts
+- Generated hooks should return `{ data, isLoading, isFetching, error }`, not throw on missing data
+
+**Django Architect:**
+- DRF `TokenAuthentication` collision: both use `Authorization: Bearer`, Mizan's JWT decode rejects DRF tokens with a 401
+- `mizan-cache` as Django cache backend, not separate system
+
+**Framework Authoring:**
+- Define `CacheBackend` protocol before implementing — the abstraction is cheaper to get right before users exist
+- Add `"version": 1` to manifest root now — adding it later is harder
+- `@client` is approaching parameter overload — if `cache` becomes extensible, use `CachePolicy` object pattern, not more kwargs
+
+**SaaS Founder:**
+- The debugging UX for HMAC cache is a black box — invest in an invalidation graph debugging UI as a paid feature
+- The `affects=` auto-refetch is the "wow" moment — optimize time-to-that-moment in onboarding