Files
mizan/ARCHITECTURE-REWORK.md
Ryth Azhur b2f990b4e5 Architecture rework: fix protocol bugs, add origin-side cache, document spec
8-expert review identified 3 bugs in shipped code (Vary header hallucination,
fn/function wire key mismatch, max-age=0 defeating PSR) — all fixed with
tests updated across Python and TypeScript.

Added: manifest version field, affects validation, wire format convention,
origin-side cache module (HMAC key derivation, MemoryCache + RedisCache
backends, reverse index for scoped invalidation, executor integration).

16 known issues documented in cache/KNOWN_ISSUES.md from expert review —
critical items (user_id not passed, purge race condition, no Redis error
handling) to be fixed in follow-up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 22:40:55 -04:00

18 KiB

Architecture Rework: Cache Keying & Invalidation

Date: 2026-04-06 Source: 8 independent expert reviews (Cloudflare, Enterprise Backend, Django, SaaS Founder, Next.js, React Query, Framework Authoring, Serverless Architecture)


Status Key

  • Fixed
  • BUG — broken in shipped code
  • DESIGN — must resolve before implementing cache layer
  • SPEC — needs specification before building
  • OPS — operational gap for production readiness
  • DX — developer experience issue
  • BUSINESS — product/pricing concern

Bugs in Shipped Code

Files: executor.py:787, dispatch.ts:77, edge-compat.test.ts:75-79

Cloudflare ignores all Vary values except Accept-Encoding and Accept (images only). This header creates a false sense of security — someone reading the code assumes different Authorization headers produce different cache entries. They do not. The edge-compat tests assert the presence of this non-functional header, reinforcing the illusion.

Origin: Claude hallucination in a prior session. Not a design decision.

Fix: Remove the header from both Python and TypeScript. Remove test assertions. Add a code comment explaining why Vary is not used and pointing to the HMAC cache key strategy (when implemented).

[x] BUG: fn vs function wire protocol key mismatch

Files: executor.py:619, runtime/index.ts:128

The Django executor reads body.get("fn"). The TypeScript runtime sends { function: functionName }. These don't match. Would break on first real use of the new TS runtime against the Django backend.

Fix: Align on one key name. Whichever is chosen, document it as stable wire format.

[x] BUG: max-age=0 defeats the PSR caching model

File: executor.py:786

Cache-Control: public, max-age=0, stale-while-revalidate=300 means the origin gets hit on every request for background revalidation. This conflicts with PSR's purge-based freshness model, where content should be cached until explicitly invalidated.

Fix: For PSR-eligible contexts, emit Cache-Control: public, s-maxage=31536000. The CDN caches forever; purge is the only freshness mechanism. Reserve max-age=0, stale-while-revalidate for contexts that opt out of PSR or use time-based revalidation.


Critical Design Flaws

[ ] DESIGN: HMAC concatenation without delimiter

Severity: Security vulnerability — cache key collisions across different logical entries

HMAC(secret, context + user_id + params) without structured separation means "user" + "12" + "3" collides with "user1" + "2" + "3".

Fix: Use null-byte delimiters: HMAC(secret, context + "\x00" + user_id + "\x00" + canonical_sorted_params). Or HMAC over a JSON-canonical form. Document the canonical form as part of the AFI protocol spec.

[ ] DESIGN: Full context flush on deploy = thundering herd

Severity: Operational — self-inflicted DDoS on every deploy that changes a decorator

Every deploy that changes any @client decorator nukes all cached content for affected contexts. Teams deploying 3-5x/day means the Edge cache is cold 3-5x/day. 100K concurrent users + 10 contexts = 1M origin requests in seconds post-deploy.

Preferred fix: Versioned cache keys. Include a manifest content hash in the cache key. Old and new entries coexist during transition. No purge, no thundering herd. 2x cache storage during transition (negligible). Old entries expire naturally via TTL or LRU eviction.

Alternative fix: Granular per-context diffing. Only flush contexts whose function signatures, params, or auth requirements actually changed. The manifest already contains per-context param lists to support this.

[ ] DESIGN: Purge token in customer Workers exposes shared cache

Severity: Security — one compromised customer can purge all customers' cache

Every customer Edge Worker deployment carries a Cloudflare API token with Zone:Cache Purge permission for render.mizan.cloud.

Fix: Build a purge proxy Worker on the Mizan zone. Validates purge requests (HMAC signature + customer-scoped URL pattern matching) before forwarding to the Cloudflare purge API. No customer Worker ever holds a direct zone API token.

[ ] DESIGN: Permission key race condition

Severity: Data correctness — stale content served for duration of JWT lifetime

User permission changes (e.g., tier upgrade) don't take effect until JWT expires because: (1) cache key uses only user_id, not tier, and (2) permission key comparison uses the JWT-derived value, which is stale until refresh.

Options:

  • (a) Make permission-relevant attributes part of the cache key (increases cardinality).
  • (b) Accept the JWT-lifetime staleness window, document as known constraint.
  • (c) Add short-TTL revalidation for permission-sensitive contexts.

Decision needed before implementation.

[ ] DESIGN: No waitUntil() in purge/warm flow

Severity: Latency — client blocks on cache management operations

If a mutation invalidates N URLs, the Edge Worker must complete all purge API calls before responding. Each call is 50-200ms.

Fix: Return mutation response immediately. Fire all purge and warming fetches inside waitUntil(). Same Worker invocation, no extra billing, client doesn't block.


Missing Specifications

[ ] SPEC: Secret rotation protocol

No rotation mechanism, no dual-secret acceptance window, no compromise recovery procedure. Rotating the single secret invalidates every HMAC globally.

Need: Key derivation hierarchy (master secret -> per-context derived keys). Rotation at context level. Dual-secret acceptance window during rotation. Document compromise recovery procedure.

[ ] SPEC: GDPR right-to-erasure for cached content

HMAC keys make targeted per-user cache purge difficult. Must reconstruct every possible HMAC for every context x param combination for a given user.

Need: purge_by_user(user_id) operation that iterates manifest contexts to reconstruct all HMACs. Tractable if context count is bounded. Audit trail for compliance proof.

[ ] SPEC: Cache adapter conformance requirements

Every Mizan backend adapter (Python, TypeScript, and future: PHP, C#, Go, etc.) must implement the origin-side cache protocol. This is NOT a binary ABI or pluggable backend interface. It is a set of operations each adapter implements in its own language, backed by Redis. Conformance is verified by a shared test suite (same model as the existing edge-compat tests that prove Python and TypeScript produce identical protocol output).

Storage: Redis. Not pluggable. Not in-memory-only. Redis handles persistence, cross-worker sharing, and crash recovery. The adapter is a thin protocol layer over Redis commands.

Required operations:

cache_get(context: string, params: dict, user_id: string | null, rev: int) -> CachedResponse | null

Derives HMAC key from inputs using JSON-canonical form, fetches from Redis.

cache_put(context: string, params: dict, user_id: string | null, rev: int, response: CachedResponse) -> void

Derives HMAC key, stores response in Redis. Also maintains a reverse index (context + params -> HMAC keys) so cache_purge can find entries to delete.

cache_purge(context: string, params: dict | null) -> int

Looks up the reverse index for matching entries, deletes them from Redis. Returns number of entries purged. When params is null, purges entire context.

cache_purge_user(user_id: string) -> int

Iterates all contexts in the manifest, reconstructs HMAC keys for the given user_id across all param combinations in the reverse index, deletes them. Required for GDPR right-to-erasure.

HMAC key derivation (must be identical across all adapters):

key = HMAC-SHA256(secret, JSON.stringify({
    "c": context,
    "p": sorted_params,
    "r": rev,
    "u": user_id    // omitted for public content
}, sort_keys=True))

MWT validation (must be identical across all adapters):

Validate the X-Mizan-Token header as a standard JWT (HMAC-SHA256). Extract sub (user_id) for cache key derivation, check exp for token freshness.

Conformance test suite:

Each adapter must pass a shared set of protocol conformance tests verifying:

  • Identical HMAC output for identical inputs (cross-language determinism)
  • Identical MWT validation behavior
  • Correct purge semantics (scoped and broad)
  • Correct reverse index maintenance
  • Correct cache_purge_user behavior

[ ] SPEC: Client-side cache lifecycle

Runtime is ~95 lines. No staleTime, isFetching/isLoading distinction, garbage collection, retry logic, optimistic updates, refetchOnWindowFocus.

Minimum viable:

  • Loading/fetching state distinction (don't throw on missing data)
  • Error return shape: { data, isLoading, isFetching, error }
  • refetchOnWindowFocus as default
  • Mutation lifecycle with rollback support for optimistic updates
  • Garbage collection for unmounted context data (configurable delay)

[x] SPEC: Per-context cache policy

cache= on @client accepts three forms:

  • Omitted (default): Invalidation-based. Emits s-maxage=31536000. Cache forever, purge on mutation. Use when your backend is the source of truth.
  • cache=60 (integer seconds): TTL-based. Emits s-maxage=60. Accept bounded staleness. Use for unobservable mutations — when your backend mirrors external data (third-party APIs, aggregations, upstream services) and cannot know when it changes.
  • cache=False: Never cache. Emits Cache-Control: no-store. Use for non-deterministic functions (random(), datetime.now()).

This is the escape hatch for data the backend doesn't own the mutation scope for. Positioned in docs as: "Are you the source of truth, or a mirror? Source of truth → use affects=. Mirror → use cache=N."

The cache=int value flows into the edge manifest per-context, so the Edge Worker and CDN respect it without special handling (s-maxage is standard CDN behavior).

[ ] SPEC: Extension points for cache/invalidation lifecycle

Zero hooks for third-party code. No pre-invalidation hook, no custom cache key function, no invalidation transport plugin.

Minimum viable:

  • CacheBackend protocol (third parties implement custom backends)
  • on_invalidate(context, params) event hook (monitoring/debugging)
  • Document these as public API from day one

[x] SPEC: Manifest versioning

The manifest has no version field. When the schema evolves, Edge Workers can't distinguish v1 from v2 format.

Fix: Add "version": 1 to manifest root before anyone deploys it. Edge Workers check version and fail fast on unknown versions.

[x] SPEC: Wire format convention

Python emits snake_case params (user_id). TypeScript conventionally uses camelCase (userId). The USER_SCOPED_PARAMS set in manifest.ts contains both conventions. Invalidation headers from Python won't match TypeScript keys expecting camelCase.

Fix: Document snake_case as the wire format convention. TypeScript adapters convert at the boundary.


Operational Gaps

[ ] OPS: No cache observability

No hit/miss metrics, no cache key debugging, no invalidation audit trail, no manifest version tracking.

Need: X-Mizan-Cache-Status response header (HIT/MISS/BYPASS/STALE/PURGED/DYNAMIC). Structured logging in Edge Worker. Console-level invalidation event log for devtools.

[ ] OPS: Purge rate limits at scale

Cloudflare zone purge API: 500 req/10s (free/pro), 2500/10s (Enterprise). Bulk operations can exceed this.

Need: Batch purge requests (up to 30 URLs per API call). Document rate limits. Design Cache Tags upgrade path for Enterprise.

[ ] OPS: Purge-then-warm race condition

Warming fetch arriving at a PoP before purge propagates gets a cache HIT on stale data.

Fix: Use Cache-Control: no-cache or cf: { cacheTtl: 0 } on warming requests to force revalidation.

[ ] OPS: PSR warming only warms one colo

Warming fetch from a Worker runs in a single datacenter. Only warms that colo's cache (+ upper-tier if Tiered Cache active). Does not warm all 300+ PoPs.

Document: PSR warming reduces origin load by warming the shield tier. First request from each edge PoP is still a cache miss to the shield. Not zero-latency for all users.


Django Integration Concerns

[ ] DX: @client breaks decorator stacking

@client returns a class (FunctionWrapper), not a callable. @login_required, @csrf_exempt, @cache_page cannot compose with it.

Options:

  • (a) Make @client return a wraps-compatible callable that also carries metadata (Django Ninja approach).
  • (b) Document incompatibility prominently. Provide Mizan-native equivalents. State that @client replaces @login_required (via auth=), @cache_page (via context caching), etc.

[ ] DX: JWTUser too thin for complex auth checks

Works for is_staff/is_superuser. Fails for allauth relations, DRF permissions, request.user.groups.all(), user model relations.

Need: Document limitation. Provide get_full_user() helper that does DB lookup when needed. Or optionally expand JWT claims.

[ ] DX: Transaction safety of invalidation

Invalidation in response body is optimistic — fires before ATOMIC_REQUESTS commits. If transaction rolls back, invalidation was already sent.

Need: Document as known behavior. Recommend transaction.on_commit() for critical paths. When building mizan-cache, consider two-phase: mark for invalidation during request, execute purge on commit.

[ ] DX: Admin/ORM writes invisible to invalidation

Only @client(affects=...) functions trigger invalidation. Django admin saves, management commands, direct ORM writes are invisible.

Need: Document clearly. Provide manual purge API: purge_context('products', params={'product_id': 42}).

[ ] DX: Cache adapter integration for Django

The Python cache adapter is a thin protocol layer over Redis (not a Django cache backend). Django developers call mizan.cache.get(context, params, user_id, rev) directly. Provide a mizan.cache.clear() for test fixture teardown. Document that this is separate from Django's CACHES framework — Mizan owns its own cache protocol.


Business/Product Concerns

[ ] BUSINESS: Free tier + Cloudflare free = 80% of paid product

Existing Cache-Control headers on context fetches are CDN-ready. A developer puts Cloudflare free tier in front and gets stale-while-revalidate at 300+ PoPs for $0. The 20% gap (user-scoped HMAC keying, PSR, render Workers) doesn't exist in code yet.

[ ] BUSINESS: $20/seat wrong pricing model

"Seat" is undefined for a framework. Usage-based ($0.50/100K requests with generous free tier) or flat-per-project ($29/month) converts better for infrastructure products.

[ ] BUSINESS: Ship framework first, cloud second

The framework has working code. The cloud product has zero. Risk: building both depletes runway before either has adoption. Recommended: get 500 devs using @client + affects= on their VPS first, then build the Edge product for the gap they actually hit.


Validated Design Decisions (No Changes Needed)

These were confirmed sound by multiple reviewers:

  • Declarative invalidation graph (affects= + auto-scoping) — unanimously praised as genuinely novel
  • Two-zone fetch() pattern — correct architecture for global CDN caching from Workers
  • Cross-language protocol — Python/TS with identical manifests, proven by parallel test suites
  • Manifest-driven URL resolution — eliminates need for cache inventory state (no KV/DOs needed)
  • Typed ReactContext for affects targeting — prevents the string-fragility concern (string form is escape hatch only)
  • Replacing React Query — correct decision given context bundling + transport transparency goals
  • Cost model — ~$5/month Cloudflare at 10K DAU, ~$20/month at 10x. Origin infra is the real cost.
  • Origin-side Redis cache as L2 — viable fallback behind CDN, same protocol as Edge

Unique Expert Insights

Cloudflare Expert:

  • Add cf.cacheTtl and cf.cacheEverything to all fetch() subrequests — don't rely solely on response headers
  • Consider Cache Tags (Cache-Tag response header) from day one for Enterprise upgrade path
  • Consider Durable Objects for per-user cache coordination as alternative to HMAC-in-URL

Enterprise Architect:

  • Key derivation hierarchy: master secret derives per-context keys. Compromise of one context doesn't affect others.
  • X-Mizan-Cache-Version header on every response for self-healing on version mismatch

Serverless Expert:

  • Use renderToReadableStream (streaming SSR) in Render Worker, not renderToString. Memory and CPU budget are tight (128MB / 50ms).
  • Cache manifest in globalThis in Edge Worker — do not read from KV per-request
  • AWS portability: CloudFront invalidation pricing is 10-100x more expensive. Design TTL-based alternative.

Next.js Expert:

  • PSR doesn't address cold-start pages (initial population before any mutation) or render fan-out (10K parameterized variants re-rendering on one mutation)
  • No streaming/Suspense/progressive delivery — entire context response blocks on slowest function

React Query Expert:

  • Wire existing WebSocket push infrastructure to emit invalidation events for named contexts
  • Generated hooks should return { data, isLoading, isFetching, error }, not throw on missing data

Django Architect:

  • DRF TokenAuthentication collision: both use Authorization: Bearer, Mizan's JWT decode rejects DRF tokens with a 401
  • mizan-cache as Django cache backend, not separate system

Framework Authoring:

  • Define CacheBackend protocol before implementing — the abstraction is cheaper to get right before users exist
  • Add "version": 1 to manifest root now — adding it later is harder
  • @client is approaching parameter overload — if cache becomes extensible, use CachePolicy object pattern, not more kwargs

SaaS Founder:

  • The debugging UX for HMAC cache is a black box — invest in an invalidation graph debugging UI as a paid feature
  • The affects= auto-refetch is the "wow" moment — optimize time-to-that-moment in onboarding