8-expert review identified 3 bugs in shipped code (Vary header hallucination, fn/function wire key mismatch, max-age=0 defeating PSR) — all fixed with tests updated across Python and TypeScript. Added: manifest version field, affects validation, wire format convention, origin-side cache module (HMAC key derivation, MemoryCache + RedisCache backends, reverse index for scoped invalidation, executor integration). 16 known issues documented in cache/KNOWN_ISSUES.md from expert review — critical items (user_id not passed, purge race condition, no Redis error handling) to be fixed in follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
18 KiB
Architecture Rework: Cache Keying & Invalidation
Date: 2026-04-06 Source: 8 independent expert reviews (Cloudflare, Enterprise Backend, Django, SaaS Founder, Next.js, React Query, Framework Authoring, Serverless Architecture)
Status Key
- Fixed
- BUG — broken in shipped code
- DESIGN — must resolve before implementing cache layer
- SPEC — needs specification before building
- OPS — operational gap for production readiness
- DX — developer experience issue
- BUSINESS — product/pricing concern
Bugs in Shipped Code
[x] BUG: Vary: Authorization, Cookie does nothing on Cloudflare
Files: executor.py:787, dispatch.ts:77, edge-compat.test.ts:75-79
Cloudflare ignores all Vary values except Accept-Encoding and Accept (images only). This header creates a false sense of security — someone reading the code assumes different Authorization headers produce different cache entries. They do not. The edge-compat tests assert the presence of this non-functional header, reinforcing the illusion.
Origin: Claude hallucination in a prior session. Not a design decision.
Fix: Remove the header from both Python and TypeScript. Remove test assertions. Add a code comment explaining why Vary is not used and pointing to the HMAC cache key strategy (when implemented).
[x] BUG: fn vs function wire protocol key mismatch
Files: executor.py:619, runtime/index.ts:128
The Django executor reads body.get("fn"). The TypeScript runtime sends { function: functionName }. These don't match. Would break on first real use of the new TS runtime against the Django backend.
Fix: Align on one key name. Whichever is chosen, document it as stable wire format.
[x] BUG: max-age=0 defeats the PSR caching model
File: executor.py:786
Cache-Control: public, max-age=0, stale-while-revalidate=300 means the origin gets hit on every request for background revalidation. This conflicts with PSR's purge-based freshness model, where content should be cached until explicitly invalidated.
Fix: For PSR-eligible contexts, emit Cache-Control: public, s-maxage=31536000. The CDN caches forever; purge is the only freshness mechanism. Reserve max-age=0, stale-while-revalidate for contexts that opt out of PSR or use time-based revalidation.
Critical Design Flaws
[ ] DESIGN: HMAC concatenation without delimiter
Severity: Security vulnerability — cache key collisions across different logical entries
HMAC(secret, context + user_id + params) without structured separation means "user" + "12" + "3" collides with "user1" + "2" + "3".
Fix: Use null-byte delimiters: HMAC(secret, context + "\x00" + user_id + "\x00" + canonical_sorted_params). Or HMAC over a JSON-canonical form. Document the canonical form as part of the AFI protocol spec.
[ ] DESIGN: Full context flush on deploy = thundering herd
Severity: Operational — self-inflicted DDoS on every deploy that changes a decorator
Every deploy that changes any @client decorator nukes all cached content for affected contexts. Teams deploying 3-5x/day means the Edge cache is cold 3-5x/day. 100K concurrent users + 10 contexts = 1M origin requests in seconds post-deploy.
Preferred fix: Versioned cache keys. Include a manifest content hash in the cache key. Old and new entries coexist during transition. No purge, no thundering herd. 2x cache storage during transition (negligible). Old entries expire naturally via TTL or LRU eviction.
Alternative fix: Granular per-context diffing. Only flush contexts whose function signatures, params, or auth requirements actually changed. The manifest already contains per-context param lists to support this.
[ ] DESIGN: Purge token in customer Workers exposes shared cache
Severity: Security — one compromised customer can purge all customers' cache
Every customer Edge Worker deployment carries a Cloudflare API token with Zone:Cache Purge permission for render.mizan.cloud.
Fix: Build a purge proxy Worker on the Mizan zone. Validates purge requests (HMAC signature + customer-scoped URL pattern matching) before forwarding to the Cloudflare purge API. No customer Worker ever holds a direct zone API token.
[ ] DESIGN: Permission key race condition
Severity: Data correctness — stale content served for duration of JWT lifetime
User permission changes (e.g., tier upgrade) don't take effect until JWT expires because: (1) cache key uses only user_id, not tier, and (2) permission key comparison uses the JWT-derived value, which is stale until refresh.
Options:
- (a) Make permission-relevant attributes part of the cache key (increases cardinality).
- (b) Accept the JWT-lifetime staleness window, document as known constraint.
- (c) Add short-TTL revalidation for permission-sensitive contexts.
Decision needed before implementation.
[ ] DESIGN: No waitUntil() in purge/warm flow
Severity: Latency — client blocks on cache management operations
If a mutation invalidates N URLs, the Edge Worker must complete all purge API calls before responding. Each call is 50-200ms.
Fix: Return mutation response immediately. Fire all purge and warming fetches inside waitUntil(). Same Worker invocation, no extra billing, client doesn't block.
Missing Specifications
[ ] SPEC: Secret rotation protocol
No rotation mechanism, no dual-secret acceptance window, no compromise recovery procedure. Rotating the single secret invalidates every HMAC globally.
Need: Key derivation hierarchy (master secret -> per-context derived keys). Rotation at context level. Dual-secret acceptance window during rotation. Document compromise recovery procedure.
[ ] SPEC: GDPR right-to-erasure for cached content
HMAC keys make targeted per-user cache purge difficult. Must reconstruct every possible HMAC for every context x param combination for a given user.
Need: purge_by_user(user_id) operation that iterates manifest contexts to reconstruct all HMACs. Tractable if context count is bounded. Audit trail for compliance proof.
[ ] SPEC: Cache adapter conformance requirements
Every Mizan backend adapter (Python, TypeScript, and future: PHP, C#, Go, etc.) must implement the origin-side cache protocol. This is NOT a binary ABI or pluggable backend interface. It is a set of operations each adapter implements in its own language, backed by Redis. Conformance is verified by a shared test suite (same model as the existing edge-compat tests that prove Python and TypeScript produce identical protocol output).
Storage: Redis. Not pluggable. Not in-memory-only. Redis handles persistence, cross-worker sharing, and crash recovery. The adapter is a thin protocol layer over Redis commands.
Required operations:
cache_get(context: string, params: dict, user_id: string | null, rev: int) -> CachedResponse | null
Derives HMAC key from inputs using JSON-canonical form, fetches from Redis.
cache_put(context: string, params: dict, user_id: string | null, rev: int, response: CachedResponse) -> void
Derives HMAC key, stores response in Redis. Also maintains a reverse index
(context + params -> HMAC keys) so cache_purge can find entries to delete.
cache_purge(context: string, params: dict | null) -> int
Looks up the reverse index for matching entries, deletes them from Redis.
Returns number of entries purged. When params is null, purges entire context.
cache_purge_user(user_id: string) -> int
Iterates all contexts in the manifest, reconstructs HMAC keys for the given user_id across all param combinations in the reverse index, deletes them. Required for GDPR right-to-erasure.
HMAC key derivation (must be identical across all adapters):
key = HMAC-SHA256(secret, JSON.stringify({
"c": context,
"p": sorted_params,
"r": rev,
"u": user_id // omitted for public content
}, sort_keys=True))
MWT validation (must be identical across all adapters):
Validate the X-Mizan-Token header as a standard JWT (HMAC-SHA256). Extract sub
(user_id) for cache key derivation, check exp for token freshness.
Conformance test suite:
Each adapter must pass a shared set of protocol conformance tests verifying:
- Identical HMAC output for identical inputs (cross-language determinism)
- Identical MWT validation behavior
- Correct purge semantics (scoped and broad)
- Correct reverse index maintenance
- Correct
cache_purge_userbehavior
[ ] SPEC: Client-side cache lifecycle
Runtime is ~95 lines. No staleTime, isFetching/isLoading distinction, garbage collection, retry logic, optimistic updates, refetchOnWindowFocus.
Minimum viable:
- Loading/fetching state distinction (don't throw on missing data)
- Error return shape:
{ data, isLoading, isFetching, error } refetchOnWindowFocusas default- Mutation lifecycle with rollback support for optimistic updates
- Garbage collection for unmounted context data (configurable delay)
[x] SPEC: Per-context cache policy
cache= on @client accepts three forms:
- Omitted (default): Invalidation-based. Emits
s-maxage=31536000. Cache forever, purge on mutation. Use when your backend is the source of truth. cache=60(integer seconds): TTL-based. Emitss-maxage=60. Accept bounded staleness. Use for unobservable mutations — when your backend mirrors external data (third-party APIs, aggregations, upstream services) and cannot know when it changes.cache=False: Never cache. EmitsCache-Control: no-store. Use for non-deterministic functions (random(),datetime.now()).
This is the escape hatch for data the backend doesn't own the mutation scope for.
Positioned in docs as: "Are you the source of truth, or a mirror? Source of truth →
use affects=. Mirror → use cache=N."
The cache=int value flows into the edge manifest per-context, so the Edge Worker
and CDN respect it without special handling (s-maxage is standard CDN behavior).
[ ] SPEC: Extension points for cache/invalidation lifecycle
Zero hooks for third-party code. No pre-invalidation hook, no custom cache key function, no invalidation transport plugin.
Minimum viable:
CacheBackendprotocol (third parties implement custom backends)on_invalidate(context, params)event hook (monitoring/debugging)- Document these as public API from day one
[x] SPEC: Manifest versioning
The manifest has no version field. When the schema evolves, Edge Workers can't distinguish v1 from v2 format.
Fix: Add "version": 1 to manifest root before anyone deploys it. Edge Workers check version and fail fast on unknown versions.
[x] SPEC: Wire format convention
Python emits snake_case params (user_id). TypeScript conventionally uses camelCase (userId). The USER_SCOPED_PARAMS set in manifest.ts contains both conventions. Invalidation headers from Python won't match TypeScript keys expecting camelCase.
Fix: Document snake_case as the wire format convention. TypeScript adapters convert at the boundary.
Operational Gaps
[ ] OPS: No cache observability
No hit/miss metrics, no cache key debugging, no invalidation audit trail, no manifest version tracking.
Need: X-Mizan-Cache-Status response header (HIT/MISS/BYPASS/STALE/PURGED/DYNAMIC). Structured logging in Edge Worker. Console-level invalidation event log for devtools.
[ ] OPS: Purge rate limits at scale
Cloudflare zone purge API: 500 req/10s (free/pro), 2500/10s (Enterprise). Bulk operations can exceed this.
Need: Batch purge requests (up to 30 URLs per API call). Document rate limits. Design Cache Tags upgrade path for Enterprise.
[ ] OPS: Purge-then-warm race condition
Warming fetch arriving at a PoP before purge propagates gets a cache HIT on stale data.
Fix: Use Cache-Control: no-cache or cf: { cacheTtl: 0 } on warming requests to force revalidation.
[ ] OPS: PSR warming only warms one colo
Warming fetch from a Worker runs in a single datacenter. Only warms that colo's cache (+ upper-tier if Tiered Cache active). Does not warm all 300+ PoPs.
Document: PSR warming reduces origin load by warming the shield tier. First request from each edge PoP is still a cache miss to the shield. Not zero-latency for all users.
Django Integration Concerns
[ ] DX: @client breaks decorator stacking
@client returns a class (FunctionWrapper), not a callable. @login_required, @csrf_exempt, @cache_page cannot compose with it.
Options:
- (a) Make
@clientreturn a wraps-compatible callable that also carries metadata (Django Ninja approach). - (b) Document incompatibility prominently. Provide Mizan-native equivalents. State that
@clientreplaces@login_required(viaauth=),@cache_page(via context caching), etc.
[ ] DX: JWTUser too thin for complex auth checks
Works for is_staff/is_superuser. Fails for allauth relations, DRF permissions, request.user.groups.all(), user model relations.
Need: Document limitation. Provide get_full_user() helper that does DB lookup when needed. Or optionally expand JWT claims.
[ ] DX: Transaction safety of invalidation
Invalidation in response body is optimistic — fires before ATOMIC_REQUESTS commits. If transaction rolls back, invalidation was already sent.
Need: Document as known behavior. Recommend transaction.on_commit() for critical paths. When building mizan-cache, consider two-phase: mark for invalidation during request, execute purge on commit.
[ ] DX: Admin/ORM writes invisible to invalidation
Only @client(affects=...) functions trigger invalidation. Django admin saves, management commands, direct ORM writes are invisible.
Need: Document clearly. Provide manual purge API: purge_context('products', params={'product_id': 42}).
[ ] DX: Cache adapter integration for Django
The Python cache adapter is a thin protocol layer over Redis (not a Django cache backend).
Django developers call mizan.cache.get(context, params, user_id, rev) directly.
Provide a mizan.cache.clear() for test fixture teardown. Document that this is
separate from Django's CACHES framework — Mizan owns its own cache protocol.
Business/Product Concerns
[ ] BUSINESS: Free tier + Cloudflare free = 80% of paid product
Existing Cache-Control headers on context fetches are CDN-ready. A developer puts Cloudflare free tier in front and gets stale-while-revalidate at 300+ PoPs for $0. The 20% gap (user-scoped HMAC keying, PSR, render Workers) doesn't exist in code yet.
[ ] BUSINESS: $20/seat wrong pricing model
"Seat" is undefined for a framework. Usage-based ($0.50/100K requests with generous free tier) or flat-per-project ($29/month) converts better for infrastructure products.
[ ] BUSINESS: Ship framework first, cloud second
The framework has working code. The cloud product has zero. Risk: building both depletes runway before either has adoption. Recommended: get 500 devs using @client + affects= on their VPS first, then build the Edge product for the gap they actually hit.
Validated Design Decisions (No Changes Needed)
These were confirmed sound by multiple reviewers:
- Declarative invalidation graph (
affects=+ auto-scoping) — unanimously praised as genuinely novel - Two-zone
fetch()pattern — correct architecture for global CDN caching from Workers - Cross-language protocol — Python/TS with identical manifests, proven by parallel test suites
- Manifest-driven URL resolution — eliminates need for cache inventory state (no KV/DOs needed)
- Typed
ReactContextforaffectstargeting — prevents the string-fragility concern (string form is escape hatch only) - Replacing React Query — correct decision given context bundling + transport transparency goals
- Cost model — ~$5/month Cloudflare at 10K DAU, ~$20/month at 10x. Origin infra is the real cost.
- Origin-side Redis cache as L2 — viable fallback behind CDN, same protocol as Edge
Unique Expert Insights
Cloudflare Expert:
- Add
cf.cacheTtlandcf.cacheEverythingto allfetch()subrequests — don't rely solely on response headers - Consider Cache Tags (
Cache-Tagresponse header) from day one for Enterprise upgrade path - Consider Durable Objects for per-user cache coordination as alternative to HMAC-in-URL
Enterprise Architect:
- Key derivation hierarchy: master secret derives per-context keys. Compromise of one context doesn't affect others.
X-Mizan-Cache-Versionheader on every response for self-healing on version mismatch
Serverless Expert:
- Use
renderToReadableStream(streaming SSR) in Render Worker, notrenderToString. Memory and CPU budget are tight (128MB / 50ms). - Cache manifest in
globalThisin Edge Worker — do not read from KV per-request - AWS portability: CloudFront invalidation pricing is 10-100x more expensive. Design TTL-based alternative.
Next.js Expert:
- PSR doesn't address cold-start pages (initial population before any mutation) or render fan-out (10K parameterized variants re-rendering on one mutation)
- No streaming/Suspense/progressive delivery — entire context response blocks on slowest function
React Query Expert:
- Wire existing WebSocket push infrastructure to emit invalidation events for named contexts
- Generated hooks should return
{ data, isLoading, isFetching, error }, not throw on missing data
Django Architect:
- DRF
TokenAuthenticationcollision: both useAuthorization: Bearer, Mizan's JWT decode rejects DRF tokens with a 401 mizan-cacheas Django cache backend, not separate system
Framework Authoring:
- Define
CacheBackendprotocol before implementing — the abstraction is cheaper to get right before users exist - Add
"version": 1to manifest root now — adding it later is harder @clientis approaching parameter overload — ifcachebecomes extensible, useCachePolicyobject pattern, not more kwargs
SaaS Founder:
- The debugging UX for HMAC cache is a black box — invest in an invalidation graph debugging UI as a paid feature
- The
affects=auto-refetch is the "wow" moment — optimize time-to-that-moment in onboarding