Quartz v5.25

Phase 3b Investigation: tc_free is a no-op for a different reason than expected

Status: Investigation complete. No code-path fix shipped this session — the original Phase 3b plan (“free TC state after typecheck, free AST after MIR”) turns out to be aimed at the wrong target. The actual leak is elsewhere, and fixing it requires a different approach (next session).

What the original handoff assumed:

  1. tc_free is a no-op because libc holds freed pages in its arena.
  2. Adding ast_free after MIR lowering will drop ~7 GB of AST data.
  3. Together these cut peak RSS by 5–10 GB.

What we actually found:

tc_free’s coverage is only ~300 KB

Instrumented tc_dump_sizes(tc) to sum the entry counts of every Vec field on TypecheckState (including nested Vec<Vec<Int>> registry tables). Result on a full self-compile:

[mem] tc Vec entries (sum of sizes): 18276
[mem] tc nested Vec<Vec> entries:    72001
[mem] tc registry: structs=47 funcs=2140 traits=6 ptypes=106
[mem] tc interner.strings: 3042
[mem] global interner.strings: 32964

Total tracked entries: ~90k. At ~16 bytes per entry that is ~1.4 MB. The “typecheck adds 6.5 GB” delta we have been chasing is NOT IN tc_free’s covered Vecs. Even fixing every page-allocation issue under tc_free would recover at most a few MB.

The 6.3 GB lives in tc_function’s body walk

Bisected with a definitive test: commenting out the call to typecheck_walk::tc_function in the per-function loop in self-hosted/quartz.qz (line 742) drops the typecheck phase delta from +6552 MB to +208 MB, a 6.3 GB difference.

Then commenting out only the tc_stmt(tc, ast_storage, body) call inside tc_function reproduces the same 6.3 GB drop. So the leak lives inside the tc_stmt body walk (or its callee tc_expr), not in scope/registry bookkeeping or borrow-summary refinement.

Across 2140 functions, that is ~3 MB per function. None of it ends up in tc-tracked Vecs by the time the loop exits, which means it is in:

  • Global state mutated during the walk. liveness::g_func_handles / g_func_infos (per-function liveness info structs, never freed) — partial candidate but probably <1 GB total.
  • AST mutations. ast_set_str1 / ast_set_str2 interns strings into the global interner, but the interner only holds 33k strings (~1 MB).
  • Per-walk closure allocations. each(stmts, stmt: Int -> tc_stmt(...)) in the NODE_BLOCK handler allocates a fresh closure environment on every block walked. ~32 bytes each, but blocks are dense — bounded above ~10 MB total though, not 6 GB.
  • Substring/type-name allocations in tc_parse_type. Each call to tc_parse_type allocates Vecs and substring slices that are never freed. The per-call leak is small (~tens of bytes), but tc_parse_type is called O(node-count) times during the walk. Plausible candidate.
  • tc_tv_fresh_with_origin desc strings. Each fresh type variable created during inference allocates a new String for its origin description (e.g. "let x"). tc_tv_reset pops the entries between functions but never frees the underlying String objects.

The real culprit is almost certainly multiple of the above accumulating silently. The investigation needs to keep bisecting inside tc_stmt / tc_expr to identify the dominant allocator.

macOS libc hides the problem

Even when we DO free things via vec_free, libc’s malloc keeps the pages in its arena rather than returning them to the OS. Verified in C:

void *p = malloc(400 * 1024 * 1024);  // 400 MB
memset(p, 0xab, 400 * 1024 * 1024);   // RSS = 401 MB
free(p);                               // RSS = 401 MB (still!)

malloc_zone_pressure_relief(NULL, 0) returns 0 — there is nothing pooled to release. The pages are tracked in libc’s free lists and not handed back until the process exits.

madvise(MADV_FREE) and madvise(MADV_FREE_REUSABLE) on malloc-allocated memory also do not drop RSS — verified empirically. The only way to drop RSS on macOS for a freed block is to allocate via mmap directly and free via munmap. mimalloc linked via DYLD_INSERT_LIBRARIES has the same problem.

Why mmap-backed Vec helpers were attempted and reverted

We built __qz_vec_alloc_data / __qz_vec_realloc_data / __qz_vec_free_data runtime helpers that route allocations >= 64 KB through mmap (with a hidden 16-byte size prefix) and free them via munmap. Verified to work for large single-vec allocate-and-free: malloc 400 MB then free leaves RSS at 401; the mmap-backed equivalent leaves RSS at 1.

But the helpers had a transition leak: when a Vec grows past the 64 KB threshold via vec_grow, the OLD malloc-backed buffer is freed via libc (stays in libc’s pool, never released to OS), and a NEW mmap-backed buffer is allocated. Net peak excursion: +500 MB on the self-compile, with no corresponding drop because the malloc-pool half is unreclaimable. Reverted.

The right fix would be to make Vec’s data buffer mmap-backed FROM THE START for any vec that might grow large (i.e. lower the threshold to ~one page, 0 transition leaks, but pay page-padding overhead per small vec). This is a Vec architecture change and was out of scope for this investigation.

What shipped this session

  1. mem_release intrinsic (mem_release(): Int)

    • macOS: calls malloc_zone_pressure_relief(NULL, 0).
    • Linux: calls malloc_trim(0).
    • WASI: returns 0.
    • Returns the number of bytes the runtime claims to have released. (On macOS this routinely returns 0 because the pool is empty; on Linux malloc_trim actually does work for the small-block heap.)
    • Useful to call after batch-free phases. Currently a no-op on macOS, useful on Linux, designed for future __qz_vec_*_data integration.
  2. This handoff document.

No changes to vec/sb/string allocators. No changes to the AST. No changes to tc_free’s coverage. The Phase 3b mmap-helper experiment was reverted to keep the self-compile baseline clean (no regression).

Next session — Phase 3b.next

Goal: localize the 3 MB-per-function leak inside tc_stmt / tc_expr and fix it.

Bisection plan

Inside tc_stmt (typecheck_walk.qz line 1363 onwards):

  1. First narrow by node kind. tc_stmt has dozens of elsif node_kind == NODE_X branches. Pick the most common kinds — NODE_LET, NODE_EXPR_STMT, NODE_BLOCK, NODE_RETURN, NODE_IF, NODE_FOR, NODE_WHILE — and disable them one at a time. Whichever removal drops the typecheck phase delta the most is the leak source.

  2. Inside the leaking handler, look for:

    • vec_new() calls that don’t have a paired vec_free.
    • String concatenation via #{...} interpolation in hot paths.
    • ast_set_* calls that intern into the global interner.
    • Calls to tc_parse_type (which allocates substring Vecs/slices).
    • Closures created via lambda literals.
  3. Verify with tc_dump_sizes. After the fix, run with --memory-stats and check that the typecheck phase delta drops by the expected amount.

Likely actual fixes

  • Cache type parses. tc_parse_type is called O(node-count) times with many duplicate annotation strings. A Map<String, Int> cache from annotation → resolved type ID would make the lookup O(1) and eliminate allocation.
  • Free liveness info per function after typecheck consumes it. The g_func_infos global vec accumulates per-function tables; each one is read once during tc_function and never again.
  • Reuse closures. The block-walking pattern each(stmts, stmt -> tc_stmt (tc, ast_storage, stmt)) should not allocate a fresh closure per call. Either use a top-level helper function (no captures) or pre-create one closure and reuse it.

What success looks like

  • Self-compile peak RSS: 15 GB → ~10 GB (50% reduction from current).
  • typecheck phase delta: 6.5 GB → <1 GB (most of the 6.3 GB recovered).
  • mir phase delta: likely also drops because mir works on top of the typecheck baseline.

What is NOT in scope for the next session

  • Vec mmap-backing rewrite. Out of scope until we hit a leak that is purely in libc’s pool (current bottleneck is upstream of that).
  • ast_free. The handoff originally targeted ast_free but the AST is at most ~370 MB based on the resolve-phase total — even fully freeing it only saves ~370 MB.
  • Phase 3c (@cfg gating). Tracked separately. Worth doing if Phase 3b.next ships.

Pointers

  • self-hosted/quartz.qz — main pipeline. Line 682 for i in 0..func_count is the per-function tc_function loop. mem_report("typecheck") at line 759.
  • self-hosted/middle/typecheck_walk.qztc_function (line 3198), tc_stmt (line 1363), tc_expr (search file).
  • self-hosted/middle/typecheck.qztc_parse_type (line 929).
  • self-hosted/middle/typecheck_util.qztc_tv_reset, tc_tv_fresh_with_origin, tc_free.
  • self-hosted/middle/liveness.qzg_func_handles / g_func_infos.
  • self-hosted/backend/codegen_runtime.qz__qz_mem_release at the appropriate spot.

Prime directives this session

  • D1 (highest impact): the original Phase 3b plan would not have moved the needle — the data is not in tc_free’s covered Vecs. Discovering this is load-bearing; pursuing the wrong fix would have wasted the next session too. The bisection cost was justified.
  • D2 (research first): validated the libc pool / madvise / mimalloc / mmap behavior empirically before committing to any design.
  • D5 (report reality): shipping the investigation honestly without a fix, rather than shipping a “looks like it does something” mmap helper that net-regressed self-compile peak by 500 MB.
  • D6 (holes get filled or filed): the leak is filed here with bisection data and a continuation plan.

Phase 3b.next Session — CLOSED (Apr 15, 2026)

Status:FIX SHIPPED. Root cause found via multi-level bisection and fixed. Typecheck phase 5401 MB → 737 MB (-86%). Peak RSS 12.47 GB → 7.81 GB (-37%). Wall time 21.0s → 18.3s (-13%). Session target of 8 GB beaten.

Root cause (for the next handoff reader): 9 sites in typecheck_registry.qz (tc_lookup_function_return, ..._return_annotation, ..._params, ..._param_fn_ann, tc_find_matching_overload, tc_lookup_struct, tc_lookup_enum, tc_lookup_trait, tc_lookup_function_index) implemented the suffix-fallback scan using str_byte_slice(name, len-suf, len).eq(suffix). That pattern allocates a fresh substring for every candidate in the scan — roughly 2100 substrings (one per registered function) per fallback call.

The fallback fires whenever a function is referenced without its module prefix, which is the normal case for module-internal calls. In the self-hosted compiler, about 10% of NODE_CALL typechecks hit the fallback (the other 90% resolve through the primary intern-id match). That’s 67k calls × 10% × 2100 substring allocs = 14 million allocations × ~30 bytes each = ~420 MB per typecheck run, amplified by libc’s inability to return pages across repeated calls.

Fix: replace all 9 sites with name.ends_with(suffix). The runtime’s qz_str_ends_with uses memcmp — zero allocations. The name_len >= suffix_len guard is left in place (harmless dead code, removed as part of the edit for cleanliness is a separate style pass). Confirmed by measurement: typecheck delta dropped from +5401 MB to +737 MB in a single build-measure cycle.


Bisection trail (for future similar problems)

This took 4 levels of bisection to find. Worth documenting because the techniques generalize.

Level 1 — per-function RSS delta. Instrumented the tc_function loop in quartz.qz to print RSS delta per function > 8 MB. Found 153 functions leaking, distributed roughly proportional to body size. Ruled out “one hot function is leaking” — the leak was diffuse.

Level 2 — per-node-kind delta inside tc_stmt and tc_expr. Wrapped tc_stmt and tc_expr with entry/exit RSS snapshots, bucketing totals by NODE_X kind. Found NODE_CALL dominant at 5.76 GB / 69k calls (~87 KB/call). NODE_IDENT was 15 bytes/call — ruling out borrow/scope/move checks as the culprit. Localized the leak to tc_expr_call.

Level 3 — internal checkpoints in tc_expr_call. Added 7 RSS checkpoints through tc_expr_call’s ~1800-line body. Found:

  • UFCS rewrite block: 3 KB/call (2.9%) — small
  • Function lookup / open UFCS / arity resolution: 26 KB/call (32%) — big
  • Main arg walk loop: 11 KB/call (13%)
  • Borrow/lambda/trait/return/container/linear/move section: 42 KB/call (51%) — big

Gotcha: checkpoints declared with var X = 0 inside conditional blocks get reassigned with X = mem_current_rss() only if the branch runs. For branches that don’t run, the uninitialized delta yields astronomic bogus numbers. Declare all checkpoint vars at function top, reassign (not re-declare) inside branches. Takes one false-positive cycle to learn.

Level 4 — drill into the arity-check section. Added two more checkpoints splitting Open UFCS (~1 MB, negligible) vs arity resolution. Then disabled tc_function_param_count and tc_function_required_count entirely — typecheck dropped from 5334 MB to 4883 MB (-451 MB). That gave a concrete target: the arity lookups were leaking 6.6 KB/call. Examined their source → tc_lookup_function_params → suffix fallback → str_byte_slice(...).eq(suffix)found the allocation.

What shipped this session

  1. 9-site ends_with fix in typecheck_registry.qz (the main fix, -4.6 GB).
  2. tc.parse_type_cache field — negligible immediate win (~12 MB) but the cache is in place for when tc_parse_type becomes the next hot path.
  3. First-arg cache in tc_expr_call — on-demand _fa_cache_type / _fa_cache_computed locals that thread the first-arg type through the UFCS resolution cascade. Saves ~90 MB by eliminating redundant tc_expr walks of args[0]. Invalidated after tc_try_resolve_map_key_value_types which mutates the binding type and thus the first-arg type.
  4. Updated ROADMAP #19 with the new numbers and root-cause cite.
  5. This handoff closing out the investigation.

What’s left on the table (for a future session if needed to hit 6 GB)

From the Level 3 bisection data:

  • tc_expr_call post-arg-walk section: still leaks 42 KB/call (~2.8 GB total of the remaining leak). Candidates: per-call tc_lookup_borrow_summary, lambda arity validation loops, trait bound validation, return-type lookup, container type propagation. Each may have smaller but cumulative allocations.
  • MIR phase still leaks 6.45 GB. Not touched this session — Phase 3b.next was scoped to typecheck. The MIR lowering pass has its own allocation patterns; a similar bisection there would be the next lever.
  • Linux path not re-measured. This session was macOS-only. Linux is presumably also much better given the same code path.

Stretch: drilling tc_expr_call further could push peak RSS to ~6 GB. MIR phase optimization is the next big lever after that.


Phase 3b.next Session — PRIOR findings (Apr 15, 2026)

Status: Primary fix target missed. tc_parse_type cache landed but only saved ~12 MB (0.2%) — the tc_parse_type hypothesis was wrong. Deeper bisection narrowed the leak to NODE_CALL (5.7 GB of the 5.4 GB net typecheck delta, ~87 KB average allocation per NODE_CALL through tc_expr_call). Next session should drill into tc_expr_call’s sub-steps rather than guess from the outside.

What this session measured — trustworthy data

1. Per-function distribution (size-proportional)

Instrumented the per-function loop in quartz.qz:685 to print RSS delta for every tc_function call exceeding 8 MB. Top leakers on a full self-compile:

[mem-fn] +136 MB: mir_lower$mir_lower_expr
[mem-fn] +134 MB: lexer$lexer_tokenize
[mem-fn] +114 MB: mir_lower_expr_handlers$mir_lower_call
[mem-fn] +93  MB: typecheck_expr_handlers$tc_expr_call
[mem-fn] +87  MB: mir_lower$mir_lower_actor_poll
[mem-fn] +73  MB: typecheck_walk$tc_stmt
[mem-fn] +68  MB: compile
[mem-fn] +67  MB: mir_lower$mir_lower_all
[mem-fn] +63  MB: typecheck$tc_parse_type
[mem-fn] +59  MB: typecheck_util$tc_free
[mem-fn] +56  MB: mir_lower_expr_handlers$mir_lower_concurrency
...153 total functions leaked >8 MB each...

The leak is size-proportional to the function body being checked. Big functions (1000+ LOC) leak 100+ MB each during their tc_function call. Small functions leak <8 MB. The cumulative distribution matches the 5.4 GB net typecheck delta.

2. Per-node-kind distribution inside tc_stmt

Instrumented tc_stmt with a per-kind RSS delta bucket:

Node kindCountRSS deltaPer-stmt avg
NODE_BLOCK (32)13,132+14,041 MB~1 MB/stmt (composite — includes descent)
NODE_IF (23)8,986+9,749 MB~1 MB/stmt (composite)
NODE_EXPR_STMT (29)50,091+2,372 MB~47 KB/stmt (leaf)
NODE_LET (25)15,117+1,411 MB~93 KB/stmt (leaf)
NODE_FOR (44)1,008+1,457 MB~1.4 MB/stmt
NODE_WHILE (24)418+795 MB~2 MB/stmt
NODE_RETURN (22)5,065+211 MB~40 KB/stmt

The composite statements’ delta includes their recursive descent cost. The true leaf contributors are NODE_EXPR_STMT, NODE_LET, and NODE_RETURN, each leaking ~40–95 KB per statement.

3. Per-node-kind distribution inside tc_expr

Instrumented tc_expr the same way. The signal is overwhelming:

Expr kindCountRSS deltaPer-call avg
NODE_CALL (9)69,186+5,764 MB~87 KB/call
NODE_INTERP_STRING (43)1,088+450 MB~413 KB/call (descent cost)
NODE_BINARY (8)14,117+731 MB~52 KB/call
NODE_IDENT (6)209,691+3 MB15 bytes/call (trivial!)
NODE_UNARY (7)349+32 MB~93 KB/call
NODE_ARRAY (10)184+11 MB~60 KB/call
NODE_INDEX (13)4,369+2 MB~480 bytes/call
NODE_STRUCT_INIT (14)67+10 MB~150 KB/call
NODE_FIELD_ACCESS (15)9,920+9 MB~900 bytes/call

NODE_CALL is the dominant allocator at 5.76 GB (more than the 5.4 GB net delta — the over-count is libc oscillation: RSS grows and shrinks across the tc_expr calls, but the NET held memory is 5.4 GB). NODE_IDENT is surprisingly cheap at 15 bytes per call, which rules out the borrow-check and scope-lookup hot path. NODE_FIELD_ACCESS is also cheap.

Conclusion: the leak is concentrated in tc_expr_call (typecheck_expr_handlers.qz:1319) — NOT distributed across tc_expr’s dispatcher branches. Next session’s bisection should start there.

4. Hypotheses that were tested and failed

  • tc_parse_type cache (wrapper around tc_parse_type_impl keyed on annotation string, skipping forall/impl). Saved ~12 MB of 5,401 MB (0.2%). Confirms tc_parse_type is not called often enough to matter. The cache is KEPT in the commit as a small but measurable win and future groundwork — when tc_parse_type does become a hot path (e.g. after tc_expr_call is fixed and its mass gets out of the way), the cache will already be there.
  • Replacing each(stmts, stmt -> tc_stmt(...)) with plain for loops in NODE_BLOCK and NODE_INTERP_STRING. Zero change in typecheck delta. Closure allocation is not a contributor at this scale.
  • NODE_IDENT safety-check disable (would have tested per-ident borrow/ linear/move/partial-move checks). NOT tested this session because the 15 bytes/call signal already ruled out NODE_IDENT before I got to it.

What SHIPPED this session

  1. tc_parse_type cachetc.parse_type_cache: Map<String, Int>, populated for all scope-independent annotations (skipping forall and impl ). ~12 MB savings measured. Kept because it’s a cheap, correct, reusable improvement that will pay off more once the NODE_CALL leak is fixed and tc_parse_type becomes relatively more important.
  2. This updated handoff. Full bisection data for next session.

Not shipped: a fix for the actual leak. Per the handoff’s failure-mode guidance (>6 bisection cycles without conclusive narrowing), continuing to guess at the fix this session would have been cowardice masquerading as pragmatism.

Next session — Phase 3b.next.2 — drill into tc_expr_call

Primary target

typecheck_expr_handlers.qz:1319 — the tc_expr_call function. ~600 lines, handles all function call typing including UFCS rewrite, trait method resolution, arity validation, argument type checking, named args, default values, type parameter inference.

Suggested bisection order

  1. First pass — disable whole sub-blocks and measure. Add return type_constants::TYPE_INT at strategic points and measure typecheck delta. In order:

    • (a) Return TYPE_INT right after var func_name = ast::ast_get_str1(...). Skips everything. This establishes the baseline “how much do the children of a NODE_CALL contribute vs. tc_expr_call itself” — i.e. separates descent cost from tc_expr_call’s OWN cost.
    • (b) Keep arg walking but disable UFCS rewrite (lines 1715–1960+). If delta drops significantly, UFCS rewrite is the leak. Strong candidate: UFCS calls tc_try_resolve_vec_element_type and tc_try_resolve_map_key_value_types which do recursive re-walks of the first arg.
    • (c) Keep UFCS rewrite but disable tc_try_resolve_*. These re-walk args redundantly.
    • (d) Disable the arity-check loop and the type parameter inference blocks.
  2. Second pass — within the leaking sub-block, look for:

    • Allocations proportional to argument count (most function calls have 1–5 args, so a per-arg allocation × 69k calls × 3 args avg = 207k allocations).
    • str_byte_slice calls that materialize substrings.
    • vec_new() calls that don’t have paired vec_free.
    • Interpolated error strings "#{...}" in the hot path (unlikely — errors are rare).
    • ast_set_str1/str2 mutations that intern new strings.
  3. Third pass — fix in place. Likely shapes:

    • Deduplicate arg walks. tc_try_resolve_vec_element_type etc. walk first_arg which was ALREADY walked at line 1721. Caching the per-node type via a Map<AstNodeId, Int> or annotating the AST with the resolved type would eliminate the second walk.
    • Build the mangled name lazily: tc_mangle always allocates "#{struct_name}$#{func_name}" even when the lookup will fail. Cache with a Map<(String, String), String> or intern it.
    • Stop re-checking arg types across the UFCS rewrite. The rewrite changes func_name but the arg types don’t change.

Pre-flight checklist

cd /Users/mathisto/projects/quartz
git log --oneline -6  # verify Phase 3b.next progress commit is on top
./self-hosted/bin/quake guard:check
./self-hosted/bin/quake smoke 2>&1 | tail -6

# Capture baseline (with tc_parse_type cache):
./self-hosted/bin/quartz --no-cache --memory-stats \
  -I self-hosted/frontend -I self-hosted/middle -I self-hosted/backend \
  -I self-hosted/shared -I std -I tools \
  self-hosted/quartz.qz > /dev/null 2>/tmp/mem_baseline.txt
grep '\[mem\]' /tmp/mem_baseline.txt
# Expected (post-cache): typecheck ~5389 MB, peak ~12438 MB

# Fix-specific backup:
cp self-hosted/bin/quartz self-hosted/bin/backups/quartz-pre-mem3b-next2-golden

Instrumentation you can reuse

These are deleted as part of the Phase 3b.next commit, but here is the shape for quick re-addition:

# Add near top of typecheck_walk.qz:
var _mem_bisect_totals = vec_new<Int>()
var _mem_bisect_counts = vec_new<Int>()

def _mem_bisect_record(kind: Int, delta: Int): Void
  while _mem_bisect_totals.size <= kind
    _mem_bisect_totals.push(0)
    _mem_bisect_counts.push(0)
  end
  _mem_bisect_totals[kind] = _mem_bisect_totals[kind] + delta
  _mem_bisect_counts[kind] = _mem_bisect_counts[kind] + 1
end

def _mem_bisect_report(): Void
  var i = 0
  while i < _mem_bisect_totals.size
    var total = _mem_bisect_totals[i]
    if total > 1048576
      eputs("[kind #{i}] +#{total / 1048576} MB / #{_mem_bisect_counts[i]} stmts")
    end
    i += 1
  end
end

# Wrap tc_expr via an _tc_expr_dispatch helper:
def tc_expr(tc: TypecheckState, ast_storage: ast::AstStorage, node: AstNodeId): Int
  if node < 0
    return type_constants::TYPE_ERROR
  end
  var _k = ast::ast_get_kind(ast_storage, node)
  var _r = mem_current_rss()
  var _result = _tc_expr_dispatch(tc, ast_storage, node)
  _mem_bisect_record(_k + 1000, mem_current_rss() - _r)
  return _result
end
# ...and call _mem_bisect_report() right before mem_report("typecheck").

Context for why tc_expr_call is likely the leak

tc_expr_call uniquely among tc_expr handlers:

  • Does SEVERAL recursive tc_expr calls on args (primary + UFCS rewrite’s first-arg re-walk + tc_try_resolve_*‘s walks).
  • Mutates the AST (ast_set_str1, ast_set_str2, ast_set_kind) — these intern into the global string interner.
  • Calls tc_parse_type on type_arg (cached now, but used to be hot).
  • Does fuzzy name lookups into the function registry (string compares).
  • Does UFCS container-type inference via tc_try_resolve_vec_element_type / tc_try_resolve_map_key_value_types which are non-trivial helpers.

All of these compound. A single NODE_CALL can trigger 5+ tc_expr calls (one per arg, one receiver re-walk, plus resolve helpers) and multiple string allocations.

Don’t drift into

  • Scheduler park/wake refactor
  • Async Mutex/RwLock
  • Any unrelated roadmap items

Stay focused on tc_expr_call. One session. Fix it or document progress.