Session 4 Handoff — Iomap Fix + Lessons Learned
Date: Apr 11, 2026 (overnight session)
Starting state: trunk at 96ad7a95 (session 3 handoff)
Ending state: trunk at 38c88277 — clean, 2 commits ahead
What shipped
Two commits on trunk, neither pushed:
38c88277 Regression tests for iomap fixes + refresh Linux golden binary
b86bde04 Guard iomap access in channel codegen: sync-context crash + sched_shutdown UAF
b86bde04 — iomap null/dangling guard
Three related bugs in channel codegen that all involved @__qz_sched[12]
(the iomap base pointer) being either zero or dangling:
-
try_send in sync code crashed.
select { send(ch, 42) => 0 end }from a pure-synchronous program segfaulted on the first successful send: the inlined try_send intrinsic unconditionally read the iomap base and indexed it to wake any io-suspended peer. In sync code the scheduler never initializes, so slot 12 is zero and the xchg deref’d null+fd*8. -
channel_close had the same pattern (different label prefix) and crashed the same way from sync code.
-
sched_shutdown freed the iomap via
call void @free(...)but never zeroed slot 12. Subsequent channel_close chased the freed pointer.
Fixes:
cg_intrinsic_conc_channel_try.qz: insert an.iomap.hasnull check before the xchg intry_send’s io_wake block. Branches tots_io_skip_if iomap is null.cg_intrinsic_conc_channel.qz: same guard for channel_close (cc_prefix).codegen_runtime.qz: zero slot 12 afterfree(%io.map.raw)in__qz_sched_shutdown.
Impact: unlocked concurrency_spec.qz (57 tests) and
channel_result_spec.qz (6 tests), plus async_io_spec.qz (6 tests) and
scheduler_spec.qz (3/4 — test 4 is documented flaky pre-existing). 72
tests newly passing with zero regressions in existing async/closure/
concurrency specs.
38c88277 — regression tests + golden refresh
Adds three new tests to async_spill_regression_spec.qz:
- select-send on empty buffered channel from sync code
- select-send with default arm picks send when ready
- channel_close after sched_shutdown
And refreshes the Linux golden binary. Total 11 regression tests, all green.
ROADMAP updated to reflect the two new “Known Bugs → Fixed” entries and move
concurrency_spec / channel_result_spec / async_io_spec / scheduler_spec
off the “Pre-existing failures still open” list (leaving only
backpressure_spec 1/7 and tls_async_spec 0/6, both confirmed pre-existing
on the pre-fix golden).
What was attempted and reverted
Several commits (c43a5416, b6c929ad, 20cf7fb2, 54a771ee) were made and
then reset via git reset --hard 38c88277. The pattern:
-
c43a5416: Call/vec/tuple spill — attempted to extend session 3’s binary-op spill/reload fix (3440903f) to also cover:- NODE_CALL argument lists (
f("pre", await h, "post")) - NODE_ARRAY literals (
[1, await h, 3]) - NODE_TUPLE_LITERAL (
(7, await h, 99))
All three crash llc with “Instruction does not dominate all uses” pre-fix, so the compiler-emission bug is real and waiting to be fixed.
Why it was reverted: the fix’s binary broke
actor_spec.qzcompilation. The compiler emitted truncated IR (8370 lines instead of ~63440), stopping mid-function atdefine i64 @with no function name. I root-caused it partway (a default-arg slot inarg_nodescarried a sentinel value that wasn’t a valid AST node ID; my newmir_any_sibling_contains_awaithelper walked intoast_get_kindwith a garbage index → size-37 OOB) but even after adding a defensive bounds check (node >= ast_node_count(s)), the rebuilt compiler still produced truncated IR for actor_spec specifically.The root cause is still open. Actor codegen emits
__Future_<pointer>$newand__Future_<pointer>$pollfunctions whose name contains a raw pointer value — that’s weird enough to warrant its own investigation. - NODE_CALL argument lists (
-
b6c929ad: Call/vec/tuple regression tests — depended on c43a5416. -
20cf7fb2: Zero completion_map + task_locals slots in sched_shutdown — same bug pattern as the iomap fix in b86bde04. Aftersched_shutdown, spawn+await hung forever because@__qz_completion_map[3]was freed but not nulled, and the await logic uses that slot as a “scheduler active” proxy.I verified the fix worked in isolation (a minimal
sched_init + go + await + sched_shutdown + spawn + awaithung pre-fix, ran clean post-fix). But the binary that tested it was built atop the broken c43a5416 binary, so the rebuild cycle tangled.The completion_map source fix is worth re-applying in a fresh session. It’s a ~20-line diff to
codegen_runtime.qzthat addsstore i64 0, ptr ...after eachfree(...)for@__qz_completion_map[0/1/2/3]and@__qz_task_locals[0/1/2/3]. Patch reproduced in an appendix below. -
54a771ee: ROADMAP — depended on 20cf7fb2.
Why the rebuild got tangled
The Linux binary at self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-golden
is the ONLY working Linux compiler — macOS binaries don’t help and there’s no
alternative build path. Every compiler source edit requires rebuilding via that
golden, which then OVERWRITES the golden with the rebuilt binary (the whole
point is to verify fixpoint).
If the rebuilt binary has a subtle bug that only surfaces when compiling
certain programs (like actor_spec), the bug is hidden by the fixpoint check
(which only compares self-compilation output, not arbitrary-program output).
And once you’re on the bad binary, rebuilding from source doesn’t fix you
unless you have a known-good older golden in hand — which, because I kept
overwriting it, I ultimately had to recover from git history
(git show 38c88277:self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-golden).
Lesson for next session:
- Before any compiler change, save a copy of the current golden under a
fix-specific name (
quartz-pre-<fix-name>-golden) and do NOT overwrite that copy under any circumstances. - After the rebuild, run
quake fixpointAND compile a few non-self programs (actor_spec.qz,concurrency_spec.qz,stream_skip_while-style generator tests) to catch program-specific codegen regressions that fixpoint misses. - If the rebuild fails those smoke tests, restore the saved pre-fix golden IMMEDIATELY — do not iterate on fixes from a suspect binary.
What I learned (deep insights worth carrying forward)
-
“Free without zero” is a recurring compiler pattern. Every global slot in
@__qz_sched/@__qz_completion_map/@__qz_task_localsthat holds a heap pointer has the same latent UAF: freeing without zeroing leaves the slot looking “active” to theptr != 0proxy that various codegen sites use. The iomap fix only closed one hole; completion_map and task_locals have the same shape (still unfixed). -
The “active” proxy check itself is fragile. Using
@__qz_sched[12] != 0as “is the scheduler running?” is cute but brittle — it relies on teardown code being disciplined about zeroing. A dedicated@__qz_sched_active: i64thatsched_init/sched_shutdowntoggle explicitly would be more robust but touches every check site. Worth doing eventually. -
Default-arg placeholders in
arg_nodeslook like AST node IDs but aren’t. Whenna_default_mask != 0, the compiler leaves sentinel values in the children vec for skipped slots. Any traversal that walkschildrenwithout consulting the mask will feed garbage toast_get_kind→ size-N array OOB. Mymir_expr_contains_awaitwalker needed a bounds check (node >= ast_node_count(s)) as defense in depth — that’s the right fix regardless of the default-mask issue. -
__Future_<raw_pointer>$new/$pollnames are a codegen fossil. actor_spec emits functions named with literal pointer values (e.g.@__Future_1082221776$new). Every build produces different pointer values, making the IR non-deterministic. This smells like actor/future construction using a pointer-as-identity hash instead of a stable identifier. The fact that my compiler broke on THIS specific file suggests the pointer-name path interacts badly with something in my attempted fix — and also that the whole pointer-naming approach is technical debt worth cleaning up. -
scheduler_spec test 4 is 60% flaky. I ran it 10+ times and it passes ~4/10. It’s the “pipe-based async task wakes via I/O poller” test that earlier handoffs called “1/4 flaky — OOM-killed exit 137”. Environment- dependent timing.
Suggested next targets
A. Re-apply the completion_map fix cleanly (high ROI)
~20-line addition to codegen_runtime.qz. Pure additive — only adds
store i64 0, ptr %slot lines after existing free(...) calls. Reveals the
post-shutdown spawn+await hang documented above. Regression test is straight-
forward:
def rs_spawn_after_shutdown(): Int
sched_init(0)
var h1 = go rs_double(5)
var r1 = await h1
sched_shutdown()
var h2 = spawn rs_double(9)
var r2 = await h2
return r1 + r2
end
Patch:
# After each free(...) call in __qz_sched_shutdown, add:
codegen_util::cg_emit_line(out, " store i64 0, ptr %cm.arr.p2")
codegen_util::cg_emit_line(out, " %cm.cnt.p2 = getelementptr [4 x i64], [4 x i64]* @__qz_completion_map, i64 0, i64 1")
codegen_util::cg_emit_line(out, " store i64 0, ptr %cm.cnt.p2")
codegen_util::cg_emit_line(out, " %cm.cap.p2 = getelementptr [4 x i64], [4 x i64]* @__qz_completion_map, i64 0, i64 2")
codegen_util::cg_emit_line(out, " store i64 0, ptr %cm.cap.p2")
# ... after the mutex free:
codegen_util::cg_emit_line(out, " store i64 0, ptr %cm.mtx.p2")
# Same pattern for @__qz_task_locals slots 0/1/2/3
Full diff in stash (see git stash show + grep for completion_map).
B. Call/vec/tuple spill fix — retry, with smoke tests
The pattern is right; my implementation had subtle issues. Next attempt:
- Start with the vec/tuple portions only. Call-args has extra complications (default masks, sentinel values).
- Skip the default-mask case entirely.
if na_default_mask != 0: no spill. - Keep the defensive bounds check I added to
mir_expr_contains_await. That’s a good invariant regardless. - Smoke-test aggressively before committing. After the rebuild, compile:
actor_spec.qz(full IR lines, not truncated)concurrency_spec.qz- A file with generators + lambda args
- The whole regression spec suite
C. Investigate the __Future_<pointer>$<name> naming
Search for where the __Future_ prefix is emitted with a raw pointer value
in the suffix. Probably in actor/future lowering. Replace with a stable
counter or content hash. Deterministic names unlock IR snapshot testing for
actor code.
D. Task-local storage spec (tls_async_spec) 0/6
0/6 is a wall, not a flake. Worth a real investigation — could be another free-without-zero or a genuine missing feature. Start by reading the spec and one of its failing tests to understand what’s expected.
Known flaky or pre-existing (NOT tonight’s regression)
| Spec | State | Note |
|---|---|---|
scheduler_spec.qz test 4 | ~40% pass | ”pipe-based async task wakes via I/O poller” — environment timing |
backpressure_spec.qz | 1/7 | Compiles now (improvement from pre-fix golden), but policy tests fail |
tls_async_spec.qz | 0/6 | Task-local storage — pre-existing, reproduces on old golden |
actor_spec.qz | 21/21 | Works on 38c88277 golden; any compiler edit must preserve this |
async_spill_regression_spec.qz | 11/11 | The session 3 + session 4 regression file |
How to resume
cd /home/mathisto/projects/quartz-git
git log --oneline trunk -3 # 38c88277 should be at top
# Recover a working Linux golden from git (in case backups/ is contaminated)
git show 38c88277:self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-golden \
> /tmp/q_known_good && chmod +x /tmp/q_known_good
/tmp/q_known_good --version # expect: quartz 5.12.21-alpha
# Before starting any compiler edit, save the current golden:
cp self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-golden \
self-hosted/bin/backups/quartz-pre-<fix-name>-golden
# DO NOT overwrite quartz-pre-<fix-name>-golden until the fix is committed.
The worktree at /home/mathisto/projects/quartz-head/ was heavily tangled
during debugging and contains uncommitted bootstrap stubs plus misc files.
Don’t trust it — reset hard to a clean state before using it:
cd /home/mathisto/projects/quartz-head
git stash drop # drop any stashes from session 4
git restore --source=HEAD --worktree .
Then re-apply the bootstrap stubs from the main repo as needed (the
ast_func_is_cconv_c references in mir_lower.qz and resolver.qz need to
be if false # bootstrap stub: ... for the Linux golden to compile the
main-repo source).