Quartz v5.25

Overnight Session Summary — Ready for macOS merge

Started: trunk at 96ad7a95 (session 3 handoff) Ended: trunk at cbc8274a7 commits ahead of origin/trunk, unpushed

cbc8274a ROADMAP: record channel_new urem fix + remove backpressure_spec from open list
020e8194 Channel capacity: use exact user-requested cap, not next power of 2
cf0da41d ROADMAP: record completion_map fix + document actor_spec non-determinism
73dd0479 Zero completion_map + task_locals slots in __qz_sched_shutdown
48a94793 Session 4 handoff: iomap fix wins + rebuild-cycle lessons
38c88277 Regression tests for iomap fixes + refresh Linux golden binary
b86bde04 Guard iomap access in channel codegen: sync-context crash + sched_shutdown UAF

Headline numbers

  • 4 real compiler bugs fixed (all in concurrency/channel codegen)
  • ~85 newly-passing tests unlocked in spec suite
  • Trunk is source-clean — no stray diffs, no broken files
  • Regression coverage added: async_spill_regression_spec grows from 8 tests (session 3 end) to 12 tests (session 4)
  • Linux golden at self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-golden stays at 38c88277-era state — intentionally NOT rebuilt due to a pre-existing Linux non-determinism bug (see §“Known caveat” below). The macOS golden (self-hosted/bin/quartz) should pick up all source fixes on the next quake build you run on macOS.

The 4 bugs fixed

1. b86bde04 — iomap null + dangling guard (try_send / channel_close / sched_shutdown)

Three related bugs in channel codegen, all involving @__qz_sched[12] (the scheduler’s iomap base pointer):

  • try_send in pure-synchronous code crashed on the first successful send because it unconditionally dereferenced @__qz_sched[12] to wake io-suspended peers. In sync code the scheduler never initializes, so slot 12 is zero → null+fd*8 crash.
  • channel_close had the same pattern at a different label prefix.
  • sched_shutdown freed the iomap via free(...) but never zeroed slot 12, so a subsequent channel_close chased the freed pointer.

Unlocked: concurrency_spec (57 tests, was SEGV-at-startup), channel_result_spec (6 tests, was SEGV-at-startup), async_io_spec (6 tests, was silent hang), scheduler_spec (3/4 — test 4 pre-existing flake). 72 tests newly passing.

2. 73dd0479 — completion_map + task_locals zero-after-free in sched_shutdown

Same bug shape as the iomap fix. After sched_shutdown, @__qz_completion_map[3] (mutex pointer used as a “scheduler active” proxy) was freed but not nulled. Subsequent spawn + await took the scheduler-backed completion-pipe path on a dead scheduler and hung forever.

Fix: zero slots 0/1/2/3 of both @__qz_completion_map and @__qz_task_locals after their respective free(...) calls in __qz_sched_shutdown.

Unlocked: post-shutdown spawn + await works cleanly. Regression test rs_spawn_after_shutdown (hangs pre-fix, returns 14 post-fix).

3. 020e8194 — Channel capacity uses exact user-requested cap (not next power of 2)

channel_new(cap) silently rounded capacity up to the next power of 2 so ring-buffer indexing could use a fast and count, (cap-1) mask. User-visible consequence:

  • channel_new(10) actually held 16 items before blocking (!)
  • channel_pressure(ch) on a half-full channel_new(10) reported 31% instead of 50%
  • channel_remaining(ch) reported 11 instead of 5

Fix: remove the bit-smearing round-up from channel_new, switch all 8 ring-buffer indexing sites from and count, (cap-1) to urem count, cap. The LLVM urem is slower than and in raw cycles but both the send and recv paths already involve pthread_mutex_lock, so the perf impact is in the noise.

Unlocked: backpressure_spec (7/7 passing, was 1/7). Also fixes the subtle “channels hold more than their declared capacity” semantic bug that was never noticed until the backpressure tests exposed it.

4. Cumulative: no regressions in existing concurrency/async specs

After all three fixes, 130 tests across 8 key specs are green on a freshly rebuilt Linux compiler:

concurrency_spec: 57/57
channel_result_spec: 6/6
backpressure_spec: 7/7      ← newly unlocked
async_spill_regression_spec: 12/12
async_channel_spec: 6/6
closure_capture_spec: 24/24
async_mutex_spec: 8/8
task_group_spec: 10/10

What’s in the regression spec now

spec/qspec/async_spill_regression_spec.qz protects every fix from session 3 AND session 4:

FixCommitTests
Binary-op async spill/reload3440903f (session 3)rs_string_concat_across_await, rs_literals_bracket_await, rs_arith_chain, rs_nested_await
MIR_AWAIT double-use UAFe2f829fd (session 3)rs_await_in_while_cond, rs_await_reuse
Closure capture walker for async73a14d56 (session 3)rs_lambda_captures_handle, rs_zero_arg_lambda
try_send/channel_close iomap guardb86bde04 (session 4)rs_select_send_ready, rs_select_send_multi
iomap dangling after sched_shutdownb86bde04 (session 4)rs_close_after_shutdown
completion_map dangling after sched_shutdown73dd0479 (session 4)rs_spawn_after_shutdown

12 tests total, all green in ~6ms. Run this as a smoke test after any channel/async codegen touch.

Known caveat for Linux rebuild (not a blocker for macOS)

While refreshing the Linux golden binary, I discovered a pre-existing non-determinism bug in actor codegen:

  • actor_spec.qz exercises actor classes (actor Counter, etc.).
  • The compiler emits __Future_<name>$new and __Future_<name>$poll functions for each actor, where <name> is derived from a string stored in _async_func_names via as_string(...).
  • On the 38c88277-era Linux golden, the string resolves correctly and you get deterministic names like __Future_Counter$__poll$new.
  • On any freshly-rebuilt Linux binary, the same source resolves to a raw pointer value like __Future_369251536$new, which differs across runs, causing the IR to truncate mid-function with define i64 @ (empty name) and llc to reject it.

The bug is environmental (Linux rebuild only) and pre-existing (reproduces without any of my session 4 changes). I did not refresh the Linux golden in this session to avoid regressing actor_spec.qz from 21/21 to 0/21.

Why this is fine for you: the canonical binary per CLAUDE.md is the macOS arm64 self-hosted/bin/quartz. When you run quake build on macOS, it’ll compile from the same (now-fixed) source and produce a fresh macOS golden. The Linux non-determinism is a separate bug specific to this machine’s bootstrap chain and can be investigated in its own session.

If you hit it on macOS (you shouldn’t), the rescue is git show 38c88277:self-hosted/bin/backups/quartz-linux-x64-7ba5fa12-golden > /tmp/rescue && chmod +x /tmp/rescue — that’s the known-good binary from before the non-determinism regression.

Other work notes

  • Session 4 handoff doc is at HANDOFF_SESSION_4.md — describes the rebuild-cycle debugging that led to this summary.
  • Session 3 handoff doc is at HANDOFF_CONCURRENCY_FIXES.md.
  • Linux bootstrap handoff is at HANDOFF_LINUX_BOOTSTRAP.md.
  • ROADMAP at docs/Roadmap/ROADMAP.md — Known Bugs → Fixed section now has 5 session-4 entries (the 4 fixes above plus the Apr 11 session 3 entries).
  • Worktree at /home/mathisto/projects/quartz-head/ was used for all Linux builds this session and is in a somewhat tangled state (bootstrap stubs + partial source overrides). Do not trust its state — reset it with git restore --source=HEAD --worktree . if you ever come back to it.

How to proceed on macOS

cd /path/to/your/quartz/checkout   # (macOS working copy)
git fetch                          # pull tonight's 7 commits
git log --oneline origin/trunk -10 # confirm you see cbc8274a at top
git merge origin/trunk             # or rebase, whichever your flow uses

# Rebuild macOS golden with the fixes:
./self-hosted/bin/quake build
./self-hosted/bin/quake fixpoint

# Smoke tests — all should be green:
./self-hosted/bin/quartz spec/qspec/async_spill_regression_spec.qz | llc -filetype=obj -o /tmp/asr.o
clang /tmp/asr.o -o /tmp/asr -lm -lpthread
/tmp/asr  # expect: 12 tests, 12 passed

# Key wins to verify:
# 1. select { send(ch, 42) => 0 end } in sync code no longer crashes
# 2. channel_new(10) holds exactly 10 items
# 3. channel_pressure/channel_remaining return correct logical values
# 4. sched_init + go + await + sched_shutdown + spawn + await no longer hangs
# 5. backpressure_spec: 7/7 (was 1/7)

If macOS rebuild succeeds + fixpoint holds + regression spec is 12/12, you’re good to develop on top of these fixes.

What I’d touch next (handoff targets)

If you want to keep pushing in a future session, the highest-ROI items:

  1. Investigate the __Future_<pointer>$<name> Linux non-determinism. The value stored in _async_func_names[i] is a String but resolves to a raw int during interpolation on rebuilt Linux binaries. The root cause is probably in how as_string / existential-type re-tagging interacts with binary layout at certain addresses. Fix would unblock clean Linux rebuilds forever.
  2. tls_async_spec (0/6) is actually TLS networking, not task-local storage — it depends on OpenSSL and real sockets. Not a compiler bug; decide whether it’s in scope for the test suite or should be gated behind a feature flag.
  3. Audit __qz_sched_shutdown for any remaining free-without-zero patterns. The iomap, completion_map, and task_locals slots are now clean, but other slots (worker state, priority queues at slots 20/24/28, the timer_peers) could have the same shape. Same zero-after-free discipline applies.

Good morning. Everything is committed, trunk is clean, and the work is ready to merge back to your macOS main.