Quartz v5.25

Next Session — Tier S Bugs: Actor Task Loss (S1) + Unwrap-In-Loop Miscompile (S2)

Baseline: 7d2548f9 (post Apr 15-16 sprint, trunk) Scope: Two Tier S bugs — the only items on the roadmap with “silent wrong result” or “25% crash repro” severity. Both are pre-launch blockers that users WILL hit. Session shape: S2 first (more constrained, better-narrowed, multiple viable fix paths). S1 second (deeper I/O poller work, benefits from warm-up on S2).


TL;DR

  1. S2 (B4-UNWRAP-IN-LOOP, 2-4h): step! inside a while loop with reassignment returns constant 0 instead of the Option payload. Explicit match in the same loop works. The bug is specifically in how the $unwrap macro-generated AST interacts with MIR match lowering. Two fix attempts already failed (documented). Three viable fix paths remain. Detailed handoff at docs/handoff/b4-unwrap-in-loop-handoff.md.

  2. S1 (15b: Actor crash+restart task loss, 3-6h): Actor that crashes (panic → longjmp → state-0 restart → re-enqueue) successfully processes subsequent messages (bump, get) but then vanishes from the scheduler. b.stop() hangs forever — all workers idle, I/O poller idle, task in limbo. 25% repro rate. Root-caused to the interaction between stale pipe bytes in channel notify_wfd, EV_ONESHOT auto-removal, and an io_map clear/register race window. The conditional pipe-write fix was tested and didn’t help — the deeper issue is in how the I/O poller handles rapid re-registration after actor crash restart.


Pre-flight (5 min)

cd /Users/mathisto/projects/quartz
git log --oneline -5
# Expected: 7d2548f9 at top
git status                              # clean
./self-hosted/bin/quake guard:check     # "Fixpoint stamp valid"
./self-hosted/bin/quake smoke           # 4/4 + 22/22

# Fix-specific backup
cp self-hosted/bin/quartz self-hosted/bin/backups/quartz-pre-s1-s2-golden

S2: B4-UNWRAP-IN-LOOP — step! miscompile in while loop

The bug (11-line reproducer)

def main(): Int
  var step = Option::Some(10)
  var sum = 0
  var i = 0
  while i < 3
    sum += step!           # ← returns 0 every iteration
    step = Option::Some(i + 100)
    i += 1
  end
  return 0 if sum == 211   # expected: 10+100+101=211
  return 1                 # actual: 0+0+0=0
end

What fires the bug (all three required)

  1. $unwrap macro expansion (postfix ! or explicit $unwrap())
  2. Inside a while loop body
  3. Subject reassigned inside the loop

Remove ANY ONE → works. Explicit match step { Some(x) => x; None => 0; end } in the same loop works perfectly.

What’s been tried and failed

  • Attempt 1: Changed macro’s payloads = vec_new() to payloads = 0 (matching parser convention). No effect.
  • Attempt 2: Hoisted match subject into var __subj = expr block. No effect.
  • Both documented in commit 258468f5 and docs/handoff/b4-unwrap-in-loop-handoff.md.

IR symptom

The step! expansion emits %v0 = add i64 0, 0 (constant 0) defined once at function entry. The match body references this constant instead of loading from %step’s alloca. The match subject isn’t being re-evaluated per iteration.

Phase 1 (best path, 30 min): Side-by-side AST dump of macro-generated match vs hand-written match. Add temp debug print in mir_lower_match_expr (mir_lower_expr_handlers.qz). The ASTs look identical at the surface — there must be a subtle slot difference in str1, str2, extras, children, int_val, or ops. Diff the dumps. The difference IS the bug.

Phase 2 (if Phase 1 is clean): Check MirContext state at match emission time. The macro runs at parse time, but MIR lowering happens in the loop body. Check mir_ctx_lookup_var("step") returns the right value, check scope stack matches the explicit-match path.

Phase 3 (if Phase 2 is clean): Check MIR constant folding. The %v0 = add i64 0, 0 at entry screams “constant fold.” Grep for anything that short-circuits a match with a single-ident-body arm to a constant. If found, check whether it accounts for gensym binding scope.

Phase 4 (escape hatch): Rewrite $unwrap to desugar to something other than a match:

  • Option C (simplest): Add option_unwrap as a builtin intrinsic in cg_intrinsic_core.qz — tag check + payload load + panic. ! lowers to option_unwrap(e). Matches opt.unwrap() UFCS which already works.
  • Option B: Lower to if-else with is narrowing + load_offset.

Key files

FileLinesWhat
self-hosted/frontend/macro_expand.qz~1220expand_builtin_unwrap — generates the match AST
self-hosted/backend/mir_lower_expr_handlers.qzmir_lower_match_exprMatch lowering — where the constant substitution happens
spec/qspec/unwrap_in_loop_spec.qzall3 passing + 3 tests documenting the broken cases

Exit criteria

  • unwrap_in_loop_spec.qz — all 6 tests pass (currently 3 pass, 3 fail)
  • Minimal reproducer returns 211
  • quake guard + quake smoke green
  • ROADMAP row B4-UNWRAP-IN-LOOP updated to RESOLVED

S1: Actor crash+restart+stop hang — scheduler task loss

The bug (isolated reproducer)

actor Crasher
  var n: Int = 0
  def bump(): Void
    n += 1
  end
  def get(): Int
    return n
  end
  def crash(): Void
    var v = vec_new()
    var bad = v[999]   # intentional panic
  end
end

actor Pinger
  var count: Int = 0
  def ping(): Void
    count += 1
  end
  def get(): Int
    return count
  end
end

# This sequence hangs 25% of the time:
var a = Pinger()
var b = Crasher()
a.link(b)
usleep(5000)
b.crash()        # fire-and-forget (void method)
usleep(20000)
a.ping()         # ✓ works
a.get()          # ✓ works (returns 1)
b.bump()         # ✓ works (actor restarted after crash)
b.get()          # ✓ works (returns 1)
a.stop()         # ✓ works
b.stop()         # ← HANGS HERE — recv(reply_ch) blocks forever

Debugger findings (from Apr 16 investigation)

Thread dump during hang:

  • Thread 1 (main): Crasher$stoppthread_cond_wait on reply_ch (waiting for reply that never comes)
  • Thread 2 (I/O poller): kevent — blocked, no events pending
  • Threads 3-12 (10 scheduler workers): cond_wait — idle, no tasks in queue

The actor task is NOT on any scheduler queue, NOT in io_map[notify_rfd], NOT consuming a worker. It has vanished from the system. Progress tracing confirmed: crash→restart→bump→get all succeed. The task disappears between b.get() returning and b.stop() sending.

Root cause analysis (from Apr 16 session)

The chain of events:

  1. Actor processes messages via try_recv(inbox) in async poll loop
  2. try_recv returns empty → io_suspend(notify_rfd) → poll returns PENDING
  3. Worker: register_io(notify_rfd, task) → stores in io_map[fd], calls EV_ADD|EV_ONESHOT
  4. send to inbox writes to channel buffer AND writes a byte to notify_wfd pipe
  5. EV_ONESHOT fires → I/O poller loads io_map[fd], clears it, re-enqueues task

The failure mode: The send intrinsic writes a pipe byte on EVERY send (unconditionally — line 814-860 of cg_intrinsic_conc_channel.qz). But try_recv reads from the channel buffer, NOT from the pipe. Pipe bytes accumulate without being drained. After crash+restart, the actor rapidly cycles through: wake (stale pipe byte) → try_recv (empty) → io_suspend → EV_ADD → fires immediately (more stale bytes) → wake → … This hot loop races with the io_map clear/register window. At some point the task falls through the cracks.

What was tested and didn’t work:

  • Conditional pipe write (only when cnt==0): didn’t fix the hang (tested, 116/500 hangs)
  • Removing EV_DELETE for EVFILT_READ from ev_enqueue cleanup: didn’t fix (tested, 23/500 hangs)

Phase 1: Research what the big players do (D2, 30 min)

Web-search how these runtimes handle channel-backed actor mailbox notification:

  • Go: how does the Go scheduler wake a goroutine blocked on a channel? (netpoller + goready)
  • Tokio (Rust): how does tokio’s io_driver handle re-registration after task wake?
  • Erlang BEAM: how does the BEAM VM handle process mailbox notification? (no fd at all — shared-memory flag + scheduler scan)

The pattern to look for: do they use a pipe per channel? Or a different notification mechanism? How do they handle the “stale notification” problem?

Phase 2: Design the fix (1h)

Three viable approaches, ranked by correctness:

A. Drain pipe bytes on task wake (simplest, might not be sufficient): In the I/O poller’s ev_enqueue path, after loading io_map[fd] and before re-enqueueing, do read(fd, buf, sizeof(buf)) with O_NONBLOCK to drain all pending bytes. This eliminates the stale-byte hot loop. Risk: the fd might not be a pipe (could be a real I/O fd for non-channel use). Need a way to distinguish channel notify pipes from real I/O fds.

B. Switch channel notify_rfd from EV_ONESHOT to EV_CLEAR (level-triggered with clear): EV_CLEAR makes kevent behave like edge-triggered: it fires once per new data, not once per registration. Stale bytes don’t re-fire. But this changes the semantics: the I/O poller would need to handle “event fired but task already re-registered” gracefully. The io_map[fd]!=0 check already does this.

C. Replace pipe notification with recv_q (the channel’s own waiter queue): The channel already has a recv_q (head@168, tail@176) for parking async waiters. The actor’s async recv could enqueue itself in recv_q instead of using io_suspend(notify_rfd). Then send dequeues from recv_q and calls sched_wake directly — no pipe, no kevent, no race. This is the cleanest fix but requires changes to the actor poll’s MIR lowering to use channel_park_recv instead of try_recv + io_suspend.

Phase 3: Implement + stress test (2-4h)

Whichever approach is chosen, stress gate is:

  • actor_link_spec: must be 0/500 hangs (was 28/500)
  • Isolated crash-only test: must be 0/500 hangs (was ~125/500)
  • async_mutex_spec: must remain 0/2000
  • async_rwlock_spec: must remain 0/1500
  • quake guard + quake smoke green

Key files

FileLinesWhat
self-hosted/backend/cg_intrinsic_conc_channel.qz237-870send intrinsic — pipe write at 814-860
self-hosted/backend/codegen_runtime.qz2890-2910register_io — EV_ADD|EV_ONESHOT
self-hosted/backend/codegen_runtime.qz3310-3510__qz_sched_io_poller — event loop
self-hosted/backend/codegen_runtime.qz3435-3471ev_enqueue — io_map lookup + cleanup
self-hosted/backend/codegen_runtime.qz2148-2163panic_caughtpc_actor_restart
self-hosted/backend/mir_lower.qz4482-5000mir_lower_actor_poll — actor async state machine
spec/qspec/actor_link_spec.qzall10 tests, test 10 (crash+restart+stop) triggers

TimeWork
0:00-0:15Pre-flight + read both handoff sections
0:15-2:30S2: B4-UNWRAP-IN-LOOP — Phase 1 AST dump, fix, or Phase 4 escape hatch
2:30-3:00S2 commit + guard + roadmap update
3:00-3:30S1: Research (Go/Tokio/BEAM mailbox notification)
3:30-4:00S1: Design fix (pick A/B/C)
4:00-6:00S1: Implement + stress test
6:00-6:30S1 commit + guard + roadmap update + handoff if incomplete

Failure mode: S2 should ship in one sitting — multiple escape hatches exist. S1 is harder — if the implementation proves complex, commit the research + design and hand off the implementation. The actor crash+restart scenario is niche enough that filing it with a clear fix plan is D6-compliant. Don’t leave the session with a half-done I/O poller refactor.


Prime directives check

  • D1 (highest impact): Both are silent-wrong-result / user-visible-hang bugs. Highest-impact class.
  • D2 (research first): S1 needs Go/Tokio/BEAM research on mailbox notification. S2 needs side-by-side AST analysis.
  • D3 (pragmatism): S2 has an escape hatch (Option C: intrinsic). S1’s Option C (recv_q) is the right thing but may be too large for one session — pragmatic path is Option A or B.
  • D4 (multi-session): S1 may span sessions. Hand off cleanly if needed.
  • D5 (report reality): Stress numbers are the gate. 0/500 or it’s not fixed.
  • D6 (fill or file): If S1 can’t be fixed, the design + research must be committed.
  • D7 (delete freely): Both fixes should delete the wrong thing and replace it.
  • D8 (binary discipline): quake guard before every commit. Fix-specific backup before touching codegen_runtime.qz.
  • D9 (quartz-time): S2: 2-4h. S1: 3-6h. Total: 5-10h = one focused session.
  • D10 (corrections): If the fix approach is wrong, pivot. Don’t defend sunk cost.