Next Session — Tier S Bugs: Actor Task Loss (S1) + Unwrap-In-Loop Miscompile (S2)

Baseline: 7d2548f9 (post Apr 15-16 sprint, trunk) Scope: Two Tier S bugs — the only items on the roadmap with “silent wrong result” or “25% crash repro” severity. Both are pre-launch blockers that users WILL hit. Session shape: S2 first (more constrained, better-narrowed, multiple viable fix paths). S1 second (deeper I/O poller work, benefits from warm-up on S2).

TL;DR

S2 (B4-UNWRAP-IN-LOOP, 2-4h): step! inside a while loop with reassignment returns constant 0 instead of the Option payload. Explicit match in the same loop works. The bug is specifically in how the $unwrap macro-generated AST interacts with MIR match lowering. Two fix attempts already failed (documented). Three viable fix paths remain. Detailed handoff at docs/handoff/b4-unwrap-in-loop-handoff.md.
S1 (15b: Actor crash+restart task loss, 3-6h): Actor that crashes (panic → longjmp → state-0 restart → re-enqueue) successfully processes subsequent messages (bump, get) but then vanishes from the scheduler. b.stop() hangs forever — all workers idle, I/O poller idle, task in limbo. 25% repro rate. Root-caused to the interaction between stale pipe bytes in channel notify_wfd, EV_ONESHOT auto-removal, and an io_map clear/register race window. The conditional pipe-write fix was tested and didn’t help — the deeper issue is in how the I/O poller handles rapid re-registration after actor crash restart.

Pre-flight (5 min)

cd /Users/mathisto/projects/quartz
git log --oneline -5
# Expected: 7d2548f9 at top
git status                              # clean
./self-hosted/bin/quake guard:check     # "Fixpoint stamp valid"
./self-hosted/bin/quake smoke           # 4/4 + 22/22

# Fix-specific backup
cp self-hosted/bin/quartz self-hosted/bin/backups/quartz-pre-s1-s2-golden

S2: B4-UNWRAP-IN-LOOP — `step!` miscompile in while loop

The bug (11-line reproducer)

def main(): Int
  var step = Option::Some(10)
  var sum = 0
  var i = 0
  while i < 3
    sum += step!           # ← returns 0 every iteration
    step = Option::Some(i + 100)
    i += 1
  end
  return 0 if sum == 211   # expected: 10+100+101=211
  return 1                 # actual: 0+0+0=0
end

What fires the bug (all three required)

$unwrap macro expansion (postfix ! or explicit $unwrap())
Inside a while loop body
Subject reassigned inside the loop

Remove ANY ONE → works. Explicit match step { Some(x) => x; None => 0; end } in the same loop works perfectly.

What’s been tried and failed

Attempt 1: Changed macro’s payloads = vec_new() to payloads = 0 (matching parser convention). No effect.
Attempt 2: Hoisted match subject into var __subj = expr block. No effect.
Both documented in commit 258468f5 and docs/handoff/b4-unwrap-in-loop-handoff.md.

IR symptom

The step! expansion emits %v0 = add i64 0, 0 (constant 0) defined once at function entry. The match body references this constant instead of loading from %step’s alloca. The match subject isn’t being re-evaluated per iteration.

Recommended fix approach (read `docs/handoff/b4-unwrap-in-loop-handoff.md` for full detail)

Phase 1 (best path, 30 min): Side-by-side AST dump of macro-generated match vs hand-written match. Add temp debug print in mir_lower_match_expr (mir_lower_expr_handlers.qz). The ASTs look identical at the surface — there must be a subtle slot difference in str1, str2, extras, children, int_val, or ops. Diff the dumps. The difference IS the bug.

Phase 2 (if Phase 1 is clean): Check MirContext state at match emission time. The macro runs at parse time, but MIR lowering happens in the loop body. Check mir_ctx_lookup_var("step") returns the right value, check scope stack matches the explicit-match path.

Phase 3 (if Phase 2 is clean): Check MIR constant folding. The %v0 = add i64 0, 0 at entry screams “constant fold.” Grep for anything that short-circuits a match with a single-ident-body arm to a constant. If found, check whether it accounts for gensym binding scope.

Phase 4 (escape hatch): Rewrite $unwrap to desugar to something other than a match:

Option C (simplest): Add option_unwrap as a builtin intrinsic in cg_intrinsic_core.qz — tag check + payload load + panic. ! lowers to option_unwrap(e). Matches opt.unwrap() UFCS which already works.
Option B: Lower to if-else with is narrowing + load_offset.

Key files

File	Lines	What
`self-hosted/frontend/macro_expand.qz`	~1220	`expand_builtin_unwrap` — generates the match AST
`self-hosted/backend/mir_lower_expr_handlers.qz`	`mir_lower_match_expr`	Match lowering — where the constant substitution happens
`spec/qspec/unwrap_in_loop_spec.qz`	all	3 passing + 3 tests documenting the broken cases

Exit criteria

unwrap_in_loop_spec.qz — all 6 tests pass (currently 3 pass, 3 fail)
Minimal reproducer returns 211
quake guard + quake smoke green
ROADMAP row B4-UNWRAP-IN-LOOP updated to RESOLVED

S1: Actor crash+restart+stop hang — scheduler task loss

The bug (isolated reproducer)

actor Crasher
  var n: Int = 0
  def bump(): Void
    n += 1
  end
  def get(): Int
    return n
  end
  def crash(): Void
    var v = vec_new()
    var bad = v[999]   # intentional panic
  end
end

actor Pinger
  var count: Int = 0
  def ping(): Void
    count += 1
  end
  def get(): Int
    return count
  end
end

# This sequence hangs 25% of the time:
var a = Pinger()
var b = Crasher()
a.link(b)
usleep(5000)
b.crash()        # fire-and-forget (void method)
usleep(20000)
a.ping()         # ✓ works
a.get()          # ✓ works (returns 1)
b.bump()         # ✓ works (actor restarted after crash)
b.get()          # ✓ works (returns 1)
a.stop()         # ✓ works
b.stop()         # ← HANGS HERE — recv(reply_ch) blocks forever

Debugger findings (from Apr 16 investigation)

Thread dump during hang:

Thread 1 (main): Crasher$stop → pthread_cond_wait on reply_ch (waiting for reply that never comes)
Thread 2 (I/O poller): kevent — blocked, no events pending
Threads 3-12 (10 scheduler workers): cond_wait — idle, no tasks in queue

The actor task is NOT on any scheduler queue, NOT in io_map[notify_rfd], NOT consuming a worker. It has vanished from the system. Progress tracing confirmed: crash→restart→bump→get all succeed. The task disappears between b.get() returning and b.stop() sending.

Root cause analysis (from Apr 16 session)

The chain of events:

Actor processes messages via try_recv(inbox) in async poll loop
try_recv returns empty → io_suspend(notify_rfd) → poll returns PENDING
Worker: register_io(notify_rfd, task) → stores in io_map[fd], calls EV_ADD|EV_ONESHOT
send to inbox writes to channel buffer AND writes a byte to notify_wfd pipe
EV_ONESHOT fires → I/O poller loads io_map[fd], clears it, re-enqueues task

The failure mode: The send intrinsic writes a pipe byte on EVERY send (unconditionally — line 814-860 of cg_intrinsic_conc_channel.qz). But try_recv reads from the channel buffer, NOT from the pipe. Pipe bytes accumulate without being drained. After crash+restart, the actor rapidly cycles through: wake (stale pipe byte) → try_recv (empty) → io_suspend → EV_ADD → fires immediately (more stale bytes) → wake → … This hot loop races with the io_map clear/register window. At some point the task falls through the cracks.

What was tested and didn’t work:

Conditional pipe write (only when cnt==0): didn’t fix the hang (tested, 116/500 hangs)
Removing EV_DELETE for EVFILT_READ from ev_enqueue cleanup: didn’t fix (tested, 23/500 hangs)

Recommended fix approach

Phase 1: Research what the big players do (D2, 30 min)

Web-search how these runtimes handle channel-backed actor mailbox notification:

Go: how does the Go scheduler wake a goroutine blocked on a channel? (netpoller + goready)
Tokio (Rust): how does tokio’s io_driver handle re-registration after task wake?
Erlang BEAM: how does the BEAM VM handle process mailbox notification? (no fd at all — shared-memory flag + scheduler scan)

The pattern to look for: do they use a pipe per channel? Or a different notification mechanism? How do they handle the “stale notification” problem?

Phase 2: Design the fix (1h)

Three viable approaches, ranked by correctness:

A. Drain pipe bytes on task wake (simplest, might not be sufficient): In the I/O poller’s ev_enqueue path, after loading io_map[fd] and before re-enqueueing, do read(fd, buf, sizeof(buf)) with O_NONBLOCK to drain all pending bytes. This eliminates the stale-byte hot loop. Risk: the fd might not be a pipe (could be a real I/O fd for non-channel use). Need a way to distinguish channel notify pipes from real I/O fds.

B. Switch channel notify_rfd from EV_ONESHOT to EV_CLEAR (level-triggered with clear): EV_CLEAR makes kevent behave like edge-triggered: it fires once per new data, not once per registration. Stale bytes don’t re-fire. But this changes the semantics: the I/O poller would need to handle “event fired but task already re-registered” gracefully. The io_map[fd]!=0 check already does this.

C. Replace pipe notification with recv_q (the channel’s own waiter queue): The channel already has a recv_q (head@168, tail@176) for parking async waiters. The actor’s async recv could enqueue itself in recv_q instead of using io_suspend(notify_rfd). Then send dequeues from recv_q and calls sched_wake directly — no pipe, no kevent, no race. This is the cleanest fix but requires changes to the actor poll’s MIR lowering to use channel_park_recv instead of try_recv + io_suspend.

Phase 3: Implement + stress test (2-4h)

Whichever approach is chosen, stress gate is:

actor_link_spec: must be 0/500 hangs (was 28/500)
Isolated crash-only test: must be 0/500 hangs (was ~125/500)
async_mutex_spec: must remain 0/2000
async_rwlock_spec: must remain 0/1500
quake guard + quake smoke green

Key files

File	Lines	What
`self-hosted/backend/cg_intrinsic_conc_channel.qz`	237-870	`send` intrinsic — pipe write at 814-860
`self-hosted/backend/codegen_runtime.qz`	2890-2910	`register_io` — EV_ADD\|EV_ONESHOT
`self-hosted/backend/codegen_runtime.qz`	3310-3510	`__qz_sched_io_poller` — event loop
`self-hosted/backend/codegen_runtime.qz`	3435-3471	ev_enqueue — io_map lookup + cleanup
`self-hosted/backend/codegen_runtime.qz`	2148-2163	`panic_caught` → `pc_actor_restart`
`self-hosted/backend/mir_lower.qz`	4482-5000	`mir_lower_actor_poll` — actor async state machine
`spec/qspec/actor_link_spec.qz`	all	10 tests, test 10 (crash+restart+stop) triggers

Recommended sprint shape

Time	Work
0:00-0:15	Pre-flight + read both handoff sections
0:15-2:30	S2: B4-UNWRAP-IN-LOOP — Phase 1 AST dump, fix, or Phase 4 escape hatch
2:30-3:00	S2 commit + guard + roadmap update
3:00-3:30	S1: Research (Go/Tokio/BEAM mailbox notification)
3:30-4:00	S1: Design fix (pick A/B/C)
4:00-6:00	S1: Implement + stress test
6:00-6:30	S1 commit + guard + roadmap update + handoff if incomplete

Failure mode: S2 should ship in one sitting — multiple escape hatches exist. S1 is harder — if the implementation proves complex, commit the research + design and hand off the implementation. The actor crash+restart scenario is niche enough that filing it with a clear fix plan is D6-compliant. Don’t leave the session with a half-done I/O poller refactor.

Prime directives check

D1 (highest impact): Both are silent-wrong-result / user-visible-hang bugs. Highest-impact class.
D2 (research first): S1 needs Go/Tokio/BEAM research on mailbox notification. S2 needs side-by-side AST analysis.
D3 (pragmatism): S2 has an escape hatch (Option C: intrinsic). S1’s Option C (recv_q) is the right thing but may be too large for one session — pragmatic path is Option A or B.
D4 (multi-session): S1 may span sessions. Hand off cleanly if needed.
D5 (report reality): Stress numbers are the gate. 0/500 or it’s not fixed.
D6 (fill or file): If S1 can’t be fixed, the design + research must be committed.
D7 (delete freely): Both fixes should delete the wrong thing and replace it.
D8 (binary discipline): quake guard before every commit. Fix-specific backup before touching codegen_runtime.qz.
D9 (quartz-time): S2: 2-4h. S1: 3-6h. Total: 5-10h = one focused session.
D10 (corrections): If the fix approach is wrong, pivot. Don’t defend sunk cost.

Next Session — Tier S Bugs: Actor Task Loss (S1) + Unwrap-In-Loop Miscompile (S2)

TL;DR

Pre-flight (5 min)

S2: B4-UNWRAP-IN-LOOP — step! miscompile in while loop

The bug (11-line reproducer)

What fires the bug (all three required)

What’s been tried and failed

IR symptom

Recommended fix approach (read docs/handoff/b4-unwrap-in-loop-handoff.md for full detail)

Key files

Exit criteria

S1: Actor crash+restart+stop hang — scheduler task loss

The bug (isolated reproducer)

Debugger findings (from Apr 16 investigation)

Root cause analysis (from Apr 16 session)

Recommended fix approach

Key files

Recommended sprint shape

Prime directives check

S2: B4-UNWRAP-IN-LOOP — `step!` miscompile in while loop

Recommended fix approach (read `docs/handoff/b4-unwrap-in-loop-handoff.md` for full detail)