Next Session — Tier S Bugs: Actor Task Loss (S1) + Unwrap-In-Loop Miscompile (S2)
Baseline: 7d2548f9 (post Apr 15-16 sprint, trunk)
Scope: Two Tier S bugs — the only items on the roadmap with “silent wrong result” or “25% crash repro” severity. Both are pre-launch blockers that users WILL hit.
Session shape: S2 first (more constrained, better-narrowed, multiple viable fix paths). S1 second (deeper I/O poller work, benefits from warm-up on S2).
TL;DR
-
S2 (B4-UNWRAP-IN-LOOP, 2-4h):
step!inside awhileloop with reassignment returns constant 0 instead of the Option payload. Explicitmatchin the same loop works. The bug is specifically in how the$unwrapmacro-generated AST interacts with MIR match lowering. Two fix attempts already failed (documented). Three viable fix paths remain. Detailed handoff atdocs/handoff/b4-unwrap-in-loop-handoff.md. -
S1 (15b: Actor crash+restart task loss, 3-6h): Actor that crashes (panic → longjmp → state-0 restart → re-enqueue) successfully processes subsequent messages (bump, get) but then vanishes from the scheduler.
b.stop()hangs forever — all workers idle, I/O poller idle, task in limbo. 25% repro rate. Root-caused to the interaction between stale pipe bytes in channel notify_wfd, EV_ONESHOT auto-removal, and an io_map clear/register race window. The conditional pipe-write fix was tested and didn’t help — the deeper issue is in how the I/O poller handles rapid re-registration after actor crash restart.
Pre-flight (5 min)
cd /Users/mathisto/projects/quartz
git log --oneline -5
# Expected: 7d2548f9 at top
git status # clean
./self-hosted/bin/quake guard:check # "Fixpoint stamp valid"
./self-hosted/bin/quake smoke # 4/4 + 22/22
# Fix-specific backup
cp self-hosted/bin/quartz self-hosted/bin/backups/quartz-pre-s1-s2-golden
S2: B4-UNWRAP-IN-LOOP — step! miscompile in while loop
The bug (11-line reproducer)
def main(): Int
var step = Option::Some(10)
var sum = 0
var i = 0
while i < 3
sum += step! # ← returns 0 every iteration
step = Option::Some(i + 100)
i += 1
end
return 0 if sum == 211 # expected: 10+100+101=211
return 1 # actual: 0+0+0=0
end
What fires the bug (all three required)
$unwrapmacro expansion (postfix!or explicit$unwrap())- Inside a
whileloop body - Subject reassigned inside the loop
Remove ANY ONE → works. Explicit match step { Some(x) => x; None => 0; end } in the same loop works perfectly.
What’s been tried and failed
- Attempt 1: Changed macro’s
payloads = vec_new()topayloads = 0(matching parser convention). No effect. - Attempt 2: Hoisted match subject into
var __subj = exprblock. No effect. - Both documented in commit
258468f5anddocs/handoff/b4-unwrap-in-loop-handoff.md.
IR symptom
The step! expansion emits %v0 = add i64 0, 0 (constant 0) defined once at function entry. The match body references this constant instead of loading from %step’s alloca. The match subject isn’t being re-evaluated per iteration.
Recommended fix approach (read docs/handoff/b4-unwrap-in-loop-handoff.md for full detail)
Phase 1 (best path, 30 min): Side-by-side AST dump of macro-generated match vs hand-written match. Add temp debug print in mir_lower_match_expr (mir_lower_expr_handlers.qz). The ASTs look identical at the surface — there must be a subtle slot difference in str1, str2, extras, children, int_val, or ops. Diff the dumps. The difference IS the bug.
Phase 2 (if Phase 1 is clean): Check MirContext state at match emission time. The macro runs at parse time, but MIR lowering happens in the loop body. Check mir_ctx_lookup_var("step") returns the right value, check scope stack matches the explicit-match path.
Phase 3 (if Phase 2 is clean): Check MIR constant folding. The %v0 = add i64 0, 0 at entry screams “constant fold.” Grep for anything that short-circuits a match with a single-ident-body arm to a constant. If found, check whether it accounts for gensym binding scope.
Phase 4 (escape hatch): Rewrite $unwrap to desugar to something other than a match:
- Option C (simplest): Add
option_unwrapas a builtin intrinsic incg_intrinsic_core.qz— tag check + payload load + panic.!lowers tooption_unwrap(e). Matchesopt.unwrap()UFCS which already works. - Option B: Lower to
if-elsewithisnarrowing +load_offset.
Key files
| File | Lines | What |
|---|---|---|
self-hosted/frontend/macro_expand.qz | ~1220 | expand_builtin_unwrap — generates the match AST |
self-hosted/backend/mir_lower_expr_handlers.qz | mir_lower_match_expr | Match lowering — where the constant substitution happens |
spec/qspec/unwrap_in_loop_spec.qz | all | 3 passing + 3 tests documenting the broken cases |
Exit criteria
unwrap_in_loop_spec.qz— all 6 tests pass (currently 3 pass, 3 fail)- Minimal reproducer returns 211
quake guard+quake smokegreen- ROADMAP row
B4-UNWRAP-IN-LOOPupdated to RESOLVED
S1: Actor crash+restart+stop hang — scheduler task loss
The bug (isolated reproducer)
actor Crasher
var n: Int = 0
def bump(): Void
n += 1
end
def get(): Int
return n
end
def crash(): Void
var v = vec_new()
var bad = v[999] # intentional panic
end
end
actor Pinger
var count: Int = 0
def ping(): Void
count += 1
end
def get(): Int
return count
end
end
# This sequence hangs 25% of the time:
var a = Pinger()
var b = Crasher()
a.link(b)
usleep(5000)
b.crash() # fire-and-forget (void method)
usleep(20000)
a.ping() # ✓ works
a.get() # ✓ works (returns 1)
b.bump() # ✓ works (actor restarted after crash)
b.get() # ✓ works (returns 1)
a.stop() # ✓ works
b.stop() # ← HANGS HERE — recv(reply_ch) blocks forever
Debugger findings (from Apr 16 investigation)
Thread dump during hang:
- Thread 1 (main):
Crasher$stop→pthread_cond_waiton reply_ch (waiting for reply that never comes) - Thread 2 (I/O poller):
kevent— blocked, no events pending - Threads 3-12 (10 scheduler workers):
cond_wait— idle, no tasks in queue
The actor task is NOT on any scheduler queue, NOT in io_map[notify_rfd], NOT consuming a worker. It has vanished from the system. Progress tracing confirmed: crash→restart→bump→get all succeed. The task disappears between b.get() returning and b.stop() sending.
Root cause analysis (from Apr 16 session)
The chain of events:
- Actor processes messages via
try_recv(inbox)in async poll loop try_recvreturns empty →io_suspend(notify_rfd)→ poll returns PENDING- Worker:
register_io(notify_rfd, task)→ stores inio_map[fd], callsEV_ADD|EV_ONESHOT sendto inbox writes to channel buffer AND writes a byte tonotify_wfdpipeEV_ONESHOTfires → I/O poller loadsio_map[fd], clears it, re-enqueues task
The failure mode: The send intrinsic writes a pipe byte on EVERY send (unconditionally — line 814-860 of cg_intrinsic_conc_channel.qz). But try_recv reads from the channel buffer, NOT from the pipe. Pipe bytes accumulate without being drained. After crash+restart, the actor rapidly cycles through: wake (stale pipe byte) → try_recv (empty) → io_suspend → EV_ADD → fires immediately (more stale bytes) → wake → … This hot loop races with the io_map clear/register window. At some point the task falls through the cracks.
What was tested and didn’t work:
- Conditional pipe write (only when
cnt==0): didn’t fix the hang (tested, 116/500 hangs) - Removing EV_DELETE for EVFILT_READ from ev_enqueue cleanup: didn’t fix (tested, 23/500 hangs)
Recommended fix approach
Phase 1: Research what the big players do (D2, 30 min)
Web-search how these runtimes handle channel-backed actor mailbox notification:
- Go: how does the Go scheduler wake a goroutine blocked on a channel? (netpoller + goready)
- Tokio (Rust): how does tokio’s io_driver handle re-registration after task wake?
- Erlang BEAM: how does the BEAM VM handle process mailbox notification? (no fd at all — shared-memory flag + scheduler scan)
The pattern to look for: do they use a pipe per channel? Or a different notification mechanism? How do they handle the “stale notification” problem?
Phase 2: Design the fix (1h)
Three viable approaches, ranked by correctness:
A. Drain pipe bytes on task wake (simplest, might not be sufficient):
In the I/O poller’s ev_enqueue path, after loading io_map[fd] and before re-enqueueing, do read(fd, buf, sizeof(buf)) with O_NONBLOCK to drain all pending bytes. This eliminates the stale-byte hot loop. Risk: the fd might not be a pipe (could be a real I/O fd for non-channel use). Need a way to distinguish channel notify pipes from real I/O fds.
B. Switch channel notify_rfd from EV_ONESHOT to EV_CLEAR (level-triggered with clear): EV_CLEAR makes kevent behave like edge-triggered: it fires once per new data, not once per registration. Stale bytes don’t re-fire. But this changes the semantics: the I/O poller would need to handle “event fired but task already re-registered” gracefully. The io_map[fd]!=0 check already does this.
C. Replace pipe notification with recv_q (the channel’s own waiter queue):
The channel already has a recv_q (head@168, tail@176) for parking async waiters. The actor’s async recv could enqueue itself in recv_q instead of using io_suspend(notify_rfd). Then send dequeues from recv_q and calls sched_wake directly — no pipe, no kevent, no race. This is the cleanest fix but requires changes to the actor poll’s MIR lowering to use channel_park_recv instead of try_recv + io_suspend.
Phase 3: Implement + stress test (2-4h)
Whichever approach is chosen, stress gate is:
actor_link_spec: must be 0/500 hangs (was 28/500)- Isolated crash-only test: must be 0/500 hangs (was ~125/500)
async_mutex_spec: must remain 0/2000async_rwlock_spec: must remain 0/1500quake guard+quake smokegreen
Key files
| File | Lines | What |
|---|---|---|
self-hosted/backend/cg_intrinsic_conc_channel.qz | 237-870 | send intrinsic — pipe write at 814-860 |
self-hosted/backend/codegen_runtime.qz | 2890-2910 | register_io — EV_ADD|EV_ONESHOT |
self-hosted/backend/codegen_runtime.qz | 3310-3510 | __qz_sched_io_poller — event loop |
self-hosted/backend/codegen_runtime.qz | 3435-3471 | ev_enqueue — io_map lookup + cleanup |
self-hosted/backend/codegen_runtime.qz | 2148-2163 | panic_caught → pc_actor_restart |
self-hosted/backend/mir_lower.qz | 4482-5000 | mir_lower_actor_poll — actor async state machine |
spec/qspec/actor_link_spec.qz | all | 10 tests, test 10 (crash+restart+stop) triggers |
Recommended sprint shape
| Time | Work |
|---|---|
| 0:00-0:15 | Pre-flight + read both handoff sections |
| 0:15-2:30 | S2: B4-UNWRAP-IN-LOOP — Phase 1 AST dump, fix, or Phase 4 escape hatch |
| 2:30-3:00 | S2 commit + guard + roadmap update |
| 3:00-3:30 | S1: Research (Go/Tokio/BEAM mailbox notification) |
| 3:30-4:00 | S1: Design fix (pick A/B/C) |
| 4:00-6:00 | S1: Implement + stress test |
| 6:00-6:30 | S1 commit + guard + roadmap update + handoff if incomplete |
Failure mode: S2 should ship in one sitting — multiple escape hatches exist. S1 is harder — if the implementation proves complex, commit the research + design and hand off the implementation. The actor crash+restart scenario is niche enough that filing it with a clear fix plan is D6-compliant. Don’t leave the session with a half-done I/O poller refactor.
Prime directives check
- D1 (highest impact): Both are silent-wrong-result / user-visible-hang bugs. Highest-impact class.
- D2 (research first): S1 needs Go/Tokio/BEAM research on mailbox notification. S2 needs side-by-side AST analysis.
- D3 (pragmatism): S2 has an escape hatch (Option C: intrinsic). S1’s Option C (recv_q) is the right thing but may be too large for one session — pragmatic path is Option A or B.
- D4 (multi-session): S1 may span sessions. Hand off cleanly if needed.
- D5 (report reality): Stress numbers are the gate. 0/500 or it’s not fixed.
- D6 (fill or file): If S1 can’t be fixed, the design + research must be committed.
- D7 (delete freely): Both fixes should delete the wrong thing and replace it.
- D8 (binary discipline):
quake guardbefore every commit. Fix-specific backup before touching codegen_runtime.qz. - D9 (quartz-time): S2: 2-4h. S1: 3-6h. Total: 5-10h = one focused session.
- D10 (corrections): If the fix approach is wrong, pivot. Don’t defend sunk cost.