Next Session — Scheduler Park/Wake Race + Dead-Code Cleanup

Baseline: 2baa99d2 (post-Phase 3b.next.2, trunk) Primary target: Fix the 1% hang in sched_park_spec.qz test 4 + un-pend it. Delete dead spin-park codegen. Secondary: Enable async Mutex/RwLock (roadmap item #15) which is blocked purely on this work. Scope: One focused session. Own context window — scheduler hot path is dense and error-prone.

Prior handoff: docs/HANDOFF_PRIORITY_SPRINT.md (Apr 12, 2026). Items 2-4 of that doc are all complete; only Item 1 (this one) remains. The prior doc’s hypothesis (spin-park CAS race) is partially wrong — see §Investigation below. Don’t trust the implementation plan in that doc verbatim.

TL;DR

Test 4 fails ~1% of runs (2 hangs / 200 runs, confirmed Apr 15). Under-pended, not 3% as the prior handoff claimed.
The async path IS being used. I was surprised too. go bouncer_a() compiles to __Future_bouncer_a$poll which calls sched_set_park_pending + emits an async_suspend_point. No spin-park IR executes on the hot path.
The spin-park codegen is dead code. cg_intrinsic_conc_sched.qz:2988-3086 emits spin-park IR inside plain @bouncer_a() functions that are never called (verified: grep shows zero call i64 @bouncer_a in the test 4 binary). Delete it per Prime Directive 7.
The real race lives in the async task_parked path — the CAS protocol in codegen_runtime.qz:2100-2124 + __qz_sched_wake at 2480-2516. Needs careful reading. My best-guess race is §Race Hypothesis below.
Fix direction is Rust-model: stackless state machines, park is async-only, race lives in the handoff between task-returns-from-$poll and worker-reads-TLS-flag. Not Go-stackful.
After the fix: un-pend test 4, delete spin-park dead code, delete plain-function emission for async targets (if confirmed dead), unblock async Mutex/RwLock.

Pre-flight (≤ 5 min)

cd /Users/mathisto/projects/quartz
git log --oneline -5
# Expected top 3:
#   2baa99d2 ROADMAP: sync header peak RSS to 7.81 GB
#   c68dceaa Phase 3b.next.2: fix tc_lookup suffix-fallback substring leak
#   35dea910 Phase 3b.next: localize leak to tc_expr_call
git status
./self-hosted/bin/quake guard:check
./self-hosted/bin/quake smoke 2>&1 | tail -6

# Baseline: test 4 currently pending
./self-hosted/bin/quake qspec_file FILE=spec/qspec/sched_park_spec.qz 2>&1 | tail -8
# Expected: 3 passing, 1 pending

# Fix-specific backup
cp self-hosted/bin/quartz self-hosted/bin/backups/quartz-pre-park-wake-golden

Reproduce the hang

# 1. Un-pend test 4 locally (don't commit yet)
sed -i '' 's/it_pending("repeated park\/wake on same task (blocked: needs async go closures)")/it("repeated park\/wake") do -> assert_eq(test_repeated_park_wake(), 3) end/' spec/qspec/sched_park_spec.qz

# 2. Build ONE binary and stress it
./self-hosted/bin/quartz spec/qspec/sched_park_spec.qz -I self-hosted -I tools -I std 2>/dev/null > /tmp/spk.ll
llc /tmp/spk.ll -o /tmp/spk.s && clang /tmp/spk.s -o /tmp/spk_bin -lpthread -lm

# 3. Run 200 times, count hangs
pass=0; fail=0; hang=0
for i in $(seq 1 200); do
  result=$(timeout 5 /tmp/spk_bin 2>&1); code=$?
  if [ $code -eq 124 ]; then hang=$((hang+1))
  elif echo "$result" | grep -q "All green"; then pass=$((pass+1))
  else fail=$((fail+1)); fi
done
echo "pass=$pass fail=$fail hang=$hang / 200"
# Expected: ~1% hang rate (2 hangs / 200 during this investigation)

# 4. Capture a hang stack trace via sample(1)
for attempt in $(seq 1 200); do
  /tmp/spk_bin > /tmp/out 2>&1 &
  BINPID=$!
  sleep 2
  if kill -0 $BINPID 2>/dev/null; then
    sample $BINPID 2 > /tmp/hang_stack.txt
    kill -9 $BINPID
    echo "Hung at attempt $attempt"
    break
  fi
  wait
done

# 5. Restore spec before doing anything else
git checkout spec/qspec/sched_park_spec.qz

When you capture the stack, you’re looking for:

Threads stuck in __qz_sched_io_poller (normal) — idle poller
A worker thread stuck in pthread_cond_wait — parked on the global mutex (normal)
A thread stuck in cmpxchg loop (if the race is in a CAS retry)
Any thread still inside @__Future_bouncer_a$poll or the worker loop between task_yield and task_parked or dy_reenqueue
Whether both bouncer_a and bouncer_b are parked with no runnable task

Investigation findings (Apr 15 session — before writing this handoff)

The spin-park codegen is dead code

The prior handoff focused on the spin-park codegen at cg_intrinsic_conc_sched.qz:2988-3086. That code IS still emitted, but it’s emitted inside plain @park_worker() / @bouncer_a() functions. Those plain functions are never called:

$ grep -c "call i64 @bouncer_a\|call i64 @park_worker\|call i64 @pc_consumer" /tmp/spk.ll
0

go bouncer_a() goes through the lowering chain at mir_lower.qz:989-996:

# Named function: reuse existing async lowering
ast::ast_set_kind(s, node, node_constants::NODE_ASYNC_CALL)
var frame = ctx.mir_lower_expr(s, node)  # creates __Future_bouncer_a$new call
ast::ast_set_kind(s, node, node_constants::NODE_GO)
var spawn_args = vec_new()
spawn_args.push(frame)
ctx.mir_emit_intrinsic("sched_spawn", spawn_args)

This produces __Future_bouncer_a$new() + __Future_bouncer_a$poll(). The $poll function is generated by mir_lower_gen::mir_lower_async_poll which sets _gen_active = 2 and lowers the AST body as a state machine. Inside that lowering, sched_park matches the async branch at mir_lower_expr_handlers.qz:1897-1903:

if func_name == "sched_park" and mir_lower_gen::mir_gen_is_active() >= 2
  var park_args = vec_new()
  ctx.mir_emit_intrinsic("sched_set_park_pending", as_int(park_args))
  mir_lower_gen::mir_emit_async_suspend_point(ctx, -1, 0)
  return ctx.mir_emit_const_int(0)
end

Confirmed by IR inspection: @__Future_bouncer_a$poll contains 2 sched_set_park_pending stores and 4 resume-block labels, and zero spk. (spin-park register prefix) instances. The async path IS the hot path.

But @bouncer_a() is still emitted as a plain function with the full spin-park IR. It’s dead weight — zero callers. Delete it or suppress emission when a function is an async target.

Why was the prior handoff wrong?

The Apr 12 handoff was written before the go-lambda state machine landed (commit 10bc4539 from Mar 27) was fully audited against named-function go. The Apr 12 author saw the spin-park codegen and assumed it was live. It’s not. The state machine path shipped in mir_lower_go_lambda_poll + the NODE_GO → NODE_ASYNC_CALL rewrite at mir_lower.qz:990.

Current park/wake protocol (full detail)

Frame layout (relevant slots):

frame[0] = state (int): -1=done, 0=start, 1..N=resume point after Nth suspend
frame[5] = park_state (atomic): 0=RUNNING, 1=PARKED, 2=WAKE_PENDING, 3=SPIN_PARKED

State 3 (SPIN_PARKED) is currently only reachable via the dead spin-park path. After cleanup it can be retired.

Async sched_park (hot path):

mir_lower_expr_handlers.qz:1897-1903:

Emit sched_set_park_pending intrinsic → stores 1 to @__qz_sched_park_pending (TLS).
Emit mir_emit_async_suspend_point(ctx, -1, 0) which:
- Records this function as a suspendable leaf.
- Increments _gen_yield_counter to get next resume state.
- Saves all locals to the frame.
- Stores state = next_state to frame[0].
- Sets block terminator to RETURN 0 (yield signal).
- Creates a yield_resume resume block for later re-entry.

The $poll function returns 0 (yield) to the worker loop.

Worker loop task_yield handler (codegen_runtime.qz:2100-2147):

task_yield:
  %pk.flag = load i64, ptr @__qz_sched_park_pending  ; TLS
  %pk.is_park = icmp ne i64 %pk.flag, 0
  br i1 %pk.is_park, label %task_parked, label %do_yield

task_parked:
  store i64 0, ptr @__qz_sched_park_pending          ; clear TLS flag
  %pk.ps.p = getelementptr i64, ptr %task.p, i64 5
  %pk.cas = cmpxchg ptr %pk.ps.p, i64 0, i64 1 acq_rel monotonic  ; 0→1
  %pk.old = extractvalue {i64, i1} %pk.cas, 0
  %pk.was_wp = icmp eq i64 %pk.old, 2               ; was it WAKE_PENDING(2)?
  br i1 %pk.was_wp, label %task_park_woken, label %task_park_done

task_park_woken:
  ; wake arrived between sched_park's set-flag and here → reset + re-enqueue
  store atomic i64 0, ptr %pk.ps.p monotonic, align 8
  call void @__qz_sched_local_push(i64 %wid, i64 %task.i)
  br label %loop

task_park_done:
  ; successfully parked — task is dormant, worker goes back to loop
  call void @__qz_trace_emit(i64 7, i64 %task.i, i64 0)
  br label %loop

__qz_sched_wake(task) runtime (codegen_runtime.qz:2480-2516):

entry:
  %wk.frame = inttoptr i64 %task to ptr
  %wk.ps.p = getelementptr i64, ptr %wk.frame, i64 5
  %wk.cas1 = cmpxchg ptr %wk.ps.p, i64 1, i64 0 acq_rel monotonic  ; PARKED(1)→RUNNING(0)
  %wk.ok1 = extractvalue {i64, i1} %wk.cas1, 1
  br i1 %wk.ok1, label %wk_reenqueue, label %wk_try_pending

wk_reenqueue:
  ; task was PARKED, now RUNNING — push onto worker queue
  %wk.wid = load i64, ptr @__qz_current_worker_id
  %wk.is_worker = icmp sge i64 %wk.wid, 0
  br i1 %wk.is_worker, label %wk_local, label %wk_global
wk_local:
  call void @__qz_sched_local_push(i64 %wk.wid, i64 %task)
  ret void
wk_global:
  call void @__qz_sched_reenqueue(i64 %task)
  ret void

wk_try_pending:
  %wk.cas3 = cmpxchg ptr %wk.ps.p, i64 3, i64 0 acq_rel monotonic  ; SPIN_PARKED(3)→RUNNING(0)
  %wk.ok3 = extractvalue {i64, i1} %wk.cas3, 1
  br i1 %wk.ok3, label %wk_spin_done, label %wk_set_pending
wk_spin_done:
  ret void                         ; spin-park path — never fires post-cleanup

wk_set_pending:
  ; task not yet parked → set WAKE_PENDING so worker loop re-enqueues
  %wk.cas2 = cmpxchg ptr %wk.ps.p, i64 0, i64 2 acq_rel monotonic
  ret void                         ; note: doesn't check success!

Race hypothesis (verify this with the stack trace)

The protocol has a four-way window between task-side and wake-side state transitions:

Task: emit sched_set_park_pending → TLS flag = 1.
Task: emit mir_emit_async_suspend_point → saves locals, stores state, returns 0.
Worker loop: reads ret=0, enters task_yield.
Worker loop: reads TLS flag, goes to task_parked.
Worker loop: CAS frame[5] 0→1.

Meanwhile, another task B can call sched_wake(A) at any of five points:

B’s wake at…	A’s state	wake does	Result
Before (1)	frame[5]=0	CAS 1→0 fail, CAS 3→0 fail, CAS 0→2 ok	WAKE_PENDING, A’s task_parked will re-enqueue — OK
Between (1) and (2)	frame[5]=0	Same as above	OK
Between (2) and (3)	frame[5]=0	Same as above	OK (TLS flag is still 1)
Between (3) and (5)	frame[5]=0	Same as above	Race!
After (5)	frame[5]=1	CAS 1→0 ok, reenqueue	OK

The race window is between (3) and (5). Specifically between the store 0, @__qz_sched_park_pending at line 2107 and the CAS at line 2111. Between these two instructions, B can observe frame[5]=0 (A hasn’t parked yet) and CAS 0→2 (WAKE_PENDING). A’s CAS at 2111 then reads %pk.old=2, takes the task_park_woken branch, stores 0, and re-enqueues. That looks correct.

But there’s a subtler issue. Look at line 2106-2107:

task_parked:
  store i64 0, ptr @__qz_sched_park_pending   ; clear TLS flag

This store is not atomic and not marked release. If the compiler or CPU reorders it with the subsequent CAS, or if the TLS storage isn’t memory-ordered against the CAS, another thread observing frame[5]=1 could still see the stale park_pending=1. That’s unlikely to cause A itself to hang since A is the only writer of A’s own TLS flag, but it could cause an incorrect re-park if A is somehow re-scheduled before the CAS commits.

More likely hypothesis: the wk_set_pending CAS at line 2514 does not check its success. If A’s worker loop has already transitioned frame[5] from 0 to 1 (task_parked branch) BETWEEN wake’s CAS 1→0 attempt (line 2489) and wake’s CAS 0→2 attempt (line 2514), then wake’s 0→2 CAS fails silently — the task is parked with no one to wake it.

Sequence:

A runs $poll, calls sched_park → sets TLS flag, returns 0.
A’s worker reads TLS flag = 1.
A’s worker clears TLS flag (store 0).
B calls sched_wake(A). wake: CAS 1→0 — FAILS (frame[5] is 0). Branch wk_try_pending.
A’s worker executes CAS 0→1 — SUCCEEDS. frame[5] = 1. Branch task_park_done. “Parked” trace. Loop.
B: CAS 3→0 — FAILS (frame[5] is 1). Branch wk_set_pending.
B: CAS 0→2 — FAILS (frame[5] is 1, not 0). Wake is lost.
A is parked forever. Hang.

Fix: wk_set_pending must handle the CAS-0→2-failure case. Specifically: if the CAS fails because frame[5] is now 1 (PARKED), retry as CAS 1→0 + reenqueue. Loop until one of the CASes succeeds or frame[5] is already 2/0.

wk_set_pending:
  ; retry loop: state may have raced from 0 to 1
  %wk.state = load atomic i64, ptr %wk.ps.p acquire, align 8
  %wk.is1 = icmp eq i64 %wk.state, 1
  br i1 %wk.is1, label %wk_retry_parked, label %wk_try_cas0
wk_try_cas0:
  %wk.cas2 = cmpxchg ptr %wk.ps.p, i64 0, i64 2 acq_rel acquire
  %wk.ok2 = extractvalue {i64, i1} %wk.cas2, 1
  %wk.now = extractvalue {i64, i1} %wk.cas2, 0
  br i1 %wk.ok2, label %wk_done, label %wk_set_pending  ; retry if it's now 1
wk_retry_parked:
  ; state is 1 (just parked) — retry CAS 1→0 + reenqueue
  %wk.cas1r = cmpxchg ptr %wk.ps.p, i64 1, i64 0 acq_rel monotonic
  %wk.ok1r = extractvalue {i64, i1} %wk.cas1r, 1
  br i1 %wk.ok1r, label %wk_reenqueue, label %wk_set_pending
wk_done:
  ret void

This is a standard CAS-retry pattern — loop until state is in a stable terminal form (RUNNING with WAKE_PENDING set, or PARKED that we then unpark). Rust’s parking_lot crate uses this shape.

Verify: after writing the retry loop, re-run the 200-iteration stress. Must be 200/200 pass across multiple trials.

Research: how Rust, Go, Zig, Erlang handle park/wake

Why this matters (D2): every mature concurrent runtime has ironed out CAS protocols for park/wake. We should not invent our own protocol if a well-researched design exists.

Rust / Tokio — stackless state machines + Waker

Park = return Poll::Pending. The task yields to the runtime. Cannot park outside async fn.
Wake = call Waker::wake() which enqueues the task.
Waker is Send/Sync — safe to call from any thread.
Memory model: Waker::wake uses AtomicUsize with seq_cst ordering around the enqueue.
The race: Rust solves it via AtomicUsize state on the task with values [NOTIFIED, RUNNING, PARKED] and a CAS retry loop in both park and wake. When park loses the race (state became NOTIFIED), it returns immediately. When wake loses (state transitioned), it retries or drops depending on semantics.
Key design: the Waker state machine is tokio::runtime::task::state — 180 lines, worth reading. It has separate bits for NOTIFIED, RUNNING, COMPLETE, CANCELLED, and uses AtomicUsize + CAS retries for every transition.
parking_lot::Parker: simpler primitive. Uses 3 states (EMPTY, PARKED, NOTIFIED) + atomic CAS retry. About 200 lines of Rust. Source: https://github.com/Amanieu/parking_lot

Go runtime — stackful coroutines (gopark/goready)

Park = gopark(unlockf, lock, reason, trace, skip). Saves G context to the G struct via assembly, switches to the M’s scheduler stack. The unlockf callback provides atomicity against the wake.
Wake = goready(g, skip). Sets G’s status to Grunnable, puts it on the runqueue.
Atomicity trick: gopark takes a waitlock + unlockf. The runtime holds waitlock while transitioning G from Grunning to Gwaiting, THEN calls unlockf which releases waitlock. This creates a linearizable park-or-wake-loses-race window.
State machine: G status atomics — Grunning, Gwaiting, Grunnable. All transitions via atomic.Store / atomic.Cas.
Why Quartz can’t do this cleanly: Quartz is stackless (Rust model). No context switch, no “scheduler stack”. The Go approach requires stackful goroutines.
What Quartz can borrow: the unlockf pattern. In Quartz terms: sched_park takes an optional “check” callback that runs inside the worker loop AFTER the CAS but BEFORE blocking. If the check says “don’t park, wake already arrived,” the worker re-enqueues instead of parking. This is a cleaner alternative to WAKE_PENDING.

Zig (current) — explicit state machines, Thread.ResetEvent

Zig dropped async/await in 0.11 (see Andrew Kelley’s blog post). Current approach: explicit state machines written by hand.
For park/wake: std.Thread.ResetEvent — a futex-backed primitive with set(), reset(), wait(). Uses std.atomic.Value(u32) with cmpxchgStrong and futex-based waits.
Not directly applicable (we don’t have futex exposure from Quartz yet), but the state transition table is worth studying: https://github.com/ziglang/zig/blob/master/lib/std/Thread/ResetEvent.zig

Erlang / BEAM — stackful processes with preemption

Each process has its own heap + stack + message queue.
“Park” = waiting on a message receive. Implemented via setjmp/longjmp save of the scheduler thread’s C stack + a process-level wait flag.
Wake = scheduler signals the wait flag and re-queues the process.
Takeaway: BEAM has no “park lost” race because all suspension goes through message receive with a mailbox. The mailbox IS the wake notification channel. The wake state is the presence of a message.
Design lesson for Quartz: if sched_park’s race is too hard to fix directly, consider routing park/wake through a per-task wait semaphore (atomic counter + futex/cond). Park = decrement-or-block. Wake = increment-and-signal. This is a proven pattern — literally every POSIX mutex uses it.

libuv — never blocks the worker

libuv workers never park. If work needs to wait, it registers a callback and the loop resumes. epoll/kqueue/IOCP dispatch.
Not directly applicable but reinforces the M:N invariant: workers should never spin-wait for other tasks to finish. If they do, you’ve lost the point of M:N.

Parking-lot crate — the gold standard for Rust-side

~2400 lines of CAS-retry state machines. Handles 3 states (EMPTY, PARKED, NOTIFIED), rich futex integration, signal safety.
Uses AtomicUsize with Release-Acquire ordering — weaker than seq_cst but enough for single-producer-single-consumer park/wake.
Key insight: wake must retry if the observed state isn’t what was expected. CAS failure is not an error, it’s a signal to re-read and try again.

Synthesis — what to do for Quartz

Quartz is stackless (Rust model). The right design is:

park_state is an atomic with exactly 3 values: RUNNING(0), PARKED(1), NOTIFIED(2). Drop SPIN_PARKED(3).
All CAS operations live in a retry loop. Never fail silently.
Release-Acquire ordering is sufficient. No seq_cst.
The task_parked branch in the worker loop: after the CAS 0→1 succeeds, the task is parked and the worker is free. If CAS 0→1 fails with observed value=2, the task is NOTIFIED — reset to 0 and re-enqueue.
sched_wake: retry loop. Load state. If 0, CAS 0→2 (set NOTIFIED). If 1, CAS 1→0 + re-enqueue. If 2, already notified, nothing to do. On CAS failure, re-read and retry.

This matches parking_lot exactly. Don’t invent a fourth state.

Implementation plan

Phase 0 — Capture a hang + validate race hypothesis (1-2 hours)

Before writing code, reproduce the hang under sample(1) and confirm the race is in the wk_set_pending branch (or wherever sample points). Do not trust my hypothesis above without data.

Steps:

Un-pend test 4, build, run in a loop capturing hangs (pre-flight script above).
Get 3+ hang stack traces with sample or lldb.
Correlate stack PC against the IR in /tmp/spk.ll — find exactly which basic block each thread is stuck in.
If the hang is NOT in the wake CAS protocol, the hypothesis is wrong — pivot. Read the captured stacks and find the real race.

If you confirm the race, go to Phase 1. If not, write down the new hypothesis and restart Phase 0 with the right target.

Phase 1 — Fix the wake CAS retry (2-4 hours)

Edit codegen_runtime.qz:2480-2516 (__qz_sched_wake). Replace the single-attempt wk_try_pending → wk_set_pending chain with a retry loop that handles all three states:

wk_retry:
  %state = load atomic i64, ptr %wk.ps.p acquire, align 8
  switch i64 %state, label %wk_unknown [
    i64 0, label %wk_try_notify
    i64 1, label %wk_try_unpark
    i64 2, label %wk_already_notified
  ]
wk_try_notify:
  ; CAS 0→2
  %cas = cmpxchg ptr %wk.ps.p, i64 0, i64 2 acq_rel acquire
  %ok = extractvalue {i64, i1} %cas, 1
  br i1 %ok, label %wk_done, label %wk_retry  ; retry on CAS failure
wk_try_unpark:
  ; CAS 1→0 + reenqueue
  %cas = cmpxchg ptr %wk.ps.p, i64 1, i64 0 acq_rel acquire
  %ok = extractvalue {i64, i1} %cas, 1
  br i1 %ok, label %wk_reenqueue, label %wk_retry
wk_already_notified:
  ret void                          ; another wake already set NOTIFIED
wk_unknown:
  ; state 3 (SPIN_PARKED) — should never fire post-cleanup
  ret void
wk_reenqueue:
  ; (existing local-push / global-reenqueue code)
wk_done:
  ret void

Rebuild, test 4 un-pended, stress 200+ runs. Must be 200/200 pass across at least 3 independent builds.

Phase 2 — Fix the task_parked symmetric race (1-2 hours)

The worker loop task_parked handler at codegen_runtime.qz:2100-2124 has a parallel race. The fix is symmetric:

Current:

task_parked:
  store i64 0, ptr @__qz_sched_park_pending
  cmpxchg frame[5], 0, 1  ; only tries ONE transition
  ...

The current code handles the only expected cases (CAS succeeds = parked; CAS fails because value was 2 = NOTIFIED). But if multiple waking interleave and state ends up in an unexpected value, it crashes the invariant. Audit: are there ANY sequences where frame[5] can be something other than 0 or 2 at this point?

Could be 3 (SPIN_PARKED) — only if dead code ran. Verify it’s fully dead after Phase 3.
Could be 1 — only if someone else parked the task. Impossible: one task has one state machine.

So current code is fine assuming 3 is dead. If Phase 3 kills the dead path, no race here. Document the invariant in the IR comment.

Phase 3 — Delete dead code (1-2 hours)

Delete:

cg_intrinsic_conc_sched.qz:2988-3086 — spin-park codegen. Replace with:

if name == "sched_park"
  # sched_park is only valid in async $poll context. The MIR-level handler
  # at mir_lower_expr_handlers.qz:1897 intercepts the async case. If we
  # reach here, it means sched_park was called from sync code — unreachable
  # after Phase 3b/4 of this sprint. Emit a trap.
  codegen_util::cg_emit_line(out, "  call void @__qz_abort_with(i64 ptrtoint (ptr @.str.sched_park_sync to i64))")
  codegen_util::cg_emit_line(out, "  unreachable")
  return 1
end

cg_intrinsic_conc_sched.qz:3089-3097 — sched_set_park_pending is internal and should not be user-callable. Mark it @internal or remove from the builtin table in typecheck_builtins.qz:279. User-code callers → compile error.
codegen_runtime.qz:2504-2510 — the wk_try_pending → wk_spin_done path (CAS 3→0). Dead once spin-park is gone. Delete and fold wk_try_pending into wk_set_pending / Phase 1 retry loop.
Plain function emission for async-target functions. Grep for define i64 @bouncer_a etc. in a test 4 build — they’re emitted but never called. Find the codegen path that emits plain bodies for functions that are also async targets, and suppress. Saves ~1% binary size on every program that uses go. Not a correctness fix but a code-cleanliness one (D1 / D7).
frame[5] SPIN_PARKED(3) constant. Comment it out / delete from the state docs at codegen_runtime.qz:903.

Don’t forget: update the comment at codegen_runtime.qz:903 to reflect the new 3-state model (0=RUNNING, 1=PARKED, 2=NOTIFIED).

Phase 4 — Reject sched_park outside async context (1-2 hours)

In typecheck_walk.qz, when type-checking a NODE_CALL to sched_park, check whether the enclosing function is an async target. If not, emit a compile error:

QZ0210: sched_park() can only be called from async contexts.
       Rewrite the caller as `async def` (via the EFFECT_SUSPEND taint)
       or call sched_park inside a go lambda body.

The taint analysis at mir_lower_async_registry.qz:76-133 (mir_mark_suspendable) already propagates EFFECT_SUSPEND through the call graph. Piggyback on it: if a user function calls sched_park (or calls a function that calls sched_park), it must itself be an async target. This is already ensured by the go fn() → NODE_ASYNC_CALL rewrite at mir_lower.qz:990 — fn automatically becomes an async target. So in practice the new compile error fires only for “called sched_park from main() without a surrounding go,” which is already a logic error.

Rationale: this codifies the Rust model — park is an async-only primitive. It also prevents the “I forgot to wrap in go()” footgun.

Phase 5 — Unblock async Mutex/RwLock (roadmap item #15, 2-3 hours)

Roadmap item #15 is blocked purely on “scheduler refactor.” After Phases 0-4, it’s unblocked. async_mutex_lock and async_rwlock_read/write already exist at mir_lower_expr_handlers.qz:1905-1918. They use the same _gen_active >= 2 gate and route through mir_emit_async_mutex_lock etc.

What’s likely missing:

Verification that they compose correctly with the fixed park/wake protocol.
Spec coverage: async_mutex_spec.qz, async_rwlock_spec.qz. Check if these exist or need writing.

Don’t scope-creep into building these if Phases 0-4 eat the session. File as follow-up.

Phase 6 — Un-pend test 4 + add stress tests (30 min)

# spec/qspec/sched_park_spec.qz
it("repeated park/wake on same task") do ->
  assert_eq(test_repeated_park_wake(), 3)
end

# Add stress variants
it("200 iterations of repeated park/wake") do ->
  # Run the bouncer loop with iter_count=200
  assert_eq(test_repeated_park_wake_stress(200), 200)
end

it("4 parallel bouncer pairs") do ->
  # 8 tasks, 4 park, 4 wake — exercise multi-worker wake races
  assert_eq(test_parallel_bouncer(4), 4)
end

Verification gates (run after every fix attempt)

# 1. Full self-compile measurement (ensure no memory regression from Phase 3b.next.2)
./self-hosted/bin/quartz --no-cache --memory-stats \
  -I self-hosted/frontend -I self-hosted/middle -I self-hosted/backend \
  -I self-hosted/shared -I std -I tools \
  self-hosted/quartz.qz > /dev/null 2>/tmp/mem_post.txt
grep '\[mem\]' /tmp/mem_post.txt
# Expected: typecheck ~737 MB, peak ~7810 MB. No regression.

# 2. Guard (mandatory)
./self-hosted/bin/quake guard 2>&1 | tail -10

# 3. Smoke
./self-hosted/bin/quake smoke 2>&1 | tail -6

# 4. Scheduler regression sweep
for spec in sched_park_spec sched_lifecycle_spec scheduler_spec spawn_await_spec \
            sched_idle_hook_spec sched_sleep_spec concurrency_spec colorblind_async_spec \
            async_channel_spec spawn_await_spec; do
  echo "=== $spec ==="
  FILE=spec/qspec/${spec}.qz ./self-hosted/bin/quake qspec_file 2>&1 | tail -2
done

# 5. Park stress — THE gate (at least 3 independent 200-run trials)
./self-hosted/bin/quartz spec/qspec/sched_park_spec.qz -I self-hosted -I tools -I std 2>/dev/null > /tmp/spk.ll
llc /tmp/spk.ll -o /tmp/spk.s && clang /tmp/spk.s -o /tmp/spk_bin -lpthread -lm
for trial in 1 2 3; do
  pass=0; hang=0
  for i in $(seq 1 200); do
    result=$(timeout 5 /tmp/spk_bin 2>&1); code=$?
    if [ $code -eq 124 ]; then hang=$((hang+1)); else pass=$((pass+1)); fi
  done
  echo "trial $trial: pass=$pass hang=$hang"
done
# Expected: 600 pass, 0 hang across 3 trials.

# 6. Concurrency stress specs (if 3 trials green)
for spec in concurrency_stress_spec backpressure_spec fairness_spec; do
  echo "=== $spec ==="
  FILE=spec/qspec/${spec}.qz timeout 180 ./self-hosted/bin/quake qspec_file 2>&1 | tail -3
done
# Note: backpressure_spec and semaphore_spec are in the roadmap as "blocked on scheduler refactor."
# If they now PASS, update the roadmap. If they still TIMEOUT, that's a separate bug.

Success criteria

Minimum viable:

Test 4 passes 600/600 runs across 3 independent trials.
All existing scheduler specs still pass.
quake guard + quake smoke green.
No memory regression in self-compile (peak ~7.81 GB).
Commit message explains the race + the fix.

Target:

All of the above, PLUS:
Spin-park dead code deleted (Phase 3).
backpressure_spec and/or semaphore_spec go from TIMEOUT to PASS (they’re roadmap-blocked on this).
Compile error on sched_park called from non-async context (Phase 4).

Stretch:

All of the above, PLUS:
Async Mutex/RwLock specs shipped (Phase 5 → roadmap item #15 closed).

Failure mode — when to hand off to the next session

If Phase 0 (capturing a hang + understanding the race) takes more than 4 hours without clear root cause, STOP and write up what you found. The honest report is more valuable than a forced fix. Specifically:

Commit the stack traces and any analysis to a docs/handoff/sched-park-investigation.md.
File the race as SCHED-PARK-RACE in the roadmap.
Move on to Phase 4 (typecheck error) anyway — it doesn’t depend on the fix landing, and it codifies the invariant that sched_park is async-only. This ships useful work even without the race fix.
Do NOT attempt Phase 1 blindly without confirmed hypothesis — CAS protocols are the wrong place for “try something and see.” You will introduce different races.

Key files quick reference

Area	File	Lines
sched_park MIR handler (async path)	`self-hosted/backend/mir_lower_expr_handlers.qz`	1897-1903
sched_park codegen (dead spin path)	`self-hosted/backend/cg_intrinsic_conc_sched.qz`	2988-3086
sched_set_park_pending codegen	`self-hosted/backend/cg_intrinsic_conc_sched.qz`	3089-3097
sched_wake codegen	`self-hosted/backend/cg_intrinsic_conc_sched.qz`	3099-3112
__qz_sched_wake runtime	`self-hosted/backend/codegen_runtime.qz`	2480-2516
Worker loop task_yield / task_parked	`self-hosted/backend/codegen_runtime.qz`	2100-2147
Frame layout docs	`self-hosted/backend/codegen_runtime.qz`	895-920
Async suspend point emission	`self-hosted/backend/mir_lower_gen.qz`	220-268
Suspendable taint analysis	`self-hosted/backend/mir_lower_async_registry.qz`	39-133
go → async lowering	`self-hosted/backend/mir_lower.qz`	904-996
Test file	`spec/qspec/sched_park_spec.qz`	1-229
sched_park typecheck rejection (Phase 4)	`self-hosted/middle/typecheck_walk.qz`	NODE_CALL case around `func_name == "sched_park"` (doesn’t exist yet)

Prime directives check

D1 (highest impact): this is the #1 P0 work. 1% hang rate is blocking async Mutex/RwLock + 4 timeout specs. Worth eating a session.
D2 (research first): parking_lot and tokio::runtime::task::state are the references. Read them BEFORE touching the CAS protocol. 30-60 min of reading saves hours of debugging.
D3 (pragmatism ≠ cowardice): Phase 1 (CAS retry loop) is the minimum correct fix. Phase 3 (delete dead code) + Phase 4 (typecheck reject) are the cleanup that makes the invariant hold forward. Phase 5 is the credit — don’t shortcut.
D4 (work spans sessions): if Phase 0 shows the hypothesis is wrong, pivot. Don’t force a fix within this session if the evidence doesn’t support it. File + hand off.
D5 (report reality): the 3-trial 600-run stress gate is load-bearing. “It worked on my one run” is not evidence. Run 3 independent trials with fresh binaries.
D6 (holes get filled or filed): if you find OTHER races in the scheduler while doing this (likely — scheduler code is dense), file them in the roadmap immediately, don’t silently move on.
D7 (delete freely): Phase 3 is load-bearing. The spin-park codegen, the SPIN_PARKED state, and the plain-function emission for async targets are dead code. Remove them in the same commit as the fix. No “legacy” comments.
D8 (binary discipline): quake guard before every commit. Fix-specific backup (quartz-pre-park-wake-golden) before touching anything. The 200-run stress is the smoke test equivalent for this work — the standard quake smoke won’t catch 1% races.
D9 (quartz-time): one session, 6-12 hours of focused work. Don’t pad.
D10 (corrections are calibration): if the stack trace shows the race is NOT where I hypothesized, update and move. Don’t perform “I was right actually” — just refocus on the real target.

Pointers to background

docs/HANDOFF_PRIORITY_SPRINT.md — prior handoff (Apr 12). Items 2-4 complete. Item 1 = this work. Implementation plan inside is partially wrong — see §Investigation.
commit 10bc4539 — “Concurrency V4: park/wake, async mutex/rwlock, async generators.” Original landing of sched_park infrastructure. First place to look for the original design intent.
commit b61da0f4 — “B2: Scheduler state machine + free-without-zero audit.” Recent scheduler lifecycle refactor. Introduced @__qz_sched_state atomic. Model for how to do clean atomic state machines in Quartz codegen.
commit ab443188 — “proc_suspend(pid) intrinsic — pidfd / EVFILT_PROC child exit.” Another recent example of a scheduler-integrated suspend primitive done right.
tokio::runtime::task::state source — canonical stackless task state machine. Read this.
parking_lot::Parker source — canonical CAS retry park protocol. Simpler than tokio. Read this.
Go runtime proc.go gopark — for reference only; stackful, not directly applicable but the unlockf pattern is clever.