Sprint 1 Handoff — Soundness: json SIGSEGV + cache-pattern miscompile
Baseline: 134152b5 (trunk, clean, fixpoint 2026 functions, guard passed)
Estimated effort: 2-3 quartz-days (1 session each for the two items, or one long session for both)
Session goal: Close two S-tier soundness bugs that block the “compiler completeness” tier. Both are real bugs (not test infrastructure), both have clean reproducers, both unblock downstream polish work.
Handoff Prompt (copy-paste to start next session)
Read `docs/handoff/next-session-sprint-1-soundness.md` and execute Sprint 1: fix
the json_spec SIGSEGV and root-cause the cache-pattern miscompile. Both are
S-tier soundness bugs tracked on ROADMAP.md.
Priority Philosophy (from ROADMAP header): compiler completeness comes FIRST.
Docs, package manager, demos, VS Code extension, blog post — all ceremony,
all DEFERRED until the compiler is bug-free and every spec passes. The
anti-toy-language rule: don't write docs for a compiler with open holes.
Prime Directives v2 compact form:
1. Pick highest-impact, not easiest.
2. Design before building; research Rust/Go/Zig/Haskell first.
3. Pragmatism ≠ cowardice; shortcuts = cowardice. Name the path.
4. Work spans sessions. Don't compromise a design because context is ending.
5. Report reality, not optimism. No weasel words.
6. Holes get filled or filed.
7. Delete freely, no compat layers. Pre-launch, zero users.
8. Binary discipline = source discipline. quake guard mandatory. Fixpoint
not optional. Smoke tests not optional. Fix-specific backups not optional.
9. Quartz-time estimation. Traditional ÷ 4.
10. Corrections are calibration, not conflict.
Tree is clean at 134152b5. Guard stamp valid. Smoke (brainfuck + style_demo)
green. Maranget exhaustiveness checker just shipped and validated.
Task 1: json_spec SIGSEGV (priority: soundness bug in stdlib)
What we know (from fresh crash report, Apr 16 20:39):
Signal: SIGSEGV (EXC_BAD_ACCESS, KERN_INVALID_ADDRESS)
Faulting PC: json_stringify_object_pretty + 1832
Faulting addr: 0x0000000000000018 (offset 24 of a null/bad pointer)
Stack (most-recent first):
json_stringify_object_pretty + 1832 ← crash
json_stringify_object_pretty + 1768 ← recursive/re-entry call
json_stringify_pretty_inner + 540
__lambda_21 + 244 ← an `it("...") do -> ... end` body
it + 588
__lambda_18 + 272 ← a `describe() do -> ... end` body
describe + 392
qz_main + 420
main + 32
Where this lives: std/json.qz:1185-1216 (function json_stringify_object_pretty).
The likely trigger (only pretty-print tests hit this code path):
spec/qspec/json_spec.qz:121-143 — the describe("json pretty printing") block. Most suspicious test:
it("pretty prints simple object") do ->
var obj = json_object()
json_object_set(obj, "name", json_string("test"))
var result = json_stringify_pretty(obj, 2)
assert_eq(str_contains(result, "\"name\""), true)
assert_eq(str_contains(result, "\"test\""), true)
end
Reproducer (ONE command):
./self-hosted/bin/quartz spec/qspec/json_spec.qz 2>/dev/null > /tmp/js.ll \
&& llc -filetype=obj /tmp/js.ll -o /tmp/js.o \
&& clang /tmp/js.o -o /tmp/js -lm -lpthread \
&& /tmp/js; echo "exit: $?" # prints "exit: 139"
Output is empty — crashes before the first ✓ or ✗ prints — which is consistent with the crash report: the qz_main → describe → it → pretty-print chain is one specific lambda deep, not the first test run overall.
Investigation starting points (ranked):
-
Does
map_new()without type args create the right map?std/json.qz:121-123:def json_object(): JsonValue return JsonValue::Object(map_new()) endNo
<K, V>annotation. String keys are used downstream (json_object_set(obj, "name", ...)). Ifmap_new()defaults to intmap and gets String-coerced keys,map_keyswould return Int values that aren’t valid String pointers → crash when used as String injson_escape_string(key)or passed tomap_get. Grep for howmap_new()resolves without annotation intypecheck_expr_handlers.qz:1487-1571(the dispatch rewrite block). -
Which line in
json_stringify_object_prettyis PC+1832? Disassemble:otool -tvV /tmp/js | awk '/_json_stringify_object_pretty:/{f=1} f{print} /^$/{if(f)exit}' > /tmp/js_obj.s # Then compute function start address from otool, add 1832 (0x728), and # find the instruction. It's likely a `ldr x?, [x?, #24]` or similar # offset-24 load.Offset 0x18 reads often come from: Vec struct field access (cap is at offset 24 in some layouts), hashmap header fields, or struct field at index 3. Identify the exact load to isolate which variable is null.
-
Does adding explicit type annotation fix it? Patch
std/json.qz:122tomap_new<String, JsonValue>()and rebuild — if the crash disappears, the bug ismap_new()without annotation producing a wrong-shape map. That’s a real compiler bug (needs filing) but a stdlib patch is a legitimate workaround. -
lldb walkthrough (if #1-3 don’t isolate it):
lldb /tmp/js (lldb) b json.qz:1186 # map_keys(fields) line (lldb) r (lldb) frame variable # inspect fields, keys (lldb) p (i64)*(i64*)fields # dump map header
Success criteria:
/tmp/jsexits 0 after a fix (all 17 tests pass)- No SIGSEGV in any of the 17 tests
- Fixpoint verified via
quake guard - Smoke tests (brainfuck + style_demo + expr_eval) still green
- If the root cause is a compiler bug (not a stdlib bug), file it with a minimal reproducer and write a regression spec
Memory entries worth reading before starting:
feedback_crash_reports_first.md— crash report reading protocolfeedback_quartz_module_quirks.md— PSQ-4 Vec element type loss (hit twice recently; similar shape — “map without type args” could be another instance)
Task 2: Cache-pattern miscompile (priority: latent soundness bug, blocks perf work)
What we know (from ROADMAP “Open compiler / bootstrap issues” + docs/BOOTSTRAP_RECOVERY.md:320):
There’s a latent codegen bug in how the as_int(string) → intmap → as_string(cached) round-trip behaves when the intmap grows to ~9.7 GB of entries. When a binary compiled FROM that caching pattern tries to self-compile, cross-module struct type lookups silently fail with “no struct type for X” after [mem] resolve_pass1 grows past ~9.7 GB.
The workaround (currently active in trunk): both sites use raw string interpolation instead of the intmap cache.
# self-hosted/shared/string_intern.qz:269 — currently:
def mangle(prefix: String, suffix: String): String = "#{prefix}$#{suffix}"
# With the comment at line 257-268 explaining why the cache is disabled.
# self-hosted/middle/typecheck_registry.qz:32 — currently:
def _cached_suffix(tc: TypecheckState, name: String): String = "$#{name}"
# 8 call sites all using this raw-interp version.
# self-hosted/resolver.qz — calls string_intern::mangle at 3 sites
# (lines 242, 1644, 1657). All go through the raw-interp mangle.
What the old cache looked like (from commits afad28c0 and 77d968d5, which you’ll want to git show):
- Keyed by
as_int(prefix) * 31 + as_int(suffix)(or similar) in an intmap - Value was
as_int(result_string) - Read-back via
as_string(cached_int)returned the original pointer
Root cause hypothesis (from the recovery doc): The as_int(string) → as_string(int) boundary in codegen produces a stale pointer when the cache is large enough. Somewhere between storing as_int(s) and retrieving as_string(cached), the string moves/is freed, but the cache still holds the old pointer. The 9.7 GB threshold is probably when a vec_reserve/map_rehash relocates the backing store and invalidates all previous interior pointers.
Candidate failure modes (ranked by likelihood):
-
Map rehash relocates strings. Quartz strings are heap-allocated. If the intmap keys/values are stored directly (not interned via
g_interner), then a rehash that moves the string data invalidates everyas_int(string)that was stored. The cache round-trip reads a dangling pointer.Test: store
intern_id(s)(Int) instead ofas_int(s)(pointer).intern_idreturns a stable integer handle. -
as_int/as_stringis not pointer-identity in codegen. The claim is that these are no-ops — same i64 representation. If codegen emits a different IR for pointer-vs-int contexts, the round-trip might go through a cast that’s not bitwise identity.Test: emit
--dump-miron a tiny program that callsmangle, verify nofptoui/ptrtoint/inttoptron the round-trip. Expect direct i64 pass-through. -
GC / memory pressure moves strings. Quartz doesn’t have a GC, but
vec_free/sb_free/ explicit frees on temporaries could invalidate storedas_intpointers. The cache would be storing pointers into freed memory.Test: Add an explicit
string_clone(s)beforeas_inton store, andstring_cloneon retrieve. If crashes disappear, it’s ownership-related.
Reproducer (DESTRUCTIVE — only do after quake guard is green and you have a binary backup):
# 1. Save a fix-specific golden BEFORE any changes (per CLAUDE.md Rule 1)
cp self-hosted/bin/quartz self-hosted/bin/backups/quartz-pre-cache-pattern-golden
# 2. Re-enable the cache in both sites. Start with ONE at a time.
# string_intern.qz:269 — revert to the caching version from afad28c0.
# (Check `git log --oneline -- self-hosted/shared/string_intern.qz` and
# `git show afad28c0 -- self-hosted/shared/string_intern.qz` to see the
# original cache implementation.)
# 3. Rebuild with current binary
./self-hosted/bin/quake build # should succeed — old binary's codegen
# is stable; it produces the new source's IR
# 4. Try self-compile with the NEW binary
./self-hosted/bin/quake build # the NEW binary now runs the cache logic.
# Watch for [mem] resolve_pass1 growing.
# If it reaches ~9 GB and fails with
# "no struct type for X" — that's the bug
# reproducing on gen2.
# 5. Recover if broken
cp self-hosted/bin/backups/quartz-pre-cache-pattern-golden self-hosted/bin/quartz
Investigation strategy:
-
Re-enable one site at a time (string_intern first — it has 3 callers, much lower call frequency than
_cached_suffixwith 8 callers). If that alone triggers the bug, the fix target is clearer. If it doesn’t, add_cached_suffixcache next and see if the bug scales with cache size. -
Measure cache size at the failure point. Add an
eputs("[mangle-cache] size=#{map_size(cache)}")before the lookup. If the failure threshold is consistent across runs (~9.7 GB), that’s a memory layout invariant. -
Test the fix. The most likely fix is to store interned IDs instead of raw pointers:
# OLD (broken): map_set(cache, key, as_int(result)) ... as_string(map_get(cache, key).unwrap()) # NEW (fix candidate): map_set(cache, key, intern_id(tc.interner, result)) ... intern_string(tc.interner, map_get(cache, key).unwrap())The TODO at
string_intern.qz:263-268mentions this was attempted and failed with “index-out-of-bounds from intern_id returning pointer-sized values.” Investigate WHY intern_id returned a pointer when the contract is “return a stable small integer handle.” Likely the interner table grew past INT_MAX boundary OR the function’s return type was being miscompiled through theintmap<Int, Int>layer. -
If the fix sticks — measure the win. Roadmap notes ~2 GB RSS recovery (“dedup loses ~2 GB peak RSS that Phase 2 memory optimization had recovered”). Re-enabling the cache should drop peak RSS from 7.81 GB back toward ~5.8 GB.
Success criteria:
- Cache re-enabled at BOTH sites (
string_intern.qz:269andtypecheck_registry.qz:32) - Compiler self-compiles (gen2 succeeds) with cache active
- Fixpoint verified (
quake guard) - No regression in memory: peak RSS stays below the current 7.81 GB (ideally drops ~2 GB per the TODO comment)
- Smoke tests + full spec run (
quake qspec) show no new failures - If the root cause is in codegen (not just “need to store intern IDs”), the codegen bug gets a regression spec
Files to touch:
self-hosted/shared/string_intern.qz:246-272(the cache + mangle)self-hosted/middle/typecheck_registry.qz:29-32, 121-1022(the suffix cache + 8 call sites)- Possibly
self-hosted/backend/codegen.qzorcodegen_instr.qzif root cause is in IR emission
Memory entries worth reading before starting:
feedback_binary_backup.md— mandatory backup protocol (Rule 1)feedback_fixpoint_enforcement.md— why skipping fixpoint costs 100+ commitsproject_memory_optimization.md— where the original ~2 GB loss came fromfeedback_no_git_checkout_uncommitted.md— stash before checkout
Infrastructure setup (do this first in the next session)
cd /Users/mathisto/projects/quartz
# Verify baseline is clean
git status # should be clean at 134152b5
git log --oneline -5 # verify recent Maranget commits are present
./self-hosted/bin/quake guard:check # should report stamp valid
# Verify smoke passes before starting work
./self-hosted/bin/quartz examples/brainfuck.qz 2>/dev/null \
| llc -filetype=obj -o /tmp/bf.o \
&& clang /tmp/bf.o -o /tmp/bf -lm -lpthread \
&& /tmp/bf | tail -3 # should print "4 passed, 0 failed"
./self-hosted/bin/quartz examples/style_demo.qz 2>/dev/null \
| llc -filetype=obj -o /tmp/sd.o \
&& clang /tmp/sd.o -o /tmp/sd -lm -lpthread \
&& /tmp/sd | head -2 # should print the title line
# Save a session-wide backup (Rule 1 — BEFORE any compiler source changes)
cp self-hosted/bin/quartz self-hosted/bin/backups/quartz-pre-sprint1-golden
Success criteria for the sprint overall
- Both tasks complete with fixpoint verified
/tmp/js(json_spec binary) exits 0 with all tests greenquake qspectotal failing-spec count decreased (at least -1 for json_spec)- Either: cache re-enabled with peak RSS reduction OR filed with a deeper codegen bug reproducer if the fix is blocked on a separate issue
- Roadmap updated: move closed items from “Known Bugs” / “Open compiler holes” to “Completed (Recent)”
- Memory files updated if any non-obvious lessons surface
What to AVOID in this sprint
- Don’t touch docs, package manager, demos, VS Code extension, launch-prep work. Those are explicitly last-mile per the priority philosophy. Fixing the compiler is the only job this week.
- Don’t skip quake guard. The Bootstrap Island incident (Apr 9-11) lost 100+ commits because fixpoint was skipped.
- Don’t
--no-verifythe pre-commit hook. If the hook fails, the commit didn’t happen. Fix the underlying issue, re-stage, re-commit. - Don’t overwrite
quartz-pre-sprint1-goldenuntil the sprint is done and committed. Rollingquartz-golden(the quake guard backup) is fine to overwrite — that’s what it’s for. The sprint-specific golden is your escape hatch.
Next sprint preview (Sprint 2)
After both items land: open up the tail of failing specs (separate_compilation_spec, file_helpers_spec, http2_frame_spec, route_groups_spec, semaphore_spec, tls_async_spec, actor_spec Linux rebuild) and close the IMPL-TRAIT-RPITIT bounded-generic dispatch hole. Goal: zero failing specs in default features.