Quartz Self-Hosted Compiler: Dogfooding Audit
Version: v5.12.37-alpha | Total Lines Audited: 18,904 | Potential Savings: ~5,300 lines
Executive Summary
The self-hosted compiler has massive refactoring opportunities. Four files contain 18,904 lines of code with significant repetition that could be eliminated through:
- Data-driven tables replacing giant if/elsif chains
- Helper function extraction for repeated patterns
- Macro usage for boilerplate elimination
- Cross-file consolidation (especially intrinsics)
| File | Lines | Potential Savings | Top Issue |
|---|---|---|---|
typecheck.qz | 5,585 | ~470 | Builtin registration wall (283 lines) |
codegen.qz | 5,098 | ~1,200-1,500 | Monolithic 3,400-line intrinsic function |
mir.qz | 4,484 | ~600 (local) + ~2,500 (cross-file) | Intrinsic duplication with codegen |
parser.qz | 3,737 | ~435-600 | 8 nearly-identical binary op parsers |
| TOTAL | 18,904 | ~5,300 |
Critical: Cross-File Intrinsic Duplication
The Problem
Intrinsic handling is duplicated across three locations:
- mir.qz:1621-1885 (264 lines):
mir_is_intrinsic()- 189 string comparisons - codegen.qz:420-3819 (3,400 lines):
cg_emit_intrinsic()- 138 emission blocks - typecheck.qz:1012-1295 (283 lines):
tc_register_builtin()- 283 registration calls
Each intrinsic is defined in three separate places with no single source of truth.
The Solution: Centralized Intrinsic Registry
# In new file: self-hosted/core/intrinsics.qz
struct IntrinsicDef
name: String
arg_count: Int
return_type: Int
category: IntrinsicCategory
emit_pattern: EmitPattern
end
enum IntrinsicCategory
IO # puts, print, eputs, eprint
String # str_len, str_concat, str_eq, ...
Vec # vec_new, vec_len, vec_push, ...
HashMap # hashmap_new, hashmap_get, ...
Memory # malloc, free, arena_*, pool_*
Atomic # atomic_load, atomic_store, ...
Regex # regex_match, regex_replace, ...
# ...
end
enum EmitPattern
PtrToCall # inttoptr + call (puts, print)
TwoArgCall # Simple two-arg function call
VecAccess # Load header, bounds check, access
MapLookup # Linear search loop
# ~10 patterns cover 138 intrinsics
end
# Single source of truth
INTRINSICS: Vec<IntrinsicDef> = [
IntrinsicDef { name: "puts", arg_count: 1, return_type: TYPE_INT(),
category: IO, emit_pattern: PtrToCall },
IntrinsicDef { name: "print", arg_count: 1, return_type: TYPE_INT(),
category: IO, emit_pattern: PtrToCall },
# ... 187 more
]
# O(1) lookup via hash set
var intrinsic_set: Set<String> = nil
def is_intrinsic(name: String): Bool
intrinsic_set = Set.from_iter(INTRINSICS.map(|i| i.name)) if intrinsic_set.nil?
return intrinsic_set.contains(name)
end
def get_intrinsic(name: String): Option<IntrinsicDef>
# Hash lookup instead of 189 string comparisons
end
Estimated Impact
| Before | After | Savings |
|---|---|---|
| 3,947 lines across 3 files | ~800 lines in 1 file + patterns | ~3,100 lines |
| O(n) string scan per lookup | O(1) hash lookup | Performance boost |
| Add intrinsic = edit 3 files | Add intrinsic = add 1 line | Maintainability |
File: typecheck.qz (5,585 lines)
Issue 1: Builtin Registration Wall (Lines 1012-1295)
283 consecutive lines of:
tc_register_builtin(tc, "puts", TYPE_VOID())
tc_register_builtin(tc, "eputs", TYPE_VOID())
tc_register_builtin(tc, "str_len", TYPE_INT())
# ... 280 more
Fix: Data-driven bulk registration
BUILTINS = [
("puts", TYPE_VOID()),
("str_len", TYPE_INT()),
# ...
]
for (name, type) in BUILTINS do
tc_register_builtin(tc, name, type)
end
Savings: ~140 lines
Issue 2: Type Mapping Cascades (Lines 1301-1378, 2135-2168)
~100 lines of if/elsif mapping type constants to values:
if kind == TYPE_INT()
return "Int"
elsif kind == TYPE_BOOL()
return "Bool"
# ... 30+ more
Appears in:
tc_type_to_infer_type(32 lines)tc_infer_type_to_type(33 lines)tc_type_name(33 lines)
Fix: Table-driven lookup
TYPE_INFO = [
TypeInfo { kind: TYPE_INT(), name: "Int", infer_id: 2 },
TypeInfo { kind: TYPE_BOOL(), name: "Bool", infer_id: 3 },
# ...
]
def tc_type_name(kind: Int): String = TYPE_INFO[kind].name
Savings: ~80 lines
Issue 3: UFCS Method Rewrite Cascade (Lines 3838-3953)
~115 lines of identical blocks for String, Vec, HashMap, StringBuilder, Set:
elsif first_arg_type == TYPE_STRING()
type_name = "String"
var mangled = str_concat("String$", func_name)
# ... 15 lines of lookup and rewrite logic
elsif first_arg_type == TYPE_VEC()
type_name = "Vec"
var mangled = str_concat("Vec$", func_name)
# ... identical 15 lines
Fix: UFCS lookup table
UFCS_MAP = {
"String$find": "str_find",
"String$slice": "str_slice",
"Vec$push": "vec_push",
# ...
}
def resolve_ufcs(type_name: String, method: String): Option<String>
return UFCS_MAP.get(type_name + "$" + method)
end
Savings: ~90 lines
Issue 4: Predicate Rewriting (Lines 4191-4289)
~100 lines for .some?, .none?, .ok?, .err?, .digit?, etc:
elsif field_name == "some?"
if object_type == TYPE_OPTION()
# 5 lines to rewrite to is_some() call
else
tc_error(tc, "Only Option has 'some?'", line, col)
end
elsif field_name == "none?"
# identical structure
Fix: Predicate table
PREDICATES = [
("some?", TYPE_OPTION(), "is_some"),
("none?", TYPE_OPTION(), "is_none"),
("ok?", TYPE_RESULT(), "is_ok"),
# ...
]
Savings: ~75 lines
Issue 5: Operator to String Conversion (Lines 3682-3728)
~46 lines of:
if op == 0
op_str = "+"
elsif op == 1
op_str = "-"
# ... 20+ more
Fix: Lookup table
OP_STRINGS = ["+", "-", "*", "/", "%", "==", "!=", "<", ">", ...]
var op_str = OP_STRINGS[op]
Savings: ~35 lines
Total for typecheck.qz: ~470 lines
File: codegen.qz (5,098 lines)
Issue 1: Monolithic 3,400-Line Function (Lines 420-3819)
cg_emit_intrinsic is 3,400 lines with 138 intrinsic cases.
Fix: Split by category + use emission patterns
def cg_emit_intrinsic(state: CgState, name: String, args: Vec<Int>, dest: String)
match get_intrinsic(name).category
IO -> cg_emit_io_intrinsic(state, name, args, dest)
String -> cg_emit_string_intrinsic(state, name, args, dest)
Vec -> cg_emit_vec_intrinsic(state, name, args, dest)
# ...
end
end
Savings: 150-200 lines (from organization alone)
Issue 2: Repeated Pointer Conversion (149 + 75 occurrences)
STATUS: ✅ COMPLETED (inttoptr helper)
inttoptr originally appeared 149 times. Extracted cg_emit_inttoptr() helper and replaced 71 occurrences. Remaining 78 use non-standard prefixes (%rmp, %cfn) or different patterns.
## Helper function added at codegen.qz:420-428
def cg_emit_inttoptr(out: Int, dest: String, src: String, lltype: String): Void
cg_emit_line(out, " %v" + dest + " = inttoptr i64 %v" + src + " to " + lltype)
end
ptrtoint still appears 75 times - candidate for next helper extraction.
Actual Result: Code readability dramatically improved. Net +10 lines (helper definition) but 71 call sites now much cleaner.
Before:
cg_emit_line(out, " %v" + d + ".ptr = inttoptr i64 %v" + int_to_str(arg) + " to i8*")
After:
cg_emit_inttoptr(out, d + ".ptr", int_to_str(arg), "i8*")
Next: Extract cg_emit_ptrtoint() for remaining 75 occurrences.
Issue 3: Identical puts/print, eputs/eprint (Lines 428-462)
Four functions that do the same thing:
if str_eq(name, "puts") == 1
# 5 lines
end
if str_eq(name, "print") == 1
# identical 5 lines
Fix: Combine conditions
if str_eq(name, "puts") == 1 or str_eq(name, "print") == 1
cg_emit_call_single_ptr_arg(out, d, args[0], "@puts", "i64")
return
end
Savings: ~15 lines
Issue 4: Duplicated Vec/HashMap Access (Lines 1134-1285, 3349-3470)
Vec header loading repeated 6 times (~80 lines):
cg_emit_line(out, " %v" + d + ".hdr = inttoptr i64 %v" + int_to_str(vec) + " to i64*")
cg_emit_line(out, " %v" + d + ".size.ptr = getelementptr i64, i64* %v" + d + ".hdr, i64 1")
cg_emit_line(out, " %v" + d + ".size = load i64, i64* %v" + d + ".size.ptr")
HashMap iteration duplicated 4 times (~120 lines).
Fix: Extract cg_emit_vec_load_header(), cg_emit_map_lookup_loop()
Savings: ~200 lines
Issue 5: Regex Setup Duplication (Lines 668-957)
7 regex intrinsics share common setup (~100 lines duplicated):
cg_emit_line(out, " %v" + d + ".regex = call i8* @malloc(i64 64)")
cg_emit_line(out, " %v" + d + ".pattern = inttoptr i64 %v" + int_to_str(pattern) + " to i8*")
cg_emit_line(out, " %v" + d + ".rc = call i32 @regcomp(i8* %v" + d + ".regex, ...)")
Fix: Extract cg_emit_regex_compile(out, d, pattern_arg)
Savings: ~100 lines
Issue 6: Runtime Declaration Wall (Lines 4424-4800)
~70 sequential declare lines:
cg_emit_line(out, "declare i64 @puts(i8*)")
cg_emit_line(out, "declare i32 @printf(i8*, ...)")
# ... 68 more
Fix: Data array
RUNTIME_DECLS = [
"declare i64 @puts(i8*)",
"declare i32 @printf(i8*, ...)",
# ...
]
for decl in RUNTIME_DECLS do
cg_emit_line(out, decl)
end
Savings: ~100 lines
Total for codegen.qz: ~1,200-1,500 lines
File: mir.qz (4,484 lines)
Issue 1: Giant Intrinsic Lookup (Lines 1621-1885)
264 lines of sequential string comparisons:
def mir_is_intrinsic(name: String): Int
return 1 if str_eq(name, "puts") == 1
return 1 if str_eq(name, "print") == 1
# ... 187 more
return 0
end
Fix: Hash set lookup (covered in centralized intrinsics section)
Savings: ~180 lines
Issue 2: Massive Expression Lowering (Lines 2219-3315)
~1,100 lines in single function with 68 kind checks.
Fix: Split into category handlers
def mir_lower_expr(ctx: MirContext, s: Int, node: Int): Int
var kind = ast$ast_get_kind(s, node)
return mir_lower_literal(ctx, s, node, kind) if kind <= 4
return mir_lower_call(ctx, s, node) if kind == 9
return mir_lower_control_flow(ctx, s, node, kind) if kind >= 40
# ...
end
Savings: ~400 lines (organization + shared setup)
Issue 3: Loop Setup Boilerplate (6 occurrences)
~15 lines repeated 6 times:
var saved_break = mir_ctx_get_break_target(ctx)
var saved_continue = mir_ctx_get_continue_target(ctx)
# ... save, set, body, restore pattern
Fix: LoopScope struct + mir_enter_loop/mir_exit_loop helpers
Savings: ~80 lines
Issue 4: Block Termination Pattern (18 occurrences)
if mir_block_get_term_kind(then_cur) < 0
mir_block_set_terminator(then_cur, TERM_JUMP(), merge_block)
end
Fix: mir_ensure_terminated(block, target)
Savings: ~36 lines
Total for mir.qz: ~600 lines (local) + ~2,500 (via shared intrinsics)
File: parser.qz (3,737 lines)
Issue 1: Type Parameter Parsing (6 occurrences)
17 lines copy-pasted 6 times for trait, type alias, newtype, impl, extend, struct, enum:
var type_params = ""
if ps_check(ps, token_constants$TOK_LT()) == 1
ps_advance(ps)
type_params = "<"
var first = 1
while ps_check(ps, token_constants$TOK_GT()) == 0
# ... 10 more lines
end
end
Fix: Extract ps_parse_optional_type_params(ps): String
Savings: ~100 lines
Issue 2: Binary Expression Boilerplate (8 functions)
8 nearly-identical functions (~28 lines each) for precedence levels:
ps_parse_factor(*, /, %)ps_parse_term(+, -)ps_parse_shift(<<, >>)ps_parse_bitand(&)ps_parse_bitxor(^)ps_parse_bitor(|)ps_parse_comparison(<, >, <=, >=)ps_parse_equality(==, !=, =~)
Fix: Table-driven precedence climbing
struct PrecLevel
ops: Vec<(Int, Int)> # (token, op_code)
next: |ParserState| -> Int
end
PREC_TABLE = [
PrecLevel { ops: [(TOK_STAR, OP_MUL), (TOK_SLASH, OP_DIV), (TOK_PERCENT, OP_MOD)],
next: ps_parse_unary },
# ...
]
def ps_parse_binary_at_level(ps: ParserState, level: Int): Int
# Generic binary parser using PREC_TABLE[level]
end
Savings: ~150 lines
Issue 3: Comma-Separated List Parsing (15 occurrences)
8-12 lines repeated 15 times:
while ps_check(ps, token_constants$TOK_COMMA()) == 1
ps_advance(ps)
ps_skip_newlines(ps)
if ps_check(ps, CLOSER) == 1
break
end
elem = ps_parse_XXX(ps)
items.push(elem)
end
Fix: Macro or combinator
# Macro approach
items = $parse_list!(ps, TOK_RPAREN, ps_parse_expr)
# Or combinator
items = ps_parse_comma_list(ps, TOK_RPAREN, ps_parse_expr)
Savings: ~80 lines
Issue 4: if/elsif/unless Duplication (Lines 1853-1940)
ps_parse_if and ps_parse_elsif are 90% identical (~28 lines each).
Fix: Extract ps_parse_conditional_body()
Savings: ~35 lines
Issue 5: Literal Dispatch Cascade (Lines 424-496)
8 consecutive token checks:
if ps_current_type(ps) == token_constants$TOK_INT()
var lex = ps_current_lexeme(ps)
var ln = ps_current_line(ps)
var cl = ps_current_col(ps)
ps_advance(ps)
return ast$ast_int_lit(s, str_to_int(lex), ln, cl)
end
# ... 7 more identical patterns
Fix: Match expression or ps_capture_loc() helper
Savings: ~30 lines
Issue 6: Doc Comment Handling (8 occurrences)
var doc = ps_take_pending_doc(ps)
# ... parse ...
if doc.size > 0
ast$ast_set_doc(s, node, doc)
end
return node
Fix: Wrapper function ps_with_doc(ps, node): Int
Savings: ~20 lines
Total for parser.qz: ~435-600 lines
Macro Opportunities
Quartz has a powerful macro system with proper error propagation. Perfect for:
1. $expect!(token, message) — Parser Expected Pattern
# Before (everywhere in parser.qz)
if ps_current_type(ps) != token_constants$TOK_LPAREN()
ps_error(ps, "Expected '('")
return 0
end
ps_advance(ps)
# After
$expect!(ps, TOK_LPAREN, "Expected '('")
2. $emit_ir!(template, args...) — Codegen Line Emission
# Before (149 times in codegen.qz)
cg_emit_line(out, " %v" + d + ".ptr = inttoptr i64 %v" + int_to_str(arg) + " to i8*")
# After
$emit_ir!(out, " %v{d}.ptr = inttoptr i64 %v{arg} to i8*")
3. $match_kind!(node, handlers...) — AST Dispatch
# Before (68 times in mir.qz)
if kind == NODE_INT_LIT()
return mir_emit_const_int(ctx, ...)
elsif kind == NODE_BOOL_LIT()
return mir_emit_const_bool(ctx, ...)
# ...
# After
$match_kind!(kind,
NODE_INT_LIT => mir_emit_const_int(ctx, ...),
NODE_BOOL_LIT => mir_emit_const_bool(ctx, ...),
...
)
4. $register_builtins!(list) — Bulk Registration
# Before (283 lines in typecheck.qz)
tc_register_builtin(tc, "puts", TYPE_VOID())
tc_register_builtin(tc, "str_len", TYPE_INT())
# ... 281 more
# After
$register_builtins!(tc, [
("puts", TYPE_VOID()),
("str_len", TYPE_INT()),
# ...
])
Implementation Roadmap
Phase 1: Quick Wins (1 week)
- Extract
cg_emit_inttoptr()— DONE (71 replacements, 10x more readable) - Extract
cg_emit_ptrtoint()— ~75 occurrences remaining - Extract
ps_parse_optional_type_params()— ~100 lines - Extract
mir_ensure_terminated()— ~36 lines - Combine identical intrinsics (puts/print, etc.) — ~15 lines
Total: ~450 lines
Phase 2: Table-Driven Refactoring (2 weeks)
- Create
INTRINSICSregistry in new file — ~3,100 lines - Type mapping tables in typecheck.qz — ~80 lines
- UFCS lookup table — ~90 lines
- Operator string table — ~35 lines
- Binary operator precedence table — ~150 lines
Total: ~3,455 lines
Phase 3: Macro Development (1 week)
-
$expect!macro for parser -
$emit_ir!macro for codegen -
$match_kind!macro for dispatch -
$register_builtins!macro
Total: Variable (infrastructure for future savings)
Phase 4: Function Splitting (1 week)
- Split
cg_emit_intrinsicby category — organization - Split
mir_lower_exprby category — ~400 lines - Extract loop scope helpers — ~80 lines
Total: ~480 lines
Success Metrics
Quantitative
- Total lines reduced by 4,000+
- Largest function < 500 lines (currently 3,400)
- No function with > 50 if/elsif branches
- All intrinsics defined in single source of truth
Qualitative
- Adding new intrinsic requires editing 1 file (not 3)
- Adding new binary operator requires adding 1 table entry
- Parser patterns are obvious and consistent
- Macro usage documented in ref.md
Next Steps
- Review this audit for accuracy
- Prioritize which phase to tackle first
- Create tracking issues/todos for each item
- Begin Phase 1 (quick wins) while designing Phase 2 architecture