Quartz Self-Hosted Compiler: Dogfooding Audit

Version: v5.12.37-alpha | Total Lines Audited: 18,904 | Potential Savings: ~5,300 lines

Executive Summary

The self-hosted compiler has massive refactoring opportunities. Four files contain 18,904 lines of code with significant repetition that could be eliminated through:

Data-driven tables replacing giant if/elsif chains
Helper function extraction for repeated patterns
Macro usage for boilerplate elimination
Cross-file consolidation (especially intrinsics)

File	Lines	Potential Savings	Top Issue
`typecheck.qz`	5,585	~470	Builtin registration wall (283 lines)
`codegen.qz`	5,098	~1,200-1,500	Monolithic 3,400-line intrinsic function
`mir.qz`	4,484	~600 (local) + ~2,500 (cross-file)	Intrinsic duplication with codegen
`parser.qz`	3,737	~435-600	8 nearly-identical binary op parsers
TOTAL	18,904	~5,300

Critical: Cross-File Intrinsic Duplication

The Problem

Intrinsic handling is duplicated across three locations:

mir.qz:1621-1885 (264 lines): mir_is_intrinsic() - 189 string comparisons
codegen.qz:420-3819 (3,400 lines): cg_emit_intrinsic() - 138 emission blocks
typecheck.qz:1012-1295 (283 lines): tc_register_builtin() - 283 registration calls

Each intrinsic is defined in three separate places with no single source of truth.

The Solution: Centralized Intrinsic Registry

# In new file: self-hosted/core/intrinsics.qz

struct IntrinsicDef
  name: String
  arg_count: Int
  return_type: Int
  category: IntrinsicCategory
  emit_pattern: EmitPattern
end

enum IntrinsicCategory
  IO          # puts, print, eputs, eprint
  String      # str_len, str_concat, str_eq, ...
  Vec         # vec_new, vec_len, vec_push, ...
  HashMap     # hashmap_new, hashmap_get, ...
  Memory      # malloc, free, arena_*, pool_*
  Atomic      # atomic_load, atomic_store, ...
  Regex       # regex_match, regex_replace, ...
  # ...
end

enum EmitPattern
  PtrToCall        # inttoptr + call (puts, print)
  TwoArgCall       # Simple two-arg function call
  VecAccess        # Load header, bounds check, access
  MapLookup        # Linear search loop
  # ~10 patterns cover 138 intrinsics
end

# Single source of truth
INTRINSICS: Vec<IntrinsicDef> = [
  IntrinsicDef { name: "puts", arg_count: 1, return_type: TYPE_INT(), 
                 category: IO, emit_pattern: PtrToCall },
  IntrinsicDef { name: "print", arg_count: 1, return_type: TYPE_INT(),
                 category: IO, emit_pattern: PtrToCall },
  # ... 187 more
]

# O(1) lookup via hash set
var intrinsic_set: Set<String> = nil
def is_intrinsic(name: String): Bool
  intrinsic_set = Set.from_iter(INTRINSICS.map(|i| i.name)) if intrinsic_set.nil?
  return intrinsic_set.contains(name)
end

def get_intrinsic(name: String): Option<IntrinsicDef>
  # Hash lookup instead of 189 string comparisons
end

Estimated Impact

Before	After	Savings
3,947 lines across 3 files	~800 lines in 1 file + patterns	~3,100 lines
O(n) string scan per lookup	O(1) hash lookup	Performance boost
Add intrinsic = edit 3 files	Add intrinsic = add 1 line	Maintainability

File: typecheck.qz (5,585 lines)

283 consecutive lines of:

tc_register_builtin(tc, "puts", TYPE_VOID())
tc_register_builtin(tc, "eputs", TYPE_VOID())
tc_register_builtin(tc, "str_len", TYPE_INT())
# ... 280 more

Fix: Data-driven bulk registration

BUILTINS = [
  ("puts", TYPE_VOID()),
  ("str_len", TYPE_INT()),
  # ...
]
for (name, type) in BUILTINS do
  tc_register_builtin(tc, name, type)
end

Savings: ~140 lines

Issue 2: Type Mapping Cascades (Lines 1301-1378, 2135-2168)

~100 lines of if/elsif mapping type constants to values:

if kind == TYPE_INT()
  return "Int"
elsif kind == TYPE_BOOL()
  return "Bool"
# ... 30+ more

Appears in:

tc_type_to_infer_type (32 lines)
tc_infer_type_to_type (33 lines)
tc_type_name (33 lines)

Fix: Table-driven lookup

TYPE_INFO = [
  TypeInfo { kind: TYPE_INT(), name: "Int", infer_id: 2 },
  TypeInfo { kind: TYPE_BOOL(), name: "Bool", infer_id: 3 },
  # ...
]

def tc_type_name(kind: Int): String = TYPE_INFO[kind].name

Savings: ~80 lines

Issue 3: UFCS Method Rewrite Cascade (Lines 3838-3953)

~115 lines of identical blocks for String, Vec, HashMap, StringBuilder, Set:

elsif first_arg_type == TYPE_STRING()
  type_name = "String"
  var mangled = str_concat("String$", func_name)
  # ... 15 lines of lookup and rewrite logic
elsif first_arg_type == TYPE_VEC()
  type_name = "Vec"
  var mangled = str_concat("Vec$", func_name)
  # ... identical 15 lines

Fix: UFCS lookup table

UFCS_MAP = {
  "String$find": "str_find",
  "String$slice": "str_slice",
  "Vec$push": "vec_push",
  # ...
}

def resolve_ufcs(type_name: String, method: String): Option<String>
  return UFCS_MAP.get(type_name + "$" + method)
end

Savings: ~90 lines

Issue 4: Predicate Rewriting (Lines 4191-4289)

~100 lines for .some?, .none?, .ok?, .err?, .digit?, etc:

elsif field_name == "some?"
  if object_type == TYPE_OPTION()
    # 5 lines to rewrite to is_some() call
  else
    tc_error(tc, "Only Option has 'some?'", line, col)
  end
elsif field_name == "none?"
  # identical structure

Fix: Predicate table

PREDICATES = [
  ("some?", TYPE_OPTION(), "is_some"),
  ("none?", TYPE_OPTION(), "is_none"),
  ("ok?", TYPE_RESULT(), "is_ok"),
  # ...
]

Savings: ~75 lines

Issue 5: Operator to String Conversion (Lines 3682-3728)

~46 lines of:

if op == 0
  op_str = "+"
elsif op == 1
  op_str = "-"
# ... 20+ more

Fix: Lookup table

OP_STRINGS = ["+", "-", "*", "/", "%", "==", "!=", "<", ">", ...]
var op_str = OP_STRINGS[op]

Savings: ~35 lines

Total for typecheck.qz: ~470 lines

File: codegen.qz (5,098 lines)

Issue 1: Monolithic 3,400-Line Function (Lines 420-3819)

cg_emit_intrinsic is 3,400 lines with 138 intrinsic cases.

Fix: Split by category + use emission patterns

def cg_emit_intrinsic(state: CgState, name: String, args: Vec<Int>, dest: String)
  match get_intrinsic(name).category
    IO -> cg_emit_io_intrinsic(state, name, args, dest)
    String -> cg_emit_string_intrinsic(state, name, args, dest)
    Vec -> cg_emit_vec_intrinsic(state, name, args, dest)
    # ...
  end
end

Savings: 150-200 lines (from organization alone)

Issue 2: Repeated Pointer Conversion (149 + 75 occurrences)

STATUS: ✅ COMPLETED (inttoptr helper)

inttoptr originally appeared 149 times. Extracted cg_emit_inttoptr() helper and replaced 71 occurrences. Remaining 78 use non-standard prefixes (%rmp, %cfn) or different patterns.

## Helper function added at codegen.qz:420-428
def cg_emit_inttoptr(out: Int, dest: String, src: String, lltype: String): Void
  cg_emit_line(out, "  %v" + dest + " = inttoptr i64 %v" + src + " to " + lltype)
end

ptrtoint still appears 75 times - candidate for next helper extraction.

Actual Result: Code readability dramatically improved. Net +10 lines (helper definition) but 71 call sites now much cleaner.

Before:

cg_emit_line(out, "  %v" + d + ".ptr = inttoptr i64 %v" + int_to_str(arg) + " to i8*")

After:

cg_emit_inttoptr(out, d + ".ptr", int_to_str(arg), "i8*")

Next: Extract cg_emit_ptrtoint() for remaining 75 occurrences.

Issue 3: Identical puts/print, eputs/eprint (Lines 428-462)

Four functions that do the same thing:

if str_eq(name, "puts") == 1
  # 5 lines
end
if str_eq(name, "print") == 1
  # identical 5 lines

Fix: Combine conditions

if str_eq(name, "puts") == 1 or str_eq(name, "print") == 1
  cg_emit_call_single_ptr_arg(out, d, args[0], "@puts", "i64")
  return
end

Savings: ~15 lines

Issue 4: Duplicated Vec/HashMap Access (Lines 1134-1285, 3349-3470)

Vec header loading repeated 6 times (~80 lines):

cg_emit_line(out, "  %v" + d + ".hdr = inttoptr i64 %v" + int_to_str(vec) + " to i64*")
cg_emit_line(out, "  %v" + d + ".size.ptr = getelementptr i64, i64* %v" + d + ".hdr, i64 1")
cg_emit_line(out, "  %v" + d + ".size = load i64, i64* %v" + d + ".size.ptr")

HashMap iteration duplicated 4 times (~120 lines).

Fix: Extract cg_emit_vec_load_header(), cg_emit_map_lookup_loop()

Savings: ~200 lines

Issue 5: Regex Setup Duplication (Lines 668-957)

7 regex intrinsics share common setup (~100 lines duplicated):

cg_emit_line(out, "  %v" + d + ".regex = call i8* @malloc(i64 64)")
cg_emit_line(out, "  %v" + d + ".pattern = inttoptr i64 %v" + int_to_str(pattern) + " to i8*")
cg_emit_line(out, "  %v" + d + ".rc = call i32 @regcomp(i8* %v" + d + ".regex, ...)")

Fix: Extract cg_emit_regex_compile(out, d, pattern_arg)

Savings: ~100 lines

Issue 6: Runtime Declaration Wall (Lines 4424-4800)

~70 sequential declare lines:

cg_emit_line(out, "declare i64 @puts(i8*)")
cg_emit_line(out, "declare i32 @printf(i8*, ...)")
# ... 68 more

Fix: Data array

RUNTIME_DECLS = [
  "declare i64 @puts(i8*)",
  "declare i32 @printf(i8*, ...)",
  # ...
]
for decl in RUNTIME_DECLS do
  cg_emit_line(out, decl)
end

Savings: ~100 lines

Total for codegen.qz: ~1,200-1,500 lines

File: mir.qz (4,484 lines)

Issue 1: Giant Intrinsic Lookup (Lines 1621-1885)

264 lines of sequential string comparisons:

def mir_is_intrinsic(name: String): Int
  return 1 if str_eq(name, "puts") == 1
  return 1 if str_eq(name, "print") == 1
  # ... 187 more
  return 0
end

Fix: Hash set lookup (covered in centralized intrinsics section)

Savings: ~180 lines

Issue 2: Massive Expression Lowering (Lines 2219-3315)

~1,100 lines in single function with 68 kind checks.

Fix: Split into category handlers

def mir_lower_expr(ctx: MirContext, s: Int, node: Int): Int
  var kind = ast$ast_get_kind(s, node)
  
  return mir_lower_literal(ctx, s, node, kind) if kind <= 4
  return mir_lower_call(ctx, s, node) if kind == 9
  return mir_lower_control_flow(ctx, s, node, kind) if kind >= 40
  # ...
end

Savings: ~400 lines (organization + shared setup)

Issue 3: Loop Setup Boilerplate (6 occurrences)

~15 lines repeated 6 times:

var saved_break = mir_ctx_get_break_target(ctx)
var saved_continue = mir_ctx_get_continue_target(ctx)
# ... save, set, body, restore pattern

Fix: LoopScope struct + mir_enter_loop/mir_exit_loop helpers

Savings: ~80 lines

Issue 4: Block Termination Pattern (18 occurrences)

if mir_block_get_term_kind(then_cur) < 0
  mir_block_set_terminator(then_cur, TERM_JUMP(), merge_block)
end

Fix: mir_ensure_terminated(block, target)

Savings: ~36 lines

Total for mir.qz: ~600 lines (local) + ~2,500 (via shared intrinsics)

File: parser.qz (3,737 lines)

Issue 1: Type Parameter Parsing (6 occurrences)

17 lines copy-pasted 6 times for trait, type alias, newtype, impl, extend, struct, enum:

var type_params = ""
if ps_check(ps, token_constants$TOK_LT()) == 1
  ps_advance(ps)
  type_params = "<"
  var first = 1
  while ps_check(ps, token_constants$TOK_GT()) == 0
    # ... 10 more lines
  end
end

Fix: Extract ps_parse_optional_type_params(ps): String

Savings: ~100 lines

Issue 2: Binary Expression Boilerplate (8 functions)

8 nearly-identical functions (~28 lines each) for precedence levels:

ps_parse_factor (*, /, %)
ps_parse_term (+, -)
ps_parse_shift (<<, >>)
ps_parse_bitand (&)
ps_parse_bitxor (^)
ps_parse_bitor (|)
ps_parse_comparison (<, >, <=, >=)
ps_parse_equality (==, !=, =~)

Fix: Table-driven precedence climbing

struct PrecLevel
  ops: Vec<(Int, Int)>  # (token, op_code)
  next: |ParserState| -> Int
end

PREC_TABLE = [
  PrecLevel { ops: [(TOK_STAR, OP_MUL), (TOK_SLASH, OP_DIV), (TOK_PERCENT, OP_MOD)], 
              next: ps_parse_unary },
  # ...
]

def ps_parse_binary_at_level(ps: ParserState, level: Int): Int
  # Generic binary parser using PREC_TABLE[level]
end

Savings: ~150 lines

Issue 3: Comma-Separated List Parsing (15 occurrences)

8-12 lines repeated 15 times:

while ps_check(ps, token_constants$TOK_COMMA()) == 1
  ps_advance(ps)
  ps_skip_newlines(ps)
  if ps_check(ps, CLOSER) == 1
    break
  end
  elem = ps_parse_XXX(ps)
  items.push(elem)
end

Fix: Macro or combinator

# Macro approach
items = $parse_list!(ps, TOK_RPAREN, ps_parse_expr)

# Or combinator
items = ps_parse_comma_list(ps, TOK_RPAREN, ps_parse_expr)

Savings: ~80 lines

Issue 4: if/elsif/unless Duplication (Lines 1853-1940)

ps_parse_if and ps_parse_elsif are 90% identical (~28 lines each).

Fix: Extract ps_parse_conditional_body()

Savings: ~35 lines

Issue 5: Literal Dispatch Cascade (Lines 424-496)

8 consecutive token checks:

if ps_current_type(ps) == token_constants$TOK_INT()
  var lex = ps_current_lexeme(ps)
  var ln = ps_current_line(ps)
  var cl = ps_current_col(ps)
  ps_advance(ps)
  return ast$ast_int_lit(s, str_to_int(lex), ln, cl)
end
# ... 7 more identical patterns

Fix: Match expression or ps_capture_loc() helper

Savings: ~30 lines

Issue 6: Doc Comment Handling (8 occurrences)

var doc = ps_take_pending_doc(ps)
# ... parse ...
if doc.size > 0
  ast$ast_set_doc(s, node, doc)
end
return node

Fix: Wrapper function ps_with_doc(ps, node): Int

Savings: ~20 lines

Total for parser.qz: ~435-600 lines

Macro Opportunities

Quartz has a powerful macro system with proper error propagation. Perfect for:

1. `$expect!(token, message)` — Parser Expected Pattern

# Before (everywhere in parser.qz)
if ps_current_type(ps) != token_constants$TOK_LPAREN()
  ps_error(ps, "Expected '('")
  return 0
end
ps_advance(ps)

# After
$expect!(ps, TOK_LPAREN, "Expected '('")

2. `$emit_ir!(template, args...)` — Codegen Line Emission

# Before (149 times in codegen.qz)
cg_emit_line(out, "  %v" + d + ".ptr = inttoptr i64 %v" + int_to_str(arg) + " to i8*")

# After
$emit_ir!(out, "  %v{d}.ptr = inttoptr i64 %v{arg} to i8*")

3. `$match_kind!(node, handlers...)` — AST Dispatch

# Before (68 times in mir.qz)
if kind == NODE_INT_LIT()
  return mir_emit_const_int(ctx, ...)
elsif kind == NODE_BOOL_LIT()
  return mir_emit_const_bool(ctx, ...)
# ...

# After
$match_kind!(kind,
  NODE_INT_LIT => mir_emit_const_int(ctx, ...),
  NODE_BOOL_LIT => mir_emit_const_bool(ctx, ...),
  ...
)

4. `$register_builtins!(list)` — Bulk Registration

# Before (283 lines in typecheck.qz)
tc_register_builtin(tc, "puts", TYPE_VOID())
tc_register_builtin(tc, "str_len", TYPE_INT())
# ... 281 more

# After
$register_builtins!(tc, [
  ("puts", TYPE_VOID()),
  ("str_len", TYPE_INT()),
  # ...
])

Implementation Roadmap

Phase 1: Quick Wins (1 week)

Extract cg_emit_inttoptr() — DONE (71 replacements, 10x more readable)
Extract cg_emit_ptrtoint() — ~75 occurrences remaining
Extract ps_parse_optional_type_params() — ~100 lines
Extract mir_ensure_terminated() — ~36 lines
Combine identical intrinsics (puts/print, etc.) — ~15 lines

Total: ~450 lines

Phase 2: Table-Driven Refactoring (2 weeks)

Create INTRINSICS registry in new file — ~3,100 lines
Type mapping tables in typecheck.qz — ~80 lines
UFCS lookup table — ~90 lines
Operator string table — ~35 lines
Binary operator precedence table — ~150 lines

Total: ~3,455 lines

Phase 3: Macro Development (1 week)

$expect! macro for parser
$emit_ir! macro for codegen
$match_kind! macro for dispatch
$register_builtins! macro

Total: Variable (infrastructure for future savings)

Phase 4: Function Splitting (1 week)

Split cg_emit_intrinsic by category — organization
Split mir_lower_expr by category — ~400 lines
Extract loop scope helpers — ~80 lines

Total: ~480 lines

Success Metrics

Quantitative

Total lines reduced by 4,000+
Largest function < 500 lines (currently 3,400)
No function with > 50 if/elsif branches
All intrinsics defined in single source of truth

Qualitative

Adding new intrinsic requires editing 1 file (not 3)
Adding new binary operator requires adding 1 table entry
Parser patterns are obvious and consistent
Macro usage documented in ref.md

Next Steps

Review this audit for accuracy
Prioritize which phase to tackle first
Create tracking issues/todos for each item
Begin Phase 1 (quick wins) while designing Phase 2 architecture

Quartz Self-Hosted Compiler: Dogfooding Audit

Executive Summary

Critical: Cross-File Intrinsic Duplication

The Problem

The Solution: Centralized Intrinsic Registry

Estimated Impact

File: typecheck.qz (5,585 lines)

Issue 1: Builtin Registration Wall (Lines 1012-1295)

Issue 2: Type Mapping Cascades (Lines 1301-1378, 2135-2168)

Issue 3: UFCS Method Rewrite Cascade (Lines 3838-3953)

Issue 4: Predicate Rewriting (Lines 4191-4289)

Issue 5: Operator to String Conversion (Lines 3682-3728)

Total for typecheck.qz: ~470 lines

File: codegen.qz (5,098 lines)

Issue 1: Monolithic 3,400-Line Function (Lines 420-3819)

Issue 2: Repeated Pointer Conversion (149 + 75 occurrences)

Issue 3: Identical puts/print, eputs/eprint (Lines 428-462)

Issue 4: Duplicated Vec/HashMap Access (Lines 1134-1285, 3349-3470)

Issue 5: Regex Setup Duplication (Lines 668-957)

Issue 6: Runtime Declaration Wall (Lines 4424-4800)

Total for codegen.qz: ~1,200-1,500 lines

File: mir.qz (4,484 lines)

Issue 1: Giant Intrinsic Lookup (Lines 1621-1885)

Issue 2: Massive Expression Lowering (Lines 2219-3315)

Issue 3: Loop Setup Boilerplate (6 occurrences)

Issue 4: Block Termination Pattern (18 occurrences)

Total for mir.qz: ~600 lines (local) + ~2,500 (via shared intrinsics)

File: parser.qz (3,737 lines)

Issue 1: Type Parameter Parsing (6 occurrences)

Issue 2: Binary Expression Boilerplate (8 functions)

Issue 3: Comma-Separated List Parsing (15 occurrences)

Issue 4: if/elsif/unless Duplication (Lines 1853-1940)

Issue 5: Literal Dispatch Cascade (Lines 424-496)

Issue 6: Doc Comment Handling (8 occurrences)

Total for parser.qz: ~435-600 lines

Macro Opportunities

1. $expect!(token, message) — Parser Expected Pattern

2. $emit_ir!(template, args...) — Codegen Line Emission

3. $match_kind!(node, handlers...) — AST Dispatch

4. $register_builtins!(list) — Bulk Registration

Implementation Roadmap

Phase 1: Quick Wins (1 week)

Phase 2: Table-Driven Refactoring (2 weeks)

Phase 3: Macro Development (1 week)

Phase 4: Function Splitting (1 week)

Success Metrics

Quantitative

Qualitative

Next Steps

1. `$expect!(token, message)` — Parser Expected Pattern

2. `$emit_ir!(template, args...)` — Codegen Line Emission

3. `$match_kind!(node, handlers...)` — AST Dispatch

4. `$register_builtins!(list)` — Bulk Registration