Benchmark Analysis — v5.25.0-alpha
Date: 2026-02-16 | Commit: b9d5066 | Tests: 3,274
Results Summary
| Benchmark | Quartz (self-hosted) | C (clang -O2) | Ratio | Status |
|---|---|---|---|---|
| fibonacci | 316.8ms | 317.8ms | 1.0x | ✅ Tied |
| sum | 2.3ms | 2.2ms | 1.1x | ✅ Tied |
| sieve | 63.7ms | 13.7ms | 4.7x | 🔴 Needs work |
| string_concat | 2.4ms | 2.6ms | 0.9x | ✅ Faster than C |
| matrix | 5.4ms | 5.3ms | 1.0x | ✅ Tied |
| binary_trees | — | 5.8ms | — | ⚠️ Self-hosted failed to compile |
Quartz (bootstrap) on binary_trees: 4.9ms (0.8x C — faster).
Cross-Language Comparison
| Benchmark | Quartz | C | Rust | Zig | Notes |
|---|---|---|---|---|---|
| fibonacci | 316.8ms | 317.8ms | 317.2ms | 317.9ms | All tied |
| sum | 2.3ms | 2.2ms | 2.4ms | 2.3ms | All tied |
| sieve | 63.7ms | 13.7ms | 12.9ms | 13.4ms | Quartz 4.7x slower |
| string_concat | 2.4ms | 2.6ms | 2.6ms | 2.2ms | Quartz beats C & Rust |
| matrix | 5.4ms | 5.3ms | 5.6ms | 5.7ms | Quartz beats Rust & Zig |
| binary_trees | 4.9ms* | 5.8ms | 6.0ms | 2.9ms | Zig 0.5x C |
*Bootstrap only; self-hosted failed to compile.
Analysis
Why Quartz Beats C at String Concat
Quartz uses sb_new() / sb_append() — a StringBuilder intrinsic with an optimized append-only buffer (built during Phase B Benchmark Optimization Sprint). The C version uses malloc + realloc with manual 2x growth, where every realloc potentially copies the entire buffer.
The ~10% win is an implementation quality difference: the Quartz StringBuilder was specifically tuned for this pattern.
Why Sieve Is 4.7x Slower — The Real Issue
Root cause: 8x memory footprint.
- C:
char *sieve = malloc(n)— 1 byte per element, flat array, directsieve[i] - Quartz:
var sieve = vec_new()+vec_push(1)— 8 bytes per element (i64), heap Vec
For n=10M: C uses ~10MB (fits in L2/L3), Quartz uses ~80MB (blows past L3). The 4.7x ratio maps almost perfectly to cache hierarchy penalties.
The Quartz sieve already uses vec_get_unchecked / vec_set_unchecked, so bounds checking isn’t the issue — it’s pure data size.
Fixes (ranked by impact):
Vec<U8>orBytestype for sieve — Use byte-width storage, matching C’s memory footprint- BitVec type — 8 flags per byte (1.25MB for n=10M), would likely beat C
- SoA (Struct of Arrays) layout — pack boolean flags into cache-friendly layout
Why Zig Wins Binary Trees (0.5x C)
Zig uses an arena allocator:
var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
_ = arena.reset(.retain_capacity); // Frees ALL nodes in O(1)
All tree nodes are bump-allocated into a contiguous region. When done, the entire arena resets in O(1) instead of walking every node to free it.
- C:
mallocper node + recursivetree_free(32,768 free calls per depth-15 tree) - Zig: Bump-allocate + O(1) arena reset
- Quartz:
alloc(24)per node, never freed (leaks intentionally)
Binary Trees Self-Hosted Compilation Failure
Root cause: missing alloc intrinsic in self-hosted codegen.
The self-hosted compiler emits ; WARNING: Unknown intrinsic 'alloc' and falls through
to %v1 = add i64 0, 0 — storing null (0) as the pointer. When the program then tries
to write tree node fields at offsets 0, 8, 16 from null, it hits a trace trap (SIGILL,
exit code 133).
The bootstrap compiler correctly translates alloc(24) → malloc(24 * 8):
; Bootstrap (correct):
%v0.bytes = mul i64 24, 8
%v0.ptr = call i8* @malloc(i64 %v0.bytes)
%v0 = ptrtoint i8* %v0.ptr to i64
; Self-hosted (broken):
; WARNING: Unknown intrinsic 'alloc'
%v1 = add i64 0, 0 ; ← null pointer!
store i64 %v1, i64* %ptr ; ptr = 0
Fix: Add alloc to the self-hosted compiler’s intrinsic dispatch table.
This is separate from the malloc FFI declaration — alloc(n) is a Quartz-level
intrinsic that allocates n * 8 bytes (word-sized slots).
Research: Closing the Remaining Gaps
Arena Allocators (Highest Impact)
Papers: Tofte & Talpin (1997) “Region-Based Memory Management”. MLKit proved compile-time region inference is feasible.
Approach for Quartz: Add arena allocator intrinsics:
arena_new()/arena_alloc(arena, size)/arena_reset(arena)/arena_free(arena)- Unlocks fast binary trees AND is exactly what web servers need (per-request arenas)
Bounds Check Elimination
LLVM can eliminate redundant bounds checks if given proper metadata:
!rangemetadata on loads- Induction variable analysis for loop-indexed access
- Quartz could emit
nuw/nswflags on index arithmetic
Narrow Collection Types
The sieve benchmark proves that i64-everywhere has a real cost for data-parallel workloads. Options:
Vec<U8>/Vec<I32>with monomorphized element widthBitVecfor boolean flag arraysBytes(already implemented) for byte buffers, but needs sieve-friendly API
SoA/AoS Layout Control
For struct-heavy benchmarks, allow programmers to request Structure-of-Arrays layout:
@soa struct Particle { x: F32, y: F32, z: F32, mass: F32 }
This is a stretch goal but would make Quartz competitive in HPC/game engine scenarios.