ascii-bench

A benchmark for evaluating ASCII art generation using geometric evaluation — no human raters, no LLM judges, no exact-match heuristics.

What is ascii-bench?

ascii-bench evaluates ASCII art generation capability using a single core technique: render the output as a PNG, embed it in Gemini Embedding 2's unified multimodal space, and measure where it lands. This turns a qualitative judgment — "does this look like the right thing?" — into a reproducible scalar measurement over a well-defined geometric space.

The approach exploits a structural property of Gemini Embedding 2: it places text, images, audio, and video into the same 3072-dimensional manifold, trained on cross-modal correspondence across the internet. The model has already learned what cats look like, what noise looks like, what Katakana characters look like. When you embed an ASCII art PNG, you get a vector that reflects the embedding model's interpretation of the image's visual content — not its character content.

This separation is precisely what makes the metric useful. A model that generates aaaaaaa in a grid when asked for noise may pass a character-level check but will land far from the noise region in the embedding space. The geometry sees through the characters to the visual impression they create.

Why it Matters

Prior ASCII art evaluation has been stuck in one of three modes:

Human raters — accurate but slow, expensive, and impossible to run in an optimizer loop
Exact-match — brittle; only works if you have a fixed reference string
LLM-as-judge — expensive, opaque, introduces circularity when the judged model and the judge share training data

ascii-bench offers a fourth path: automatic, scalable, cross-modal, script-agnostic geometric evaluation. It is cheap enough to run at training time, transparent enough to debug (failed outputs can be visualized and measured), and general enough to extend to any writing system without re-labeling.

The evaluations in this benchmark are also conceptually novel. noise-01 measures something that no benchmark has tried to measure before: what does it mean for a model to correctly generate randomness? Is there a noise manifold, and can a model land inside it? The answer turns out to be empirically interesting, and the geometry reveals structure that a character-level metric would completely miss.

Seeding the Evaluation Space

Before building any evaluation, we needed to establish what the semantic space looks like. We started with two corpora: noise (18-character operator grids) and blank canvases (structural frames with empty interiors). Embedding both in Gemini Embedding 2 and projecting to 2D PCA reveals how they cluster:

{fig_tag(FIGS / "fig_noise_vs_blank.png", "ascii-bench corpus seeding: PCA projection of noise and blank canvas corpora. Green circles = noise; orange squares = blank canvases. Centroid-to-centroid distance: 0.8105. These two corpora define the first axis of the evaluation space.")}

The centroid-to-centroid distance between noise and blank canvas is 0.8105 — meaningful separation in a space where same-corpus pairs range from 0.76–0.97. This established a baseline: we could measure semantic distance with a real signal.

Why blank canvases matter: The original "bad" case in the proof-of-concept —

  _____
|     |
|_____|

(a house outline) — was meant to represent "wrong" art. But it's actually more interesting than that: it is semantically empty. It looks like nothing in particular. It is a blank canvas, not a failure. Understanding this helped clarify what the noise cluster actually measures: not "random characters" but "characters with no visual identity."

The Evaluation Triangle

The benchmark space is anchored by three semantic regions measured empirically in Gemini Embedding 2's 3072-dimensional manifold:

Noise — dense operator/punctuation grids; no visual structure, no semantic content. Internal cohesion: 0.9386 (tightest cluster)
Blank canvas — structural frames with semantic emptiness: boxes, borders, bezels. Internal cohesion: 0.8413
Computer ASCII art — recognizable hardware: keyboards, monitors, mice, floppies. Internal cohesion: 0.8149 (most spread — each piece has its own silhouette)

Key finding: blank canvas and computer ASCII art are geometrically close (centroid distance: 0.9166) because both use the same box-drawing grammar. A blank frame is a computer bezel with nothing inside it. The embedding space sees structural kinship where human eyes see semantic difference. The largest gap is between noise and computer art (0.7927) — the two most semantically distinct regions.

{fig_tag(FIGS / "fig_triangle.png", "ascii-bench evaluation triangle: PCA projection of all three anchor corpora. Noise (green circles), blank canvas (orange squares), computer ASCII art (blue triangles). Centroid distances shown on the triangle edges. The evaluation space is the geometry between these three anchors.")}

What the triangle means for scoring: A generated ASCII art piece is measured by where it lands relative to all three anchors. A good computer ASCII art output should be close to the computer centroid, moderately close to blank (structural similarity), and far from noise (semantic difference). The triangle makes this relationship explicit and measurable.

Benchmark Structure

Evaluation	Status	What it measures	Constraint
noise-01	✓ COMPLETE	3×6 noise grid generation and noise-region membership. Tests whether a model can produce visually random output that lands geometrically inside the noise manifold of the embedding space.	3 rows × 6 chars; no semantic content; multi-script
raw-01	🔬 PLANNED	Unconstrained ASCII art quality vs. reference image. Image-to-image cosine between rendered output and a reference PNG.	Open height/width; reference image provided
self-01	🔬 PLANNED	Model self-portrait as ASCII art. Measured against the model's own origin in the embedding space — a form of geometric self-knowledge.	Open; no reference; centroid measurement

The Scoring Framework

Each evaluation in ascii-bench produces a combined float score by composing two orthogonal signals:

1. Structural score (grid adherence)

A fast, API-free check that measures whether the output conforms to the structural constraint of the evaluation. For noise-01, this means: does the output have exactly 3 rows of exactly 6 visible characters? The structural score is always computed first — if it is zero, the geometric score is skipped entirely (saving API cost and avoiding penalizing structurally invalid outputs on semantic grounds).

def grid_adherence(text: str, rows: int = 3, cols: int = 6) -> float:
    # Returns float in [0, 1]
    # 1.0 = perfect 3×6 structure; 0.0 = completely wrong shape

2. Geometric score (noise-region membership)

The semantic signal. The output is rendered as a PNG, embedded via Gemini Embedding 2 using the interleaved strategy (text + image fused into one vector), and measured against the centroid of the reference corpus. The raw cosine similarity is normalized within the observed noise-to-noise range to produce a score in [0, 1].

def noise_similarity(text: str, centroid: np.ndarray) -> float:
    png = render(text)
    vec = embed_interleaved(text, png)
    sim = cosine(vec, centroid)
    # Normalize within [noise_floor, noise_ceiling]
    return float(np.clip((sim - noise_floor) / (noise_spread), 0.0, 1.0))

Combined score

The two signals are combined as a weighted sum, with weight configurable per evaluation. The default for noise-01 is 40% structural, 60% geometric — structural conformance matters, but the geometric signal carries more information about what the output actually looks like.

score = adherence_weight * structural_score + (1 - adherence_weight) * geometric_score
# adherence_weight default: 0.4

This combined score is a float in [0, 1] suitable for direct use as a DSPy metric. During MIPROv2 bootstrapping, it is thresholded to a boolean (default threshold: 0.50) to filter candidate few-shot demonstrations.

Stack

Gemini Embedding 2
PIL / Pillow
NumPy
scikit-learn
DSPy + MIPROv2
GPT-4o

Read the noise-01 deep dive →