A benchmark for evaluating ASCII art generation using geometric evaluation — no human raters, no LLM judges, no exact-match heuristics.
ascii-bench evaluates ASCII art generation capability using a single core technique: render the output as a PNG, embed it in Gemini Embedding 2's unified multimodal space, and measure where it lands. This turns a qualitative judgment — "does this look like the right thing?" — into a reproducible scalar measurement over a well-defined geometric space.
The approach exploits a structural property of Gemini Embedding 2: it places text, images, audio, and video into the same 3072-dimensional manifold, trained on cross-modal correspondence across the internet. The model has already learned what cats look like, what noise looks like, what Katakana characters look like. When you embed an ASCII art PNG, you get a vector that reflects the embedding model's interpretation of the image's visual content — not its character content.
This separation is precisely what makes the metric useful. A model that generates aaaaaaa in a grid when asked for noise may pass a character-level check but will land far from the noise region in the embedding space. The geometry sees through the characters to the visual impression they create.
Prior ASCII art evaluation has been stuck in one of three modes:
ascii-bench offers a fourth path: automatic, scalable, cross-modal, script-agnostic geometric evaluation. It is cheap enough to run at training time, transparent enough to debug (failed outputs can be visualized and measured), and general enough to extend to any writing system without re-labeling.
The evaluations in this benchmark are also conceptually novel. noise-01 measures something that no benchmark has tried to measure before: what does it mean for a model to correctly generate randomness? Is there a noise manifold, and can a model land inside it? The answer turns out to be empirically interesting, and the geometry reveals structure that a character-level metric would completely miss.
Before building any evaluation, we needed to establish what the semantic space looks like. We started with two corpora: noise (18-character operator grids) and blank canvases (structural frames with empty interiors). Embedding both in Gemini Embedding 2 and projecting to 2D PCA reveals how they cluster:
{fig_tag(FIGS / "fig_noise_vs_blank.png", "ascii-bench corpus seeding: PCA projection of noise and blank canvas corpora. Green circles = noise; orange squares = blank canvases. Centroid-to-centroid distance: 0.8105. These two corpora define the first axis of the evaluation space.")}The centroid-to-centroid distance between noise and blank canvas is 0.8105 — meaningful separation in a space where same-corpus pairs range from 0.76–0.97. This established a baseline: we could measure semantic distance with a real signal.
_____
| |
|_____| (a house outline) — was meant to represent "wrong" art. But it's actually more interesting than that: it is semantically empty. It looks like nothing in particular. It is a blank canvas, not a failure. Understanding this helped clarify what the noise cluster actually measures: not "random characters" but "characters with no visual identity."
The benchmark space is anchored by three semantic regions measured empirically in Gemini Embedding 2's 3072-dimensional manifold:
| Evaluation | Status | What it measures | Constraint |
|---|---|---|---|
| noise-01 | ✓ COMPLETE | 3×6 noise grid generation and noise-region membership. Tests whether a model can produce visually random output that lands geometrically inside the noise manifold of the embedding space. | 3 rows × 6 chars; no semantic content; multi-script |
| raw-01 | 🔬 PLANNED | Unconstrained ASCII art quality vs. reference image. Image-to-image cosine between rendered output and a reference PNG. | Open height/width; reference image provided |
| self-01 | 🔬 PLANNED | Model self-portrait as ASCII art. Measured against the model's own origin in the embedding space — a form of geometric self-knowledge. | Open; no reference; centroid measurement |
Each evaluation in ascii-bench produces a combined float score by composing two orthogonal signals:
A fast, API-free check that measures whether the output conforms to the structural constraint of the evaluation. For noise-01, this means: does the output have exactly 3 rows of exactly 6 visible characters? The structural score is always computed first — if it is zero, the geometric score is skipped entirely (saving API cost and avoiding penalizing structurally invalid outputs on semantic grounds).
def grid_adherence(text: str, rows: int = 3, cols: int = 6) -> float:
# Returns float in [0, 1]
# 1.0 = perfect 3×6 structure; 0.0 = completely wrong shape
The semantic signal. The output is rendered as a PNG, embedded via Gemini Embedding 2 using the interleaved strategy (text + image fused into one vector), and measured against the centroid of the reference corpus. The raw cosine similarity is normalized within the observed noise-to-noise range to produce a score in [0, 1].
def noise_similarity(text: str, centroid: np.ndarray) -> float:
png = render(text)
vec = embed_interleaved(text, png)
sim = cosine(vec, centroid)
# Normalize within [noise_floor, noise_ceiling]
return float(np.clip((sim - noise_floor) / (noise_spread), 0.0, 1.0))
The two signals are combined as a weighted sum, with weight configurable per evaluation. The default for noise-01 is 40% structural, 60% geometric — structural conformance matters, but the geometric signal carries more information about what the output actually looks like.
score = adherence_weight * structural_score + (1 - adherence_weight) * geometric_score
# adherence_weight default: 0.4
This combined score is a float in [0, 1] suitable for direct use as a DSPy metric. During MIPROv2 bootstrapping, it is thresholded to a boolean (default threshold: 0.50) to filter candidate few-shot demonstrations.