Deep technical dive — 3×6 ASCII noise grid generation and geometric measurement. Part of ascii-bench.
noise-01 is the simplest possible evaluation in ascii-bench. It asks a model to generate a 3-row × 6-character grid of ASCII noise — 18 characters with no semantic content, no recognizable shapes, no letters or digits. Then it measures two things:
The constraint "exactly 18 characters in a 3×6 grid" is not arbitrary. It is small enough that a failure to adhere is clearly a structural problem, not an ambiguity. It is also large enough to have meaningful visual texture when rendered — 18 characters at 48px monospace produce an image with visible density patterns. The embedding model reads those patterns.
No one has measured this before, because before multimodal embeddings there was no obvious way to measure it. What does it mean for a language model to correctly generate randomness? The answer turns out to involve geometry, topology, and writing system identity in ways that character-level metrics completely miss.
The reference corpus defines the noise region. Version 0.1 contains five hand-crafted 3×6 grids, all using shifted keyboard characters (the symbols accessed by holding Shift on a standard QWERTY keyboard). No letters, no digits — only punctuation and operator characters.
The choice of shifted characters is deliberate. These are the characters a model reaches for when asked to produce something with no semantic content in its native Latin environment. They are the model's natural visual vocabulary for "noise" — the characters that live at the boundary between meaningful and meaningless in the training distribution.
The keyboard spiral observation: the first grid (x@#$%^&*()_+!?><{}) traces a path across the keyboard that approximates a golden ratio spiral when read left-to-right, top-to-bottom. This was not designed — it emerged from asking "what arrangement of shifted characters feels most random?" The geometry of the keyboard encodes its own notion of visual density and traversal order.
# The five Latin reference grids (v0.1)
"x@#$%^" # keyboard spiral — the golden ratio path
"&*()_+"
"!?><{}"
"~`|\;'" # upper row sweep + punctuation descent
'",.:?!'
"[]{}()"
"!@#$%^" # bracket / operator cluster
"&*-+=<"
">/?|~`"
"^%$#@!" # diagonal shift sweep
"*&()_+"
"><{}[]"
"@#$&*(" # dense operator field
")!?><{"
"}|~`\;"
Before committing to a single embedding strategy for the metric, we compared three approaches on the same corpus and test cases:
The ASCII grid is rendered to a PNG and embedded as an image. Only visual information is available to the embedding model — it sees the density and spatial arrangement of characters, not their symbolic identity.
The raw ASCII string is embedded as text. The embedding model reads the character sequence symbolically — it knows these are parentheses and dollar signs, not just dense marks.
Both the raw text and the rendered PNG are passed as parts of a single Content object. Gemini Embedding 2 fuses them into one 3072-dim vector. This strategy provides both visual and symbolic information simultaneously.
| Strategy | Noise-to-noise range | Spread | Noise score | Non-noise score |
|---|---|---|---|---|
| Image | [0.9197, 0.9798] | 0.0601 | 1.000 | 0.000 |
| Text | [0.8703, 0.9609] | 0.0905 | 1.000 | 0.000 |
| Interleaved | [0.8960, 0.9680] | 0.0720 | 1.000 | 0.000 |
All three strategies achieve a noise score of 1.000 and non-noise score of 0.000 — perfect separation between noise grids and non-noise grids (alphabetic, numeric, semantic, wrong-size). The embedding space cleanly partitions the noise region regardless of embedding mode.
The strategies differ in their internal structure. Text has the largest spread (0.0905) — the noise grids are more differentiated from each other symbolically than visually. Interleaved acts as the strictest judge of wrong-size outputs: a grid with the right characters but wrong dimensions gets a lower interleaved score than image alone. We use interleaved as the default.
Corpus v0.2 adds seven grids across four non-Latin script families: Katakana, Cyrillic, Greek, and Arabic. This tests whether noise is a universal concept in the embedding manifold, or anchored to the Latin keyboard.
A 2D PCA projection of the 12 corpus vectors reveals five completely distinct clusters — one per script family. Latin keyboard symbols: top-right, tight cluster. Cyrillic: top-left, tight pair. Greek: near Cyrillic but distinct. Arabic: bottom-left-center. Katakana: bottom-left corner, most distant from Latin. Noise is not a single universal region. Each writing system occupies a distinct sub-region of the noise manifold.
The PCA projection explains 61.7% of variance in just two dimensions — a high fraction for 3072-dimensional vectors, indicating that script family identity is the dominant source of variation in the corpus. The remaining 38.3% reflects within-script variation between individual grid designs.
This finding has a concrete calibration implication: a unified centroid is a Latin-dominated centroid. With 5 Latin grids out of 12, the centroid is pulled toward the Latin cluster. Scores for Latin grids are systematically higher, and scores for non-Latin grids systematically lower, regardless of geometric quality relative to their own script cluster. Per-script centroids would be more accurate.
GPT-4o was asked to generate 3×6 noise grids under 10 different framings: six Latin-oriented and four multilingual. Each framing shapes what the model understands as "noise" — and those differences are geometrically measurable.
The most striking result: LLM outputs land near their respective script reference clusters. This is behavioral introspection without self-report — we learn about the model's relationship to writing systems by measuring where its outputs live geometrically, without asking the model anything about itself.
Notable: the "keyboard self" framing drifts far from the noise cluster on PCA. When asked to generate what it "sees" when looking at a keyboard, the model produces a conceptual map of the keyboard rather than noise — a mix of characters that is coherent but not random. The geometry distinguishes between "what keyboards contain" and "what noise looks like."
Scores against the unified corpus centroid:
| Source | Script | Score range | Note |
|---|---|---|---|
| Reference corpus | Latin | 0.87–0.93 | Centroid-dominant script, highest scores |
| Reference corpus | Cyrillic | 0.83–0.85 | Visually similar to Latin |
| Reference corpus | Greek | 0.79 | Alphabetic but visually distinct |
| Reference corpus | Arabic | 0.75 | Visually distinct, right-to-left |
| Reference corpus | Katakana | 0.73–0.75 | Most distant from Latin |
| LLM (keyboard native) | Latin | 0.93 | Native noise vocabulary, highest LLM score |
| LLM (plain ask) | Latin | 0.88 | Solid default |
| LLM (visual static) | Latin | 0.71 | Framing pulls toward peripheral noise |
| LLM (keyboard self) | Latin | <0.71 | Drifts far from noise region on PCA |
| LLM (cyrillic) | Cyrillic | 0.80 | Near reference cluster |
| LLM (greek) | Greek | 0.81 | Near reference cluster |
| LLM (katakana) | Katakana | 0.66 | Below reference; centroid bias applies |
| LLM (arabic) | Arabic | 0.65 | Below reference; centroid bias applies |
All non-Latin scores are depressed by the Latin-dominated unified centroid. This is not a failure of those outputs — it is a calibration artifact. Per-script centroids would give fairer absolute scores. The PCA projection is the better diagnostic for non-Latin outputs.
def noise_eval(
text: str,
noise_centroid: np.ndarray,
trace=None,
threshold: float = 0.50,
adherence_weight: float = 0.4
) -> float | bool:
"""
Combined noise metric for DSPy.
Args:
text: The ASCII grid output to evaluate.
noise_centroid: Pre-computed centroid of the noise corpus embeddings.
Normalized unit vector in 3072-dim space.
trace: DSPy trace object.
None = evaluation mode → returns float in [0,1]
not None = bootstrapping → returns bool (score >= threshold)
threshold: Bootstrap pass threshold. Default 0.50.
adherence_weight: Weight of structural score vs. geometric score.
0.4 = 40% structural, 60% geometric. Default 0.4.
"""
# Step 1: structural check (no API cost)
adherence = grid_adherence(text)
# Step 2: geometric check (only if structurally non-zero)
if adherence == 0.0:
score = 0.0
else:
rendered = render(text)
pred_vec = embed_interleaved(text, rendered) # text+PNG → one vector
similarity = cosine(pred_vec, noise_centroid)
# Normalize raw cosine into [0, 1] using observed noise-to-noise range
# noise_floor ≈ 0.896 noise_ceil ≈ 0.968 (interleaved, Latin corpus)
similarity_norm = float(np.clip(
(similarity - noise_floor) / (noise_ceil - noise_floor), 0.0, 1.0
))
score = (adherence_weight * adherence
+ (1 - adherence_weight) * similarity_norm)
# Step 3: dual-mode return (matches DSPy metric contract)
return score if trace is None else score >= threshold
The metric wraps into a DSPy-compatible function and plugs directly into MIPROv2. The centroid is pre-computed once at startup to avoid redundant API calls during the optimization loop:
import dspy
from functools import partial
# Pre-compute the centroid once at startup
noise_centroid = build_noise_centroid()
# Wrap into a DSPy-compatible metric
def noise_metric(example, prediction, trace=None):
return noise_eval(
text = prediction.grid,
noise_centroid = noise_centroid,
trace = trace,
threshold = 0.50,
adherence_weight = 0.4,
)
# Plug into MIPROv2
optimizer = dspy.MIPROv2(
metric = noise_metric,
auto = 'light',
num_threads= 4,
)
optimized = optimizer.compile(
NoiseGridModule(),
trainset = trainset,
max_bootstrapped_demos = 3,
max_labeled_demos = 2,
)
Given this metric, MIPROv2 will search for instructions that produce outputs landing deeper inside the noise manifold. It will discover that shifted keyboard characters score higher than alphanumerics, that the 3×6 structure must be maintained, and that some framings are geometrically more coherent than others — without being told any of this explicitly.