noise-01

Deep technical dive — 3×6 ASCII noise grid generation and geometric measurement. Part of ascii-bench.

The Evaluation

noise-01 is the simplest possible evaluation in ascii-bench. It asks a model to generate a 3-row × 6-character grid of ASCII noise — 18 characters with no semantic content, no recognizable shapes, no letters or digits. Then it measures two things:

Grid adherence — does the output conform to the 3×6 structural constraint?
Noise-region membership — does the rendered output land in the noise region of Gemini's embedding space, as defined by a reference corpus?

The constraint "exactly 18 characters in a 3×6 grid" is not arbitrary. It is small enough that a failure to adhere is clearly a structural problem, not an ambiguity. It is also large enough to have meaningful visual texture when rendered — 18 characters at 48px monospace produce an image with visible density patterns. The embedding model reads those patterns.

Why This is Interesting

No one has measured this before, because before multimodal embeddings there was no obvious way to measure it. What does it mean for a language model to correctly generate randomness? The answer turns out to involve geometry, topology, and writing system identity in ways that character-level metrics completely miss.

The Reference Corpus v0.1 — Latin Keyboard Symbols

The reference corpus defines the noise region. Version 0.1 contains five hand-crafted 3×6 grids, all using shifted keyboard characters (the symbols accessed by holding Shift on a standard QWERTY keyboard). No letters, no digits — only punctuation and operator characters.

The choice of shifted characters is deliberate. These are the characters a model reaches for when asked to produce something with no semantic content in its native Latin environment. They are the model's natural visual vocabulary for "noise" — the characters that live at the boundary between meaningful and meaningless in the training distribution.

The keyboard spiral observation: the first grid (x@#$%^&*()_+!?><{}) traces a path across the keyboard that approximates a golden ratio spiral when read left-to-right, top-to-bottom. This was not designed — it emerged from asking "what arrangement of shifted characters feels most random?" The geometry of the keyboard encodes its own notion of visual density and traversal order.

# The five Latin reference grids (v0.1)

"x@#$%^"   # keyboard spiral — the golden ratio path
"&*()_+"
"!?><{}"

"~`|\;'"   # upper row sweep + punctuation descent
'",.:?!'
"[]{}()"

"!@#$%^"   # bracket / operator cluster
"&*-+=<"
">/?|~`"

"^%$#@!"   # diagonal shift sweep
"*&()_+"
"><{}[]"

"@#$&*("   # dense operator field
")!?><{"
"}|~`\;"

The Three Embedding Strategies

Before committing to a single embedding strategy for the metric, we compared three approaches on the same corpus and test cases:

Image — embed the rendered PNG only

The ASCII grid is rendered to a PNG and embedded as an image. Only visual information is available to the embedding model — it sees the density and spatial arrangement of characters, not their symbolic identity.

Text — embed the raw ASCII string only

The raw ASCII string is embedded as text. The embedding model reads the character sequence symbolically — it knows these are parentheses and dollar signs, not just dense marks.

Interleaved — PNG + raw string, fused into one vector

Both the raw text and the rendered PNG are passed as parts of a single Content object. Gemini Embedding 2 fuses them into one 3072-dim vector. This strategy provides both visual and symbolic information simultaneously.

Strategy	Noise-to-noise range	Spread	Noise score
Image	[0.9197, 0.9798]	0.0601	1.000
Text	[0.8703, 0.9609]	0.0905	1.000
Interleaved	[0.8960, 0.9680]	0.0720	1.000

Total Separation on Every Strategy

All three strategies achieve a noise score of 1.000 and non-noise score of 0.000 — perfect separation between noise grids and non-noise grids (alphabetic, numeric, semantic, wrong-size). The embedding space cleanly partitions the noise region regardless of embedding mode.

The strategies differ in their internal structure. Text has the largest spread (0.0905) — the noise grids are more differentiated from each other symbolically than visually. Interleaved acts as the strictest judge of wrong-size outputs: a grid with the right characters but wrong dimensions gets a lower interleaved score than image alone. We use interleaved as the default.

The Multilingual Expansion — Corpus v0.2

Corpus v0.2 adds seven grids across four non-Latin script families: Katakana, Cyrillic, Greek, and Arabic. This tests whether noise is a universal concept in the embedding manifold, or anchored to the Latin keyboard.

ascii-bench noise-01 Reference Corpus v0.2: 5 Latin grids (top row) and 7 multilingual grids (bottom row). Each grid rendered with a script-appropriate font, colored by script family.

Noise-to-noise pairwise cosine similarity matrix for the full v0.2 corpus (interleaved strategy). Tick labels colored by script family. Within-script pairs cluster at the high end; cross-script pairs spread into the lower range.

The PCA Finding — Noise has Topology

Discovery: Five Distinct Script Clusters

A 2D PCA projection of the 12 corpus vectors reveals five completely distinct clusters — one per script family. Latin keyboard symbols: top-right, tight cluster. Cyrillic: top-left, tight pair. Greek: near Cyrillic but distinct. Arabic: bottom-left-center. Katakana: bottom-left corner, most distant from Latin. Noise is not a single universal region. Each writing system occupies a distinct sub-region of the noise manifold.

2D PCA of the 12-grid v0.2 corpus (interleaved embeddings, 3072-dim → 2D). Five distinct clusters, one per script family. 61.7% variance explained by the first two principal components. Dashed convex hulls mark per-script boundaries.

The PCA projection explains 61.7% of variance in just two dimensions — a high fraction for 3072-dimensional vectors, indicating that script family identity is the dominant source of variation in the corpus. The remaining 38.3% reflects within-script variation between individual grid designs.

This finding has a concrete calibration implication: a unified centroid is a Latin-dominated centroid. With 5 Latin grids out of 12, the centroid is pulled toward the Latin cluster. Scores for Latin grids are systematically higher, and scores for non-Latin grids systematically lower, regardless of geometric quality relative to their own script cluster. Per-script centroids would be more accurate.

LLM Generation — the Framing Effect

GPT-4o was asked to generate 3×6 noise grids under 10 different framings: six Latin-oriented and four multilingual. Each framing shapes what the model understands as "noise" — and those differences are geometrically measurable.

GPT-4o generated noise grids across 10 framings (grid adherence score shown per panel). The multilingual framings produce visually distinct outputs that render correctly with script-appropriate fonts.

Full PCA — LLM Outputs Land Near Their Script Clusters

Full PCA projection: reference corpus (circles) and all 10 LLM outputs (triangles), colored by script family. LLM outputs land near their respective script reference clusters. LLM Katakana noise is geometrically coherent with reference Katakana noise.

The most striking result: LLM outputs land near their respective script reference clusters. This is behavioral introspection without self-report — we learn about the model's relationship to writing systems by measuring where its outputs live geometrically, without asking the model anything about itself.

Notable: the "keyboard self" framing drifts far from the noise cluster on PCA. When asked to generate what it "sees" when looking at a keyboard, the model produces a conceptual map of the keyboard rather than noise — a mix of characters that is coherent but not random. The geometry distinguishes between "what keyboards contain" and "what noise looks like."

Scores

Score bar chart across reference corpus (left, colored by script family) and LLM outputs (Latin framings center, multilingual framings right). Dashed lines show noise-region floor and ceiling from the interleaved pairwise matrix.

Scores against the unified corpus centroid:

Source	Script	Score range	Note
Reference corpus	Latin	0.87–0.93	Centroid-dominant script, highest scores
Reference corpus	Cyrillic	0.83–0.85	Visually similar to Latin
Reference corpus	Greek	0.79	Alphabetic but visually distinct
Reference corpus	Arabic	0.75	Visually distinct, right-to-left
Reference corpus	Katakana	0.73–0.75	Most distant from Latin
LLM (keyboard native)	Latin	0.93	Native noise vocabulary, highest LLM score
LLM (plain ask)	Latin	0.88	Solid default
LLM (visual static)	Latin	0.71	Framing pulls toward peripheral noise
LLM (keyboard self)	Latin	<0.71	Drifts far from noise region on PCA
LLM (cyrillic)	Cyrillic	0.80	Near reference cluster
LLM (greek)	Greek	0.81	Near reference cluster
LLM (katakana)	Katakana	0.66	Below reference; centroid bias applies
LLM (arabic)	Arabic	0.65	Below reference; centroid bias applies

Centroid Bias

All non-Latin scores are depressed by the Latin-dominated unified centroid. This is not a failure of those outputs — it is a calibration artifact. Per-script centroids would give fairer absolute scores. The PCA projection is the better diagnostic for non-Latin outputs.

The Metric Function

def noise_eval(
    text: str,
    noise_centroid: np.ndarray,
    trace=None,
    threshold: float = 0.50,
    adherence_weight: float = 0.4
) -> float | bool:
    """
    Combined noise metric for DSPy.

    Args:
        text:             The ASCII grid output to evaluate.
        noise_centroid:   Pre-computed centroid of the noise corpus embeddings.
                          Normalized unit vector in 3072-dim space.
        trace:            DSPy trace object.
                          None  = evaluation mode  → returns float in [0,1]
                          not None = bootstrapping → returns bool (score >= threshold)
        threshold:        Bootstrap pass threshold. Default 0.50.
        adherence_weight: Weight of structural score vs. geometric score.
                          0.4 = 40% structural, 60% geometric. Default 0.4.
    """
    # Step 1: structural check (no API cost)
    adherence = grid_adherence(text)

    # Step 2: geometric check (only if structurally non-zero)
    if adherence == 0.0:
        score = 0.0
    else:
        rendered   = render(text)
        pred_vec   = embed_interleaved(text, rendered)   # text+PNG → one vector
        similarity = cosine(pred_vec, noise_centroid)

        # Normalize raw cosine into [0, 1] using observed noise-to-noise range
        # noise_floor ≈ 0.896  noise_ceil ≈ 0.968 (interleaved, Latin corpus)
        similarity_norm = float(np.clip(
            (similarity - noise_floor) / (noise_ceil - noise_floor), 0.0, 1.0
        ))

        score = (adherence_weight * adherence
                 + (1 - adherence_weight) * similarity_norm)

    # Step 3: dual-mode return (matches DSPy metric contract)
    return score if trace is None else score >= threshold

DSPy Integration

The metric wraps into a DSPy-compatible function and plugs directly into MIPROv2. The centroid is pre-computed once at startup to avoid redundant API calls during the optimization loop:

import dspy
from functools import partial

# Pre-compute the centroid once at startup
noise_centroid = build_noise_centroid()

# Wrap into a DSPy-compatible metric
def noise_metric(example, prediction, trace=None):
    return noise_eval(
        text           = prediction.grid,
        noise_centroid = noise_centroid,
        trace          = trace,
        threshold      = 0.50,
        adherence_weight = 0.4,
    )

# Plug into MIPROv2
optimizer = dspy.MIPROv2(
    metric     = noise_metric,
    auto       = 'light',
    num_threads= 4,
)
optimized = optimizer.compile(
    NoiseGridModule(),
    trainset = trainset,
    max_bootstrapped_demos = 3,
    max_labeled_demos = 2,
)

What MIPROv2 Will Optimize

Given this metric, MIPROv2 will search for instructions that produce outputs landing deeper inside the noise manifold. It will discover that shifted keyboard characters score higher than alphanumerics, that the 3×6 structure must be maintained, and that some framings are geometrically more coherent than others — without being told any of this explicitly.

Open Questions

Per-script centroids. Build a separate centroid per script family. Remove calibration bias and give fair absolute scores across writing systems.
Font sensitivity. How much does score change with different font choices? Does a pixel font produce a different noise manifold?
Grid size sensitivity. Would a 4×8 or 6×6 grid produce a different manifold? Is the noise region scale-invariant?
Script expansion. Devanagari, Hebrew, CJK ideographs — each would add a new cluster. How many distinct script clusters exist in the manifold?
Adversarial outputs. What outputs maximize the noise score without being perceptually noisy? Are there degenerate high-scoring patterns?

← Back to ascii-bench overview