Geometry as Judge

A new paradigm for automatic model evaluation using multimodal embeddings — no human labels, no rubrics, no LLM judges. Now includes ascii-bench, the first benchmark built on this paradigm.

Proof of concept: cosine similarity between 'a cat' and three ASCII art renderings. The good cat shape scores highest.
Proof of concept: cosine similarity between 'a cat' and three ASCII art renderings. The good cat shape scores highest.

The Core Idea

For decades, evaluating model outputs required one of three things: expensive human raters, brittle rule-based heuristics, or — more recently — a second large model acting as judge. Each approach has a fundamental problem. Human raters are slow, costly, and inconsistent. Heuristics break the moment you leave their narrow domain. LLM judges are expensive, opaque, and introduce their own biases and failure modes.

Multimodal embedding models offer a different path. Google's Gemini Embedding 2 encodes text, images, audio, and video into a single geometric space — a high-dimensional manifold where semantic meaning determines position. Things that mean the same thing land nearby each other, regardless of what modality they came from. A sentence and a photograph of the same subject end up close together.

This gives us a new primitive: cosine similarity as semantic distance across modalities. Instead of asking a human or an LLM "is this ASCII art a good cat?", we can ask the geometry directly: "is the position of this rendered PNG close to the position of the text 'a cat'?" The embedding space does not need to be told what 'good' looks like. It already encodes it.

What We Proved

We embedded the text description 'a cat' and three PNG renderings of ASCII art — one recognizable cat shape, one plausible ASCII art of a different subject (a house), and random noise — into Gemini Embedding 2's shared space. The cosine similarity between the text embedding and each image embedding followed the expected ordering: good > bad > noise. The metric works without any labels.

ascii-bench: Measuring What Models Know About Noise

ascii-bench is the first benchmark built directly on this paradigm. Rather than evaluating semantic fidelity to a prompt, it asks a more fundamental question: can a model produce output that lands in the correct geometric region of the embedding space? The first evaluation, noise-01, measures 3×6 ASCII noise grid generation across Latin and multilingual scripts.

ascii-bench noise-01: Full PCA projection of the noise manifold. Reference corpus (circles) and GPT-4o LLM outputs (triangles) plotted by script family. Five distinct clusters — noise has topology.
ascii-bench noise-01: Full PCA projection of the noise manifold. Reference corpus (circles) and GPT-4o LLM outputs (triangles) plotted by script family. Five distinct clusters — noise has topology.

Explore the Documentation

Part 1

The Problem and the Space

Why existing evaluation approaches fail, and how a unified embedding manifold solves it.

Part 2

The Experiment

The exact pipeline, code, test cases, and results that validate the cross-modal metric.

Part 3

Wiring it into DSPy

How to use this metric as a DSPy training objective and let MIPROv2 optimize prompts automatically.

Benchmark

ascii-bench

The benchmark overview: evaluation structure, scoring framework, and planned evaluations.

Evaluation 1

noise-01 Deep Dive

Full technical analysis of the noise grid evaluation: corpus design, embedding strategies, PCA clustering, and LLM generation scores.

The Stack