Geometry as Judge

A new paradigm for automatic model evaluation using multimodal embeddings — no human labels, no rubrics, no LLM judges. Now includes ascii-bench, the first benchmark built on this paradigm.

Proof of concept: cosine similarity between 'a cat' and three ASCII art renderings. The good cat shape scores highest.

The Core Idea

For decades, evaluating model outputs required one of three things: expensive human raters, brittle rule-based heuristics, or — more recently — a second large model acting as judge. Each approach has a fundamental problem. Human raters are slow, costly, and inconsistent. Heuristics break the moment you leave their narrow domain. LLM judges are expensive, opaque, and introduce their own biases and failure modes.

Multimodal embedding models offer a different path. Google's Gemini Embedding 2 encodes text, images, audio, and video into a single geometric space — a high-dimensional manifold where semantic meaning determines position. Things that mean the same thing land nearby each other, regardless of what modality they came from. A sentence and a photograph of the same subject end up close together.

This gives us a new primitive: cosine similarity as semantic distance across modalities. Instead of asking a human or an LLM "is this ASCII art a good cat?", we can ask the geometry directly: "is the position of this rendered PNG close to the position of the text 'a cat'?" The embedding space does not need to be told what 'good' looks like. It already encodes it.

What We Proved

We embedded the text description 'a cat' and three PNG renderings of ASCII art — one recognizable cat shape, one plausible ASCII art of a different subject (a house), and random noise — into Gemini Embedding 2's shared space. The cosine similarity between the text embedding and each image embedding followed the expected ordering: good > bad > noise. The metric works without any labels.

ascii-bench: Measuring What Models Know About Noise

ascii-bench is the first benchmark built directly on this paradigm. Rather than evaluating semantic fidelity to a prompt, it asks a more fundamental question: can a model produce output that lands in the correct geometric region of the embedding space? The first evaluation, noise-01, measures 3×6 ASCII noise grid generation across Latin and multilingual scripts.

ascii-bench noise-01: Full PCA projection of the noise manifold. Reference corpus (circles) and GPT-4o LLM outputs (triangles) plotted by script family. Five distinct clusters — noise has topology.

Explore the Documentation

Part 1

The Stack

Gemini Embedding 2
PIL / Pillow
NumPy
scikit-learn
DSPy
GPT-4o
Python 3.11+

Geometry as Judge

The Core Idea

ascii-bench: Measuring What Models Know About Noise

Explore the Documentation

The Problem and the Space

The Experiment

Wiring it into DSPy

ascii-bench

noise-01 Deep Dive

The Stack