A new paradigm for automatic model evaluation using multimodal embeddings — no human labels, no rubrics, no LLM judges. Now includes ascii-bench, the first benchmark built on this paradigm.
For decades, evaluating model outputs required one of three things: expensive human raters, brittle rule-based heuristics, or — more recently — a second large model acting as judge. Each approach has a fundamental problem. Human raters are slow, costly, and inconsistent. Heuristics break the moment you leave their narrow domain. LLM judges are expensive, opaque, and introduce their own biases and failure modes.
Multimodal embedding models offer a different path. Google's Gemini Embedding 2 encodes text, images, audio, and video into a single geometric space — a high-dimensional manifold where semantic meaning determines position. Things that mean the same thing land nearby each other, regardless of what modality they came from. A sentence and a photograph of the same subject end up close together.
This gives us a new primitive: cosine similarity as semantic distance across modalities. Instead of asking a human or an LLM "is this ASCII art a good cat?", we can ask the geometry directly: "is the position of this rendered PNG close to the position of the text 'a cat'?" The embedding space does not need to be told what 'good' looks like. It already encodes it.
We embedded the text description 'a cat' and three PNG renderings of ASCII art — one recognizable cat shape, one plausible ASCII art of a different subject (a house), and random noise — into Gemini Embedding 2's shared space. The cosine similarity between the text embedding and each image embedding followed the expected ordering: good > bad > noise. The metric works without any labels.
ascii-bench is the first benchmark built directly on this paradigm. Rather than evaluating semantic fidelity to a prompt, it asks a more fundamental question: can a model produce output that lands in the correct geometric region of the embedding space? The first evaluation, noise-01, measures 3×6 ASCII noise grid generation across Latin and multilingual scripts.
Why existing evaluation approaches fail, and how a unified embedding manifold solves it.
The exact pipeline, code, test cases, and results that validate the cross-modal metric.
How to use this metric as a DSPy training objective and let MIPROv2 optimize prompts automatically.
The benchmark overview: evaluation structure, scoring framework, and planned evaluations.
Full technical analysis of the noise grid evaluation: corpus design, embedding strategies, PCA clustering, and LLM generation scores.