The Problem and the Space

Why Evaluation is Hard

Evaluation is one of the oldest unsolved problems in machine learning. Before you can improve a model, you need to measure it. But measurement at scale is genuinely difficult — and the three classical approaches each have deep failure modes.

Human raters are the gold standard, but gold is expensive. A meaningful human evaluation of a text-to-image model requires dozens of annotators, calibration sessions, inter-rater agreement measurement, and weeks of calendar time. You cannot run this in a training loop. Human evaluation answers questions about past snapshots; it cannot guide an optimizer in real time.

Exact-match and rule-based heuristics scale well but generalize poorly. A pixel-level diff metric catches regression on memorized examples and nothing else. An ASCII art heuristic that counts specific character frequencies will score a line of ^^^^ highly for "mountain range" and miss the point entirely. Every heuristic is a closed-world assumption that breaks outside its calibration domain.

LLM-as-judge is the current fashionable solution, and it is genuinely better than heuristics for many tasks. But it introduces a second large model into your evaluation loop — with its own inference cost, its own latency, its own biases, and its own failure modes. It is also opaque: when the judge disagrees with you, you cannot look inside to understand why. And there is a circularity problem when the judge and the judged share training data.

One Space to Rule Them All

Gemini Embedding 2 is trained to encode text, images, audio, and video into a single 3072-dimensional vector space. The training objective pushes semantically related content — regardless of modality — toward the same region of the space. This is not a loose metaphor. It is a measurable geometric fact: you can compute the cosine similarity between a text embedding and an image embedding and get a meaningful number.

Gemini Embedding 2 maps text, images, audio, and video into one geometric space. Cross-modal siblings — things that mean the same thing in different modalities — cluster nearby each other.

"A photograph of a forest and the sentence 'dense canopy, filtered light' end up nearby each other — without any translation layer."

This is the key structural property that makes cross-modal evaluation possible. The space was not designed with ASCII art evaluation in mind. It was trained on a massive corpus of multimodal data with a general contrastive objective. And yet that objective is sufficient to create a useful evaluator for any semantic task — because semantic meaning is what the space encodes.

What Cosine Similarity Actually Measures

Cosine similarity measures the angle between two vectors in a high-dimensional space, ignoring their magnitude. It returns a value between −1 and +1. Two vectors pointing in exactly the same direction score 1.0; two perpendicular vectors score 0.0; two vectors pointing in opposite directions score −1.0.

Cosine similarity as angle between vectors. What matters is direction, not magnitude — two embeddings can have very different norms but still agree semantically.

In practice, embeddings for semantically related content tend to point in similar directions. The cosine between "a cat" and a high-quality cat image is noticeably higher than the cosine between "a cat" and a house image — even though all three vectors have roughly unit norm. The signal is in the direction, not the length.

For cross-modal pairs (text vs. image), absolute values tend to be lower than within-modality comparisons. A score of 0.40 for a text-image pair can represent strong alignment; a score of 0.25 can represent near-noise. The absolute thresholds need calibration for each use case. But relative ordering — which of two outputs is better — is robust and interpretable without calibration.

The Key Insight

We don't need to tell the metric what 'good' looks like. The embedding space already knows. We just need to measure the distance.

What This Unlocks

Any modality can evaluate any other modality. Text descriptions can evaluate images, videos, audio. Rendered images can evaluate textual descriptions. The pairing is free.
Scales to any subject without relabeling. The same metric function that evaluates ASCII cats evaluates ASCII castles, portraits, landscapes, or abstract concepts. No domain-specific rules needed.
The same metric works as a training objective. Because it returns a scalar, it plugs directly into any optimizer that needs a score function — including DSPy's MIPROv2.
No inference from a second LLM. The embedding model is much cheaper and faster to call than a capable judge model. At training time, this matters.
Transparent failure mode. When the metric is wrong, it is because the embedding space is wrong — which you can inspect. There is no hidden chain-of-thought to debug.