Evaluation is one of the oldest unsolved problems in machine learning. Before you can improve a model, you need to measure it. But measurement at scale is genuinely difficult — and the three classical approaches each have deep failure modes.
Human raters are the gold standard, but gold is expensive. A meaningful human evaluation of a text-to-image model requires dozens of annotators, calibration sessions, inter-rater agreement measurement, and weeks of calendar time. You cannot run this in a training loop. Human evaluation answers questions about past snapshots; it cannot guide an optimizer in real time.
Exact-match and rule-based heuristics scale well but generalize poorly. A pixel-level diff metric catches regression on memorized examples and nothing else. An ASCII art heuristic that counts specific character frequencies will score a line of ^^^^ highly for "mountain range" and miss the point entirely. Every heuristic is a closed-world assumption that breaks outside its calibration domain.
LLM-as-judge is the current fashionable solution, and it is genuinely better than heuristics for many tasks. But it introduces a second large model into your evaluation loop — with its own inference cost, its own latency, its own biases, and its own failure modes. It is also opaque: when the judge disagrees with you, you cannot look inside to understand why. And there is a circularity problem when the judge and the judged share training data.
Gemini Embedding 2 is trained to encode text, images, audio, and video into a single 3072-dimensional vector space. The training objective pushes semantically related content — regardless of modality — toward the same region of the space. This is not a loose metaphor. It is a measurable geometric fact: you can compute the cosine similarity between a text embedding and an image embedding and get a meaningful number.
This is the key structural property that makes cross-modal evaluation possible. The space was not designed with ASCII art evaluation in mind. It was trained on a massive corpus of multimodal data with a general contrastive objective. And yet that objective is sufficient to create a useful evaluator for any semantic task — because semantic meaning is what the space encodes.
Cosine similarity measures the angle between two vectors in a high-dimensional space, ignoring their magnitude. It returns a value between −1 and +1. Two vectors pointing in exactly the same direction score 1.0; two perpendicular vectors score 0.0; two vectors pointing in opposite directions score −1.0.
In practice, embeddings for semantically related content tend to point in similar directions. The cosine between "a cat" and a high-quality cat image is noticeably higher than the cosine between "a cat" and a house image — even though all three vectors have roughly unit norm. The signal is in the direction, not the length.
For cross-modal pairs (text vs. image), absolute values tend to be lower than within-modality comparisons. A score of 0.40 for a text-image pair can represent strong alignment; a score of 0.25 can represent near-noise. The absolute thresholds need calibration for each use case. But relative ordering — which of two outputs is better — is robust and interpretable without calibration.
We don't need to tell the metric what 'good' looks like. The embedding space already knows. We just need to measure the distance.