DSPy is a Python framework for building and optimizing language model programs. The core insight behind DSPy is that prompts should be derived, not hand-written. Instead of spending hours crafting system prompts, few-shot examples, and output format instructions, you declare what you want — the input fields, the output fields, and the evaluation metric — and DSPy's optimizers find the best prompt automatically.
This matters because prompt engineering is brittle. A prompt that works well on GPT-4o may perform worse on Gemini. A prompt tuned for 10 examples may generalize poorly to 100. DSPy treats prompts as learned parameters, not fixed strings, and optimizes them against a metric over a training set. The result is programs that are more robust, more reproducible, and often surprisingly effective — because the optimizer explores a much larger space of instructions than any human would think to try.
| Primitive | What it is | In our context |
|---|---|---|
dspy.Example |
A labeled training example with named fields | A story excerpt paired with a reference ASCII art |
dspy.Prediction |
The output produced by a module for one input | The ASCII art string generated by the LM |
dspy.Signature |
A typed interface: input fields → output fields with descriptions | story_excerpt → ascii_art with task description |
dspy.Predict / Module |
A callable that maps inputs to outputs using an LM | The forward pass that calls the LM with the current prompt |
| Metric function | A callable (example, prediction, trace) → float | bool |
Our embedding cosine similarity score |
A dspy.Example holds a dictionary of named fields. The with_inputs() call marks which fields are inputs (fed to the model) and which are labels (used by the metric for comparison).
import dspy
# One training example: a story excerpt that should produce an ASCII cat
example = dspy.Example(
story_excerpt="The old tabby curled into a perfect circle on the windowsill.",
ascii_art=" /\_/\
( o.o )
> ^ <
",
).with_inputs("story_excerpt")
# .with_inputs("story_excerpt") means:
# - story_excerpt is an INPUT (passed to module.forward)
# - ascii_art is a LABEL (available to the metric as example.ascii_art)
trainset = [example, ...] # add more examples here
DSPy metric functions receive three arguments: the gold example, the predicted output, and a trace (used internally during optimization). The function must return a float for evaluation mode and can return a bool threshold for bootstrapping.
import os, io, numpy as np
from PIL import Image, ImageDraw, ImageFont
from google import genai
from google.genai import types
client = genai.Client(api_key=os.environ['GEMINI_API_KEY'])
MODEL = 'gemini-embedding-2-preview'
FONT = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf', 20)
def render(ascii_text: str) -> bytes:
"""Rasterize ASCII art to PNG bytes."""
lines = ascii_text.splitlines()
test_draw = ImageDraw.Draw(Image.new('RGB', (1, 1)))
lh = FONT.getbbox('A')[3] + 4
w = int(max(test_draw.textlength(l, font=FONT) for l in lines)) + 32
h = lh * len(lines) + 32
img = Image.new('RGB', (w, h), (255, 255, 255))
draw = ImageDraw.Draw(img)
y = 16
for line in lines:
draw.text((16, y), line, fill=(0, 0, 0), font=FONT)
y += lh
buf = io.BytesIO()
img.save(buf, format='PNG')
return buf.getvalue()
def embed_text(text: str) -> np.ndarray:
r = client.models.embed_content(
model=MODEL, contents=[text],
config=types.EmbedContentConfig(task_type='SEMANTIC_SIMILARITY'))
return np.array(r.embeddings[0].values, dtype=np.float32)
def embed_image(png_bytes: bytes) -> np.ndarray:
r = client.models.embed_content(
model=MODEL,
contents=[types.Part.from_bytes(data=png_bytes, mime_type='image/png')])
return np.array(r.embeddings[0].values, dtype=np.float32)
def cosine(a, b):
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# ── The DSPy metric ──────────────────────────────────────────────────
THRESHOLD = 0.35 # adjust after calibration
def embedding_metric(example, prediction, trace=None):
"""
Cross-modal cosine similarity between the story description
and the rendered PNG of the predicted ASCII art.
- During evaluation (trace=None or trace is a list):
Returns a float in [0, 1] representing alignment quality.
- During bootstrapping (trace is truthy):
Returns True if score exceeds THRESHOLD, else False.
DSPy uses this to filter good few-shot demos.
"""
description = example.story_excerpt # the text prompt
ascii_art = prediction.ascii_art # the generated output
text_vec = embed_text(description)
image_vec = embed_image(render(ascii_art))
score = cosine(text_vec, image_vec)
# Normalize to [0, 1] from typical cross-modal range [0.1, 0.5]
score_norm = max(0.0, min(1.0, (score - 0.1) / 0.4))
if trace is not None:
return score_norm >= THRESHOLD # bool for bootstrapping
return score_norm # float for evaluation
The trace parameter controls dual-mode behavior. When DSPy is bootstrapping few-shot demonstrations, it passes a trace object and expects a bool — "is this demo good enough to include?" When it is evaluating the full program, trace is None and it expects a float for ranking. The same function handles both cases.
import dspy
# ── 1. Configure the LM ─────────────────────────────────────────────
lm = dspy.LM('gemini/gemini-3.1-pro-preview', api_key=os.environ['GEMINI_API_KEY'])
dspy.configure(lm=lm)
# ── 2. Define the Signature ──────────────────────────────────────────
class AsciiArtSignature(dspy.Signature):
"""Convert a story excerpt into expressive ASCII art that captures
the subject described in the text."""
story_excerpt: str = dspy.InputField(
desc='A short passage of prose describing a scene or character.'
)
ascii_art: str = dspy.OutputField(
desc='ASCII art (3-8 lines, plain text characters only) that visually '
'represents the main subject of the excerpt.'
)
# ── 3. Define the Module ─────────────────────────────────────────────
class AsciiArtModule(dspy.Module):
def __init__(self):
super().__init__()
self.generate = dspy.Predict(AsciiArtSignature)
def forward(self, story_excerpt: str) -> dspy.Prediction:
# dspy.Predict calls the LM with the current prompt + any
# few-shot demos that have been compiled in.
return self.generate(story_excerpt=story_excerpt)
# ── 4. Build trainset ────────────────────────────────────────────────
trainset = [
dspy.Example(
story_excerpt='The old tabby curled on the windowsill, eyes half-closed.',
ascii_art=' /\_/\\\n ( o.o )\n > ^ <\n',
).with_inputs('story_excerpt'),
dspy.Example(
story_excerpt='A hawk circled high above the valley, wings spread wide.',
ascii_art=' __\n / \\\n \\____/\n \\ /\n \\/\n',
).with_inputs('story_excerpt'),
# ... add more for better optimization
]
# ── 5. Set up MIPROv2 ────────────────────────────────────────────────
optimizer = dspy.MIPROv2(
metric=embedding_metric, # our cross-modal cosine metric
auto='light', # light = fewer trials, good for prototyping
num_threads=4,
)
# ── 6. Compile ───────────────────────────────────────────────────────
module = AsciiArtModule()
optimized = optimizer.compile(
module,
trainset=trainset,
max_bootstrapped_demos=3, # include up to 3 few-shot examples
max_labeled_demos=2, # seed with up to 2 labeled demos
)
# ── 7. Use the optimized module ──────────────────────────────────────
result = optimized(story_excerpt='A wolf howled at the full moon.')
print(result.ascii_art)
MIPROv2 — Multiprompt Instruction Proposal Optimizer version 2 — operates in three stages:
MIPROv2 will discover prompt instructions that the embedding metric rewards. Because the metric measures how well the rendered ASCII art aligns with the text description in Gemini's embedding space, the optimizer will converge on instructions that produce outputs the embedding model recognizes as matching the text — purely from geometric feedback.
In practice, this likely means instructions like: "use dense characters for dark or solid regions", "use whitespace to preserve shape boundaries", "match the general outline of the described subject". The optimizer will not know this is what it found; it only knows that certain instructions produce higher cosine scores. But the geometric signal is rich enough to guide it there.
This is the deeper value of the paradigm: the embedding space serves as an implicit specification of quality. You do not need to encode domain knowledge into your metric. The model's pretraining has already encoded it. You just need to measure against it.
As you expand the trainset and add more subjects, the metric generalizes automatically — because the embedding space generalizes automatically. A metric trained on cats and hawks will correctly evaluate ASCII art of wolves and castles, without any additional labeling, because the embedding model already understands what wolves and castles look like.