Wiring it into DSPy

What DSPy Is

DSPy is a Python framework for building and optimizing language model programs. The core insight behind DSPy is that prompts should be derived, not hand-written. Instead of spending hours crafting system prompts, few-shot examples, and output format instructions, you declare what you want — the input fields, the output fields, and the evaluation metric — and DSPy's optimizers find the best prompt automatically.

This matters because prompt engineering is brittle. A prompt that works well on GPT-4o may perform worse on Gemini. A prompt tuned for 10 examples may generalize poorly to 100. DSPy treats prompts as learned parameters, not fixed strings, and optimizes them against a metric over a training set. The result is programs that are more robust, more reproducible, and often surprisingly effective — because the optimizer explores a much larger space of instructions than any human would think to try.

The Five Primitives

Primitive	What it is	In our context
`dspy.Example`	A labeled training example with named fields	A story excerpt paired with a reference ASCII art
`dspy.Prediction`	The output produced by a module for one input	The ASCII art string generated by the LM
`dspy.Signature`	A typed interface: input fields → output fields with descriptions	`story_excerpt → ascii_art` with task description
`dspy.Predict` / Module	A callable that maps inputs to outputs using an LM	The forward pass that calls the LM with the current prompt
Metric function	A callable `(example, prediction, trace) → float \| bool`	Our embedding cosine similarity score

dspy.Example: Your Data

A dspy.Example holds a dictionary of named fields. The with_inputs() call marks which fields are inputs (fed to the model) and which are labels (used by the metric for comparison).

import dspy

# One training example: a story excerpt that should produce an ASCII cat
example = dspy.Example(
    story_excerpt="The old tabby curled into a perfect circle on the windowsill.",
    ascii_art="  /\_/\
 ( o.o )
 > ^ <
",
).with_inputs("story_excerpt")

# .with_inputs("story_excerpt") means:
#   - story_excerpt is an INPUT (passed to module.forward)
#   - ascii_art is a LABEL (available to the metric as example.ascii_art)

trainset = [example, ...]  # add more examples here

The Metric Function

DSPy metric functions receive three arguments: the gold example, the predicted output, and a trace (used internally during optimization). The function must return a float for evaluation mode and can return a bool threshold for bootstrapping.

import os, io, numpy as np
from PIL import Image, ImageDraw, ImageFont
from google import genai
from google.genai import types

client = genai.Client(api_key=os.environ['GEMINI_API_KEY'])
MODEL = 'gemini-embedding-2-preview'
FONT = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf', 20)

def render(ascii_text: str) -> bytes:
    """Rasterize ASCII art to PNG bytes."""
    lines = ascii_text.splitlines()
    test_draw = ImageDraw.Draw(Image.new('RGB', (1, 1)))
    lh = FONT.getbbox('A')[3] + 4
    w = int(max(test_draw.textlength(l, font=FONT) for l in lines)) + 32
    h = lh * len(lines) + 32
    img = Image.new('RGB', (w, h), (255, 255, 255))
    draw = ImageDraw.Draw(img)
    y = 16
    for line in lines:
        draw.text((16, y), line, fill=(0, 0, 0), font=FONT)
        y += lh
    buf = io.BytesIO()
    img.save(buf, format='PNG')
    return buf.getvalue()

def embed_text(text: str) -> np.ndarray:
    r = client.models.embed_content(
        model=MODEL, contents=[text],
        config=types.EmbedContentConfig(task_type='SEMANTIC_SIMILARITY'))
    return np.array(r.embeddings[0].values, dtype=np.float32)

def embed_image(png_bytes: bytes) -> np.ndarray:
    r = client.models.embed_content(
        model=MODEL,
        contents=[types.Part.from_bytes(data=png_bytes, mime_type='image/png')])
    return np.array(r.embeddings[0].values, dtype=np.float32)

def cosine(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# ── The DSPy metric ──────────────────────────────────────────────────

THRESHOLD = 0.35  # adjust after calibration

def embedding_metric(example, prediction, trace=None):
    """
    Cross-modal cosine similarity between the story description
    and the rendered PNG of the predicted ASCII art.

    - During evaluation (trace=None or trace is a list):
        Returns a float in [0, 1] representing alignment quality.
    - During bootstrapping (trace is truthy):
        Returns True if score exceeds THRESHOLD, else False.
        DSPy uses this to filter good few-shot demos.
    """
    description = example.story_excerpt         # the text prompt
    ascii_art   = prediction.ascii_art          # the generated output

    text_vec  = embed_text(description)
    image_vec = embed_image(render(ascii_art))
    score     = cosine(text_vec, image_vec)

    # Normalize to [0, 1] from typical cross-modal range [0.1, 0.5]
    score_norm = max(0.0, min(1.0, (score - 0.1) / 0.4))

    if trace is not None:
        return score_norm >= THRESHOLD          # bool for bootstrapping
    return score_norm                           # float for evaluation

The trace parameter controls dual-mode behavior. When DSPy is bootstrapping few-shot demonstrations, it passes a trace object and expects a bool — "is this demo good enough to include?" When it is evaluating the full program, trace is None and it expects a float for ranking. The same function handles both cases.

The Full Program

import dspy

# ── 1. Configure the LM ─────────────────────────────────────────────
lm = dspy.LM('gemini/gemini-3.1-pro-preview', api_key=os.environ['GEMINI_API_KEY'])
dspy.configure(lm=lm)

# ── 2. Define the Signature ──────────────────────────────────────────
class AsciiArtSignature(dspy.Signature):
    """Convert a story excerpt into expressive ASCII art that captures
    the subject described in the text."""

    story_excerpt: str = dspy.InputField(
        desc='A short passage of prose describing a scene or character.'
    )
    ascii_art: str = dspy.OutputField(
        desc='ASCII art (3-8 lines, plain text characters only) that visually '
             'represents the main subject of the excerpt.'
    )

# ── 3. Define the Module ─────────────────────────────────────────────
class AsciiArtModule(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate = dspy.Predict(AsciiArtSignature)

    def forward(self, story_excerpt: str) -> dspy.Prediction:
        # dspy.Predict calls the LM with the current prompt + any
        # few-shot demos that have been compiled in.
        return self.generate(story_excerpt=story_excerpt)

# ── 4. Build trainset ────────────────────────────────────────────────
trainset = [
    dspy.Example(
        story_excerpt='The old tabby curled on the windowsill, eyes half-closed.',
        ascii_art='  /\_/\\\n ( o.o )\n > ^ <\n',
    ).with_inputs('story_excerpt'),
    dspy.Example(
        story_excerpt='A hawk circled high above the valley, wings spread wide.',
        ascii_art='   __\n  /  \\\n \\____/\n  \\  /\n   \\/\n',
    ).with_inputs('story_excerpt'),
    # ... add more for better optimization
]

# ── 5. Set up MIPROv2 ────────────────────────────────────────────────
optimizer = dspy.MIPROv2(
    metric=embedding_metric,    # our cross-modal cosine metric
    auto='light',               # light = fewer trials, good for prototyping
    num_threads=4,
)

# ── 6. Compile ───────────────────────────────────────────────────────
module = AsciiArtModule()
optimized = optimizer.compile(
    module,
    trainset=trainset,
    max_bootstrapped_demos=3,   # include up to 3 few-shot examples
    max_labeled_demos=2,        # seed with up to 2 labeled demos
)

# ── 7. Use the optimized module ──────────────────────────────────────
result = optimized(story_excerpt='A wolf howled at the full moon.')
print(result.ascii_art)

The Optimizer Loop

MIPROv2's optimization loop: trainset examples flow through the module, predictions are scored by the embedding metric, and the Bayesian optimizer updates instructions based on scores.

MIPROv2 — Multiprompt Instruction Proposal Optimizer version 2 — operates in three stages:

Bootstrap demonstrations. Run the module on trainset examples with an initial prompt. Where the metric returns True (score above threshold), collect those input-output pairs as candidate few-shot demos. These are concrete examples of good outputs that will be prepended to the prompt.
Propose instruction candidates. Use a meta-LM to generate many candidate instruction strings for each module, based on the task description, the bootstrap demos, and the signature fields. This is where the optimizer explores the space of possible prompts.
Bayesian search. Evaluate combinations of instructions and demo sets on the trainset using the metric. Use Bayesian optimization (specifically, a Tree Parzen Estimator) to efficiently search the space of combinations, spending more evaluations in promising regions. Return the combination with the highest average metric score.

What the Optimizer Will Discover

MIPROv2 will discover prompt instructions that the embedding metric rewards. Because the metric measures how well the rendered ASCII art aligns with the text description in Gemini's embedding space, the optimizer will converge on instructions that produce outputs the embedding model recognizes as matching the text — purely from geometric feedback.

In practice, this likely means instructions like: "use dense characters for dark or solid regions", "use whitespace to preserve shape boundaries", "match the general outline of the described subject". The optimizer will not know this is what it found; it only knows that certain instructions produce higher cosine scores. But the geometric signal is rich enough to guide it there.

This is the deeper value of the paradigm: the embedding space serves as an implicit specification of quality. You do not need to encode domain knowledge into your metric. The model's pretraining has already encoded it. You just need to measure against it.

The Compounding Advantage

As you expand the trainset and add more subjects, the metric generalizes automatically — because the embedding space generalizes automatically. A metric trained on cats and hawks will correctly evaluate ASCII art of wolves and castles, without any additional labeling, because the embedding model already understands what wolves and castles look like.

Next Steps

Expand the trainset. Add 20–50 examples across diverse subjects. The optimizer's signal-to-noise ratio improves significantly with more examples.
Add TinyStories story→character evaluation. Use the same metric to evaluate whether ASCII character art matches the protagonist described in a short story passage.
Add reference image alignment. For cases where a reference image exists, compute the cosine between the rendered ASCII art PNG and the reference image directly (image-to-image cosine) as a complementary metric.
Calibrate thresholds. Collect 50–100 human ratings, fit a logistic regression against the cosine scores, and use the resulting threshold as your bootstrap cutoff.
Benchmark against LLM-judge. Run both metrics on the same examples and compare ranking agreement. This tells you how much of the LLM judge's signal the embedding metric captures.