The Eval Runner: Testing Models Against Known Answers

The eval runner answers a simple question: how good is this model at answering known questions? It is a test suite for LLMs, conceptually related to the older “eval harness” meaning in AI ¹.

dataset → model → scorer → pass/fail → summary

The full implementation lives in src/eval/.

Dataset

src/eval/dataset.rs defines a set of test cases designed to trigger common hallucinations. Each case has an input, an expected answer, and a “trap” — the wrong answer that weaker models confidently give.

#![allow(unused)]
fn main() {
pub struct TestCase {
    pub id: &'static str,
    pub input: &'static str,
    pub expected: &'static str,
    pub trap: Option<&'static str>,
}
}

Examples: “What is the capital of Australia?” (expected: Canberra, trap: Sydney), “How many hearts does an octopus have?” (expected: 3, trap: 1).

The trap field tracks whether the model fell for the decoy. This exposes weak models quickly.

Model Client

src/eval/model.rs wraps any Ollama model in a single function:

#![allow(unused)]
fn main() {
pub async fn call_model(model: &str, prompt: &str) -> Result<String, String>
}

It sends a system prompt instructing brief answers, makes one API call to Ollama’s /v1/chat/completions endpoint, and returns the response text. Swap the model string to test a different model. No tools, no state, no loop — one call per test case.

Scorers

src/eval/scorer.rs provides scoring functions:

score_contains — output must contain the expected text somewhere

The normalize function maps number words to digits so that “three” and “3” are treated equally. Models answer the same question differently — this shouldn’t count as wrong.

Runner

src/eval/runner.rs loops over every test case, calls the model, scores the result, and collects the numbers:

#![allow(unused)]
fn main() {
pub async fn run_eval(
    cases: &[TestCase],
    model: &str,
    scorer: ScorerFn,
) -> EvalRun
}

The model and scorer are passed in as arguments — swap either one without touching this file.

For each test case it records: the actual answer, whether it fell for the trap, the score, whether it passed, and latency in milliseconds.

Output and Comparison

src/bin/eval.rs runs one or more models against the same dataset and prints results:

#![allow(unused)]
fn main() {
const MODELS: &[&str] = &[
    "gemma4:e4b",
];
}

Each model gets the same cases, the same scorer, the same conditions. The output shows which models fell for traps, their average scores, and their latency.

The key value: the same test, run consistently, so you can compare models or catch regressions over time.

Run it:

cargo run --bin eval

lm-evaluation-harness — EleutherAI, 2021 ↩