The Eval Runner: Testing Models Against Known Answers
The eval runner answers a simple question: how good is this model at answering known questions? It is a test suite for LLMs, conceptually related to the older “eval harness” meaning in AI 1.
dataset → model → scorer → pass/fail → summary
The full implementation lives in src/eval/.
Dataset
src/eval/dataset.rs defines a set of test cases designed to trigger
common hallucinations. Each case has an input, an expected answer, and a “trap” —
the wrong answer that weaker models confidently give.
#![allow(unused)]
fn main() {
pub struct TestCase {
pub id: &'static str,
pub input: &'static str,
pub expected: &'static str,
pub trap: Option<&'static str>,
}
}
Examples: “What is the capital of Australia?” (expected: Canberra, trap:
Sydney), “How many hearts does an octopus have?” (expected: 3, trap: 1).
The trap field tracks whether the model fell for the decoy. This exposes weak models quickly.
Model Client
src/eval/model.rs wraps any Ollama model in a single function:
#![allow(unused)]
fn main() {
pub async fn call_model(model: &str, prompt: &str) -> Result<String, String>
}
It sends a system prompt instructing brief answers, makes one API call to
Ollama’s /v1/chat/completions endpoint, and returns the response text.
Swap the model string to test a different model. No tools, no state, no loop —
one call per test case.
Scorers
src/eval/scorer.rs provides scoring functions:
score_contains— output must contain the expected text somewhere
The normalize function maps number words to digits so that “three” and “3”
are treated equally. Models answer the same question differently — this
shouldn’t count as wrong.
Runner
src/eval/runner.rs loops over every test case, calls the model, scores
the result, and collects the numbers:
#![allow(unused)]
fn main() {
pub async fn run_eval(
cases: &[TestCase],
model: &str,
scorer: ScorerFn,
) -> EvalRun
}
The model and scorer are passed in as arguments — swap either one without touching this file.
For each test case it records: the actual answer, whether it fell for the trap, the score, whether it passed, and latency in milliseconds.
Output and Comparison
src/bin/eval.rs runs one or more models against the same dataset and
prints results:
#![allow(unused)]
fn main() {
const MODELS: &[&str] = &[
"gemma4:e4b",
];
}
Each model gets the same cases, the same scorer, the same conditions. The output shows which models fell for traps, their average scores, and their latency.
The key value: the same test, run consistently, so you can compare models or catch regressions over time.
Run it:
cargo run --bin eval
-
lm-evaluation-harness — EleutherAI, 2021 ↩