Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Two Meanings of “Harness”

The word “harness” has two distinct meanings in AI. As Tejas Kumar explains, conflating them causes real confusion [^talk]. This repo has an eval runner (test suite for models) alongside an agent harness (tools + environment).

Eval Runner (Older Meaning)

Originating in ML research around 2021 with EleutherAI’s LM Evaluation Harness 1, an eval runner measures model quality against known answers. (The EleutherAI project calls itself a “harness,” which is the source of the terminology collision this chapter addresses.)

dataset → model → scorer → pass/fail → summary

It is a test suite for models. You feed it a fixed set of inputs with expected outputs, run one or more models, and get scores. No tools, no loops, no guardrails.

Agent Harness (Newer Meaning)

Emerged in agentic engineering around 2026 2 3. An agent harness enables a model to act in the real world — not just answer one prompt, but do actual work in a loop.

task → [tools + context + guardrails + loop + verify] → result

It has tools, state, guardrails, and verification. The model iterates until it finishes the task or a guardrail fires.

Side-by-Side Comparison

Eval runnerAgent harness
OriginML research, 2021Agentic engineering, 2026
ExampleEleutherAI’s LM Evaluation HarnessClaude Agent SDK, this repo
PurposeMeasure model quality against known answersEnable a model to act in the real world
InputFixed datasetOpen-ended task
OutputScores and pass/failAnswer + tool call log
LoopOne call per test caseIterates until done or guardrail fires
ToolsNoneYes — the whole point
GuardrailsNot neededEssential
StateStatelessConversation history across turns

Both are valuable. The eval runner tells you how good a model is. The agent harness tells you how well a model can do work.


  1. lm-evaluation-harness — EleutherAI, 2021

  2. “My AI Adoption Journey” — Mitchell Hashimoto, February 2026

  3. “Harness engineering: leveraging Codex in an agent-first world” — OpenAI, February 2026