Basic Harness

This book is a step-by-step guide to AI harness engineering, built alongside basic-harness — a Rust implementation inspired by Tejas Kumar’s talk “Harnesses in AI: A Deep Dive” presented at AI Engineer World’s Fair ¹. The concepts, architecture, and terminology follow Tejas’s framework, while the implementation details reflect the Rust code in this repo.

You will learn:

What an AI harness is and why it matters
The difference between eval runners and agent harnesses
The six components every agent harness needs
How to build an eval runner that tests models against known answers
How to build an agent harness that gives models tools, guardrails, and verification
Why the harness — not the model — is the moat

Each chapter corresponds to code in this repo. Read it side by side with the source.

“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩

What Is an AI Harness?

An AI harness is the infrastructure that wraps around an AI model to make it useful for real-world tasks. It gives the model tools, manages input and output behind the scenes, and ensures the model has the tools, context, and environment to do what’s asked.

As Tejas Kumar put it in his talk ¹:

An AI harness is everything except the model weights.

In practice that means: tool interfaces, context and memory handling, guardrails, verification steps, approval gates, logging, and recovery loops.

Why Do We Need Harnesses?

Large language models are non-deterministic black boxes. You send in a prompt and get back a string — but you cannot guarantee what that string will be. The same prompt can produce different answers on different calls. The model can hallucinate, lie about what it did, or silently fail.

A harness solves this by providing a deterministic skeleton around the non-deterministic model. It enforces structure, checks outcomes, and catches failures the model itself would never report.

The Mountain Climber Analogy

In his presentation at AI Engineer World’s Fair, Tejas Kumar uses a climbing analogy to explain the concept ¹:

A harness is what anchors a climber to the mountain. Without it, a single slip is fatal. With it, the climber can take risks, recover from stumbles, and do real work.

The model is the climber. The harness is the rope, the carabiners, and the belay system.

Harness vs. Orchestration

Different vendors call this same concept different things:

Vendor	Term
Anthropic	“General-purpose agent harness” / “Context engineering”
OpenAI	“Orchestration”
Mitchell Hashimoto	“Harness engineering” (coined Feb 2026)
Thoughtworks	“Agent scaffolding”

The name doesn’t matter. The pattern does.

The model is rented. The harness is the moat.

“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩ ↩2

The Two Meanings of “Harness”

The word “harness” has two distinct meanings in AI. As Tejas Kumar explains, conflating them causes real confusion [^talk]. This repo has an eval runner (test suite for models) alongside an agent harness (tools + environment).

Eval Runner (Older Meaning)

Originating in ML research around 2021 with EleutherAI’s LM Evaluation Harness ¹, an eval runner measures model quality against known answers. (The EleutherAI project calls itself a “harness,” which is the source of the terminology collision this chapter addresses.)

dataset → model → scorer → pass/fail → summary

It is a test suite for models. You feed it a fixed set of inputs with expected outputs, run one or more models, and get scores. No tools, no loops, no guardrails.

Agent Harness (Newer Meaning)

Emerged in agentic engineering around 2026 ² ³. An agent harness enables a model to act in the real world — not just answer one prompt, but do actual work in a loop.

task → [tools + context + guardrails + loop + verify] → result

It has tools, state, guardrails, and verification. The model iterates until it finishes the task or a guardrail fires.

Side-by-Side Comparison

	Eval runner	Agent harness
Origin	ML research, 2021	Agentic engineering, 2026
Example	EleutherAI’s LM Evaluation Harness	Claude Agent SDK, this repo
Purpose	Measure model quality against known answers	Enable a model to act in the real world
Input	Fixed dataset	Open-ended task
Output	Scores and pass/fail	Answer + tool call log
Loop	One call per test case	Iterates until done or guardrail fires
Tools	None	Yes — the whole point
Guardrails	Not needed	Essential
State	Stateless	Conversation history across turns

Both are valuable. The eval runner tells you how good a model is. The agent harness tells you how well a model can do work.

lm-evaluation-harness — EleutherAI, 2021 ↩
“My AI Adoption Journey” — Mitchell Hashimoto, February 2026 ↩
“Harness engineering: leveraging Codex in an agent-first world” — OpenAI, February 2026 ↩

Why Harness Engineering Matters

The Origin of the Term

In February 2026, Mitchell Hashimoto — co-founder of HashiCorp, creator of Terraform — published a blog post that gave the practice a name ¹:

Whenever an agent makes a mistake, you engineer the environment so it won’t make that mistake again.

Days later, OpenAI used the same phrase describing how they built an internal beta product ²: roughly one million lines of code, written entirely by agents, shipped in five months, with no manually written source code. Their key insight:

When something failed, the fix was almost never “try harder.” Human engineers always stepped in and asked: what capability is missing, and how do we make it both legible and enforceable for the agent?

The Shift in Engineering Work

Harness engineering changes the engineer’s job, as described by both Hashimoto and OpenAI ¹ ²:

Before	After
Write code	Design environments
Debug implementation	Specify intent
Fix bugs directly	Build feedback loops
Optimize logic	Structure constraints

The engineer stops writing application code and starts designing the scaffolding that makes agents productive.

Three Core Components

Drawing from Thoughtworks and OpenAI, three core components of a harness are ²:

Context engineering — deciding what information to include or exclude at each model call: isolation (keep subtasks separate), reduction (drop stale data to avoid context rot), retrieval (inject fresh docs or search results at the right time).
Architectural constraints — enforced not just by the model, but by deterministic linters, structural tests, and guardrails the model cannot bypass.
Verification and feedback loops — the harness checks outputs, runs eval steps, and if something is wrong, surfaces it so the agent or the engineer can fix it.

Why It Matters Now

As Tejas Kumar notes in his talk ³:

“The name of the game with harness is reliability. It’s making sure that the agents we build do what they do, period. Irrespective of the black box model.”

Users pay $20/month for Claude Pro. The model is a black box. Anthropic could serve Sonnet instead of Opus without notice. Too many variables escape control.

Harnesses solve that. They anchor an agent in a stable environment. The goal is reliability regardless of the rented model.

The company that builds the best harness will win, not the company with the most advanced model.

“My AI Adoption Journey” — Mitchell Hashimoto, February 2026 ↩ ↩2
“Harness engineering: leveraging Codex in an agent-first world” — OpenAI, February 2026 ↩ ↩2 ↩3
“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩

Components of an Agent Harness

Every agent harness has six typical components. Tejas Kumar identifies these in his talk ¹, using coding agents like Claude Code and Cursor as familiar reference examples. These form the scaffolding that grounds a model in reality.

1. Tool Registry

A defined set of actions the model can take. Each tool has a name, description, parameter schema, and an execution function.

Examples in this repo: browser_navigate, browser_click, browser_fill, browser_get_text, browser_get_stories, browser_url, browser_has_class.

Tools are registered in one place (src/tools.rs) and injected into the loop as a parameter — never imported globally.

2. Model

The underlying LLM. In this repo, any Ollama model is interchangeable by changing one string in src/harness.rs. The harness doesn’t care which model runs — it just sends messages and reads responses. The model client (src/model.rs) uses Ollama’s OpenAI-compatible endpoint at /v1/chat/completions.

3. Context Management

The harness builds initial context (src/context.rs) and manages message history. Without it, context windows fill up with stale tool results. Good context management compacts or trims messages so the model always sees relevant information.

4. Guardrails

Hard limits on agent behavior that run before every loop iteration:

Max iterations: “Do not make more than N tool calls.”
Max messages: “Stop if the conversation exceeds M messages.”

Guardrails are composable, deterministic checks. They catch structural failures before the model can waste tokens or go off the rails.

5. Agent Loop

The outer orchestration loop (src/agent_loop.rs):

while (true):
  call model → parse response
  if answer: return
  if tool_calls: execute each → append results → loop

This is the engine. But the loop alone is not the harness — the harness is everything around the loop.

Some tasks require authentication. When a browser agent hits a login page, the harness can detect this and auto-fill credentials. In this repo, src/login_handler.rs checks the current URL after every tool call batch. If it matches /login or /vote, it fills the HN login form and submits, then injects a synthetic tool event into the trace.

7. Shared State (Upvote Detection)

Tools and guardrails communicate through shared mutable state: Arc<Mutex<Option<UpvotedStory>>>. Tools write to it (detecting upvote clicks), guardrails read from it (stopping on success). This replaces the callback hook pattern from the TypeScript version with idiomatic Rust.

8. Verify Step

A post-hoc check that the intended outcome actually occurred. In a coding agent this would be “run lint, run tests.” In a browser agent this uses the live browser session to check the page DOM — if HN removed the upvote arrow element after the click, the vote was registered.

Guardrails catch structural failures. Verify catches wrong answers. You need both.

Putting It Together

runHarness()
  ├── create shared state
  ├── create guardrails (bound to state)
  │
  ├── attempt 1..MAX_ATTEMPTS:
  │   ├── open environment
  │   ├── create tools (bound to environment + state)
  │   ├── create login handler (bound to environment)
  │   ├── build initial context
  │   ├── runLoop(tools, context, guardrails, login handler)
  │   │     ├── trim context
  │   │     ├── check guardrails
  │   │     ├── call model
  │   │     ├── execute tools or return answer
  │   │     ├── run login handler
  │   │     └── log trace iteration
  │   ├── verify result from trace
  │   ├── if verified: return success
  │   └── close environment (always)
  │
  └── return result (with verification)

This architecture is adapted from the Rust source files at the project root.

“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩

The Eval Runner: Testing Models Against Known Answers

The eval runner answers a simple question: how good is this model at answering known questions? It is a test suite for LLMs, conceptually related to the older “eval harness” meaning in AI ¹.

dataset → model → scorer → pass/fail → summary

The full implementation lives in src/eval/.

Dataset

src/eval/dataset.rs defines a set of test cases designed to trigger common hallucinations. Each case has an input, an expected answer, and a “trap” — the wrong answer that weaker models confidently give.

#![allow(unused)]
fn main() {
pub struct TestCase {
    pub id: &'static str,
    pub input: &'static str,
    pub expected: &'static str,
    pub trap: Option<&'static str>,
}
}

Examples: “What is the capital of Australia?” (expected: Canberra, trap: Sydney), “How many hearts does an octopus have?” (expected: 3, trap: 1).

The trap field tracks whether the model fell for the decoy. This exposes weak models quickly.

Model Client

src/eval/model.rs wraps any Ollama model in a single function:

#![allow(unused)]
fn main() {
pub async fn call_model(model: &str, prompt: &str) -> Result<String, String>
}

It sends a system prompt instructing brief answers, makes one API call to Ollama’s /v1/chat/completions endpoint, and returns the response text. Swap the model string to test a different model. No tools, no state, no loop — one call per test case.

Scorers

src/eval/scorer.rs provides scoring functions:

score_contains — output must contain the expected text somewhere

The normalize function maps number words to digits so that “three” and “3” are treated equally. Models answer the same question differently — this shouldn’t count as wrong.

Runner

src/eval/runner.rs loops over every test case, calls the model, scores the result, and collects the numbers:

#![allow(unused)]
fn main() {
pub async fn run_eval(
    cases: &[TestCase],
    model: &str,
    scorer: ScorerFn,
) -> EvalRun
}

The model and scorer are passed in as arguments — swap either one without touching this file.

For each test case it records: the actual answer, whether it fell for the trap, the score, whether it passed, and latency in milliseconds.

Output and Comparison

src/bin/eval.rs runs one or more models against the same dataset and prints results:

#![allow(unused)]
fn main() {
const MODELS: &[&str] = &[
    "gemma4:e4b",
];
}

Each model gets the same cases, the same scorer, the same conditions. The output shows which models fell for traps, their average scores, and their latency.

The key value: the same test, run consistently, so you can compare models or catch regressions over time.

Run it:

cargo run --bin eval

lm-evaluation-harness — EleutherAI, 2021 ↩

The Agent Harness: Giving Models an Environment

The agent harness enables a model to act in the real world. It provides tools, manages state, enforces limits, and verifies outcomes. This is the newer meaning of “harness” in AI — the focus of Tejas Kumar’s talk ¹.

task → [tools + context + guardrails + loop + verify] → result

The full implementation lives at the project root. All code excerpts in this chapter are from the Rust implementation in this repo.

The harness is orchestrated by src/harness.rs, which ties together the browser, tools, guardrails, login handler, and agent loop into a single retry-capable pipeline.

Tool Registry

src/tools.rs defines the tools available to the model. Each tool has:

A definition (name, description, parameter schema in OpenAI format)
An execute function (the actual implementation)

#![allow(unused)]
fn main() {
pub struct Tool {
    pub definition: ToolDefinition,
    pub execute: ToolExecute,
}
}

Tools in this repo:

browser_navigate(url) — go to a URL
browser_url() — get current URL (detect redirects to login pages)
browser_get_text() — get visible page text
browser_fill(selector, value) — fill an input field
browser_click(selector) — click an element, wait for navigation
browser_get_stories() — structured list of HN stories with IDs and voted status
browser_has_class(selector, className) — check CSS class (verify upvote state)

The critical architectural decision: tools are created with a create_tools(session, upvote_state) function that binds them to a specific environment session and shared state. Tools don’t manage the browser. They don’t know about the browser lifecycle. The harness injects the session and upvote tracking state into the tools at construction time.

Upvote Detection

The browser_click tool in src/tools.rs also detects HN upvote clicks. After each click, it parses the selector for up_<STORYID> patterns and checks whether the current URL is on news.ycombinator.com/news. If both conditions match, it records the story ID into the shared Arc<Mutex<Option<UpvotedStory>>> state, which the guardrails read to stop the loop on success.

Model Client

src/model.rs provides a configurable ModelClient that talks to Ollama’s OpenAI-compatible endpoint:

#![allow(unused)]
fn main() {
pub struct ModelClient { /* ... */ }

impl ModelClient {
    pub fn new() -> Self { /* defaults to http://localhost:11434 */ }
    pub fn with_seed(mut self, seed: u64) -> Self;
    pub fn with_temperature(mut self, temperature: f32) -> Self;
    pub fn with_max_tokens(mut self, max_tokens: u32) -> Self;
}
}

Swap models by changing one string in src/harness.rs. No API keys needed — Ollama runs entirely locally. Override the endpoint with the OLLAMA_URL environment variable.

Context Management

src/context.rs builds the initial message array for a new task:

#![allow(unused)]
fn main() {
pub fn create_context(task: &str) -> Vec<Message> {
    vec![
        Message { role: "system".into(), content: Some(SYSTEM_PROMPT.into()), .. },
        Message { role: "user".into(), content: Some(task.into()), .. },
    ]
}
}

The loop appends tool call results and model responses to this array. In a more sophisticated harness, context management would compact or trim old messages to prevent context rot (note the MAX_CONTEXT_MESSAGES constant in src/agent_loop.rs).

Guardrails

src/guardrails.rs provides composable guardrail functions that run before every loop iteration:

max_iterations(limit): stop if the agent exceeds N loop iterations
max_messages(limit): stop if the conversation exceeds M messages
stop_after_upvote(state): stop once the shared upvote state is set
combine_guardrails(vec): run multiple guardrails, first Stop wins
default_guardrails(state): returns all three combined with sensible defaults

Guardrails are GuardrailFn closures over Arc<dyn Fn(&GuardrailInput) -> GuardrailResult>. They catch structural failures — runaway agents, infinite loops, and detect successful upvotes via shared state:

Agent Loop

src/agent_loop.rs is the orchestration engine:

loop:
  1. trim context (if over MAX_CONTEXT_MESSAGES)
  2. check guardrails → if Stop, return immediately
  3. call model with current messages + tools
  4. if model says "stop": return answer
  5. if model calls tools:
       for each tool call:
         execute the tool
         capture ToolEvent in trace
         append result to messages
  6. run login handler (if page is /login or /vote, auto-fill credentials
     and inject a "harness_auto_login" event + user message)
  7. log iteration to trace with all tool events
  8. loop back to step 1

The loop tracks a full Vec<LoopIteration> trace, where each iteration contains Vec<ToolEvent> with the tool name, arguments, and result. This trace is used by harness.rs for verification and structured output.

The loop sends messages in native OpenAI format directly to Ollama’s /v1/chat/completions endpoint. No message conversion layer needed.

src/login_handler.rs provides create_login_handler(session), which returns a closure that runs after every batch of tool calls. It checks the current URL:

If the URL contains /login or /vote (HN redirects unauthenticated upvote attempts to the login page), it auto-fills the credentials and submits the form via input[name='acct'], input[name='pw'], and input[type='submit'].

When triggered, it returns a ToolEvent { tool: "harness_auto_login", ... } that the loop injects into the trace and appends a user message telling the model it’s now authenticated and should navigate back to HN.

The Harness Lifecycle

The harness (src/harness.rs) owns the full lifecycle with retry logic. This is the architectural decision that makes it a real harness rather than just a loop with tools:

run_harness()
  ├── create shared UpvotedStory state    ← tools write, guardrails read
  ├── create default_guardrails(state)    ← includes stop_after_upvote
  │
  ├── attempt 1..MAX_ATTEMPTS:
  │   ├── BrowserSession::open()           ← harness opens the environment
  │   ├── create_tools(session, state)     ← tools bound to session + state
  │   ├── create_login_handler(session)    ← auto-login on redirect
  │   ├── create_context(TASK)             ← fresh context for this task
  │   ├── run_loop(guardrails, login)      ← loop runs inside the environment
  │   ├── verify_successful_upvote(result) ← check trace for up_ click
  │   ├── if verified: return success
  │   └── [Browser closed on Drop]         ← always, via RAII
  │
  └── return result with verification

The harness opens the browser, creates tools bound to that browser page and shared state, creates the login handler and guardrails, runs the loop with retry logic, verifies the outcome via trace inspection, and cleans up via Rust’s RAII drop semantics when BrowserSession is dropped.

Verification

verify_successful_upvote inspects the trace for an upvote click, then uses the live browser session to check the page DOM. HN removes the upvote arrow element entirely after a successful vote, so an element-not-found error (or the presence of the nosee class) confirms the vote was registered by HN’s servers. The harness retries up to 3 times if verification fails, reusing the same shared upvote state across attempts.

Browser Environment

src/browser.rs provides the BrowserSession struct — a thin wrapper around headless_chrome:

#![allow(unused)]
fn main() {
impl BrowserSession {
    pub fn open() -> Result<Self>;
    pub fn navigate(&self, url: &str) -> Result<String>;
    pub fn get_url(&self) -> Result<String>;
    pub fn get_text(&self) -> Result<String>;
    pub fn fill(&self, selector: &str, value: &str) -> Result<String>;
    pub fn click(&self, selector: &str) -> Result<String>;
    pub fn get_stories(&self) -> Result<String>;          // HN-specific
    pub fn has_class(&self, selector: &str, class_name: &str) -> Result<String>;
}
}

Each harness run gets one isolated browser page via Chrome’s DevTools Protocol. When the run ends — whether it succeeded, failed, or threw — the browser closes via Rust’s RAII drop semantics.

“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩

Incremental Demo: From Failure to Success

This chapter walks through the four versions of the agent that Tejas Kumar builds during his presentation at AI Engineer World’s Fair ¹. Each version adds one piece of harness infrastructure. The model and the task never change.

“I did not touch the prompt once. I did not change the system prompt. We just built a harness and the outcome radically changed.” — Tejas Kumar ¹

The task:

“Go to Hacker News and upvote the first post.”

The model: GPT-3.5 Turbo — deliberately chosen by Tejas as a weak, cheap model. The tool backend: raw Playwright (browser automation). (The Rust implementation in this repo uses headless_chrome instead of Playwright, and Ollama locally instead of an API-based model, but the architectural patterns are identical.)

Version 1: Raw Agent Loop

The agent loop runs with no guardrails, no verify step, no login handler.

[iter 1] browser_navigate → browser_get_text → browser_get_stories → browser_click(up_12345)
[iter 2] browser_url → browser_get_text
[iter 3] browser_url → browser_get_text
[iter 4] answer

The agent opens Hacker News, clicks the upvote button, hits a login redirect, panics, and then lies. It returns a message claiming “I upvoted” — but the trace shows it never actually logged in. The click opened a login page, and the model hallucinated success rather than admitting failure.

As Tejas Kumar said on stage ¹:

“It doesn’t verify. This is the job of a harness.”

Version 2: Guardrails and Context Limits

Two guardrails are added:

Max iterations: if the agent exceeds 6 tool calls, stop
Max messages: if the conversation exceeds 20 messages, compact context

The agent still fails (it still can’t log in), but it stops earlier. The guardrails prevent runaway token waste but don’t fix the semantic failure.

The key lesson: guardrails catch structural failures. They don’t catch wrong answers.

Version 3: Verify Step

The agent loop is refactored into runHarnessAttempt, wrapped by an outer runHarness that retries up to three times. A verifySuccessfulUpvote function inspects the trace and applies deterministic rules:

Was there a successful click on the upvote element?
Did a harness_auto_login tool run? If it ran and returned “failed,” fail.
Did the page redirect to a login URL without the login handler having run? Fail.

The harness detects the lie by reflecting on its own trace data. The agent now reports “failed to upvote” instead of falsely claiming success.

As Tejas Kumar put it ¹:

“Step one to solving a problem is admitting you have one.”

A loginHandler function runs before every trace push in the agent loop. It checks the browser session’s current URL:

If the page is not a login page: return immediately (zero cost)
If the page is a login page: inject credentials into the form fields from environment variables, submit the form, push a synthetic message: “I’m the harness. I logged in. You’re good now.”

The agent now succeeds: opens Hacker News, hits the login redirect, the harness logs in programmatically, the agent resumes control, clicks the upvote, and the verify step confirms success.

All without changing the system prompt or the task description.

Summary of the Arc

Version	What changed	Outcome
1	Raw loop	Agent lies about success
2	Guardrails	Agent stops earlier, still lies
3	Verify step	Agent admits failure honestly
4	Login handler	Agent succeeds reliably

This progression is the core demonstration from Tejas Kumar’s talk ¹.

“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩ ↩2 ↩3 ↩4 ↩5

The Harness Owns the Environment

The single most important architectural decision in this repo:

main()
  ├── BrowserSession::open()           ← harness opens the environment
  ├── create_tools(session.clone())    ← tools are bound to this session
  ├── create_context(TASK)             ← fresh context for this task
  ├── run_loop(model, &client, ...)    ← loop runs inside the environment
  └── [Browser closed on Drop]         ← always, via RAII

Tools don’t manage the browser. They don’t know about the browser lifecycle. The harness opens it, the harness closes it, and the process exits cleanly.

Why This Matters

If tools managed their own lifecycle, you’d get:

Leaked browser instances when a tool crashes
Shared global state between runs
Race conditions when tools compete for resources
No way to inject deterministic behavior (like login) mid-run

When the harness owns the environment:

Isolation: each run gets a fresh environment
Determinism: the harness can intercept and override at any point
Cleanup: finally blocks guarantee cleanup even on error
Injectability: the login handler runs inside the harness, not the agent

Tools Are Pure Functions of Their Session

Tools are created by passing a session to create_tools:

#![allow(unused)]
fn main() {
let tools = create_tools(session.clone());
}

The tools don’t import a browser or reach into global state. They reference the session that was handed to them. This makes them testable, swappable, and safe.

The Harness Can Intervene

Because the harness owns the loop, it can intercept before every iteration:

Run guardrails (max iterations, max messages)
Run the login handler (check URL, inject credentials)
Compact context (trim old messages)
Log and trace every event

The model never sees the harness code. It only sees the messages the harness chooses to inject.

What “Managing Input/Output Behind the Scenes” Looks Like

In the presentation demo, the login handler pushes this message ¹:

I'm the harness. I logged in. You're good now.

The model receives this as a tool result. It doesn’t know the harness injected credentials. It just sees that the login problem is solved and moves on to clicking the upvote button.

The harness is the deterministic skeleton. The model fills in the gaps.

“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩

The Future: Dynamic On-The-Fly Harnesses

Tejas Kumar closes his talk with a vision of where harness engineering is headed ¹.

The Timeline

Year	Era
2025	Year of agents
2026	Year of harnesses
2027	Year of dynamic on-the-fly harnesses

Dynamic Harnesses

The next step: an agent, given a task like “buy me a flight ticket,” first generates its own harness. Before doing the work, the agent creates the scaffolding — self-aware, it knows where it might hallucinate, where it might need guardrails, and where a verify step would catch failures.

Tejas describes this as “plan mode on steroids” ¹:

Analyze the task
Identify likely failure modes
Generate guardrails (max steps, context limits)
Generate tool definitions
Generate verify steps
Execute the task within the generated harness
Return the result

“This is honestly the next logical step towards AGI.” — Tejas Kumar ¹

This aligns with OpenAI’s observation that building reliable agents is about “designing environments, specifying intent, and building feedback loops” ².

What This Means for Engineers

The trend is clear: the competitive advantage in AI shifts from who has the best model to who can build the best harness. The model is rented and interchangeable. The harness is owned and differentiated.

Engineers should invest in:

Tool design: what primitives does the agent need?
Context strategy: what information at what time?
Guardrail patterns: what are the hard limits?
Verify logic: how do we catch failures deterministically?
Environment management: how do we ensure isolation and cleanup?

The model is a commodity. The harness is the moat.

“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩ ↩2 ↩3
“Harness engineering: leveraging Codex in an agent-first world” — OpenAI, February 2026 ↩

Setup and Running

Prerequisites

Rust (latest stable)
Ollama running locally with a model loaded

Installation

git clone <your-repo-url>
cd basic-harness
cargo build

Running the Eval

cargo run --bin eval

Or using the justfile:

just eval

This runs one or more models against a fixed dataset and prints results per test case: pass/fail, trap detection, and latency.

Running the Agent Harness

cargo run --bin agent

Or using the justfile:

just agent

This opens a Chromium window (via Chrome DevTools Protocol), navigates to Hacker News, and attempts to upvote the top story using the local Ollama model.

Swapping Models

Edit src/harness.rs (for the agent) or src/bin/eval.rs (for the eval) and change the MODEL constant:

#![allow(unused)]
fn main() {
const MODEL: &str = "gemma4:e4b";
}

Any model available in your local Ollama works.

Configuration

Copy .env.example to .env and set your Hacker News credentials:

HN_USER=your_username
HN_PASS=your_password

The login handler reads these at startup and uses them to auto-fill the HN login form when the agent is redirected to /login or /vote.

Optional environment variables:

OLLAMA_URL=http://localhost:11434   # default, change for remote Ollama

Sources and Further Reading

Primary Sources

Tejas Kumar, “Harnesses in AI: A Deep Dive” — AI Engineer World’s Fair, May 2026. YouTube
Mitchell Hashimoto, “My AI Adoption Journey” — February 2026. Coined “harness engineering” in its current agentic meaning. mitchellh.com
OpenAI, “Harness engineering: leveraging Codex in an agent-first world” — February 2026. openai.com
Anthropic, “Effective context engineering for AI agents” — Context engineering as a core harness component. anthropic.com
EleutherAI, “lm-evaluation-harness” — The older eval harness meaning (2021). github.com

This Repository

Source code: This repo — Rust implementation of an eval runner and agent harness
Presentation slides: Refer to Tejas Kumar’s original talk materials

Tool use / function calling — OpenAI’s API feature that allows models to request tool execution
headless_chrome — Rust crate for browser automation via Chrome DevTools Protocol, used as the tool backend
Ollama — local LLM runtime with OpenAI-compatible API endpoint
Agent loop — the “think-act-observe” cycle at the heart of agentic systems

Keyboard shortcuts

AI Harness Engineering