Basic Harness
This book is a step-by-step guide to AI harness engineering, built alongside basic-harness — a Rust implementation inspired by Tejas Kumar’s talk “Harnesses in AI: A Deep Dive” presented at AI Engineer World’s Fair 1. The concepts, architecture, and terminology follow Tejas’s framework, while the implementation details reflect the Rust code in this repo.
You will learn:
- What an AI harness is and why it matters
- The difference between eval runners and agent harnesses
- The six components every agent harness needs
- How to build an eval runner that tests models against known answers
- How to build an agent harness that gives models tools, guardrails, and verification
- Why the harness — not the model — is the moat
Each chapter corresponds to code in this repo. Read it side by side with the source.
-
“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩
What Is an AI Harness?
An AI harness is the infrastructure that wraps around an AI model to make it useful for real-world tasks. It gives the model tools, manages input and output behind the scenes, and ensures the model has the tools, context, and environment to do what’s asked.
As Tejas Kumar put it in his talk 1:
An AI harness is everything except the model weights.
In practice that means: tool interfaces, context and memory handling, guardrails, verification steps, approval gates, logging, and recovery loops.
Why Do We Need Harnesses?
Large language models are non-deterministic black boxes. You send in a prompt and get back a string — but you cannot guarantee what that string will be. The same prompt can produce different answers on different calls. The model can hallucinate, lie about what it did, or silently fail.
A harness solves this by providing a deterministic skeleton around the non-deterministic model. It enforces structure, checks outcomes, and catches failures the model itself would never report.
The Mountain Climber Analogy
In his presentation at AI Engineer World’s Fair, Tejas Kumar uses a climbing analogy to explain the concept 1:
A harness is what anchors a climber to the mountain. Without it, a single slip is fatal. With it, the climber can take risks, recover from stumbles, and do real work.
The model is the climber. The harness is the rope, the carabiners, and the belay system.
Harness vs. Orchestration
Different vendors call this same concept different things:
| Vendor | Term |
|---|---|
| Anthropic | “General-purpose agent harness” / “Context engineering” |
| OpenAI | “Orchestration” |
| Mitchell Hashimoto | “Harness engineering” (coined Feb 2026) |
| Thoughtworks | “Agent scaffolding” |
The name doesn’t matter. The pattern does.
The model is rented. The harness is the moat.
-
“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩ ↩2
The Two Meanings of “Harness”
The word “harness” has two distinct meanings in AI. As Tejas Kumar explains, conflating them causes real confusion [^talk]. This repo has an eval runner (test suite for models) alongside an agent harness (tools + environment).
Eval Runner (Older Meaning)
Originating in ML research around 2021 with EleutherAI’s LM Evaluation Harness 1, an eval runner measures model quality against known answers. (The EleutherAI project calls itself a “harness,” which is the source of the terminology collision this chapter addresses.)
dataset → model → scorer → pass/fail → summary
It is a test suite for models. You feed it a fixed set of inputs with expected outputs, run one or more models, and get scores. No tools, no loops, no guardrails.
Agent Harness (Newer Meaning)
Emerged in agentic engineering around 2026 2 3. An agent harness enables a model to act in the real world — not just answer one prompt, but do actual work in a loop.
task → [tools + context + guardrails + loop + verify] → result
It has tools, state, guardrails, and verification. The model iterates until it finishes the task or a guardrail fires.
Side-by-Side Comparison
| Eval runner | Agent harness | |
|---|---|---|
| Origin | ML research, 2021 | Agentic engineering, 2026 |
| Example | EleutherAI’s LM Evaluation Harness | Claude Agent SDK, this repo |
| Purpose | Measure model quality against known answers | Enable a model to act in the real world |
| Input | Fixed dataset | Open-ended task |
| Output | Scores and pass/fail | Answer + tool call log |
| Loop | One call per test case | Iterates until done or guardrail fires |
| Tools | None | Yes — the whole point |
| Guardrails | Not needed | Essential |
| State | Stateless | Conversation history across turns |
Both are valuable. The eval runner tells you how good a model is. The agent harness tells you how well a model can do work.
-
lm-evaluation-harness — EleutherAI, 2021 ↩
-
“My AI Adoption Journey” — Mitchell Hashimoto, February 2026 ↩
-
“Harness engineering: leveraging Codex in an agent-first world” — OpenAI, February 2026 ↩
Why Harness Engineering Matters
The Origin of the Term
In February 2026, Mitchell Hashimoto — co-founder of HashiCorp, creator of Terraform — published a blog post that gave the practice a name 1:
Whenever an agent makes a mistake, you engineer the environment so it won’t make that mistake again.
Days later, OpenAI used the same phrase describing how they built an internal beta product 2: roughly one million lines of code, written entirely by agents, shipped in five months, with no manually written source code. Their key insight:
When something failed, the fix was almost never “try harder.” Human engineers always stepped in and asked: what capability is missing, and how do we make it both legible and enforceable for the agent?
The Shift in Engineering Work
Harness engineering changes the engineer’s job, as described by both Hashimoto and OpenAI 1 2:
| Before | After |
|---|---|
| Write code | Design environments |
| Debug implementation | Specify intent |
| Fix bugs directly | Build feedback loops |
| Optimize logic | Structure constraints |
The engineer stops writing application code and starts designing the scaffolding that makes agents productive.
Three Core Components
Drawing from Thoughtworks and OpenAI, three core components of a harness are 2:
-
Context engineering — deciding what information to include or exclude at each model call: isolation (keep subtasks separate), reduction (drop stale data to avoid context rot), retrieval (inject fresh docs or search results at the right time).
-
Architectural constraints — enforced not just by the model, but by deterministic linters, structural tests, and guardrails the model cannot bypass.
-
Verification and feedback loops — the harness checks outputs, runs eval steps, and if something is wrong, surfaces it so the agent or the engineer can fix it.
Why It Matters Now
As Tejas Kumar notes in his talk 3:
“The name of the game with harness is reliability. It’s making sure that the agents we build do what they do, period. Irrespective of the black box model.”
Users pay $20/month for Claude Pro. The model is a black box. Anthropic could serve Sonnet instead of Opus without notice. Too many variables escape control.
Harnesses solve that. They anchor an agent in a stable environment. The goal is reliability regardless of the rented model.
The company that builds the best harness will win, not the company with the most advanced model.
-
“My AI Adoption Journey” — Mitchell Hashimoto, February 2026 ↩ ↩2
-
“Harness engineering: leveraging Codex in an agent-first world” — OpenAI, February 2026 ↩ ↩2 ↩3
-
“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩
Components of an Agent Harness
Every agent harness has six typical components. Tejas Kumar identifies these in his talk 1, using coding agents like Claude Code and Cursor as familiar reference examples. These form the scaffolding that grounds a model in reality.
1. Tool Registry
A defined set of actions the model can take. Each tool has a name, description, parameter schema, and an execution function.
Examples in this repo: browser_navigate, browser_click, browser_fill,
browser_get_text, browser_get_stories, browser_url, browser_has_class.
Tools are registered in one place (src/tools.rs) and injected into the
loop as a parameter — never imported globally.
2. Model
The underlying LLM. In this repo, any Ollama model is interchangeable by
changing one string in src/harness.rs. The harness doesn’t care which
model runs — it just sends messages and reads responses. The model client
(src/model.rs) uses Ollama’s OpenAI-compatible endpoint at
/v1/chat/completions.
3. Context Management
The harness builds initial context (src/context.rs) and manages message
history. Without it, context windows fill up with stale tool results. Good
context management compacts or trims messages so the model always sees
relevant information.
4. Guardrails
Hard limits on agent behavior that run before every loop iteration:
- Max iterations: “Do not make more than N tool calls.”
- Max messages: “Stop if the conversation exceeds M messages.”
Guardrails are composable, deterministic checks. They catch structural failures before the model can waste tokens or go off the rails.
5. Agent Loop
The outer orchestration loop (src/agent_loop.rs):
while (true):
call model → parse response
if answer: return
if tool_calls: execute each → append results → loop
This is the engine. But the loop alone is not the harness — the harness is everything around the loop.
6. Login Handler
Some tasks require authentication. When a browser agent hits a login page,
the harness can detect this and auto-fill credentials. In this repo,
src/login_handler.rs checks the current URL after every tool call batch.
If it matches /login or /vote, it fills the HN login form and submits,
then injects a synthetic tool event into the trace.
7. Shared State (Upvote Detection)
Tools and guardrails communicate through shared mutable state:
Arc<Mutex<Option<UpvotedStory>>>. Tools write to it (detecting upvote
clicks), guardrails read from it (stopping on success). This replaces
the callback hook pattern from the TypeScript version with idiomatic Rust.
8. Verify Step
A post-hoc check that the intended outcome actually occurred. In a coding agent this would be “run lint, run tests.” In a browser agent this uses the live browser session to check the page DOM — if HN removed the upvote arrow element after the click, the vote was registered.
Guardrails catch structural failures. Verify catches wrong answers. You need both.
Putting It Together
runHarness()
├── create shared state
├── create guardrails (bound to state)
│
├── attempt 1..MAX_ATTEMPTS:
│ ├── open environment
│ ├── create tools (bound to environment + state)
│ ├── create login handler (bound to environment)
│ ├── build initial context
│ ├── runLoop(tools, context, guardrails, login handler)
│ │ ├── trim context
│ │ ├── check guardrails
│ │ ├── call model
│ │ ├── execute tools or return answer
│ │ ├── run login handler
│ │ └── log trace iteration
│ ├── verify result from trace
│ ├── if verified: return success
│ └── close environment (always)
│
└── return result (with verification)
This architecture is adapted from the Rust source files at the project root.
-
“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩
The Eval Runner: Testing Models Against Known Answers
The eval runner answers a simple question: how good is this model at answering known questions? It is a test suite for LLMs, conceptually related to the older “eval harness” meaning in AI 1.
dataset → model → scorer → pass/fail → summary
The full implementation lives in src/eval/.
Dataset
src/eval/dataset.rs defines a set of test cases designed to trigger
common hallucinations. Each case has an input, an expected answer, and a “trap” —
the wrong answer that weaker models confidently give.
#![allow(unused)]
fn main() {
pub struct TestCase {
pub id: &'static str,
pub input: &'static str,
pub expected: &'static str,
pub trap: Option<&'static str>,
}
}
Examples: “What is the capital of Australia?” (expected: Canberra, trap:
Sydney), “How many hearts does an octopus have?” (expected: 3, trap: 1).
The trap field tracks whether the model fell for the decoy. This exposes weak models quickly.
Model Client
src/eval/model.rs wraps any Ollama model in a single function:
#![allow(unused)]
fn main() {
pub async fn call_model(model: &str, prompt: &str) -> Result<String, String>
}
It sends a system prompt instructing brief answers, makes one API call to
Ollama’s /v1/chat/completions endpoint, and returns the response text.
Swap the model string to test a different model. No tools, no state, no loop —
one call per test case.
Scorers
src/eval/scorer.rs provides scoring functions:
score_contains— output must contain the expected text somewhere
The normalize function maps number words to digits so that “three” and “3”
are treated equally. Models answer the same question differently — this
shouldn’t count as wrong.
Runner
src/eval/runner.rs loops over every test case, calls the model, scores
the result, and collects the numbers:
#![allow(unused)]
fn main() {
pub async fn run_eval(
cases: &[TestCase],
model: &str,
scorer: ScorerFn,
) -> EvalRun
}
The model and scorer are passed in as arguments — swap either one without touching this file.
For each test case it records: the actual answer, whether it fell for the trap, the score, whether it passed, and latency in milliseconds.
Output and Comparison
src/bin/eval.rs runs one or more models against the same dataset and
prints results:
#![allow(unused)]
fn main() {
const MODELS: &[&str] = &[
"gemma4:e4b",
];
}
Each model gets the same cases, the same scorer, the same conditions. The output shows which models fell for traps, their average scores, and their latency.
The key value: the same test, run consistently, so you can compare models or catch regressions over time.
Run it:
cargo run --bin eval
-
lm-evaluation-harness — EleutherAI, 2021 ↩
The Agent Harness: Giving Models an Environment
The agent harness enables a model to act in the real world. It provides tools, manages state, enforces limits, and verifies outcomes. This is the newer meaning of “harness” in AI — the focus of Tejas Kumar’s talk 1.
task → [tools + context + guardrails + loop + verify] → result
The full implementation lives at the project root. All code excerpts in this chapter are from the Rust implementation in this repo.
The harness is orchestrated by src/harness.rs, which ties together the
browser, tools, guardrails, login handler, and agent loop into a single
retry-capable pipeline.
Tool Registry
src/tools.rs defines the tools available to the model. Each tool has:
- A definition (name, description, parameter schema in OpenAI format)
- An execute function (the actual implementation)
#![allow(unused)]
fn main() {
pub struct Tool {
pub definition: ToolDefinition,
pub execute: ToolExecute,
}
}
Tools in this repo:
browser_navigate(url)— go to a URLbrowser_url()— get current URL (detect redirects to login pages)browser_get_text()— get visible page textbrowser_fill(selector, value)— fill an input fieldbrowser_click(selector)— click an element, wait for navigationbrowser_get_stories()— structured list of HN stories with IDs and voted statusbrowser_has_class(selector, className)— check CSS class (verify upvote state)
The critical architectural decision: tools are created with a
create_tools(session, upvote_state) function that binds them to a specific
environment session and shared state. Tools don’t manage the browser. They
don’t know about the browser lifecycle. The harness injects the session and
upvote tracking state into the tools at construction time.
Upvote Detection
The browser_click tool in src/tools.rs also detects HN upvote clicks.
After each click, it parses the selector for up_<STORYID> patterns and
checks whether the current URL is on news.ycombinator.com/news. If both
conditions match, it records the story ID into the shared
Arc<Mutex<Option<UpvotedStory>>> state, which the guardrails read to
stop the loop on success.
Model Client
src/model.rs provides a configurable ModelClient that talks to
Ollama’s OpenAI-compatible endpoint:
#![allow(unused)]
fn main() {
pub struct ModelClient { /* ... */ }
impl ModelClient {
pub fn new() -> Self { /* defaults to http://localhost:11434 */ }
pub fn with_seed(mut self, seed: u64) -> Self;
pub fn with_temperature(mut self, temperature: f32) -> Self;
pub fn with_max_tokens(mut self, max_tokens: u32) -> Self;
}
}
Swap models by changing one string in src/harness.rs. No API keys needed —
Ollama runs entirely locally. Override the endpoint with the OLLAMA_URL
environment variable.
Context Management
src/context.rs builds the initial message array for a new task:
#![allow(unused)]
fn main() {
pub fn create_context(task: &str) -> Vec<Message> {
vec![
Message { role: "system".into(), content: Some(SYSTEM_PROMPT.into()), .. },
Message { role: "user".into(), content: Some(task.into()), .. },
]
}
}
The loop appends tool call results and model responses to this array. In a more
sophisticated harness, context management would compact or trim old messages to
prevent context rot (note the MAX_CONTEXT_MESSAGES constant in
src/agent_loop.rs).
Guardrails
src/guardrails.rs provides composable guardrail functions that run before
every loop iteration:
max_iterations(limit): stop if the agent exceeds N loop iterationsmax_messages(limit): stop if the conversation exceeds M messagesstop_after_upvote(state): stop once the shared upvote state is setcombine_guardrails(vec): run multiple guardrails, first Stop winsdefault_guardrails(state): returns all three combined with sensible defaults
Guardrails are GuardrailFn closures over Arc<dyn Fn(&GuardrailInput) -> GuardrailResult>.
They catch structural failures — runaway agents, infinite loops, and detect
successful upvotes via shared state:
Agent Loop
src/agent_loop.rs is the orchestration engine:
loop:
1. trim context (if over MAX_CONTEXT_MESSAGES)
2. check guardrails → if Stop, return immediately
3. call model with current messages + tools
4. if model says "stop": return answer
5. if model calls tools:
for each tool call:
execute the tool
capture ToolEvent in trace
append result to messages
6. run login handler (if page is /login or /vote, auto-fill credentials
and inject a "harness_auto_login" event + user message)
7. log iteration to trace with all tool events
8. loop back to step 1
The loop tracks a full Vec<LoopIteration> trace, where each iteration
contains Vec<ToolEvent> with the tool name, arguments, and result. This
trace is used by harness.rs for verification and structured output.
The loop sends messages in native OpenAI format directly to Ollama’s
/v1/chat/completions endpoint. No message conversion layer needed.
Login Handler
src/login_handler.rs provides create_login_handler(session), which returns
a closure that runs after every batch of tool calls. It checks the current URL:
- If the URL contains
/loginor/vote(HN redirects unauthenticated upvote attempts to the login page), it auto-fills the credentials and submits the form viainput[name='acct'],input[name='pw'], andinput[type='submit'].
When triggered, it returns a ToolEvent { tool: "harness_auto_login", ... }
that the loop injects into the trace and appends a user message telling the
model it’s now authenticated and should navigate back to HN.
The Harness Lifecycle
The harness (src/harness.rs) owns the full lifecycle with retry logic.
This is the architectural decision that makes it a real harness rather than
just a loop with tools:
run_harness()
├── create shared UpvotedStory state ← tools write, guardrails read
├── create default_guardrails(state) ← includes stop_after_upvote
│
├── attempt 1..MAX_ATTEMPTS:
│ ├── BrowserSession::open() ← harness opens the environment
│ ├── create_tools(session, state) ← tools bound to session + state
│ ├── create_login_handler(session) ← auto-login on redirect
│ ├── create_context(TASK) ← fresh context for this task
│ ├── run_loop(guardrails, login) ← loop runs inside the environment
│ ├── verify_successful_upvote(result) ← check trace for up_ click
│ ├── if verified: return success
│ └── [Browser closed on Drop] ← always, via RAII
│
└── return result with verification
The harness opens the browser, creates tools bound to that browser page and
shared state, creates the login handler and guardrails, runs the loop with
retry logic, verifies the outcome via trace inspection, and cleans up via
Rust’s RAII drop semantics when BrowserSession is dropped.
Verification
verify_successful_upvote inspects the trace for an upvote click, then uses
the live browser session to check the page DOM. HN removes the upvote arrow
element entirely after a successful vote, so an element-not-found error (or the
presence of the nosee class) confirms the vote was registered by HN’s servers.
The harness retries up to 3 times if verification fails, reusing the same
shared upvote state across attempts.
Browser Environment
src/browser.rs provides the BrowserSession struct — a thin wrapper
around headless_chrome:
#![allow(unused)]
fn main() {
impl BrowserSession {
pub fn open() -> Result<Self>;
pub fn navigate(&self, url: &str) -> Result<String>;
pub fn get_url(&self) -> Result<String>;
pub fn get_text(&self) -> Result<String>;
pub fn fill(&self, selector: &str, value: &str) -> Result<String>;
pub fn click(&self, selector: &str) -> Result<String>;
pub fn get_stories(&self) -> Result<String>; // HN-specific
pub fn has_class(&self, selector: &str, class_name: &str) -> Result<String>;
}
}
Each harness run gets one isolated browser page via Chrome’s DevTools Protocol. When the run ends — whether it succeeded, failed, or threw — the browser closes via Rust’s RAII drop semantics.
-
“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩
Incremental Demo: From Failure to Success
This chapter walks through the four versions of the agent that Tejas Kumar builds during his presentation at AI Engineer World’s Fair 1. Each version adds one piece of harness infrastructure. The model and the task never change.
“I did not touch the prompt once. I did not change the system prompt. We just built a harness and the outcome radically changed.” — Tejas Kumar 1
The task:
“Go to Hacker News and upvote the first post.”
The model: GPT-3.5 Turbo — deliberately chosen by Tejas as a weak, cheap
model. The tool backend: raw Playwright (browser automation). (The Rust
implementation in this repo uses headless_chrome instead of Playwright, and
Ollama locally instead of an API-based model, but the architectural patterns
are identical.)
Version 1: Raw Agent Loop
The agent loop runs with no guardrails, no verify step, no login handler.
[iter 1] browser_navigate → browser_get_text → browser_get_stories → browser_click(up_12345)
[iter 2] browser_url → browser_get_text
[iter 3] browser_url → browser_get_text
[iter 4] answer
The agent opens Hacker News, clicks the upvote button, hits a login redirect, panics, and then lies. It returns a message claiming “I upvoted” — but the trace shows it never actually logged in. The click opened a login page, and the model hallucinated success rather than admitting failure.
As Tejas Kumar said on stage 1:
“It doesn’t verify. This is the job of a harness.”
Version 2: Guardrails and Context Limits
Two guardrails are added:
- Max iterations: if the agent exceeds 6 tool calls, stop
- Max messages: if the conversation exceeds 20 messages, compact context
The agent still fails (it still can’t log in), but it stops earlier. The guardrails prevent runaway token waste but don’t fix the semantic failure.
The key lesson: guardrails catch structural failures. They don’t catch wrong answers.
Version 3: Verify Step
The agent loop is refactored into runHarnessAttempt, wrapped by an outer
runHarness that retries up to three times. A verifySuccessfulUpvote function
inspects the trace and applies deterministic rules:
- Was there a successful click on the upvote element?
- Did a
harness_auto_logintool run? If it ran and returned “failed,” fail. - Did the page redirect to a login URL without the login handler having run? Fail.
The harness detects the lie by reflecting on its own trace data. The agent now reports “failed to upvote” instead of falsely claiming success.
As Tejas Kumar put it 1:
“Step one to solving a problem is admitting you have one.”
Version 4: Login Handler
A loginHandler function runs before every trace push in the agent loop. It
checks the browser session’s current URL:
- If the page is not a login page: return immediately (zero cost)
- If the page is a login page: inject credentials into the form fields from environment variables, submit the form, push a synthetic message: “I’m the harness. I logged in. You’re good now.”
The agent now succeeds: opens Hacker News, hits the login redirect, the harness logs in programmatically, the agent resumes control, clicks the upvote, and the verify step confirms success.
All without changing the system prompt or the task description.
Summary of the Arc
| Version | What changed | Outcome |
|---|---|---|
| 1 | Raw loop | Agent lies about success |
| 2 | Guardrails | Agent stops earlier, still lies |
| 3 | Verify step | Agent admits failure honestly |
| 4 | Login handler | Agent succeeds reliably |
This progression is the core demonstration from Tejas Kumar’s talk 1.
The Harness Owns the Environment
The single most important architectural decision in this repo:
main()
├── BrowserSession::open() ← harness opens the environment
├── create_tools(session.clone()) ← tools are bound to this session
├── create_context(TASK) ← fresh context for this task
├── run_loop(model, &client, ...) ← loop runs inside the environment
└── [Browser closed on Drop] ← always, via RAII
Tools don’t manage the browser. They don’t know about the browser lifecycle. The harness opens it, the harness closes it, and the process exits cleanly.
Why This Matters
If tools managed their own lifecycle, you’d get:
- Leaked browser instances when a tool crashes
- Shared global state between runs
- Race conditions when tools compete for resources
- No way to inject deterministic behavior (like login) mid-run
When the harness owns the environment:
- Isolation: each run gets a fresh environment
- Determinism: the harness can intercept and override at any point
- Cleanup:
finallyblocks guarantee cleanup even on error - Injectability: the login handler runs inside the harness, not the agent
Tools Are Pure Functions of Their Session
Tools are created by passing a session to create_tools:
#![allow(unused)]
fn main() {
let tools = create_tools(session.clone());
}
The tools don’t import a browser or reach into global state. They reference the session that was handed to them. This makes them testable, swappable, and safe.
The Harness Can Intervene
Because the harness owns the loop, it can intercept before every iteration:
- Run guardrails (max iterations, max messages)
- Run the login handler (check URL, inject credentials)
- Compact context (trim old messages)
- Log and trace every event
The model never sees the harness code. It only sees the messages the harness chooses to inject.
What “Managing Input/Output Behind the Scenes” Looks Like
In the presentation demo, the login handler pushes this message 1:
I'm the harness. I logged in. You're good now.
The model receives this as a tool result. It doesn’t know the harness injected credentials. It just sees that the login problem is solved and moves on to clicking the upvote button.
The harness is the deterministic skeleton. The model fills in the gaps.
-
“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩
The Future: Dynamic On-The-Fly Harnesses
Tejas Kumar closes his talk with a vision of where harness engineering is headed 1.
The Timeline
| Year | Era |
|---|---|
| 2025 | Year of agents |
| 2026 | Year of harnesses |
| 2027 | Year of dynamic on-the-fly harnesses |
Dynamic Harnesses
The next step: an agent, given a task like “buy me a flight ticket,” first generates its own harness. Before doing the work, the agent creates the scaffolding — self-aware, it knows where it might hallucinate, where it might need guardrails, and where a verify step would catch failures.
Tejas describes this as “plan mode on steroids” 1:
- Analyze the task
- Identify likely failure modes
- Generate guardrails (max steps, context limits)
- Generate tool definitions
- Generate verify steps
- Execute the task within the generated harness
- Return the result
“This is honestly the next logical step towards AGI.” — Tejas Kumar 1
This aligns with OpenAI’s observation that building reliable agents is about “designing environments, specifying intent, and building feedback loops” 2.
What This Means for Engineers
The trend is clear: the competitive advantage in AI shifts from who has the best model to who can build the best harness. The model is rented and interchangeable. The harness is owned and differentiated.
Engineers should invest in:
- Tool design: what primitives does the agent need?
- Context strategy: what information at what time?
- Guardrail patterns: what are the hard limits?
- Verify logic: how do we catch failures deterministically?
- Environment management: how do we ensure isolation and cleanup?
The model is a commodity. The harness is the moat.
-
“Harnesses in AI: A Deep Dive” — Tejas Kumar, AI Engineer World’s Fair, May 2026 ↩ ↩2 ↩3
-
“Harness engineering: leveraging Codex in an agent-first world” — OpenAI, February 2026 ↩
Setup and Running
Prerequisites
Installation
git clone <your-repo-url>
cd basic-harness
cargo build
Running the Eval
cargo run --bin eval
Or using the justfile:
just eval
This runs one or more models against a fixed dataset and prints results per test case: pass/fail, trap detection, and latency.
Running the Agent Harness
cargo run --bin agent
Or using the justfile:
just agent
This opens a Chromium window (via Chrome DevTools Protocol), navigates to Hacker News, and attempts to upvote the top story using the local Ollama model.
Swapping Models
Edit src/harness.rs (for the agent) or src/bin/eval.rs (for the
eval) and change the MODEL constant:
#![allow(unused)]
fn main() {
const MODEL: &str = "gemma4:e4b";
}
Any model available in your local Ollama works.
Configuration
Copy .env.example to .env and set your Hacker News credentials:
HN_USER=your_username
HN_PASS=your_password
The login handler reads these at startup and uses them to auto-fill the HN
login form when the agent is redirected to /login or /vote.
Optional environment variables:
OLLAMA_URL=http://localhost:11434 # default, change for remote Ollama
Sources and Further Reading
Primary Sources
- Tejas Kumar, “Harnesses in AI: A Deep Dive” — AI Engineer World’s Fair, May 2026. YouTube
- Mitchell Hashimoto, “My AI Adoption Journey” — February 2026. Coined “harness engineering” in its current agentic meaning. mitchellh.com
- OpenAI, “Harness engineering: leveraging Codex in an agent-first world” — February 2026. openai.com
- Anthropic, “Effective context engineering for AI agents” — Context engineering as a core harness component. anthropic.com
- EleutherAI, “lm-evaluation-harness” — The older eval harness meaning (2021). github.com
This Repository
- Source code: This repo — Rust implementation of an eval runner and agent harness
- Presentation slides: Refer to Tejas Kumar’s original talk materials
Related Concepts
- Tool use / function calling — OpenAI’s API feature that allows models to request tool execution
- headless_chrome — Rust crate for browser automation via Chrome DevTools Protocol, used as the tool backend
- Ollama — local LLM runtime with OpenAI-compatible API endpoint
- Agent loop — the “think-act-observe” cycle at the heart of agentic systems