Tools: How Agents Actually Do Things


Part 3 of The Agent Platform Handbook. From Loop to Platform. Previous: Your Agent Wants Root. Next: Context Is the Product.
In post one we built the agent harness: a loop, a one-tool registry, a system prompt, a dispatcher, an iteration budget. In post two we slid a sandboxed runtime under the shell tool without touching the loop. The harness stands today at tag post-02 of the-agent-platform-handbook: four files, one tool, fenced. The loop works. The runtime is fenced. The agent is still mostly useless, because one tool means the model can either run a shell command or do nothing. Every real agent has a toolbox.
This is the toolbox post. We will extend the same harness with three more tools (fs_read, http_get, git), promote the one-tool registry into a real one, handle parallel tool calls, and look at the failure modes that show up the first time the model has more than one thing to pick from. The diff lands as tag post-03 in the same repo. The agent loop’s overall shape, the system prompt structure, and the iteration budget do not change.
The interesting work in this post is not the tools themselves. It is the registry, the schemas, and the contract between what the model sees and what your code runs. Get those right and adding a fifth tool is a one-file change. Get them wrong and you will spend the next quarter retraining users on a registry the model cannot navigate.
For about a year, the way you let a language model call a function was to ask it nicely.
The ReAct paper in October 2022 sketched the loop in pseudocode. The first implementations, including the early LangChain releases that month, made it real by parsing the model’s prose output. You instructed the model to write Action: search on one line, Action Input: "what is X" on the next, then stopped generation on a token like Observation: and used the rest of the lines verbatim. It worked. It also broke whenever the model felt creative, whenever the user’s question contained the stop token, whenever the prompt accidentally taught the model a slightly different format.
Then on June 13, 2023, OpenAI shipped function calling. You declared your tools with a JSON Schema. The model returned a structured object with name and arguments. No more parsing prose. No more stop tokens. The reliability gap between “this works in the demo” and “this works on Tuesday morning” closed by an order of magnitude in a single release. Anthropic shipped tool_use shortly after on the same shape, and structured outputs (constrained decoding that guarantees the model emits valid JSON for a given schema) followed in late 2024.
The Model Context Protocol, also from Anthropic, arrived in November 2024 and added a transport layer for tools so they could live in a separate process or a separate machine. The calling convention did not change. MCP just gave the registry a network. We will spend post eight on MCP specifically.
The lesson from this lineage is one sentence. The model talks JSON now. The work that remains is the work you control: the registry, the schemas, the contract for what happens after the call. That is what this post is about.
A tool layer has three pieces. The model picks. The registry resolves. The handler runs.
+-------------------+
| Model |
| "I want to call |
| fs_read with |
| path=/etc/hosts" |
+---------+---------+
|
| tool_use block
v
+-------------------+ schema list
| Tool registry | <----- the model sees
| name -> handler |
+---------+---------+
|
| dispatch
v
+-------------------+
| Handler |
| side effect runs |
| (in the sandbox) |
+---------+---------+
|
| { ok, value | error }
v
+-------------------+
| tool_result | ---> back into context
+-------------------+What the model sees and what the handler runs are decoupled. The model sees a name, a description, and a JSON schema for inputs. The handler sees parsed arguments and returns a result string. The registry is the seam. Every tool engineering decision in this post is about that seam.
Post two left the harness with a single Tool type in types.ts and one implementation in tools.ts. Three additions turn that into a real registry: a tagged-union ToolResult so errors flow as data and not exceptions, a max_output_bytes field so a 50 MB log file does not blow the context window, and a new registry.ts so the loop does not care how many tools exist. We also reorganize the tools onto their own subdirectory now that there is more than one of them. The post-03 tree looks like this.
the-agent-platform-handbook/
├── agent.ts # loop and dispatch (rewritten)
├── registry.ts # new
├── types.ts # +ToolResult, +max_output_bytes
└── tools/
├── shell.ts # moved from ./tools.ts, returns ToolResult
├── fs.ts # new
├── http.ts # new
└── git.ts # new
Two changes in types.ts versus post-02.
// types.ts
export type ToolResult =
| { ok: true; value: string }
| { ok: false; error: string };
export type Tool = {
name: string;
description: string;
input_schema: {
type: "object";
properties: Record<string, unknown>;
required?: string[];
};
max_output_bytes?: number;
run: (input: Record<string, unknown>) => Promise<ToolResult>;
};
ToolResult is a tagged union. Tools never throw to the loop. They return { ok: false, error } when the side effect fails or the input is wrong. This matters because the model needs to read the failure and decide what to do next. An exception kills the loop. A returned error gives the model a chance to retry, switch tools, or report back to the user.
max_output_bytes is the per-tool truncation cap. The default is small. A shell tool that runs cat /var/log/syslog should not return three megabytes of text into a context window that costs you per token.
The registry itself is tiny.
// registry.ts
import type { Tool, ToolResult } from "./types";
export class Registry {
private readonly tools = new Map<string, Tool>();
register(tool: Tool): this {
if (this.tools.has(tool.name)) {
throw new Error(`duplicate tool: ${tool.name}`);
}
this.tools.set(tool.name, tool);
return this;
}
schemas() {
return Array.from(this.tools.values()).map(({ run, ...t }) => t);
}
async dispatch(name: string, input: Record<string, unknown>): Promise<ToolResult> {
const tool = this.tools.get(name);
if (!tool) return { ok: false, error: `unknown tool: ${name}` };
try {
const result = await tool.run(input);
return cap(result, tool.max_output_bytes ?? 8192);
} catch (err) {
return { ok: false, error: `tool threw: ${String(err)}` };
}
}
}
function cap(result: ToolResult, max: number): ToolResult {
if (!result.ok) return result;
const bytes = Buffer.byteLength(result.value, "utf8");
if (bytes <= max) return result;
const head = result.value.slice(0, max);
return { ok: true, value: `${head}\n\n[truncated: ${bytes - max} more bytes]` };
}
The registry holds tools, hands the model the schema view (without the handler), dispatches by name, caps output, and turns any thrown exception into a returned error. Forty lines. Done.
Now the actual toolbox. The shapes are deliberate, and so are the descriptions.
// tools/fs.ts
import type { Tool } from "../types";
export const fs_read: Tool = {
name: "fs_read",
description:
"Read a UTF-8 text file from the local filesystem and return its contents. " +
"Fails if the path does not exist, is not a regular file, is not valid UTF-8, " +
"or exceeds 1 MB. Use this for source files, configs, and logs.",
input_schema: {
type: "object",
properties: {
path: { type: "string", description: "Absolute or relative path to the file." },
},
required: ["path"],
},
max_output_bytes: 1024 * 1024,
run: async ({ path }) => {
try {
const file = Bun.file(String(path));
const exists = await file.exists();
if (!exists) return { ok: false, error: `no such file: ${path}` };
if (file.size > 1024 * 1024) return { ok: false, error: `file too large: ${file.size} bytes` };
const text = await file.text();
return { ok: true, value: text };
} catch (err) {
return { ok: false, error: String(err) };
}
},
};
Notice the description. It tells the model what the tool does, when it fails, and when to choose it (“Use this for source files, configs, and logs”). The model’s tool-selection is a function of these strings. Vague descriptions produce vague selection. Boring, specific descriptions produce reliable selection.
// tools/http.ts
export const http_get: Tool = {
name: "http_get",
description:
"Perform an HTTP GET request and return the response body as text. " +
"Times out after 10 seconds. Returns the status code in the result. " +
"Use this to fetch public documentation, API responses, or web pages. " +
"Do not use it to interact with internal services.",
input_schema: {
type: "object",
properties: {
url: { type: "string", description: "Absolute https:// URL." },
},
required: ["url"],
},
max_output_bytes: 64 * 1024,
run: async ({ url }) => {
const u = String(url);
if (!u.startsWith("https://")) return { ok: false, error: "only https:// is allowed" };
try {
const ctl = AbortSignal.timeout(10_000);
const res = await fetch(u, { signal: ctl });
const body = await res.text();
return { ok: true, value: `status: ${res.status}\n\n${body}` };
} catch (err) {
return { ok: false, error: String(err) };
}
},
};
Two design choices to call out. The tool enforces https:// at the handler level even though the description says so, because the model will sometimes call it with http:// anyway. The status code is folded into the value, not into a separate field, because the model reads strings.
// tools/git.ts
export const git: Tool = {
name: "git",
description:
"Run a read-only git command in the current repository and return its output. " +
"Allowed subcommands: log, diff, show, status, branch, ls-files. " +
"Any other subcommand is rejected. Use this to inspect history, " +
"see uncommitted changes, or list tracked files.",
input_schema: {
type: "object",
properties: {
args: {
type: "array",
items: { type: "string" },
description: "Arguments after `git`, e.g. ['log', '--oneline', '-5'].",
},
},
required: ["args"],
},
max_output_bytes: 32 * 1024,
run: async ({ args }) => {
const a = (args as string[]) ?? [];
const allowed = new Set(["log", "diff", "show", "status", "branch", "ls-files"]);
if (a.length === 0 || !allowed.has(a[0])) {
return { ok: false, error: `subcommand not allowed: ${a[0] ?? "(none)"}` };
}
const proc = Bun.spawn(["git", ...a.map(String)], { stdout: "pipe", stderr: "pipe" });
const [stdout, stderr] = await Promise.all([
new Response(proc.stdout).text(),
new Response(proc.stderr).text(),
]);
const code = await proc.exited;
if (code !== 0) return { ok: false, error: stderr || `git exited ${code}` };
return { ok: true, value: stdout };
},
};
The git tool is interesting because it shows the allow-list pattern. The model can ask git push --force if it wants to. The handler refuses, returns a clear error, and the model goes back to the drawing board. The allow-list lives in the handler, not in the description, because trusting the model to obey natural-language constraints is exactly the trap post two was about.
The shell tool from post-02 moves to tools/shell.ts and gets the same envelope refactor as the new tools: its run now returns ToolResult instead of a raw string. The sandbox flags and the executed command stay exactly as they were. Four tools, registered:
// agent.ts
import { Registry } from "./registry";
import { fs_read } from "./tools/fs";
import { http_get } from "./tools/http";
import { git } from "./tools/git";
import { shell } from "./tools/shell";
const tools = new Registry()
.register(shell)
.register(fs_read)
.register(http_get)
.register(git);
Modern frontier models can ask for several tools in a single turn. Treat them as parallel calls and you save round trips. Treat them as sequential and the model will figure it out, but slowly.
// agent.ts (continued)
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 4096,
system: SYSTEM_PROMPT,
tools: tools.schemas(),
messages,
});
messages.push({ role: "assistant", content: response.content });
if (response.stop_reason === "end_turn") {
// ... print final answer, return
}
const calls = response.content.filter((b) => b.type === "tool_use");
const results = await Promise.all(
calls.map(async (block) => {
const result = await tools.dispatch(block.name, block.input as Record<string, unknown>);
console.error(`> ${block.name} ${JSON.stringify(block.input)} -> ${result.ok ? "ok" : "err"}`);
return {
type: "tool_result" as const,
tool_use_id: block.id,
content: result.ok ? result.value : result.error,
is_error: !result.ok,
};
}),
);
messages.push({ role: "user", content: results });
Two changes versus the post-02 dispatch. The loop runs all tool calls from a single turn in parallel with Promise.all. The is_error flag now gets set when the result was an error, because the model uses it to decide whether to retry or change strategy. The rest of the work — unknown-tool handling, exception wrapping, output capping — moved into the registry, so the agent.ts dispatch shrinks from roughly twenty-five lines to ten.
A short transcript shows the difference.
$ bun agent.ts "summarize the largest TypeScript file under src and tell me what changed in the last commit"
> fs_read {"path":"src/agent.ts"} -> ok
> git {"args":["log","-1","--stat"]} -> ok
src/agent.ts (185 lines) implements the agent loop against the Anthropic
Messages API. It builds a registry of four tools (shell, fs_read, http_get,
git), dispatches tool_use blocks in parallel, and stops when the model
returns an end_turn response or hits the 10-iteration budget. The last
commit added the parallel-dispatch path and a small output truncation
helper; net 38 lines added across agent.ts and registry.ts.
One turn. Two tool calls. They ran simultaneously. The model fused the results into a single answer.
After you have built the registry, the failure mode that bites you is not the code. It is the design. Three patterns that hold up.
One tool per concept. A fs tool with a union input schema (mode: "read" | "list" | "stat") reads cleaner to a human and worse to a model. The model has to pick the right mode and the right arguments simultaneously. Split it: fs_read, fs_list, fs_stat. Three tools, three clear pictures. The model picks better and the schemas are simpler.
Descriptions are written for the model. The description is the only place the model learns when to use a tool. Be specific about inputs, outputs, error cases, and use cases. “Read a file” picks worse than the fs_read description above. The cost of the extra eighty tokens per turn is rounding error against the cost of the model picking the wrong tool and looping.
Errors are data. Every tool returns { ok, error } rather than throwing. The model can read the error, reason about it, and choose: retry with different inputs, switch tools, or surface the failure to the user. An exception removes all of that.
The honest version of the tradeoff is in the table.
| Decision | Cheap option | Right option | Why |
|---|---|---|---|
| Tool granularity | one tool with a mode | one tool per concept | Better selection, simpler schemas, simpler errors. |
| Description | one sentence | inputs, errors, when-to-use, in prose | The model picks from the string. |
| Error model | throw exceptions | tagged-union ToolResult always | The model can recover. Exceptions kill the loop. |
| Output size | return whatever the OS gives | cap per tool, truncate with a marker | Context windows are not log files. |
| Side effects | run, hope, retry on error | idempotency keys or confirm argument | Retries are real. Re-deletes are real. |
| Parallel calls | serial loop | Promise.all over tool_use blocks | Modern models batch. Latency drops by ~Nx. |
| Sensitive ops | “do not delete files” in prompt | allow-list in the handler | The model will eventually try anyway. |
The first time you hit any of these you will think your code is broken. It is not. These are tool-layer problems specifically.
path: "the file the user mentioned" instead of an actual path because it lost track of the conversation. Strict schemas help. Server-side validation of plausible inputs (file exists, URL parses) helps more.unknown tool: X as an error. The model reads it and either retries with a real tool or apologizes. Both are correct behaviors. The bug would be silently dispatching to a default handler.cat large_file or a curl massive_api blows the context window. The per-tool cap above handles it. Without a cap, you discover the problem at a token-cost billing alert.git push, email_send, or database_write, it is not. Idempotency keys or explicit confirm: true arguments are the only durable fixes.fs_write calls to the same path in one turn is the classic example. The agent will not notice. You will, in production. The fix is per-tool serialization or explicit no-parallel marking in the dispatcher.This is the tool layer. It is not the context layer, the memory layer, or the identity layer. Things you might expect this post to cover that get a dedicated post later.
http_get tool calls an internal service, the service needs to know who is asking. Post thirteen.Total damage going from post-02 to post-03: one new file (registry.ts), one new directory (tools/) with four files, two extensions to types.ts, a rewritten agent.ts dispatch, and a one-line tsconfig.json update so the build picks up the new subdirectory. git diff post-02 post-03 against the companion repo is the entire delta this post describes. The system prompt grows by a sentence. The iteration budget, the message-history shape, and the overall loop are unchanged.
Post one was the loop. Post two was the runtime around the loop. This post is what the loop reaches through to do anything useful. In the reference architecture from post twenty-two, the registry is the seam between the agent process and everything else: MCP servers, internal APIs, file systems, side effects, the world.
The rule from earlier posts still holds. The harness only ever grows; it does not get rewritten. Each post adds one layer to the same artifact and explains why the layer below was not enough.
The layer below this one was an empty toolbox. The layer above is what the model knows when it picks. A model with a brilliant toolbox and no context will pick wrong every time. Next we make the model less blind. That post will ship as post-04 in the same repo.
Part 4: Context Is the Product. Models are commodities. Context is not. Sources of context, why retrieval alone is not enough, and a minimal .AGENTS/ convention that loads context from disk into the same agent we have been building.