Loop engineering: a complete guide

For the last couple of years, working with a coding agent meant sitting in front of it: write a prompt, read the output, write the next prompt, accept or reject the change, repeat. You were the loop. Every cycle of “look at the state, decide the next move, do it, check the result” ran through your hands and your attention.

Loop engineering is what happens when you take yourself out of that cycle and build a system to run it instead. Practitioners have started describing their work this way: Anthropic’s Boris Cherny has said he no longer prompts Claude directly — he runs loops that prompt Claude and decide what to do next. Peter Steinberger puts the same idea more bluntly: you shouldn’t be prompting coding agents anymore, you should be designing the loops that prompt them.

This post is the long version of what that actually means — the what, when, where, why, and how — with examples you can reason about and the guardrails that keep a loop from quietly burning a thousand dollars overnight.

What it is

A loop is a small autonomous control system that wraps one or more agents and keeps running on its own. Instead of you prompting once and reading the answer, the loop does this on repeat:

Look at the current state — what’s failing, what’s open, what changed.
Decide the next action — given that state, what’s the most useful thing to do.
Do it — write code, run a command, open a PR, call an API.
Check the result — did it pass, fail, or get stuck.
Decide what’s next — continue, retry, roll back, or stop and hand off to you.

That last step is the whole point. This is what separates a loop from a script.

A cron job runs a fixed script: step A, then B, then C, every morning, identically. A loop runs an agent that looks at the current state, chooses the next action, does it, checks the result, and decides what to do next. The path isn’t hard-coded — it’s chosen each iteration based on what actually happened.

If you’ve heard of the ReAct pattern (Reason + Act), that’s the engine inside a single turn of the loop: the model reasons about the goal, takes an action, observes the feedback, reasons about the failure, and revises. Loop engineering takes that micro-cycle and wraps a macro-cycle around it — scheduling, memory, verification, and handoff — so it can run unattended across many turns and even across days.

Where it sits in the lineage

Loop engineering didn’t appear from nowhere. It’s the latest rung on a ladder the industry has been climbing:

Prompt engineering (2022–2024) — optimizing a single interaction. How do you word the request so the model does the right thing once?
Context engineering (2025) — optimizing what the model can see. Which files, examples, history, and tools belong in the window for this task?
Harness engineering — optimizing a single agent run. The scaffolding, tools, and feedback that equip one agent to complete one job well.
Loop engineering (2026) — optimizing the system around many runs. Scheduling, verification, persistence, and the decision of when to stop.

The key mental shift: the prompt is no longer the unit of work — the loop is. Your carefully crafted prompt becomes just one component inside a larger machine. Loop engineering doesn’t replace prompt or context engineering; it relocates the leverage point. You still need good prompts and good context — they’re now parts inside the loop, not the thing you operate by hand.

When to use it (and when not to)

This is the most important section, because the wrong answer here is what makes loops expensive and frustrating.

Build a loop when the work has two properties:

It repeats. You do this shape of task again and again — triaging CI failures, fixing flaky tests, upgrading dependencies, processing a queue of tickets.
It has a clear pass/fail signal. There’s something objective that tells you “this is done and correct” — tests pass, types check, lint is clean, a schema validates, a reviewer model confirms the UI matches the spec.

If both are true, you’re currently doing the dumb, repeatable parts by hand — exactly the work a loop should own.

Just prompt the model (no loop) when:

The task is one-off. Building a loop has real setup cost; don’t amortize it over a single run.
The end state is fuzzy or subjective — “make this feel more elegant,” “explore some ideas for the landing page.” There’s no signal for the loop to optimize toward, so it can’t know when to stop.
The work is exploratory or creative and you want to be in the driver’s seat for the judgment calls.

The rule of thumb:

If the workflow is one-off, just prompt the model. If the work repeats and has a clear pass-or-fail signal, build a loop.

A useful test: can you write down, in one sentence, the condition under which the loop is allowed to stop? If you can — “all tests in test/auth pass and lint is clean” — you have a loop candidate. If the best you can do is “until it’s good,” you don’t have a goal, you have a money fire.

Where it fits — the anatomy of a loop

A real loop needs two non-negotiable preconditions and then a handful of building blocks. Let’s start with the preconditions.

The two preconditions

1. A trigger. Something has to start the loop. This is what makes it a loop and not a session you babysit:

A schedule (“every morning at 7am”)
A git event (a new PR, a push to main)
A CI signal (a failed build)
An inbound message (a Slack mention, a new Linear ticket)
A manual kick (/goal "...")

2. A verifiable goal. Something has to end the loop. This is the condition that, when met, lets the loop stop and declare success:

Deterministic — tests pass, typecheck is clean, the build is green, the output validates against a schema.
Softer but still checkable — a reviewer model compares the result to a written specification, or a diff-to-spec comparison confirms the change matches what was asked.

Without a trigger you have a regular agent session. Without a verifiable goal you have an agent that runs forever, optimizing toward whatever vague sentence you gave it.

The building blocks

Addy Osmani’s framework lays out the pieces a mature loop is assembled from. Think of these as the parts; not every loop needs all of them, but the good ones reach for most:

Automations — the scheduling/trigger layer that wakes the loop without you. This is the heartbeat. Without it you have isolated sessions; with it you have continuous discovery and triage on a cadence.
Worktrees — isolated git working directories so multiple agents can work in parallel without overwriting each other. Each agent gets its own checkout while sharing the repository’s history. This is what makes “spawn three agents on three issues at once” safe.
Skills — durable, dense markdown files (a SKILL.md, a CLAUDE.md, or whatever your tool reads) that encode your conventions, build commands, test commands, review standards, and hard-won lessons. Without skills, every run re-learns your project from scratch and burns tokens doing it. This is the most underused piece — and the one where loops compound the most value, because the knowledge persists across every future run.
Plugins / connectors — tool integrations (usually over MCP) that let the loop act in the real world: open a PR, update a ticket, post to Slack, query a database. Without these the loop can only suggest; with them it can do.
Sub-agents — specialized agents for different roles. The classic split is maker/checker: one agent writes the code, a different agent (often a stronger model, with different instructions) reviews it against the spec and tests. This removes the self-grading bias where an agent declares its own work correct.
Memory — persistent external state, because the model forgets between runs but the repository doesn’t. A state file that records what’s being worked on, what was already tried, and what’s waiting for human review is often the loop’s single most important artifact.

Why it matters

Three reasons this is more than a rebranding of “agents.”

1. Your leverage moves from operating to designing. You architect the system once and let loops execute it, instead of steering every turn by hand. You stop being the operator who presses “go” each cycle and become the engineer who designs a self-correcting machine — and then spends their attention on the design, the goals, and the review, not the keystrokes.

2. You can scale past your own attention. A human-in-the-loop is a bottleneck on inherently iterative work. Software is iterative by nature — even excellent engineers don’t write correct code on the first try; they run it, watch it fail, and fix it. A loop removes your checkpoint from the repetitive cycles while keeping you in control of the design. Parallel work streams and autonomous verification become viable without you watching each one.

3. It matches how the work actually behaves. Debugging, refactoring, test fixing, dependency upgrades — these are all “try, observe, adjust, repeat” tasks. A single-shot prompt fights that grain. A loop runs with it.

How to build one — a worked example

Here’s a concrete morning-triage loop, the kind that’s become the canonical example. Read it as pseudocode for the control flow, not a specific API:

TRIGGER: every weekday at 7:00am

1. DISCOVER
   - Read yesterday's CI failures, open issues, and recent commits.
   - Write a state file listing the tasks worth doing today.

2. For each task in the state file:

   a. ISOLATE
      - Open a fresh git worktree so this task can't collide with others.

   b. MAKE
      - Dispatch a coding sub-agent to draft a fix.
      - Give it the relevant skills (conventions, test commands, examples).

   c. CHECK
      - Dispatch a *separate* review sub-agent.
      - Run the tests and typecheck. Validate the diff against the spec.

   d. DECIDE
      - If checks pass  -> open a PR, update the ticket, mark task done.
      - If checks fail  -> feed the errors back to the maker, retry (once or twice).
      - If still stuck  -> stop, write what it tried to the state file, alert me.

3. STOP
   - When the task list is empty, or the iteration/budget cap is hit.

Notice what’s happening across those seven steps: zero human re-prompting. You wrote the design once. The loop discovers its own work, isolates it, makes a change, independently verifies it, and either ships it or escalates to you with a clear note about what it tried. You wake up to a stack of green PRs and a short list of “these need your eyes.”

A simpler, single-task version — a single-goal loop you kick off with one verifiable stop condition — looks like this:

GOAL: "all tests in test/auth pass and lint is clean"

loop:
  run the tests + lint            # observe state
  if all green: stop, success     # verifiable goal met
  pick the most informative failure
  reason about the cause
  edit the code                   # act
  (a separate model confirms the goal is actually met, not the maker)

The maker doesn’t get to declare victory. A separate check does. That single discipline — the thing that did the work is not the thing that confirms the work — is most of what makes a loop trustworthy.

Common loop shapes

Not every loop is morning triage. A few patterns worth knowing:

Retry loop — attempt an atomic task, check pass/fail, retry on failure. Best for small tasks with a crisp signal.
Plan–execute–verify — generate a plan, execute it step by step, verify each step. Best for multi-step, ordered work like a feature build.
Explore–narrow — try several solution paths, keep the most promising based on results. Best for debugging an unknown error or learning an unfamiliar API.
Human-in-the-loop — the agent pauses and asks you when an assumption is risky, then continues. Best when a wrong guess is expensive.

How to keep it from running away

Autonomous loops make autonomous mistakes, and they make them while you sleep. There are two failure modes that bite hardest, and each has a required brake.

Failure mode 1: goal ambiguity. A vague goal lets the loop optimize toward whatever vague sentence you gave it. Fuzzy requirements enable infinite rewrites — the loop keeps changing things because nothing ever conclusively says “stop.” The fix is upfront work: spend real effort making the goal objectively verifiable before you ever start the loop. This is where most of your thinking should go.

Failure mode 2: token-cost explosion. A loop that spawns helpers, retries continuously, and self-reviews can burn through millions of tokens fast — and it compounds when it runs unmonitored. The fixes are hard brakes you build in from the start:

Maximum iteration limit — the loop may run at most N times, full stop.
No-progress detection — if two consecutive iterations don’t move the metric, stop and escalate. Spinning is not working.
Daily token / dollar budget — a hard ceiling on spend per day, per loop.
Strong verification beyond self-assessment — tests, typechecks, a reviewer agent, diff-to-spec. In production, a claim is not done until something checks it.

And the human caveat that every practitioner who’s actually run these repeats — Addy Osmani, whose write-up popularized the framework, is openly skeptical and insists you have to be careful:

Stay the engineer. Read what it shipped. Own the quality. Write the skills, or at least control them.

A good loop accelerates work you understand. A loop you stop reading becomes a way to avoid understanding your own system — speed that quietly hides a growing comprehension debt. The auditability also shifts: you’re no longer reading a clean sequence of your own commits, you’re reading a run trajectory. That demands new logging habits — log every action, summarize periodically, and make the trajectory something a human can actually review.

The takeaway

Loop engineering is the move from operating an agent to designing the system that operates it. You stop prompting turn-by-turn and instead build a small machine with a trigger to start it, a verifiable goal to stop it, skills so it doesn’t re-learn your project every run, sub-agents that separate making from checking, memory so it remembers across runs, and hard brakes so it can’t run away.

Use it when the work repeats and has a clear pass/fail signal. Skip it when the work is one-off, fuzzy, or creative. Build the verification before you build the automation. And whatever you automate, stay the engineer who reads what shipped and owns the quality — the loop multiplies your judgment, it doesn’t replace it.