Testing AI Apps¶

AgentTape ships a first-class pytest plugin. Bind a test to a cassette with one marker, and it runs offline, free, and deterministic by default.

What you're actually testing¶

Testing an agent isn't asserting 2 + 2 == 4. You're asserting that, given recorded model responses, your agent:

produces a sensible final output,
calls the right tools, in the right order,
and doesn't silently change behavior when you refactor.

AgentTape gives you assertions for all three.

The plugin is automatic¶

Installing AgentTape registers the pytest plugin — no setup. Tests marked with @pytest.mark.agenttape run in mode="none" (offline replay) by default, so CI is always fast and free.

flowchart LR
    M["@pytest.mark.agenttape('name')"] --> S[Open cassette session<br/>around the test]
    S --> R[Replay recorded calls]
    R --> A[Your assertions run]

Your first test¶

test_weather.py

import pytest
from my_app import weather_agent

@pytest.mark.agenttape("weather_sunny_london")
def test_weather_agent(agenttape_cassette):
    result = weather_agent.run("What is the weather in London?")

    # Assert on the final output
    assert "sunny" in result.lower()

    # Assert on behavior: which tools ran, in what order
    agenttape_cassette.assert_tool_calls(["get_location", "get_weather"])

What happened?

The marker binds the test to cassettes/weather_sunny_london.yaml.
AgentTape opens a session around the test and replays the recorded LLM + tool calls.
assert_tool_calls verifies the agent invoked exactly those tools, in that order.

Recording the cassette¶

The first run fails — the cassette doesn't exist yet. Record it by adding a flag (this hits real services and writes the file):

pytest --agenttape-record

Then run normally — instant, offline replay:

pytest

Command	Mode	Network
`pytest`	`none`	Off (replay)
`pytest --agenttape-record`	`all`	On (re-records marked tests)
`pytest --agenttape-mode once`	`once`	First run only

The `agenttape_cassette` fixture¶

Request the agenttape_cassette fixture to get a CassetteHandle with assertions and introspection.

Assertions¶

@pytest.mark.agenttape("checkout")
def test_checkout(agenttape_cassette):
    run_agent()

    agenttape_cassette.assert_tool_calls(["lookup_price", "charge_card"])
    agenttape_cassette.assert_final_output("Order confirmed")
    agenttape_cassette.assert_snapshot()   # fails on ANY drift from the recording

Method	Fails when…
`assert_tool_calls(list)`	The tool/retrieval boundary names (in order) differ
`assert_final_output(value)`	The agent's final output differs
`assert_snapshot()`	The full interaction sequence drifts from the cassette

Introspection¶

Property	Returns
`tool_calls`	List of tool/retrieval boundary names exercised
`final_output`	The run's final output
`interactions`	The full timeline of interactions
`path` / `mode`	The cassette file path / active mode

Snapshot testing: catch silent regressions¶

assert_snapshot() compares the entire sequence of interactions in this run against the cassette. If the agent decides to call get_weather twice instead of once — or reorders its calls — the snapshot fails with a readable diff.

@pytest.mark.agenttape("weather_sunny_london")
def test_no_regression(agenttape_cassette):
    weather_agent.run("What is the weather in London?")
    agenttape_cassette.assert_snapshot()

This is the strongest guard against your agent silently changing behavior when you tweak a prompt or bump the model.

Marker options¶

The marker accepts the same knobs as use_cassette:

# Auto-name the cassette from the test node (no explicit name)
@pytest.mark.agenttape
def test_auto(agenttape_cassette): ...

# Partial replay inside a test
@pytest.mark.agenttape("rag", live={"llm"})
def test_new_prompt(agenttape_cassette): ...

# Custom matchers / freeze
@pytest.mark.agenttape("x", matchers=["ordered"], freeze=["uuid"])
def test_ordered(agenttape_cassette): ...

Supported marker kwargs: mode, live, frozen, matchers, freeze, format. If you omit the cassette name, AgentTape derives it from the test's node name.

A realistic CI setup¶

.github/workflows/ci.yml

- name: Test (offline, deterministic)
  run: pytest          # mode=none — no API keys, no network needed

Cassettes are committed to the repo, so CI needs no secrets and runs offline. Re-record locally when behavior changes intentionally, and the cassette diff lands in the PR for review.

FAQ¶

Do I need the fixture, or just the marker?

The marker alone opens a session around the test (via an autouse fixture), so replay works without requesting agenttape_cassette. Request the fixture only when you want the assertions or introspection.

How do I update one cassette without re-recording all of them?

Run a single test with the record flag: pytest path::test_name --agenttape-record.

My snapshot test fails after a prompt change — is that a bug?

No, that's the point. Decide: intentional change → re-record. Unintended → fix the code. See Debugging.

Summary¶

@pytest.mark.agenttape("name") binds a test to a cassette; tests replay offline by default.
pytest --agenttape-record (re)records against real services.
The agenttape_cassette fixture offers assert_tool_calls, assert_final_output, assert_snapshot.
Snapshot tests catch silent behavioral regressions; CI needs no API keys.

Next: Recording APIs →