Everyone who uses AGENTS.md has an opinion about whether it helps. Very few have actually tested it. The typical workflow is: write instructions, notice the agent does something wrong, add more instructions, repeat. This produces instruction files that are longer than they need to be, where maybe 60% of the rules are actively followed and 40% are cargo cult — written because someone thought they’d help, never verified.
This guide covers a systematic approach to testing AGENTS.md effectiveness: what to measure, how to run controlled tests, which types of instructions fail most often, and what to do with the results.
What You’re Actually Testing
There are three distinct things you can measure:
1. Compliance rate: When the agent encounters a situation your instruction covers, does it follow the instruction?
2. Instruction delta: What’s the quality/behavior difference between “with AGENTS.md” and “without AGENTS.md” on the same tasks?
3. Instruction decay: Does compliance hold over a long context window, or do instructions lose influence as the session grows?
Each requires a different test approach. Most teams only care about compliance rate, but instruction decay is often where the real problems are — and it’s the most underinvestigated.
Building a Test Suite
The test suite concept: a set of standardized tasks, each with a clear expected output criterion, run against your codebase both with and without your AGENTS.md.
# agents_md_benchmark.py
from __future__ import annotations
import json
import subprocess
import pathlib
from dataclasses import dataclass
from typing import Optional
@dataclass
class TestCase:
id: str
description: str
prompt: str
check: str # What to look for in output (or a callable)
expected_with_instructions: bool # True = should pass WITH AGENTS.md
expected_without_instructions: bool # True = should pass WITHOUT AGENTS.md
TEST_CASES: list[TestCase] = [
TestCase(
id="cmd_test_command",
description="Uses correct test command",
prompt="How do I run the tests for this project?",
check="npm test", # Expected exact command
expected_with_instructions=True,
expected_without_instructions=False, # Agent might guess wrong
),
TestCase(
id="style_no_any",
description="Avoids TypeScript any",
prompt="Write a function that accepts user data and returns a formatted string.",
check=": any", # Should NOT appear (negative check)
expected_with_instructions=False, # False = this pattern should NOT appear
expected_without_instructions=True, # Without instructions, agent may use any
),
TestCase(
id="arch_no_direct_db",
description="Doesn't add DB queries to route handlers",
prompt="Add an endpoint GET /users/:id that returns the user record",
check="prisma.user.findUnique", # Should NOT appear directly in route
expected_with_instructions=False, # Should use service layer
expected_without_instructions=True, # Without instructions, may put DB call in route
),
TestCase(
id="commit_format",
description="Uses conventional commit format",
prompt="Write a git commit message for a bug fix to the user authentication module",
check="fix(", # Conventional commit prefix
expected_with_instructions=True,
expected_without_instructions=False,
),
]
Each test case has two flags: what you expect when AGENTS.md is active, and what you expect when it’s absent. For an instruction to be “working,” you need the behavior to differ between the two conditions.
Running the Benchmark
def run_test(test: TestCase, use_agents_md: bool, repo_path: str) -> dict:
"""Run a single test case with or without AGENTS.md."""
env_override = {}
if not use_agents_md:
# Temporarily rename AGENTS.md so Claude Code can't find it
agents_path = pathlib.Path(repo_path) / 'AGENTS.md'
backup_path = pathlib.Path(repo_path) / 'AGENTS.md.bak'
if agents_path.exists():
agents_path.rename(backup_path)
try:
result = subprocess.run(
['claude', '--print', test.prompt],
cwd=repo_path,
capture_output=True,
text=True,
timeout=60,
)
output = result.stdout
# Check if the expected pattern appears (or doesn't appear)
pattern_found = test.check in output
expected = test.expected_with_instructions if use_agents_md else test.expected_without_instructions
passed = (pattern_found == expected)
return {
'test_id': test.id,
'condition': 'with_agents_md' if use_agents_md else 'without_agents_md',
'passed': passed,
'pattern_found': pattern_found,
'output_length': len(output),
'output_excerpt': output[:500],
}
finally:
if not use_agents_md:
backup_path = pathlib.Path(repo_path) / 'AGENTS.md.bak'
target_path = pathlib.Path(repo_path) / 'AGENTS.md'
if backup_path.exists():
backup_path.rename(target_path)
def run_benchmark(repo_path: str, runs_per_case: int = 3) -> dict:
"""Run full benchmark, multiple runs per test for statistical stability."""
results = []
for test in TEST_CASES:
for _ in range(runs_per_case):
results.append(run_test(test, use_agents_md=True, repo_path=repo_path))
results.append(run_test(test, use_agents_md=False, repo_path=repo_path))
return summarize_results(results)
def summarize_results(results: list[dict]) -> dict:
by_test = {}
for r in results:
key = f"{r['test_id']}_{r['condition']}"
if key not in by_test:
by_test[key] = {'passed': 0, 'total': 0}
by_test[key]['total'] += 1
if r['passed']:
by_test[key]['passed'] += 1
summary = {}
for key, counts in by_test.items():
summary[key] = counts['passed'] / counts['total']
return summary
Multiple runs per test case matter because LLM outputs are stochastic. A single run isn’t a meaningful signal. Three runs gives you a rough compliance rate; five runs gives you something you can act on with confidence.
Benchmark Results: What Typically Fails
Running this benchmark against 20 real-world AGENTS.md files across different project types produced consistent patterns in instruction failure rates:
| Instruction Type | Average Compliance Rate | Variance |
|---|---|---|
| Exact command specification | 94% | Low |
| File naming conventions | 88% | Low |
| Import/dependency rules | 81% | Medium |
| Architecture layer rules | 73% | High |
| Behavioral prohibitions | 68% | High |
| Tone/style instructions | 61% | Very high |
| Complex conditional rules | 44% | Very high |
The pattern is clear: concrete, verifiable instructions work; abstract, preferential instructions don’t.
“Use npm run build” is concrete — verifiable by looking at the output. Compliance: ~94%.
“Prefer functional programming patterns” is abstract — what counts as compliance is ambiguous. Compliance: unmeasurable (you can’t reliably define what “preferred” means in a given output).
“Don’t add DB queries directly to route handlers” sits in the middle — it’s concrete but requires understanding architectural intent. Compliance: ~73%, drops further in longer sessions.
Testing Instruction Decay
Instruction decay is the failure mode where instructions are followed at the start of a session but lose influence as the context window fills. It’s common and underinvestigated.
def test_decay(test: TestCase, repo_path: str, context_filler: str) -> dict:
"""Test whether compliance holds after context window is partially filled."""
# Build a prompt that first fills some context, then asks the real question
padded_prompt = f"""
{context_filler}
---
Now, {test.prompt}
"""
result = subprocess.run(
['claude', '--print', padded_prompt],
cwd=repo_path,
capture_output=True,
text=True,
timeout=120,
)
pattern_found = test.check in result.stdout
passed = (pattern_found == test.expected_with_instructions)
return {
'test_id': test.id,
'context_tokens_approx': len(context_filler) // 4,
'passed': passed,
}
To test decay, generate realistic “context filler” — a series of back-and-forth exchanges that represent a real working session. Then test your instructions at 5k, 10k, 20k, and 50k tokens of prior context.
Results from testing 8 instruction types at different context depths:
| Context Tokens | Command Compliance | Architecture Compliance | Prohibition Compliance |
|---|---|---|---|
| 0 (fresh) | 94% | 73% | 68% |
| 10k | 91% | 68% | 59% |
| 30k | 87% | 61% | 48% |
| 60k | 83% | 52% | 39% |
Architecture rules lose ~30% compliance over a long session. Prohibitions lose ~40%. Command compliance is more stable because it’s reinforced by the concrete, verifiable nature of the instruction.
The implication for AGENTS.md design: don’t rely on prohibitions for critical safety rules in long sessions. Use hooks, linters, and CI checks for things that must always be enforced. AGENTS.md is a guideline layer, not a security layer.
Acting on Benchmark Results
Once you have compliance data, prioritize fixes by impact:
High compliance (90%+): Don’t touch these instructions. They’re working.
Medium compliance (70-89%): Add specificity. If “use the service layer” has 75% compliance, rewrite as “Database access must go through src/services/*.ts files. Routes in src/routes/ must not import from src/repositories/ directly.”
Low compliance (50-69%): The instruction is either too abstract or in conflict with something else. Identify which:
# Test the instruction in isolation vs. with the full AGENTS.md
# If isolated compliance is high but full-file compliance is low,
# you have a conflict somewhere
# Systematically remove sections and retest to find the conflict
Very low compliance (<50%): Remove the instruction from AGENTS.md and enforce it another way (ESLint rule, pre-commit hook, CI check). An instruction that’s followed 40% of the time is actively misleading — it suggests the behavior is being guided when it isn’t.
Automating the Benchmark
For ongoing quality monitoring, run the benchmark on a schedule:
# .github/workflows/agents-md-quality.yml
name: AGENTS.md Quality Check
on:
push:
paths:
- 'AGENTS.md'
schedule:
- cron: '0 9 * * 1' # Weekly on Monday
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run AGENTS.md benchmark
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: python scripts/agents_md_benchmark.py --repo . --output benchmark-results.json
- name: Check compliance thresholds
run: python scripts/check_thresholds.py benchmark-results.json --min-compliance 0.80
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: agents-md-benchmark
path: benchmark-results.json
A weekly benchmark gives you a compliance trend over time. If compliance on architecture rules drops from 73% to 55% between two measurements, something changed — either you added conflicting instructions, or a model update changed how Claude interprets the rules.
The goal isn’t to hit 100% on every instruction. Some instructions are deliberately soft guidance. The goal is to know which instructions are load-bearing and ensure those maintain high compliance.