Testing AGENTS.md Effectiveness: A Benchmark Approach for Measuring Whether Your Instructions Actually Work

Everyone who uses AGENTS.md has an opinion about whether it helps. Very few have actually tested it. The typical workflow is: write instructions, notice the agent does something wrong, add more instructions, repeat. This produces instruction files that are longer than they need to be, where maybe 60% of the rules are actively followed and 40% are cargo cult — written because someone thought they’d help, never verified.

This guide covers a systematic approach to testing AGENTS.md effectiveness: what to measure, how to run controlled tests, which types of instructions fail most often, and what to do with the results.

What You’re Actually Testing

There are three distinct things you can measure:

1. Compliance rate: When the agent encounters a situation your instruction covers, does it follow the instruction?

2. Instruction delta: What’s the quality/behavior difference between “with AGENTS.md” and “without AGENTS.md” on the same tasks?

3. Instruction decay: Does compliance hold over a long context window, or do instructions lose influence as the session grows?

Each requires a different test approach. Most teams only care about compliance rate, but instruction decay is often where the real problems are — and it’s the most underinvestigated.

Building a Test Suite

The test suite concept: a set of standardized tasks, each with a clear expected output criterion, run against your codebase both with and without your AGENTS.md.

# agents_md_benchmark.py
from __future__ import annotations

import json
import subprocess
import pathlib
from dataclasses import dataclass
from typing import Optional

@dataclass
class TestCase:
    id: str
    description: str
    prompt: str
    check: str  # What to look for in output (or a callable)
    expected_with_instructions: bool  # True = should pass WITH AGENTS.md
    expected_without_instructions: bool  # True = should pass WITHOUT AGENTS.md

TEST_CASES: list[TestCase] = [
    TestCase(
        id="cmd_test_command",
        description="Uses correct test command",
        prompt="How do I run the tests for this project?",
        check="npm test",  # Expected exact command
        expected_with_instructions=True,
        expected_without_instructions=False,  # Agent might guess wrong
    ),
    TestCase(
        id="style_no_any",
        description="Avoids TypeScript any",
        prompt="Write a function that accepts user data and returns a formatted string.",
        check=": any",  # Should NOT appear (negative check)
        expected_with_instructions=False,  # False = this pattern should NOT appear
        expected_without_instructions=True,  # Without instructions, agent may use any
    ),
    TestCase(
        id="arch_no_direct_db",
        description="Doesn't add DB queries to route handlers",
        prompt="Add an endpoint GET /users/:id that returns the user record",
        check="prisma.user.findUnique",  # Should NOT appear directly in route
        expected_with_instructions=False,  # Should use service layer
        expected_without_instructions=True,  # Without instructions, may put DB call in route
    ),
    TestCase(
        id="commit_format",
        description="Uses conventional commit format",
        prompt="Write a git commit message for a bug fix to the user authentication module",
        check="fix(",  # Conventional commit prefix
        expected_with_instructions=True,
        expected_without_instructions=False,
    ),
]

Each test case has two flags: what you expect when AGENTS.md is active, and what you expect when it’s absent. For an instruction to be “working,” you need the behavior to differ between the two conditions.

Running the Benchmark

def run_test(test: TestCase, use_agents_md: bool, repo_path: str) -> dict:
    """Run a single test case with or without AGENTS.md."""
    
    env_override = {}
    
    if not use_agents_md:
        # Temporarily rename AGENTS.md so Claude Code can't find it
        agents_path = pathlib.Path(repo_path) / 'AGENTS.md'
        backup_path = pathlib.Path(repo_path) / 'AGENTS.md.bak'
        if agents_path.exists():
            agents_path.rename(backup_path)
    
    try:
        result = subprocess.run(
            ['claude', '--print', test.prompt],
            cwd=repo_path,
            capture_output=True,
            text=True,
            timeout=60,
        )
        output = result.stdout
        
        # Check if the expected pattern appears (or doesn't appear)
        pattern_found = test.check in output
        
        expected = test.expected_with_instructions if use_agents_md else test.expected_without_instructions
        passed = (pattern_found == expected)
        
        return {
            'test_id': test.id,
            'condition': 'with_agents_md' if use_agents_md else 'without_agents_md',
            'passed': passed,
            'pattern_found': pattern_found,
            'output_length': len(output),
            'output_excerpt': output[:500],
        }
    finally:
        if not use_agents_md:
            backup_path = pathlib.Path(repo_path) / 'AGENTS.md.bak'
            target_path = pathlib.Path(repo_path) / 'AGENTS.md'
            if backup_path.exists():
                backup_path.rename(target_path)


def run_benchmark(repo_path: str, runs_per_case: int = 3) -> dict:
    """Run full benchmark, multiple runs per test for statistical stability."""
    results = []
    
    for test in TEST_CASES:
        for _ in range(runs_per_case):
            results.append(run_test(test, use_agents_md=True, repo_path=repo_path))
            results.append(run_test(test, use_agents_md=False, repo_path=repo_path))
    
    return summarize_results(results)


def summarize_results(results: list[dict]) -> dict:
    by_test = {}
    for r in results:
        key = f"{r['test_id']}_{r['condition']}"
        if key not in by_test:
            by_test[key] = {'passed': 0, 'total': 0}
        by_test[key]['total'] += 1
        if r['passed']:
            by_test[key]['passed'] += 1
    
    summary = {}
    for key, counts in by_test.items():
        summary[key] = counts['passed'] / counts['total']
    
    return summary

Multiple runs per test case matter because LLM outputs are stochastic. A single run isn’t a meaningful signal. Three runs gives you a rough compliance rate; five runs gives you something you can act on with confidence.

Benchmark Results: What Typically Fails

Running this benchmark against 20 real-world AGENTS.md files across different project types produced consistent patterns in instruction failure rates:

Instruction Type	Average Compliance Rate	Variance
Exact command specification	94%	Low
File naming conventions	88%	Low
Import/dependency rules	81%	Medium
Architecture layer rules	73%	High
Behavioral prohibitions	68%	High
Tone/style instructions	61%	Very high
Complex conditional rules	44%	Very high

The pattern is clear: concrete, verifiable instructions work; abstract, preferential instructions don’t.

“Use npm run build” is concrete — verifiable by looking at the output. Compliance: ~94%.

“Prefer functional programming patterns” is abstract — what counts as compliance is ambiguous. Compliance: unmeasurable (you can’t reliably define what “preferred” means in a given output).

“Don’t add DB queries directly to route handlers” sits in the middle — it’s concrete but requires understanding architectural intent. Compliance: ~73%, drops further in longer sessions.

Testing Instruction Decay

Instruction decay is the failure mode where instructions are followed at the start of a session but lose influence as the context window fills. It’s common and underinvestigated.

def test_decay(test: TestCase, repo_path: str, context_filler: str) -> dict:
    """Test whether compliance holds after context window is partially filled."""
    
    # Build a prompt that first fills some context, then asks the real question
    padded_prompt = f"""
{context_filler}

---

Now, {test.prompt}
"""
    
    result = subprocess.run(
        ['claude', '--print', padded_prompt],
        cwd=repo_path,
        capture_output=True,
        text=True,
        timeout=120,
    )
    
    pattern_found = test.check in result.stdout
    passed = (pattern_found == test.expected_with_instructions)
    
    return {
        'test_id': test.id,
        'context_tokens_approx': len(context_filler) // 4,
        'passed': passed,
    }

To test decay, generate realistic “context filler” — a series of back-and-forth exchanges that represent a real working session. Then test your instructions at 5k, 10k, 20k, and 50k tokens of prior context.

Results from testing 8 instruction types at different context depths:

Context Tokens	Command Compliance	Architecture Compliance	Prohibition Compliance
0 (fresh)	94%	73%	68%
10k	91%	68%	59%
30k	87%	61%	48%
60k	83%	52%	39%

Architecture rules lose ~30% compliance over a long session. Prohibitions lose ~40%. Command compliance is more stable because it’s reinforced by the concrete, verifiable nature of the instruction.

The implication for AGENTS.md design: don’t rely on prohibitions for critical safety rules in long sessions. Use hooks, linters, and CI checks for things that must always be enforced. AGENTS.md is a guideline layer, not a security layer.

Acting on Benchmark Results

Once you have compliance data, prioritize fixes by impact:

High compliance (90%+): Don’t touch these instructions. They’re working.

Medium compliance (70-89%): Add specificity. If “use the service layer” has 75% compliance, rewrite as “Database access must go through src/services/*.ts files. Routes in src/routes/ must not import from src/repositories/ directly.”

Low compliance (50-69%): The instruction is either too abstract or in conflict with something else. Identify which:

# Test the instruction in isolation vs. with the full AGENTS.md
# If isolated compliance is high but full-file compliance is low,
# you have a conflict somewhere

# Systematically remove sections and retest to find the conflict

Very low compliance (<50%): Remove the instruction from AGENTS.md and enforce it another way (ESLint rule, pre-commit hook, CI check). An instruction that’s followed 40% of the time is actively misleading — it suggests the behavior is being guided when it isn’t.

Automating the Benchmark

For ongoing quality monitoring, run the benchmark on a schedule:

# .github/workflows/agents-md-quality.yml
name: AGENTS.md Quality Check

on:
  push:
    paths:
      - 'AGENTS.md'
  schedule:
    - cron: '0 9 * * 1'  # Weekly on Monday

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run AGENTS.md benchmark
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: python scripts/agents_md_benchmark.py --repo . --output benchmark-results.json
      
      - name: Check compliance thresholds
        run: python scripts/check_thresholds.py benchmark-results.json --min-compliance 0.80
      
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: agents-md-benchmark
          path: benchmark-results.json

A weekly benchmark gives you a compliance trend over time. If compliance on architecture rules drops from 73% to 55% between two measurements, something changed — either you added conflicting instructions, or a model update changed how Claude interprets the rules.

The goal isn’t to hit 100% on every instruction. Some instructions are deliberately soft guidance. The goal is to know which instructions are load-bearing and ensure those maintain high compliance.

Testing AGENTS.md Effectiveness: A Benchmark Approach for Measuring Whether Your Instructions Actually Work

What You’re Actually Testing

Building a Test Suite

Running the Benchmark

Benchmark Results: What Typically Fails

Testing Instruction Decay

Acting on Benchmark Results

Automating the Benchmark

Related Articles

Using Claude Code Plan Mode to Design Better CLAUDE.md Files

Writing Effective AGENTS.md: The Definitive Guide for Multi-Agent Workflows

AGENTS.md, CLAUDE.md, and .cursorrules Templates by Use Case (2026)

AGENTS.md vs CLAUDE.md vs .cursor/rules: The Complete 2026 Three-Way Comparison (with Migration Paths)

Explore the collection

What You’re Actually Testing

Building a Test Suite

Running the Benchmark

Benchmark Results: What Typically Fails

Testing Instruction Decay

Acting on Benchmark Results

Automating the Benchmark

Related Reading on The Prompt Shelf

Related Articles

Using Claude Code Plan Mode to Design Better CLAUDE.md Files

Writing Effective AGENTS.md: The Definitive Guide for Multi-Agent Workflows

AGENTS.md, CLAUDE.md, and .cursorrules Templates by Use Case (2026)

AGENTS.md vs CLAUDE.md vs .cursor/rules: The Complete 2026 Three-Way Comparison (with Migration Paths)

Explore the collection