OpenAI Codex Codex CLI workflow patterns AI coding AGENTS.md Claude Code Cursor developer tools 2026

OpenAI Codex CLI Workflow Patterns 2026: From Setup to Production-Grade Automation

The Prompt Shelf ·

Getting Codex CLI running is straightforward. Getting it to do useful work reliably, at scale, without constant hand-holding — that takes deliberate workflow design.

This article picks up where the AGENTS.md setup guide left off. If you have not read that one, go there first: it covers how Codex discovers and merges instruction files, the 32 KiB budget constraint, and what to put in your AGENTS.md. Everything in this article assumes that foundation is in place.

What this article covers: seven workflow patterns that experienced Codex users have converged on, how to tune your AGENTS.md specifically for each pattern, an honest comparison of Codex CLI against Claude Code and Cursor on the axes that actually matter, practical cost optimization, and the failure modes that catch developers off guard when they try to scale up.


What Changed in 2026

Codex CLI shipped as a research preview in early 2025. By mid-year it had stabilized enough for production workflows, and the ecosystem around it — AGENTS.md tooling, CI integration, monorepo patterns — has matured considerably.

The most significant changes that affect how you design workflows:

Sandboxed execution is now the default. Codex runs in a network-disabled, write-controlled sandbox when you use --sandbox=full. In 2025, this was opt-in and noticeably slower. In 2026, the sandbox overhead has dropped to under 300ms on typical tasks, making it practical to leave on for all sessions. This changes the threat model for multi-step automation significantly — you can now let Codex run longer task chains without worrying about unintended file modifications spreading across the repo.

The approval model gained granularity. Instead of choosing between “approve everything” (--ask-for-approval never) and “approve every shell command” (always), you can now define approval policies in ~/.codex/config.toml that allow specific command classes automatically. Teams running Codex in CI use this extensively.

Sub-agent handoffs are stable. Codex can now invoke sub-agents defined in your AGENTS.md, wait for their output, and continue. In 2025, this required workarounds. In 2026, the ## Agents section in AGENTS.md is a first-class feature with documented behavior.

Context persistence across checkpoints. Sessions can checkpoint state between steps, allowing multi-hour autonomous runs that would have timed out or drifted in 2025.

These changes made the patterns below practical. Most of them were theoretically possible before — they just were not reliable enough to use in real work.


Workflow Pattern Catalog

Pattern 1: Inline Review

What it is: You write code, Codex reviews it — not for style, but for correctness, edge cases, and logical issues. The output is a structured list of specific problems with exact file locations, not generic advice.

When to use it: After completing a feature or a significant refactor, before opening a PR. Works best on code you wrote yourself (where you have blind spots) rather than unfamiliar legacy code.

Setup:

Create a review.md prompt template:

Review the diff at [DIFF_PATH] for the following, in order of priority:

1. Logic errors — conditions that evaluate incorrectly, off-by-one errors,
   incorrect null checks
2. Unhandled edge cases — inputs not covered by the existing test suite
3. Boundary violations — anything that crosses the architectural lines defined
   in AGENTS.md (database access outside repositories, business logic in routes,
   etc.)
4. Security issues — injection vectors, credential exposure, improper auth checks

For each issue:
- File and line number
- One-sentence description of the problem
- Minimal code change that fixes it (no rewrites unless necessary)

Do not flag style issues. Do not suggest "consider using X instead of Y" unless
the current approach causes an actual bug.

Run it:

git diff main > /tmp/current.diff
codex --approval-mode never \
  --model o4-mini \
  "$(cat review.md | sed 's|\[DIFF_PATH\]|/tmp/current.diff|')"

AGENTS.md tuning for this pattern:

Add an explicit review scope section:

## Review Boundaries

When asked to review code:
- Reference the architectural layers defined in [Project Structure] above
- Flag violations by layer name (e.g., "business logic in route handler violates
  services/ boundary")
- Test coverage gaps: list specific scenarios not covered, not a general "add
  more tests"
- Security: check for [list your most common security concerns]

Do not flag:
- Formatting (handled by eslint/prettier)
- Variable naming unless it causes ambiguity
- "Could be simplified" observations unless the current version has a bug

The key insight: most developers use Codex for generation and miss its value as a reviewer. A structured review prompt that focuses on correctness rather than style produces more actionable output in less time than a human code review for catching logic errors.


Pattern 2: Spec-Driven Generation

What it is: You write a specification — in structured prose, not pseudocode — and Codex generates an implementation that satisfies it. The spec becomes the source of truth. Generated code is never committed without running it against the spec as a test.

When to use it: New features where you know what the behavior should be before you know how to implement it. Particularly effective for API endpoints, data transformations, and stateful logic where behavior is complex but writable.

Setup:

Specs use a consistent format:

# Spec: Rate Limiter

## Behavior

- Accepts: userId (string), action (string), windowMs (number), limit (number)
- Returns: { allowed: boolean, remaining: number, resetAt: Date }
- Tracks attempts per (userId, action) pair within the given window
- Window slides with each new request (not fixed calendar windows)
- Concurrent requests from the same userId are handled safely (no race conditions)

## Error cases

- userId empty string: throw ValidationError("userId required")
- limit < 1: throw ValidationError("limit must be positive")
- windowMs < 100: throw ValidationError("window too narrow")

## Constraints

- No external dependencies (Redis, etc.) — must work with in-memory storage
- Thread-safe without locks (use atomic Map operations)
- Memory: evict entries older than 2x windowMs to prevent leaks

## Test cases (must pass)

- fresh limiter: returns { allowed: true, remaining: limit - 1 }
- at limit: returns { allowed: false, remaining: 0 }
- after window expires: counter resets, returns { allowed: true, remaining: limit - 1 }
- concurrent requests at limit: exactly limit requests allowed, none extra

Generate from spec:

codex --approval-mode files-only \
  --model o4-mini \
  "Implement the spec at specs/rate-limiter.md. 
   Place the implementation in src/utils/rateLimiter.ts.
   Place tests in tests/rateLimiter.test.ts.
   Run the tests after generation and fix any failures before finishing."

AGENTS.md tuning:

## Spec-Driven Generation Rules

When implementing from a spec file:
1. Read the spec completely before writing any code
2. Implement only what the spec describes — do not add convenience methods
   or extend behavior beyond the stated requirements
3. Generate tests that cover all listed test cases plus boundary conditions
   not explicitly listed
4. Run tests before reporting completion
5. If a spec requirement is ambiguous or contradictory, stop and report the
   specific conflict before guessing

Spec files are in specs/. Implementations go in src/ (matching existing 
structure). Tests mirror src/ structure.

Why this beats “write me a rate limiter”: The spec approach forces you to articulate the exact behavior before generation begins. This prevents the most common failure mode of AI-generated code — technically working code that does not do what you actually needed.


Pattern 3: Test-First Refactor

What it is: Before touching existing code, Codex generates a comprehensive test suite against the current behavior. Refactoring then proceeds against this suite as a regression harness. No refactor is complete until all tests pass and coverage has not decreased.

When to use it: Refactoring legacy code where existing test coverage is sparse. Extracting a service from a monolith. Changing data models while keeping the external API stable.

The two-phase execution:

Phase 1 — Generate characterization tests:

codex --approval-mode never \
  --model o4-mini \
  "Generate characterization tests for src/legacy/orderProcessor.ts.
   
   Coverage requirements:
   - Every public method
   - Every error path (look for try/catch blocks and conditional throws)
   - At least 3 edge cases per method based on the parameter types
   
   Do not modify the source file. Write tests to tests/legacy/orderProcessor.test.ts.
   Run the test suite and confirm all tests pass against the current implementation.
   Report: number of tests generated, coverage percentage achieved."

Phase 2 — Refactor against the suite:

codex --approval-mode files-only \
  --model o4-mini \
  "Refactor src/legacy/orderProcessor.ts to:
   1. Extract database queries to src/repositories/orderRepository.ts
   2. Remove all console.log statements (use the logger at src/utils/logger.ts)
   3. Replace callback-style async with async/await throughout
   
   Constraints:
   - tests/legacy/orderProcessor.test.ts must continue to pass without modification
   - Do not change any public method signatures
   - Do not change the file's external exports
   
   Run the test suite after each significant change. Do not proceed to the
   next refactor step if tests are failing."

AGENTS.md tuning:

## Refactor Rules

Before any refactor:
- Confirm existing test coverage with: npm run test:coverage -- [file]
- If coverage < 70%, generate characterization tests first

During refactor:
- Run tests after each logical step (not just at the end)
- If a test fails, fix the implementation — do not modify the test unless 
  the test is clearly wrong about what the function should do
- Report coverage before and after

Refactoring means behavior-preserving change. New behavior is a feature, 
not a refactor, and requires a spec.

Why test-first beats test-after: Generated tests written after a refactor tend to test the refactored implementation, not the original behavior. Characterization tests written against the original code capture actual behavior — including the undocumented behaviors that callers might depend on.


Pattern 4: Multi-Step Planning

What it is: For tasks that touch multiple files, subsystems, or require sequential decisions, Codex produces an explicit plan before executing. Each plan step is discrete and verifiable. Execution does not begin until the plan is approved.

When to use it: Adding a new entity type to a full-stack application (schema, migration, model, service, API, tests, documentation). Migrating from one library to another. Cross-cutting architectural changes.

The plan-execute protocol:

Step 1 — Generate plan only:

codex --approval-mode never \
  --model o3 \
  "Plan the implementation of [FEATURE_DESCRIPTION].
   
   Output format:
   - Numbered steps in execution order
   - Each step: file(s) affected, type of change (create/modify/delete),
     dependencies on previous steps, verification method
   - Flag any steps that require external information (credentials, API schemas)
   - Estimated step count and complexity
   
   Do not write any code. Do not modify any files. Plan only."

Step 2 — Review and optionally edit the plan in your editor.

Step 3 — Execute with step-by-step approval:

codex --approval-mode files-only \
  --model o4-mini \
  "Execute the implementation plan at /tmp/feature-plan.md.
   
   After each numbered step:
   - Run relevant tests
   - Confirm the step's verification method passes
   - Briefly state what was done before moving to the next step
   
   If a step fails its verification, stop and report the failure.
   Do not attempt to compensate for a failed step by modifying other steps."

AGENTS.md tuning:

## Planning Protocol

When asked to plan (not implement):
- Output a numbered step list only
- Each step specifies: files, change type, dependencies, verification
- Mark steps that require human decisions with [DECISION REQUIRED]
- Do not write code or modify files during planning

When executing a plan:
- Treat the plan as immutable unless a step explicitly fails
- Run verification after each step
- On failure, report the step number, failure details, and stop — do not 
  try to recover autonomously

The planning model choice matters: Use o3 for planning (broader reasoning, better at identifying dependencies and edge cases) and o4-mini for execution (faster, cheaper, sufficient for discrete well-specified tasks). Switching models between phases is a meaningful optimization.


Pattern 5: Continuous Integration Gate

What it is: Codex runs as a CI check that reviews PRs automatically — not for style (your linter handles that) but for logic, security, and architectural boundary violations. Failed checks block merge.

When to use it: Teams where multiple developers use AI tools to generate code and want a second AI pass before human review.

CI configuration (GitHub Actions):

name: Codex Review Gate

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  codex-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Generate diff
        run: git diff origin/${{ github.base_ref }}...HEAD > /tmp/pr.diff

      - name: Run Codex review
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          npm install -g @openai/codex
          codex \
            --ask-for-approval never \
            --sandbox=full \
            --model o4-mini \
            "$(cat .codex/ci-review-prompt.md)" < /tmp/pr.diff

      - name: Check review output
        # Parse Codex output for severity markers and fail on CRITICAL findings
        run: |
          if grep -q "\[CRITICAL\]" /tmp/codex-review.md; then
            echo "Critical issues found. Review output:"
            cat /tmp/codex-review.md
            exit 1
          fi

The ci-review-prompt.md defines what counts as a CRITICAL finding in your codebase. Architectural violations, SQL injection vectors, missing auth checks — whatever your team cares about most. Everything else is advisory.

AGENTS.md for CI context:

## CI Review Context

You are running as a CI gate reviewer, not an interactive assistant.

Output format (strict):
[CRITICAL] <description> — <file>:<line>
[WARNING] <description> — <file>:<line>
[INFO] <description> — <file>:<line>

CRITICAL = blocks merge. Use for:
- Security vulnerabilities with clear exploit path
- Architectural boundary violations (defined in [Project Structure])
- Logic errors that will cause data corruption or incorrect results

WARNING = advisory. Use for:
- Missing test coverage for new logic
- Potential performance issues at scale
- Pattern inconsistencies

INFO = non-blocking observations.

Do not output anything except the classified finding list and a one-line summary.

Pattern 6: Incremental Documentation

What it is: Codex generates documentation for code as it changes — not as a batch retrospective, but triggered by specific file changes. The documentation stays in sync because it is generated immediately after each substantive change.

When to use it: Teams where documentation drift is a problem. Projects where the API surface changes frequently. Anywhere that “we’ll document it later” has a track record of not happening.

Implementation:

A simple git hook (post-commit) triggers documentation generation for changed files:

#!/bin/bash
# .git/hooks/post-commit

CHANGED=$(git diff HEAD~1 --name-only | grep -E '\.(ts|tsx|py|go|rs)$')

if [ -n "$CHANGED" ]; then
  codex \
    --ask-for-approval never \
    --model o4-mini \
    "Update JSDoc/docstring documentation for changed functions in:
     $CHANGED
     
     Rules:
     - Only update docs for functions whose implementation changed
     - Preserve existing doc blocks that are still accurate
     - For new functions: generate full docblock (params, returns, throws, example)
     - For modified functions: update the relevant sections only
     
     Run: git add -u && git amend --no-edit" 
fi

AGENTS.md tuning:

## Documentation Standards

When generating or updating documentation:
- JSDoc for TypeScript/JavaScript
- Google-style docstrings for Python
- Doc comments (///) for Go and Rust

Required sections for public functions:
- One-sentence description (what it does, not how)
- @param for each parameter with type and meaning
- @returns with type and meaning
- @throws for each error condition
- @example with the most common usage

Do not document private/internal functions unless they are complex enough
to warrant it (>30 lines, non-obvious algorithm).

Do not document getter/setter accessors.

Pattern 7: Scaffolding-Then-Implement

What it is: For large new features, Codex first generates the full file/directory structure and interface definitions — with empty implementations — and then implements each component in isolation. This prevents the common failure mode where an AI agent makes increasingly incoherent decisions as it fills in a large implementation all at once.

When to use it: New microservices. Significant new modules. Any feature touching more than 5-7 files.

Phase 1 — Scaffold:

codex --approval-mode files-only \
  --model o3 \
  "Scaffold the [FEATURE_NAME] module.
   
   Create:
   - Directory structure following the conventions in AGENTS.md
   - TypeScript interface files for all data types
   - Function signatures with return types (empty bodies with // TODO: implement)
   - Test file stubs with describe blocks and it() stubs (empty)
   - Export statements so modules are properly connected
   
   Do not implement any function bodies. Do not write any logic.
   The scaffold should compile with tsc --noEmit (only TODO comments in bodies)."

Phase 2 — Implement in sequence:

# Implement from dependencies inward — lowest-dependency modules first
codex --approval-mode files-only \
  --model o4-mini \
  "Implement src/[feature]/repository.ts.
   
   The interfaces are defined in src/[feature]/types.ts.
   The tests stubs are in tests/[feature]/repository.test.ts — fill in the
   tests as you implement.
   Run tests after implementation and fix any failures."

Repeat for each module in dependency order.

AGENTS.md tuning:

## Scaffolding Protocol

When scaffolding (not implementing):
- Create directory structure first
- Create interface/type files before implementation files
- Use // TODO: implement as the body of all functions
- All files must compile without errors (types must be correct even if bodies are empty)
- Report the scaffold plan (files created, dependency order for implementation)
  before creating any files

Implementation follows scaffolding in separate sessions.

AGENTS.md Tuning for Codex CLI

Your AGENTS.md already covers the basics: commands, project structure, conventions. The patterns above add another layer — instructions that shape how Codex behaves in specific workflow contexts. Here is how to structure the file so it serves both purposes without becoming unmanageable.

Separate universal from workflow-specific

Universal instructions belong at the top-level of AGENTS.md. Workflow-specific instructions should be in a dedicated section that Codex can reference contextually:

# AGENTS.md

## Commands
[universal]

## Project Structure
[universal]

## Conventions
[universal]

---

## Workflow Instructions

These sections are referenced by specific prompts. If a prompt does not
reference a section, apply only the universal instructions above.

### Review Mode
[see Pattern 1 above]

### Spec-Driven Mode
[see Pattern 2 above]

### Refactor Mode
[see Pattern 3 above]

### Planning Mode
[see Pattern 4 above]

Prompts that use a specific pattern reference the relevant section explicitly: “Review this diff following the guidelines in the ‘Review Mode’ section of AGENTS.md.” This prevents Codex from mixing instructions across contexts.

The instruction budget problem

Every character in AGENTS.md is consumed before your first prompt. For a project with full workflow instructions, a well-structured AGENTS.md might run 8-12 KB. That leaves plenty of room within the 32 KiB default budget — but only if your global ~/.codex/AGENTS.md is lean.

Global AGENTS.md rule: under 3 KB. Personal preferences and defaults only. No project-specific content.

If you have a monorepo where different packages need different workflow instructions, use package-level AGENTS.md files:

repo-root/AGENTS.md              # Universal: commands, structure, conventions
repo-root/packages/api/AGENTS.md # API-specific workflow instructions
repo-root/packages/ui/AGENTS.md  # UI-specific workflow instructions

When Codex is in packages/api/, it merges both files — root first, package-specific second. The package-specific instructions take precedence by virtue of appearing later in the context.

Measuring instruction effectiveness

The most common question is “how do I know if my AGENTS.md is working?” The direct answer is to ask:

codex --ask-for-approval never \
  "Summarize the instructions you have loaded for this session, organized by 
   section. Then tell me: what are the top three things I should do differently 
   in this project based on these instructions?"

If Codex cannot summarize the key constraints accurately, the instructions are not landing. Common reasons: file discovery failure (wrong path), size limit truncation (global file too large), or instructions too vague to parse as constraints.


Codex CLI vs Claude Code vs Cursor

This comparison uses a functional axis rather than a spec-sheet axis. The question is not “which tool has the most features” but “which tool handles each workflow pattern better.”

CapabilityCodex CLIClaude CodeCursor
Agentic autonomyHigh — designed for long autonomous runs with sandboxingHigh — native terminal + tool use, multi-sessionMedium — editor-integrated, not designed for long autonomous tasks
Context window~128K (model-dependent)Up to 1M (claude-sonnet-4-5)~200K (model-dependent)
AGENTS.md supportNative (first-class)Via CLAUDE.md (superset format)Via .cursorrules (limited hierarchy)
Sandbox executionYes, production-gradeYes, via Docker or native process controlNo native sandbox
CI/CD integrationStrong — designed for non-interactive usePossible via headless modeNot designed for CI
Inline reviewExcellent — structured output, no IDE dependencyExcellent — can use slash commandsGood — in-editor context, limited output structuring
Spec-driven generationExcellent — works well with file-based spec inputExcellent — handles multi-file generation wellGood — better for single-file generation
Test-first refactorGood — requires explicit instructionsExcellent — native test running awarenessGood — in-editor test runner integration
Multi-step planningExcellent — o3 planning + o4-mini execution model swapGood — Claude handles planning within sessionLimited — session length constraints
Approval granularityHigh — config.toml approval policiesMedium — tool-level permissionsLow — largely auto-accept
Cost modelPay per API call (OpenAI pricing)Pro/Max subscription or APISubscription + API costs
IDE requiredNoNoYes

What Codex CLI does best: Long autonomous tasks, CI integration, multi-step workflows where you want structured plan-then-execute behavior, and anything where you need fine-grained control over what gets approved automatically. The model-swap pattern (o3 for planning, o4-mini for execution) is a genuine cost and quality optimization that neither Claude Code nor Cursor supports as cleanly.

What Claude Code does best: Tasks that require broad repository context (the 1M context window is meaningful for large codebases), CLAUDE.md configurations with Claude-specific hooks and memory systems, and teams already using Anthropic’s stack. Claude Code’s sub-agent system is more flexible than Codex’s for complex delegation patterns.

What Cursor does best: Interactive editing where you want to see changes in context, teams where IDE integration matters more than automation, and junior developers who benefit from in-editor hints rather than terminal output.

The practical answer for most teams: Codex CLI and Claude Code are complementary rather than competitive. Codex CLI handles autonomous automation and CI gates. Claude Code handles exploratory, session-based work where a developer is in the loop. Using both is a reasonable choice — AGENTS.md and CLAUDE.md serve overlapping roles and can share content.


Cost Optimization

Codex CLI costs scale with API calls, not subscription tiers. That makes cost optimization more important — and more tractable — than with subscription-based tools.

Model selection per pattern

The single biggest cost lever is matching model to task:

PatternRecommended modelWhy
Inline reviewo4-miniReview is structured output; reasoning quality doesn’t require o3
Spec-driven generationo4-miniSpec provides sufficient context; the model follows it, not reasons about it
Test-first refactor (Phase 1: characterization)o4-miniSystematic enumeration, not creative reasoning
Test-first refactor (Phase 2: refactor)o4-miniFollows established tests; minimal reasoning overhead
Multi-step planningo3This is the pattern where reasoning quality directly affects output quality
CI gateo4-miniPattern matching against defined criteria, not open-ended reasoning
Scaffoldo3Interface design requires broader reasoning about system shape
Scaffold (implementation)o4-miniExecution of well-defined interfaces

Using o3 only where it adds genuine value and o4-mini for everything else typically cuts costs by 60-70% compared to using o3 uniformly.

Prompt caching

Codex does not expose OpenAI’s prompt caching directly, but you can benefit from it by structuring your AGENTS.md and prompt templates to be consistent across calls:

  • AGENTS.md content should be stable across sessions. Do not include timestamps or session-specific information.
  • Prompt templates (like the review.md above) should have a static preamble that is the same every time, with only the variable portion (the diff, the spec path) changing. Cached prefixes count toward cache hits.

The sandbox overhead tradeoff

--sandbox=full adds latency but can save money on failed runs. A misconfigured Codex session that overwrites the wrong files and then has to be rolled back can waste more tokens (and developer time) than the sandbox overhead. For automated workflows, sandbox costs are justified.

For interactive sessions where you are present to catch issues, --sandbox=none is faster. The key is being consistent so you do not accidentally run with sandbox off in a context where you assumed it was on.

Scoping context aggressively

Codex reads the files you point it at. More files means more tokens. For the patterns above:

  • Inline review: feed only the diff, not the full files
  • Spec-driven generation: feed the spec + the interfaces it depends on, not the whole codebase
  • Test-first refactor: feed only the target file and its direct imports
  • Multi-step planning: feed the file list for the affected area, not all files

Use .codexignore to exclude files that should never be in context: build artifacts, generated files, node_modules (if not gitignored), large data files.


Pitfalls and FAQ

Q: Codex is not following the instructions in my AGENTS.md.

Most common causes in order of frequency:

  1. File not discovered: Run the summarization check (codex --ask-for-approval never "Summarize your loaded instructions"). If AGENTS.md content is not there, the file is not being found — check the path, working directory, and Codex home.

  2. Size limit truncation: Your global AGENTS.md is large enough to push project-level instructions past the 32 KiB cutoff. Check wc -c ~/.codex/AGENTS.md and keep it under 3 KB.

  3. Instructions too vague: “Follow best practices” is not an instruction. “Run npm run lint before reporting completion” is an instruction. Rewrite vague rules as specific, checkable actions.

Q: Multi-step tasks drift and produce inconsistent results over a long run.

This is the most common failure mode for Pattern 4 (multi-step planning) and Pattern 7 (scaffolding-then-implement). Two fixes:

First, use checkpointing. End each phase with a Codex call that writes a brief state file: "Write the current state of the implementation to .codex/checkpoint.md — what is done, what remains, what assumptions were made." Start the next phase by reading this checkpoint.

Second, front-load constraints. If a constraint matters throughout the entire run, it should be in AGENTS.md, not just in the first prompt. Constraints only in prompts can be “forgotten” as the context fills with intermediate output.

Q: Codex sometimes modifies files I did not ask it to touch.

Three approaches in order of invasiveness:

  1. --sandbox=files-only restricts writes to files in the current directory. For most workflows this is sufficient.
  2. Add explicit boundaries to AGENTS.md: "Do not modify files outside src/ and tests/ unless explicitly instructed." This works well when combined with the sandbox.
  3. Use a review phase: Pattern 1 (inline review) run against Codex’s own output before committing. This catches unintended changes before they are in git.

Q: How do I handle secrets and credentials in automated Codex workflows?

Never put credentials in AGENTS.md or prompts. The patterns above use the --sandbox=full flag specifically because it disables network access, which prevents Codex from exfiltrating data even if a prompt injection attack occurs.

For CI workflows that need service credentials (database connection, API keys), pass them as environment variables and reference them in AGENTS.md by variable name: "Database connection: $DATABASE_URL (set in environment, never hardcode)." Codex can use the variable name to inform its generated code without the actual value being in the instruction context.

Q: The test-first refactor pattern is slow. Is there a faster path?

The characterization test generation step typically takes 2-5 minutes for a mid-sized module (200-500 lines). If that is too slow, an alternative: use an existing coverage tool to find what is already tested (npm run test:coverage), then ask Codex to add tests only for uncovered lines. This is faster but produces a less complete behavioral baseline.

The longer answer is that the slowness is the point. Thorough characterization tests are the only thing preventing a refactor from changing behavior in ways neither you nor Codex noticed. If you need faster turnaround, reduce the scope of the refactor, not the tests.

Q: Should I use one AGENTS.md or multiple?

For projects up to ~50K lines: one AGENTS.md at the root is sufficient. Add a workflow instructions section as described above.

For monorepos or larger projects: root AGENTS.md for universal rules, package-level AGENTS.md for package-specific workflow instructions. Keep the root file under 5 KB so it leaves room for package files within the 32 KiB budget.

For teams with different AGENTS.md in each developer’s ~/.codex/: this is fine and expected. Personal AGENTS.md handles individual preferences; project AGENTS.md handles project specifics. Both are correct in their respective domains.


The patterns above are not exhaustive — they are the ones that have proven reliable enough to build real workflows around. As Codex CLI continues to evolve, the execution patterns will shift, but the underlying principle will not: the more precisely you define the task, the constraints, and the verification method, the more consistently Codex delivers useful output.

For more AGENTS.md examples from real repositories, browse the rules gallery. For the configuration fundamentals that underpin everything in this guide, see the AGENTS.md setup guide.

Related Articles

Explore the collection

Browse all AI coding rules — CLAUDE.md, .cursorrules, AGENTS.md, and more.

Browse Rules