Required Testing Behavior

Mandatory behaviors for AI agents when testing frontend code.

The Five Core Rules

Every AI agent working on frontend code must follow these five rules when testing:

Rule 1: Declare Your Test Plan First

Before writing any test or running any verification, state:

What you are testing
Which level you are using
Why that level is appropriate

Test plan:
- Testing: notification banner visibility after CSS fix
- Level: 5 (verify-ui + screenshot)
- Reason: This is a visual bug — the element exists in DOM but
  user reports it's not visible. Need to check computed styles.

Note

Declaring the test plan prevents the common mistake of defaulting to Level 1 out of habit. It forces conscious level selection.

Rule 2: Match Test Level to Goal

The test level must match what you are actually verifying:

If you are verifying...	Use at least...
A function returns the right value	Level 1
A component renders correct elements	Level 2
Build output contains expected content	Level 3
A user flow works in the browser	Level 4
Something is visually correct	Level 5
A canvas / photo-editor surface where L4 is intractable AND L5 cannot reach	Level 6 (final resort, not for CI)

Do not use Level 1 for a Level 5 problem. The test will pass, but the bug will remain. Do not jump to Level 6 just because L4 is hard or L5 is awkward — L6 requires that both lower tiers are genuinely intractable.

Rule 3: Escalate When Lower Levels Pass But Problem Persists

When a test passes but the user says the problem is not fixed:

Do not re-run the same test
Do not suggest the user clear their cache
Do escalate to the next testing level
If already at Level 4, escalate to Level 5
If at Level 5, investigate deeper (check ancestor elements, stacking context, etc.)
Only escalate to Level 6 (AI-based) when both L4 is intractable to write and L5 cannot reach the assertion. L6 is not the next step after L5 by default — it is reserved for canvas / photo-editor surfaces where no deterministic check is feasible.

Warning

Never suggest "clear browser cache" or "hard refresh" as a solution. If the user says it is still broken, the code is still broken. Investigate the actual cause.

Rule 4: Default to Level 5 for UI/CSS

Any task involving:

CSS changes
Layout modifications
Visibility issues
Visual appearance
Responsive design
Spacing, colors, fonts

should default to Level 5 verification. Lower levels are structurally unable to verify CSS correctness.

Rule 5: Report What Was NOT Tested

After testing, explicitly state the blind spots:

Verification complete:
- Tested: computed styles confirm banner has display:block,
  opacity:1, and parent has overflow:visible
- Screenshot: banner is visible at top of page
- NOT tested: responsive behavior at mobile breakpoints
- NOT tested: animation transition timing

This transparency helps the user decide if additional testing is needed.

The Four Execution Rules

Agents operating inside a tiered test suite must also follow these four rules. For background on verification artifacts, tier tags, and the graduation lifecycle, see Execution Tiers.

Rule 6 — Verification Tests Do Not Graduate Themselves

Agent-authored verification specs are tagged @verification and excluded from all gates. Promotion to a tier is an explicit, deliberate step:

Remove the @verification tag
Assign the appropriate tier tag (e.g., @heavy, @gpu)
Confirm the spec fits within the target suite's time budget

An agent never promotes its own verification test. It proposes promotion in the PR description and leaves the decision to the reviewer.

# PR description excerpt
## Verification spec — graduation proposal
- spec: tests/banner-render.spec.ts
- currently: @verification (excluded from gates)
- proposed tier: T1 (untagged CI-safe e2e)
- rationale: stable for 5 runs, sub-2s, no platform dependency
- action needed: reviewer removes @verification tag and assigns tier

Note

The concept of verification artifacts versus regression gates, and the full graduation lifecycle, lives on the Execution Tiers page. This rule carries only the enforceable agent behavior: tag it, propose promotion, never self-promote.

Rule 7 — Red Checks Block the Author

Any red check on a PR you authored is blocking for you, regardless of whether the failing check is marked required in the repository settings. The only exception: the failing test carries @flaky with a linked open issue.

# Blocked — do not merge
PR #42 authored by agent-session-A
  ✗  unit: auth-service.test.ts — 1 failed
     (not a required check, but authored by this session)
  → investigate and fix before declaring done

# Not blocked — exception applies
  ✗  e2e: flaky-animation.spec.ts @flaky — see issue #38
     (quarantined; allowed-to-fail in scheduled tier)
  → merge may proceed; issue #38 tracks the fix

Warning

"The check is not required" is not a valid reason to ignore a red result on your own PR. Required-check status is a CI configuration detail — your authorship obligation is independent of it.

Rule 8 — Never Game the Gate

Never introduce any of the following without a linked open issue reference:

test.skip on an existing test
@flaky tag on a test that was previously passing
Loosened screenshot tolerance (e.g., raised pixel threshold)
Deleted assertions

Making a gate pass by editing existing assertions requires a fresh-context review of the test diff — do not apply such edits in the same session that authored the original change.

# Wrong — no issue reference
test.skip('banner renders correctly', () => { ... })

# Wrong — no issue reference
expect(page).toHaveScreenshot({ maxDiffPixels: 500 }) // was 50

# Right
test.skip('banner renders correctly', () => { ... })
// @flaky: issue #55 — animation timing non-deterministic on CI

Rule 9 — Scoped Heavy Verification

When a change touches code that is covered only by heavy-lane tests (tagged @heavy, @gpu, @interactive, or @macos-only), run the related heavy specs on a capable host before declaring the work done. Alternatively, dispatch the exam workflow on the branch.

See Heavy Test Decision Rule for how to identify which tests count as heavy-lane and which host is required.

# Before declaring done — change touches GPU rendering path
$ gh workflow run exam.yml --ref my-feature-branch
# or, if running locally on a capable host:
$ pnpm exec playwright test --grep "@gpu" tests/gpu-render.spec.ts

Warning

Declaring work done while heavy-lane coverage for the changed code is unrun is equivalent to skipping the gate. The scheduled re-exam will catch it, but the delay creates a regression window.

Summary Checklist

Before declaring any fix complete:

[ ] Test plan was declared before testing
[ ] Test level matches the nature of the change
[ ] If test passes but user reports failure, escalated to next level
[ ] For CSS/visual changes, used Level 5
[ ] Reported what was and was not tested
[ ] New verification specs are tagged @verification and not self-promoted
[ ] All red checks on own PR are resolved or carry @flaky with a linked issue
[ ] No test.skip, @flaky, or loosened tolerances introduced without a linked issue
[ ] If change touches heavy-lane-only code, related heavy specs were run on a capable host

Required Testing Behavior

The Five Core Rules

Rule 1: Declare Your Test Plan First

Rule 2: Match Test Level to Goal

Rule 3: Escalate When Lower Levels Pass But Problem Persists

Rule 4: Default to Level 5 for UI/CSS

Rule 5: Report What Was NOT Tested

The Four Execution Rules

Rule 6 — Verification Tests Do Not Graduate Themselves

Rule 7 — Red Checks Block the Author

Rule 8 — Never Game the Gate

Rule 9 — Scoped Heavy Verification

Summary Checklist

Revision History