Required Testing Behavior
Mandatory behaviors for AI agents when testing frontend code.
The Five Core Rules
Every AI agent working on frontend code must follow these five rules when testing:
Rule 1: Declare Your Test Plan First
Before writing any test or running any verification, state:
What you are testing
Which level you are using
Why that level is appropriate
Test plan:
- Testing: notification banner visibility after CSS fix
- Level: 5 (verify-ui + screenshot)
- Reason: This is a visual bug — the element exists in DOM but
user reports it's not visible. Need to check computed styles.Note
Declaring the test plan prevents the common mistake of defaulting to Level 1 out of habit. It forces conscious level selection.
Rule 2: Match Test Level to Goal
The test level must match what you are actually verifying:
| If you are verifying... | Use at least... |
|---|---|
| A function returns the right value | Level 1 |
| A component renders correct elements | Level 2 |
| Build output contains expected content | Level 3 |
| A user flow works in the browser | Level 4 |
| Something is visually correct | Level 5 |
| A canvas / photo-editor surface where L4 is intractable AND L5 cannot reach | Level 6 (final resort, not for CI) |
Do not use Level 1 for a Level 5 problem. The test will pass, but the bug will remain. Do not jump to Level 6 just because L4 is hard or L5 is awkward — L6 requires that both lower tiers are genuinely intractable.
Rule 3: Escalate When Lower Levels Pass But Problem Persists
When a test passes but the user says the problem is not fixed:
Do not re-run the same test
Do not suggest the user clear their cache
Do escalate to the next testing level
If already at Level 4, escalate to Level 5
If at Level 5, investigate deeper (check ancestor elements, stacking context, etc.)
Only escalate to Level 6 (AI-based) when both L4 is intractable to write and L5 cannot reach the assertion. L6 is not the next step after L5 by default — it is reserved for canvas / photo-editor surfaces where no deterministic check is feasible.
Warning
Never suggest "clear browser cache" or "hard refresh" as a solution. If the user says it is still broken, the code is still broken. Investigate the actual cause.
Rule 4: Default to Level 5 for UI/CSS
Any task involving:
CSS changes
Layout modifications
Visibility issues
Visual appearance
Responsive design
Spacing, colors, fonts
should default to Level 5 verification. Lower levels are structurally unable to verify CSS correctness.
Rule 5: Report What Was NOT Tested
After testing, explicitly state the blind spots:
Verification complete:
- Tested: computed styles confirm banner has display:block,
opacity:1, and parent has overflow:visible
- Screenshot: banner is visible at top of page
- NOT tested: responsive behavior at mobile breakpoints
- NOT tested: animation transition timingThis transparency helps the user decide if additional testing is needed.
The Four Execution Rules
Agents operating inside a tiered test suite must also follow these four rules. For background on verification artifacts, tier tags, and the graduation lifecycle, see Execution Tiers.
Rule 6 — Verification Tests Do Not Graduate Themselves
Agent-authored verification specs are tagged @verification and excluded from all gates. Promotion to a tier is an explicit, deliberate step:
Remove the
@verificationtagAssign the appropriate tier tag (e.g.,
@heavy,@gpu)Confirm the spec fits within the target suite's time budget
An agent never promotes its own verification test. It proposes promotion in the PR description and leaves the decision to the reviewer.
# PR description excerpt
# # Verification spec — graduation proposal
- spec: tests/ banner- render. spec. ts
- currently: @verification (excluded from gates)
- proposed tier: T1 (untagged CI- safe e2e)
- rationale: stable for 5 runs, sub- 2s, no platform dependency
- action needed: reviewer removes @verification tag and assigns tierNote
The concept of verification artifacts versus regression gates, and the full graduation lifecycle, lives on the Execution Tiers page. This rule carries only the enforceable agent behavior: tag it, propose promotion, never self-promote.
Rule 7 — Red Checks Block the Author
Any red check on a PR you authored is blocking for you, regardless of whether the failing check is marked required in the repository settings. The only exception: the failing test carries @flaky with a linked open issue.
# Blocked — do not merge
PR #42 authored by agent-session-A
✗ unit: auth-service.test.ts — 1 failed
(not a required check, but authored by this session)
→ investigate and fix before declaring done
# Not blocked — exception applies
✗ e2e: flaky-animation.spec.ts @flaky — see issue #38
(quarantined; allowed-to-fail in scheduled tier)
→ merge may proceed; issue #38 tracks the fixWarning
"The check is not required" is not a valid reason to ignore a red result on your own PR. Required-check status is a CI configuration detail — your authorship obligation is independent of it.
Rule 8 — Never Game the Gate
Never introduce any of the following without a linked open issue reference:
test.skipon an existing test@flakytag on a test that was previously passingLoosened screenshot tolerance (e.g., raised pixel threshold)
Deleted assertions
Making a gate pass by editing existing assertions requires a fresh-context review of the test diff — do not apply such edits in the same session that authored the original change.
# Wrong — no issue reference
test. skip('banner renders correctly', () = > { . . . })
# Wrong — no issue reference
expect(page). toHaveScreenshot({ maxDiffPixels: 500 }) / / was 50
# Right
test. skip('banner renders correctly', () = > { . . . })
/ / @flaky: issue # 55 — animation timing non- deterministic on CIRule 9 — Scoped Heavy Verification
When a change touches code that is covered only by heavy-lane tests (tagged @heavy, @gpu, @interactive, or @macos-only), run the related heavy specs on a capable host before declaring the work done. Alternatively, dispatch the exam workflow on the branch.
See Heavy Test Decision Rule for how to identify which tests count as heavy-lane and which host is required.
# Before declaring done — change touches GPU rendering path
$ gh workflow run exam. yml - - ref my- feature- branch
# or, if running locally on a capable host:
$ pnpm exec playwright test - - grep "@gpu" tests/ gpu- render. spec. tsWarning
Declaring work done while heavy-lane coverage for the changed code is unrun is equivalent to skipping the gate. The scheduled re-exam will catch it, but the delay creates a regression window.
Summary Checklist
Before declaring any fix complete:
[ ] Test plan was declared before testing
[ ] Test level matches the nature of the change
[ ] If test passes but user reports failure, escalated to next level
[ ] For CSS/visual changes, used Level 5
[ ] Reported what was and was not tested
[ ] New verification specs are tagged
@verificationand not self-promoted[ ] All red checks on own PR are resolved or carry
@flakywith a linked issue[ ] No
test.skip,@flaky, or loosened tolerances introduced without a linked issue[ ] If change touches heavy-lane-only code, related heavy specs were run on a capable host