Heavy Test Decision Rule
What to do when an E2E test is too heavy for PR CI — fix, demote, or delete, then classify why it is heavy and assign the right tier.
The Question This Page Answers
Sooner or later, every project with E2E tests hits the same moment: a test looks valuable, but it is too slow, too hardware-dependent, or too unreliable for the PR gate. The instinctive question is "where else can this test run?" — but that is the second question. Where and when a test runs is a tier question, answered by Execution Tiers. The first question is whether the test deserves a tier at all.
Every heavy test leaves this page through one of three exits:
Fix — keep the test, classify why it is heavy, and assign it the matching tier (if the test is flaky, see Flake Root-Cause Catalog & Deflaking Recipe)
Demote — rewrite the assertion at a lower testing level
Delete — remove the test and watch what happens
Warning
A strategy that only finds homes for heavy tests will faithfully preserve tests that should die. If the decision rule has no demote and delete branches, every slow, redundant, or worthless test gets carried into a scheduled tier and runs forever at real cost. Branch 0 exists to filter those tests out first.
Branch 0: Before Assigning Any Tier
Before classifying a heavy test by its heaviness, ask two questions in order.
(a) Is the assertion expressible at a lower level?
If what the test actually asserts can be expressed by a component test or a unit test, demote it. A 90-second browser flow that ends in "the computed total equals 42" is a unit test wearing an E2E costume. Demotion makes the assertion faster, more deterministic, and cheaper — with no loss of meaning.
(b) Has it never caught a real regression, and is it not a critical user journey?
If both are true, delete it — then watch the production bug rate for roughly three months. If nothing surfaces that the test would have caught, it was dead weight. If something does, you have learned exactly which assertion matters, and you can write a better, cheaper test for it.
Only survivors of Branch 0 get a tier.
The Decision Flow
Classify by Why It Is Heavy
"Too heavy" is not one condition. There are four distinct reasons a test can feel too heavy for PR CI, and each demands a different treatment. Conflating them is how projects end up with tests that are local-only (bypassable), deleted out of frustration, or scheduled when they could have stayed on the PR.
1. Slow but CI-capable
The test runs fine on CI runners — it just takes long. This category stays on CI:
Tune test-runner workers first; parallelism is the cheapest win
Shard across runners only past roughly 100 tests and 30 minutes of runtime
Use
--only-changedas a lossy PR prefilter, backstopped by a scheduled full run
Tag: @heavy. Never make this category local-only — slowness alone is not a reason to leave the enforced gate.
2. Environment-incapable
The test needs hardware the CI runner does not have: a real GPU, a hardware video encoder. Software rendering on a standard CI runner produces different pixels, so pixel-level assertions fail for environmental reasons, not product reasons. Case B — a canvas/GPU-heavy pattern-generation web app whose pixel-level specs fail on software-rendering CI runners — is the canonical example.
The treatment is the scheduled tier (T3) on capable hardware, plus the local heavy lane.
Note
Demotion does not help here. A component test runs in the same GPU-less environment as the E2E test — moving the assertion down a level changes nothing about the hardware it executes on. This is the one category where the lower-level rewrite, normally the cheapest exit, is structurally unavailable.
3. Platform-incapable
The test is only trustworthy on a specific platform: real OS keyboard delivery, a native webview, macOS-only behavior. Case C — a Tauri text-editor app whose keyboard-shortcut e2e specs are only trustworthy on real WebKit/macOS — is the canonical example.
The treatment is a layered split, not one monolithic heavy suite:
Mocked-IPC frontend tests — the webview UI logic against a mocked native bridge; CI-safe, stays in the PR gate
Native-side mock-runtime tests — the native layer's logic against a mock runtime; also CI-safe
A thin native-integration layer — the few specs that genuinely need the real platform; tagged
@interactive/@macos-only, run as a T3 scheduled macOS job with on-demandworkflow_dispatchfor pre-merge escalation
Most of the suite's protection moves into the CI-safe layers; only the thin remainder is scheduled. See Scheduled Re-exam and Night Exam for how that scheduled job is built.
4. Flaky — Not a Heaviness Category at All
A test that fails randomly is not heavy — it is broken. Flakiness routinely gets misfiled as heaviness ("it needs retries, it slows the pipeline down") and shipped off to a scheduled tier, where it keeps flaking with even less visibility. Flaky tests go to the quarantine pipeline below — with a deadline, not a new home.
Tag Taxonomy
The classification result is recorded as a tag on the test itself, so the tier mapping stays mechanical and grep-able:
| Tag | Meaning | Tier |
|---|---|---|
| (untagged) | CI-safe; the default for every new e2e test | T1/T2 |
@smoke | critical-journey subset (optional) | T1 |
@heavy | slow but CI-capable | T2/T3 |
@gpu | needs hardware GPU / video encoder | T3 + local heavy |
@interactive | needs real keyboard / shortcut engine | T3 (macOS) + local heavy (macOS) |
@macos-only | trustworthy only on real macOS | T3 (macOS) |
@flaky | quarantined; inline issue URL required | T3 allowed-to-fail |
@verification | one-time proof artifact | no gate |
@verification marks one-time proof artifacts — specs written to prove a change worked once, not to guard it forever. They belong to no gate; see Required Testing Behavior for how promotion from verification to regression must happen.
Classifying a New Test: Mechanical Rules
For a new test, classification must not require judgment. Apply the first rule that matches:
Reads pixels or depends on GPU timing →
@gpuNeeds the OS to deliver real keyboard events →
@interactiveRuntime over ~60 seconds, or a multi-minute flow →
@heavyOtherwise → untagged
The default matters most: a new e2e test is untagged and CI-safe unless one of these rules forces a tag. An agent (or a developer) never starts from "which special tier does my test deserve?" — it starts from "my test runs in the PR gate" and tags only when a rule fires.
The Quarantine Pipeline for @flaky
Quarantine is a pipeline with an exit, not a parking lot:
Step 0 of quarantine: verify the test has ever passed (not pass-by-skip) on some host; if it cannot pass, it is broken, not flaky — fix or delete immediately, no quarantine.
Before entering the pipeline, confirm there is at least one verifiable green run of the exact assertion on any CI host. Pass-by-skip does not count — Playwright reports skips as - rather than ✓, so check the distinction carefully. A test that has never produced a genuine green run is not flaky; it is broken. Sending it to quarantine would suspend product coverage without any chance of recovery. Note that this step is distinct from Branch 0 (which asks whether the test deserves a tier at all): Step 0 asks whether the test can even pass anywhere. See Playwright Patterns for notes on test.skip preconditions — a precondition that "should always hold" must be a hard assertion, not a conditional skip that can silently degrade into a permanent vacuous green.
Tagging requires a paper trail.
@flakyis only valid with an inline issue URL right next to it:// quarantined: https://github.com/your-org/your-app/issues/123 test("drag preview follows cursor @flaky", async ({ page }) => { // ... });Excluded from strict gates. A quarantined test must not block a PR — otherwise quarantine means nothing.
It still runs — allowed-to-fail — in the scheduled tier. A
@flakytag that runs nowhere is a graveyard: results stop being collected and the test silently rots. Running allowed-to-fail in T3 keeps fresh failure data flowing into the tracking issue.Fix, demote, or delete — with a deadline. The exits are the same three as Branch 0. If nobody fixed it by the deadline, it was not worth fixing: demote it or delete it. For the Fix path, see Flake Root-Cause Catalog & Deflaking Recipe — it covers the five causes of E2E flakiness and the mechanical steps to eliminate each one.
Tip
A test that only passes on retry belongs in this pipeline too — pass-on-retry is a triage signal, not a success.
Note
Quarantining a Rust/cargo flake — the same pipeline, no test-title grep. The pipeline above is described with Playwright's @flaky-title-substring convention, but the three mechanics are language-agnostic. On a Rust/cargo project, #[ignore] provides all three directly:
/ / quarantined: https: / / github. com/ your- org/ your- app/ issues/ 123 # [test] # [ignore = "flaky: https: / / github. com/ your- org/ your- app/ issues/ 123"] fn drag_ preview_ follows_ cursor() { / / . . . }Paper trail. The
#[ignore = "..."]reason string keeps the mandatory inline issue URL right next to the test — the cargo analogue of the@flaky-title-substring convention.Excluded from the strict gate.
cargo testskips#[ignore]tests by default, so a quarantined flake no longer blocks the gate — the analogue of--grep-invert "@flaky".It still runs somewhere, allowed-to-fail.
cargo test -- --ignoredruns the ignored set andcargo test -- --include-ignoredruns everything; schedule one as a T3 allowed-to-fail job so fresh failure data keeps flowing into the tracking issue. Caveat:--ignoredmatches every#[ignore]test, so if your suite also uses#[ignore]for slow or manual tests, narrow the run with a name filter (e.g.cargo test flaky_ -- --ignoredover a naming convention) so unrelated ignored tests do not pollute the flake signal —#[ignore]is a broader marker than the flake-specific@flakytitle substring.
Warning
Quarantine suspends product coverage, not just test coverage. While a test is quarantined, the behavior it guards is unguarded — regressions in that behavior will not be caught until the test is fixed. Treat every quarantine fix as two verifications: first confirm the test mechanics are sound, then confirm the product behavior itself still holds. A fixed flake that immediately fails for a different reason is the pipeline working correctly, not a fix gone wrong.
Worked case: a drag-and-drop spec was quarantined because Playwright-WebKit never reliably commits native HTML5 drops — a mechanical flake. While it sat in quarantine, the asserted behavior had silently broken: the feature's backend was deleted in a refactor, and the regression shipped unnoticed for days.
Where to Go Next
Execution Tiers — the definitions of T0–T4 and when a project should adopt each one
Scheduled Re-exam and Night Exam — how the scheduled tier (T3) is actually built: cron, macOS runners, deduped issue filing, on-demand dispatch
Required Testing Behavior — the agent-facing rules: never tag
@flakywithout a linked open issue, promote verification specs explicitly, and more