Scheduled Re-exam & Night Exam

Operating heavy, platform-bound test lanes on a schedule -- the CI re-exam workflow, deduped failure issues, on-demand dispatch, and the local night exam.

Why a Local-Only Heavy Lane Fails as the Safety Net

Some test lanes can never run on PR CI: pixel assertions that need a hardware GPU, keyboard-shortcut delivery that is only trustworthy on a real OS. The Heavy Test Decision Rule covers how a test gets classified that way; this page covers what to do operationally once it has. The tempting answer is "run those lanes on the developer's machine before pushing" -- the local heavy lane, tier T4. As a convenience, that is fine. As the only safety net, it fails in three ways:

It is bypassable. A local gate is a script someone may or may not run. Under deadline pressure, on a borrowed machine, or behind a --no-verify, it silently does not happen.
It is machine-dependent. Consider a Tauri text-editor app whose keyboard-shortcut e2e specs are only trustworthy on real WebKit/macOS: on a Linux/WSL2 host the same suite false-reds dozens of specs, while a real macOS machine is the gold standard. Whether the local gate even means anything depends on whose machine ran it.
It leaves no paper trail. Nobody can answer "when did this lane last pass, and on what hardware?" A green that only ever existed in one terminal's scrollback is not a regression gate.

Note

AI agents are "the other contributors". Every argument above used to be about teammates; now it is also about coding agents. An agent working in a sandbox, in CI, or on a Linux host is exactly the contributor who cannot run your macOS-only lane -- and it will bypass the local gate without malice, every time. A safety net that requires the right person on the right machine to remember a step is not a net.

The fix is not to abandon the local lane -- it is to stop asking it to be the enforcement layer. Keep it for speed and convenience, and add a scheduled CI re-exam: the heavy lanes re-run on capable hardware on a schedule, with a paper trail and automatic issue filing.

Two Commands, Two Budgets: b4push and exam

Before building the scheduled tier, split the local commands. The pre-push convenience pass and the whole-regression heavy run are different jobs and need different names and different time budgets:

Command	Contents	Budget	Who runs it
`b4push`	lint gates, typecheck, affected unit tests, build, CI-safe smoke	bounded, 5--10 min	everyone, before every push, on any machine
`exam`	the whole-regression heavy run: GPU, WebKit/macOS, long flows	open-ended	opt-in and platform-gated: scheduled CI, or the capable machine at night

The split prevents a specific failure mode: one command slowly accreting both jobs. The moment the pre-push pass exceeds its budget, people -- and agents -- start skipping it, and then nothing runs before push. exam is allowed to be slow precisely because nobody sits waiting for it; b4push stays trusted precisely because it is fast.

The name is deliberate. The project periodically re-takes its whole exam, instead of pretending every push can afford to.

When b4push exceeds its budget

If b4push creeps past its budget, trim it in this order -- and require per-step timing output so the gate stays observable, not aspirational:

Full e2e → @smoke subset. Keep only the critical journeys plus the suites that cover areas the current diff touches. Criteria: if a full e2e pass takes more than a minute or two, scope it down; the scheduled exam is the backstop for everything else.
Full unit run → affected-only. Use turborepo/nx affected or per-package filters to run only the packages the diff reaches. A broad unit pass that reruns every package on every push is the most common accretion pattern.
Docs/site builds → CI-only. Drop the local docs build from b4push; let PR CI own it. A docs build that nobody waits on locally still contributes to budget creep.

Require --reporter=verbose (or equivalent) at each step so timing is visible in the log. If the trimmed b4push still exceeds its budget, resist the temptation to just rename the time limit upward. An enforced 25-minute gate beats an aspirational 10-minute one that gets skipped -- but if the gate is genuinely 25 minutes, name it honestly and measure it. The real failure mode is a budget that exists only in the README.

Heavy-compile / native (cargo, Rust, ...) projects

The cut-order above subsets tests -- it assumes the cost scales with how many specs run. For a native project where the dominant pre-push cost is compilation, that axis is the wrong one. A Rust workspace embedding V8 takes 15--30 min for the first cold cargo build; any compiling step (cargo clippy, cargo test) blows the budget on a cold tree, and there is no turborepo/nx "affected" for cargo to subset along. Trimming test count does not help when the budget is bounded by compile time, not spec count.

The native analogue of the cut-order moves the cut along the compilation axis instead:

The bounded budget presumes a warm incremental tree. It only holds when the prior build's artifacts are still on disk; on a cold tree no test-subsetting recovers it.
Keep the full compiling suite in CI, not b4push. CI is the authoritative T1 gate (see Execution Tiers) and runs the full cargo clippy / cargo test on a warm cache. b4push is not the place to pay first-cold-compile cost.
b4push runs only the non-compiling fast checks -- fmt, format, typecheck, JS tests -- plus warm-tree lint (a cargo clippy that reuses the incremental tree, which is cheap when warm and the budget-buster when cold).
Gate the full local compile/test behind an opt-in env flag, e.g. B4PUSH_FULL=1, so the contributor who wants the full local pass can ask for it while the default stays bounded and CI remains the enforcement layer.

Tip

Same principle as the JS cut-order, different axis: there the cut is by test count, here it is by compilation. The default b4push stays fast and trusted; the cold-compile cost lives in CI, which is the gate that actually blocks the merge.

The Scheduled CI Re-exam Workflow

Note

Scheduled rich CI is a T3 / post-cutover concern: build it at the cutover point when the project's test suite is mature enough to justify a dedicated nightly runner. Until then, the local exam lane described here is the interim. See Execution Tiers for where this fits in the full maturity arc.

A complete skeleton. It runs on a schedule and accepts manual dispatch, executes only the tagged heavy lanes (@gpu, @interactive, @macos-only -- see the tag taxonomy on Execution Tiers), and files a deduped issue on failure:

# .github/workflows/exam.yml
name: exam

on:
  schedule:
    # Nightly at 03:00 UTC -- pick a quiet hour for your timezone
    - cron: "0 3 * * *"
  # On-demand runs for pre-merge escalation (see below)
  workflow_dispatch:

permissions:
  contents: read
  issues: write

jobs:
  exam:
    # GitHub-hosted macOS runners are Apple silicon from macos-14 onward
    runs-on: macos-14
    timeout-minutes: 90
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: pnpm
      - run: pnpm install --frozen-lockfile
      - run: pnpm exec playwright install --with-deps webkit

      # Run only the tagged heavy lanes -- everything else already ran on PR CI
      # The json reporter feeds file-exam-issue.sh below; list keeps the live log readable
      - name: Run heavy lanes
        env:
          PLAYWRIGHT_JSON_OUTPUT_NAME: playwright-report/report.json
        run: pnpm test:e2e --grep "@gpu|@interactive|@macos-only" --reporter=list,json

      - name: File or update the failure tracking issue
        if: failure()
        env:
          GH_TOKEN: ${{ github.token }}
        run: bash scripts/file-exam-issue.sh

      - name: Close the tracking issue on green
        if: success()
        env:
          GH_TOKEN: ${{ github.token }}
        run: bash scripts/file-exam-issue.sh --green

Runner Notes

GitHub-hosted macOS runners are Apple silicon (macos-14 and later): real WebKit, Metal-backed rendering. For a canvas/GPU-heavy web app whose pixel-level specs fail on software-rendering CI runners, this alone can be the difference between false-red and trustworthy.
Third-party hosted macOS providers exist for when the GitHub-hosted pool is too slow or too expensive for the suite.
A self-hosted runner on your own hardware is an escalation, not a default. If you do escalate: schedule-only on main, never PR-triggered on a public repo, and pair it with offline detection -- e.g. a companion job on a hosted runner that alerts when the self-hosted job has not reported within N hours -- so a sleeping machine is visible rather than silently green-by-absence.

Warning

Never let PRs trigger a self-hosted runner on a public repository. A PR-triggered self-hosted runner executes code from anyone who opens a pull request, on your machine. Keep self-hosted exam jobs schedule-only on main.

One Tracking Issue, Never One Per Run

A nightly job that stays red for a week must not file seven issues. Per-run filing buries the signal under duplicates and trains everyone to ignore the label. The rule: one open tracking issue per workflow -- comment on it while the failure persists, close it when the exam goes green, and let the next failure open a fresh one.

The if: failure() step in the skeleton above calls this script:

#!/usr/bin/env bash
# scripts/file-exam-issue.sh -- one tracking issue per workflow, never one per run
set -euo pipefail

LABEL="exam-failure"
RUN_URL="${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}"

# DRY_RUN=1: echo the gh command instead of running it -- safe to test against a fixture report.json
run_gh() {
  if [ "${DRY_RUN:-}" = "1" ]; then
    echo "+ gh $*"
  else
    gh "$@"
  fi
}

# --green path: close the tracking issue if one is open, then exit.
# Must come BEFORE reading report.json -- green runs produce no failure report.
if [ "${1:-}" = "--green" ]; then
  EXISTING="$(gh issue list --label "$LABEL" --state open \
    --json number --jq '.[0].number // empty')"
  if [ -n "$EXISTING" ]; then
    run_gh issue comment "$EXISTING" --body "Exam green: ${RUN_URL}"
    run_gh issue close "$EXISTING"
  fi
  # No open issue -- nothing to do
  exit 0
fi

# Failure path: collect failing spec names from the reporter's output
# (example: Playwright's JSON reporter written to playwright-report/report.json)
FAILED_SPECS="$(jq -r '.. | objects | select(.ok == false) | .file? // empty' \
  playwright-report/report.json | sort -u)"

BODY="Scheduled exam failed.

Run: ${RUN_URL}

Failing specs:

\`\`\`
${FAILED_SPECS}
\`\`\`"

# Is there already an open tracking issue?
EXISTING="$(gh issue list --label "$LABEL" --state open \
  --json number --jq '.[0].number // empty')"

if [ -n "$EXISTING" ]; then
  # Yes: append this run to it -- do NOT open a duplicate
  run_gh issue comment "$EXISTING" --body "$BODY"
else
  # No: create the single tracking issue with the fixed label
  run_gh issue create \
    --title "exam: scheduled heavy run is failing" \
    --label "$LABEL" \
    --body "$BODY"
fi

Every report carries the three things a fix session needs: the fixed label (so the dedup query finds it), the failing spec names, and the run URL.

Note

Commit the script's executable bit before the first run: git update-index --chmod=+x scripts/file-exam-issue.sh. To test the script locally without touching real issues, run it with DRY_RUN=1 -- the run_gh() wrapper echoes the intended gh command instead of executing it, so you can verify both paths against a fixture report.json.

Pin Reporter-Parsing Fixtures to a Real Captured Report

The script above parses Playwright's JSON reporter output. If you write a unit test for that parser (or for any script that consumes a tool's machine-generated output), the fixture used in the test must be a real artifact captured from the tool itself -- not a hand-authored shape that matches your assumption of the schema.

The rule: when you unit-test a script that parses a tool's machine output -- a test runner's JSON reporter, a coverage JSON, a bundler stats file -- commit a trimmed sample of the tool's actual output as the fixture. The concrete recipe:

Run the tool once under real conditions.
Save its JSON output, trimmed to a representative subset, as __fixtures__/real-report.json.
Write the parser's unit test against that file.

When the tool changes its schema in a future version, the test fails loudly -- instead of agreeing forever with a fabricated structure that no longer matches reality.

The tell that this trap is in play: the parser's unit tests are green, but the script produces empty or garbage output when run against real data from a fresh tool invocation. This is the classic signature of a fixture that matches the assumption rather than reality.

Warning

Synthetic fixtures are safe only when you own the contract. For a third-party tool's output you cannot author truth -- you can only capture it. Hand-authored fixtures are the right choice when the reader owns the schema being tested: for example, mdast tree factories in remark/rehype plugin tests (you own the mdast contract) or synthetic inner-bundle objects in level-3 build-output tests (you own the bundle shape). For a tool like Playwright, Jest, or Vite whose output schema is theirs to change, capturing a real artifact is the only way to stay honest.

On-Demand Dispatch: The Pre-Merge Escalation

The standing objection to scheduled testing is feedback latency: a regression merged this morning is invisible until tomorrow. workflow_dispatch bounds that objection. When a change touches code that only the scheduled tier covers, do not merge on hope -- dispatch the exam against the branch and wait for the verdict:

# The change touches code covered only by scheduled-tier tests?
# Run the exam on the branch BEFORE merging -- do not wait for tonight's cron.
gh workflow run exam.yml --ref my-feature-branch

# Follow the run to its verdict
gh run watch "$(gh run list --workflow=exam.yml --limit 1 \
  --json databaseId --jq '.[0].databaseId')"

This turns the scheduled tier's main weakness into a bounded cost: the default feedback loop is nightly, and the changes that genuinely cannot wait get a manual escape hatch.

The Night Exam: A Project-Scope Agent Skill

The scheduled CI job is deliberately thin: run, report, file. The richer version of the same idea runs where the tests are most trustworthy -- on the gold-standard machine itself, overnight, with an agent doing the parts plain CI cannot.

Define it as a project-scope agent skill: a slash-command-style entry point checked into the repository's agent configuration, so the procedure is versioned, reviewable, and identical every night. Invoked manually before sleep:

# Before sleep, on the gold-standard machine
/exam          # run heavy lanes, triage failures, file deduped issues
/exam --fix    # ...and additionally pick up to 3 issues and fix them in-session

The skill's pipeline:

Preflight -- refuse to start unless the tree is clean, the branch is main, and it is up to date with the remote
Keep-awake wrapper -- run under caffeinate -i so the machine does not sleep mid-suite
Run the platform-gated heavy lanes -- the same tags the CI exam runs
Agent triage -- cluster the failures, then separate known environment-noise signatures from real regressions
Deduped issues per failure cluster -- one issue per cluster, not one per spec and not one per run
--fix mode -- pick up to N of the filed issues and fix them in-session, fixes ready for morning review
Morning summary -- one message: what ran, what failed, what was noise, what was filed, what was fixed

The first two steps are plain shell:

# Preflight -- refuse to run on a dirty or stale tree
git status --porcelain | grep -q . && { echo "dirty tree"; exit 1; }
[ "$(git branch --show-current)" = "main" ] || { echo "not on main"; exit 1; }
git pull --ff-only

# Keep the machine awake for the whole run (macOS)
caffeinate -i pnpm test:e2e --grep "@gpu|@interactive|@macos-only"

The triage step is the reason this is an agent skill and not a cron script. On the gold-standard machine a red is probably real, but every long-lived heavy suite accumulates known noise signatures: a first-run font-cache warning, a timing-sensitive first frame after cold boot. The agent clusters the failures, matches them against the noise signatures recorded in the project's agent instructions, and files issues only for the remainder. That judgment call -- "these three reds are one regression, that fourth red is Tuesday's known noise" -- is exactly what a plain CI job cannot do.

Note

Keep the scheduled CI job anyway. The night exam depends on a human remembering to run it and a machine staying awake -- exactly the failure modes that disqualify local-only lanes. The pairing is the point: the night exam is the rich lane with triage and fixes; the scheduled CI re-exam is the thin backstop that runs even when nobody remembered.

Scoped Heavy Runs at Implementation Time

Both the cron and the night exam are after-the-fact: they catch the regression hours after the change landed. While implementing a change that touches code covered only by heavy-lane tests, do not wait for tonight -- run the related heavy specs now, scoped to the change, on the capable host:

# The change touched the shortcut engine -- run just its heavy specs,
# on the capable host, before declaring the work done
pnpm test:e2e --grep "@interactive" e2e/shortcuts-*.spec.ts

The hard part is not running the specs -- it is knowing which specs a change implicates. That requires a change-to-spec mapping, kept mechanical with two conventions:

Issue-numbered spec filenames -- a name like e2e/issue-123-shortcut-paste.spec.ts ties the spec to the change that motivated it and makes it greppable
A module-to-spec table in the project's agent instructions -- so an agent (or a person) can look up which heavy specs a change implicates:

<!-- In the project's agent instructions: change-to-spec mapping -->

| When a change touches... | Run these heavy specs first            |
| ------------------------ | -------------------------------------- |
| src/shortcuts/**         | e2e/shortcuts-*.spec.ts (@interactive) |
| src/render/gpu/**        | e2e/render-*.spec.ts (@gpu)            |
| src/export/video/**      | e2e/export-video.spec.ts (@gpu)        |

Tip

Make the scoped run a stated requirement in the agent instructions, not folklore: "when a change touches code covered only by heavy-lane tests, run the mapped specs on a capable host (or dispatch the exam workflow on the branch) before declaring the work done."

The Layered Result

Surface	Runs	When	On failure
`b4push` (local)	fast bounded pass	before every push	fix before pushing
PR CI	CI-safe gates	every PR	merge blocked
Scheduled exam (CI)	tagged heavy lanes on a macOS runner	nightly cron + manual dispatch	deduped tracking issue
Night exam (local skill)	heavy lanes + agent triage	manually, before sleep	issues per cluster, optional `--fix`
Scoped heavy run (local)	only the specs related to the change	during implementation	fix before declaring done

No single surface is the safety net; the layering is. Which tests belong in the heavy lanes at all is decided by the Heavy Test Decision Rule, and the tier vocabulary (T0--T4) used throughout is defined on Execution Tiers.

Scheduled Re-exam & Night Exam

Why a Local-Only Heavy Lane Fails as the Safety Net

Two Commands, Two Budgets: b4push and exam

When b4push exceeds its budget

Heavy-compile / native (cargo, Rust, ...) projects

The Scheduled CI Re-exam Workflow

Runner Notes

One Tracking Issue, Never One Per Run

Pin Reporter-Parsing Fixtures to a Real Captured Report

On-Demand Dispatch: The Pre-Merge Escalation

The Night Exam: A Project-Scope Agent Skill

Scoped Heavy Runs at Implementation Time

The Layered Result

Revision History