May 26, 2026
Playwright Flaky Test Debugging Checklist: 15 Things to Check Before You Blame the App
A practical Playwright flaky test debugging checklist for selectors, waits, locators, network mocking, CI variance, and test data isolation before you blame the app.
When a Playwright test fails once, passes on rerun, and then fails again in CI, the easiest instinct is to blame the app. Sometimes that is correct. But in practice, flaky automation is often caused by a brittle locator, a timing assumption, a slow environment, or test data that leaked in from a previous run.
If you work in automation long enough, you learn that flakiness is a symptom, not a diagnosis. This checklist is meant to help you isolate where the problem actually lives. Use it as a practical Playwright flaky test debugging checklist before you open a production bug, rewrite the app, or rerun the suite for the fifth time.
A flaky test is usually telling you that the test and the system are disagreeing about what “ready” looks like, or even what element should be acted on.
1. Confirm the failure is truly flaky
Before changing anything, make sure the failure is not deterministic.
Run the test several times with the same commit, the same browser, and the same environment. If it fails every time, it is not flaky, it is broken. If it fails only under a specific condition, that condition is probably the real clue.
What to check:
- Does the test fail on local only, CI only, or both?
- Does it fail only in one browser, such as WebKit or Firefox?
- Does it fail only when the full suite runs, but pass in isolation?
- Does it fail after a retry, or only on a cold start?
A true flaky test often has an environmental trigger, not a random one.
2. Read the first failure, not the last one
Many test runners surface a chain of errors, but the first meaningful failure is usually the one that matters. A later timeout can hide an earlier locator miss, network failure, or assertion mismatch.
In Playwright, pay close attention to the first place the test diverges from expected state. If the element never appeared, do not spend time debugging the click that came later. If a request never returned the expected payload, do not blame the visibility assertion that depended on it.
A good habit is to separate failures into three buckets:
- selection failures, the test cannot find the element
- synchronization failures, the element exists but is not ready
- state failures, the app is ready but the data or UI is wrong
3. Inspect the locator strategy first
Brittle locators are a classic source of Playwright flakiness. If your test uses a selector that depends on volatile classes, generated IDs, or DOM nesting, it may work today and fail after a minor UI change.
Prefer locators that reflect user intent:
typescript
await page.getByRole('button', { name: 'Save' }).click();
await expect(page.getByText('Profile updated')).toBeVisible();
Questions to ask:
- Is the locator tied to a stable role, label, or accessible name?
- Is it dependent on a CSS class that can change with styling work?
- Is it selecting the first matching element when there may be multiple matches?
- Is it using a text string that appears in several places on the page?
If the selector is fragile, the app may be fine and the test may still fail.
4. Check for locator ambiguity
A locator that matches more than one element can be just as bad as one that matches none. The test may click the wrong control, especially if layout shifts cause the order to change.
Common signs:
- the same button text appears in a table row, a modal, and a page header
- a generic text selector matches hidden content, not the visible target
- a test uses
.nth(0)or.first()as a convenience and never revisits it
When in doubt, scope the locator to the smallest stable container possible.
typescript
const row = page.getByRole('row', { name: /invoice-1042/i });
await row.getByRole('button', { name: 'View' }).click();
The goal is not just to make the test pass, it is to make the intent obvious to the next person reading it.
5. Verify the element is actually ready for interaction
Playwright has auto-waiting, but that does not mean every action is safe at every moment. An element can exist in the DOM and still be hidden, detached, disabled, covered by an overlay, or transitioning.
Check for:
- animations or CSS transitions still in progress
- loading spinners overlaying the target
- disabled buttons waiting on validation
- dialogs or drawers that are in the middle of opening
If you need explicit waits, wait for state, not time. Time-based sleeps are usually a smell.
typescript
await expect(page.getByRole('button', { name: 'Submit' })).toBeEnabled();
await page.getByRole('button', { name: 'Submit' }).click();
If a hard wait is the only thing making a test pass, that is a clue that the test is guessing rather than observing.
6. Look for hidden waits inside the app
Sometimes the test is blamed for being too fast, but the app is actually doing a lot of work after the UI looks ready. That can include deferred network calls, background rendering, hydration, analytics, or feature flag initialization.
Useful things to inspect:
- whether the page is server-rendered and then hydrated later
- whether a control becomes visible before its data is populated
- whether a form button becomes clickable before validation completes
- whether an API request happens after a microtask or animation frame delay
If the app is relying on subtle timing, you may need a stronger readiness signal in the UI or the test.
7. Reproduce in headed mode, slow motion, and trace viewer
If the failure only happens in headless CI, you need visibility. Playwright’s trace viewer, screenshots, and videos are some of the fastest ways to see whether the test missed a selector, clicked too early, or ended up on the wrong page.
Run the failing test with trace collection and inspect the sequence around the failure.
import { defineConfig } from '@playwright/test';
export default defineConfig({ use: { trace: ‘on-first-retry’, screenshot: ‘only-on-failure’, video: ‘retain-on-failure’ } });
Then ask:
- did the test click the correct element?
- did the UI look the way the test expected?
- was there a toast, modal, or navigation that changed the page context?
This step often turns a vague flaky issue into a concrete UI mismatch.
8. Check network mocking and API timing
A test can look flaky when the real issue is inconsistent backend state or inconsistent mocks. If your suite uses route interception or fixture data, verify that the mock responses are stable and complete.
Things that frequently go wrong:
- the mock returns different data shape than production
- a request is not intercepted in one test path
- one test mutates shared backend data used by another test
- the app makes multiple requests and the test only stubs the first one
If you are mocking network calls, inspect the actual request and response cycle, not just the test code.
typescript
await page.route('**/api/orders/**', async route => {
await route.fulfill({ json: { status: 'ok', items: [] } });
});
If a request leaks through to a live environment sometimes and not others, that is a classic source of CI variance.
9. Compare local, CI, and container behavior
A test that passes on a developer laptop but fails in CI is often suffering from environment drift. That drift can be browser version, CPU throttling, memory limits, display size, locale, timezone, or parallelization.
Compare:
- browser version and channel
- viewport size and device scale factor
- locale and timezone settings
- container image or OS package versions
- concurrency level and worker count
Playwright tests can behave differently if the app renders responsive layouts based on viewport. A button might be visible in a wide local window but tucked behind a menu in a narrower CI viewport.
10. Inspect test data isolation
If tests share accounts, records, or stateful fixtures, they can interfere with each other. One run creates a record, another deletes it, a third run expects it to still exist. That is not app flakiness, that is data contention.
Check whether each test gets its own:
- user account
- seeded dataset
- unique identifiers or prefixes
- isolated storage state
- cleanup routine
In end-to-end suites, the safest pattern is usually one test, one data setup, one teardown path. If that is too expensive, at least make test data names unique enough to avoid collisions.
Shared test data is one of the fastest ways to turn a stable suite into a guessing game.
11. Validate waits around navigation and asynchronous page changes
Navigation errors can look random because the action completes before the page has fully changed. This happens with SPA routing, incremental rendering, and redirects.
Use assertions that confirm the new page state, not just that a click happened.
typescript
await page.getByRole('link', { name: 'Settings' }).click();
await expect(page).toHaveURL(/\/settings/);
await expect(page.getByRole('heading', { name: 'Settings' })).toBeVisible();
This is especially useful when the app updates the URL before the content is actually ready. A URL change alone is not proof of readiness.
12. Look for race conditions caused by parallel tests
Playwright can run tests in parallel, which is great for speed and terrible for suites that assume shared mutable state. If a test passes alone but fails in parallel, suspect a race.
Examples:
- two tests log in with the same account and invalidate each other’s sessions
- one test seeds a record that another test deletes
- a shared email inbox receives messages from multiple runs
- a backend job processes data in a different order depending on timing
Try running the file serially, then compare. If serial execution fixes the issue, the suite likely needs better isolation, not more retries.
13. Revisit assertions that are too strict or too weak
Sometimes the test is flaky because the assertion encodes exact UI details that are allowed to vary. Other times it is too weak and lets the wrong thing pass, which makes the real failure show up later.
Examples of overly strict assertions:
- exact text matching when copy changes slightly
- snapshot tests for volatile UI areas
- strict ordering checks on content that can legitimately reorder
Examples of overly weak assertions:
- checking only that a button exists, not that it works
- asserting on network status without verifying the rendered outcome
- verifying a toast appears, but not that the underlying state changed
The best assertion matches the business outcome you care about, not the implementation detail you happened to observe first.
14. Check whether browser state is leaking between tests
Storage state, cookies, service workers, and cached assets can make one run look very different from another. If the test depends on a clean browser profile, make sure it actually gets one.
Look for leakage from:
- reused authenticated sessions
- stale localStorage or sessionStorage values
- service worker cache behavior
- persisted theme or feature flag settings
A test that fails only after visiting another page or only after a prior login test may be inheriting state it should never see.
15. Decide whether the problem is really test maintainability
At some point, the question is not “why is the app flaky?” but “why is this test so hard to trust?” If the suite constantly breaks because locators shift, one-off waits are needed, and maintenance consumes more time than test design, the issue may be the automation layer itself.
This is where teams often start looking for tools that reduce upkeep. For example, Endtest uses self-healing behavior to recover from broken locators when the UI changes, which can help when brittleness in the test code is the main source of failures. That does not replace good test design, but it can reduce the amount of babysitting required when selectors are the weak point.
If you want a deeper look at that maintenance tradeoff, Endtest also documents its self-healing tests and compares its approach with Playwright in Endtest vs Playwright. The useful takeaway is not that one tool is universally better, it is that brittle automation code creates its own failure mode, and sometimes a more maintainable workflow is the real fix.
A practical debugging order you can reuse
If you want a repeatable sequence, try this order:
- Reproduce the failure several times.
- Check whether it is local, CI, or browser specific.
- Inspect the first failure and the trace.
- Validate the locator and scoping.
- Verify readiness, waits, and navigation.
- Review network mocks and test data.
- Compare environment differences.
- Test serial execution versus parallel execution.
- Audit browser state leakage.
- Reassess whether the automation itself needs simplification.
That order works because it starts with the most likely test-side causes before blaming the application.
When to blame the app
Yes, sometimes the app really is the problem. If the UI is genuinely not ready, if the backend returns inconsistent data, if a race condition is visible to real users, or if the app changes state before the test can observe it, then the issue is product code.
Still, it is worth proving that with evidence. A good bug report should include:
- exact steps to reproduce
- browser and environment details
- screenshots or trace evidence
- request and response data if relevant
- whether the issue appears in manual testing too
If the issue cannot be reproduced outside the automated run, keep digging on the test side.
Final take
A reliable Playwright suite is not built by adding more retries until the red turns green. It is built by understanding why the test and the app disagree. Most flakiness comes from one of four places: selectors, timing, environment drift, or test data.
Use this Playwright flaky test debugging checklist as a triage tool. If you walk through the 15 checks methodically, you will usually find the real problem faster than by rerunning the suite and hoping the failure disappears.
For Playwright teams, that usually means better locators, explicit readiness checks, stable fixtures, and cleaner isolation. For teams that spend too much time repairing brittle automation, it may also be a sign to look at more maintainable tooling patterns, including agentic AI-based platforms like Endtest when the root issue is locator maintenance rather than product behavior.
The goal is not zero failures, because that is unrealistic. The goal is to make each failure tell you something useful.