Golden Tests Are Faster Than Compiling Your App (And You Should Let AI Write Them)

A few weeks ago I was working on a feature and needed to verify how a screen looked in different states. The usual flow: compile the app -> wait -> navigate to the screen -> set up the right conditions -> check it visually → repeat for the next state. If reading this is already taking some time you can imagine (and probably you already know) how long and repetitive this process can become.

At some point I thought: what if I just ask the AI to generate a golden test for this screen? A golden test renders the widget, takes a screenshot, and saves it as a PNG. It turned out to be so good that I never stopped doing it.

The AI generated the test file in seconds, I ran make update-goldens, and I had PNG screenshots of my screen in two device sizes without ever opening the app. That was when I realized I had been doing things backwards.

The idea

Golden tests in Flutter are screenshot tests. You render a widget in a test environment, capture a pixel-perfect image, and save it as a reference file. The next time the test runs, it compares the current render against the saved image. If something changed, the test fails.

What makes them interesting in the context of AI-assisted development is that they kill several birds with one stone:

You get visual verification of your UI without compiling the app
You get regression detection for free (if someone breaks the layout later the test will catch it)
You get living documentation of how every screen looks in every state.

The golden PNGs sitting in your repo are basically a visual catalog of your app. It proved specially useful in PRs.

The problem with golden tests has always been that writing them is tedious. You need to set up the test environment, mock all the dependencies, configure device sizes, load fonts so text renders properly instead of showing black rectangles, handle overflow errors, and write a lot of boilerplate. That’s one of the reasons why most teams skip them. That and not even knowing that they exist.

But here’s the thing: that boilerplate is exactly what AI is good at. Known patterns, repetitive structure, predictable setup. If you give the AI the right context about how your golden tests work, it can generate them reliably for any screen in your app.

This is not a traditional tutorial

I’m not going to walk you through how to write golden tests step by step. Instead, I’m going to give you the context files that I use with Claude Code and Cursor so the AI can generate them for you. The idea is that you set up the infrastructure once (the helpers, the device sizes, the font loading) and then you let the AI do the repetitive work of creating test files for each screen and scenario.

This is the approach I described in my last post about engineers becoming orchestrators. You understand the architecture, you define the patterns, and you build context that makes the AI produce quality code. The golden test system is a perfect example of that.

The infrastructure

Before the AI can generate golden tests, you need a few things in place. This is the part you do once.

dart_test.yaml at your project root, defining a tag so you can run or exclude goldens selectively:

tags:
  golden:
    description: "Golden (screenshot) tests — run with --tags golden."

A target in your Makefile to regenerate all golden images:

update-goldens:
	flutter test --update-goldens --tags golden

Two test steps in your CI workflow (e.g. .github/workflows/ci.yml for GitHub Actions). One runs only goldens, the other runs everything except goldens. The reason to split them is the same reason you would not mix integration tests with widget tests: they are a different kind of test and you want to treat them as such. Once they are split you can also apply the flags that make sense for each. Unit and widget tests run with --coverage so you can measure how much code they exercise. Goldens don’t, because the coverage data from a visual test is meaningless and would only pollute the report. And as a bonus, when a golden fails you see it on its own step instead of getting lost between a thousand passing widget tests:

- name: Run golden tests
  run: flutter test --tags golden

- name: Run unit and widget tests
  run: flutter test --coverage --exclude-tags golden

One thing you cannot get wrong: the CI runner has to use the same OS as your team. If everyone works on macOS, run CI on macos-latest. The way Flutter draws things changes enough between operating systems that no tolerance is enough.

A helpers file at test/helpers/golden_test_helpers.dart. This is the file that does most of the work: the device size map, the function that sets the test surface size, the font loader that stops text from rendering as black rectangles, and the goldenTest wrapper that handles overflow errors and animation cleanup. It is the file you will touch the most and the one the AI needs to know best.

Start with the device sizes. A small map that every scenario loops through, so you get one PNG per device:

const kGoldenDeviceSizes = <String, Size>{
  'iPhone_SE': Size(375, 667),
  'iPhone_17_Pro': Size(402, 874),
};

The two pieces that took me the longest to get right are the font loader and the goldenTest wrapper. I will show both in full.

The font loader

Without real fonts, golden tests render text as black rectangles (Flutter’s Ahem fallback). Loading a font in a test would be easy if it were just one TTF, but Cupertino widgets ask for CupertinoSystemText and CupertinoSystemDisplay, and Material icons ask for MaterialIcons. If you only register Roboto, Cupertino widgets and icons render blank. So the loader registers the same Roboto TTF under several family names:

final fontConfigs = <(String family, String fileName)>[
  ('Roboto', 'Roboto-Regular.ttf'),
  ('Roboto', 'Roboto-Bold.ttf'),
  ('CupertinoSystemText', 'Roboto-Regular.ttf'),
  ('CupertinoSystemDisplay', 'Roboto-Regular.ttf'),
  ('MaterialIcons', 'MaterialIcons-Regular.otf'),
];

final cache = <String, ByteData>{};

for (final (family, fileName) in fontConfigs) {
  final fontFile = File('$fontsDir/$fileName');
  if (!fontFile.existsSync()) continue;
  final byteData = cache[fileName] ??= ByteData.view(
    fontFile.readAsBytesSync().buffer,
  );
  final loader = FontLoader(family)..addFont(Future.value(byteData));
  await loader.load();
}

The other detail is finding fontsDir in the first place. Material fonts ship with the Flutter SDK, but the path depends on where Flutter is installed and which binary launched the test (Dart VM vs flutter_tester). Hardcoding it works on your machine and breaks on everyone else’s. Instead, walk up from the running executable until you find the material_fonts directory:

String? _materialFontsDir() {
  try {
    var dir = File(Platform.resolvedExecutable).parent;
    for (var i = 0; i < 8; i++) {
      final candidate = Directory('${dir.path}/material_fonts');
      if (candidate.existsSync()) return candidate.path;
      final nested = Directory('${dir.path}/artifacts/material_fonts');
      if (nested.existsSync()) return nested.path;
      dir = dir.parent;
    }
    return null;
  } catch (_) {
    return null;
  }
}

The `goldenTest` wrapper

This is the core piece. It replaces testWidgets and takes care of three things at once:

Enables shadows. Flutter disables them in tests by default, which makes goldens look flat compared to the real app.
Ignores overflow errors from the Ahem test font on reduced surfaces. These come from the test setup, not from your layout.
Pumps an empty widget at the end to clear pending timers, so the next test starts clean.

void goldenTest(
  String description, {
  required Future<void> Function(WidgetTester tester) test,
  List<String> extraIgnoredErrors = const [],
}) {
  testWidgets(description, (tester) async {
    final originalHandler = FlutterError.onError;
    final nonOverflowErrors = <FlutterErrorDetails>[];

    final previousShadowFlag = debugDisableShadows;
    debugDisableShadows = false;

    FlutterError.onError = (details) {
      final msg = details.exception.toString();
      if (msg.contains('overflowed')) return;
      if (extraIgnoredErrors.any(msg.contains)) return;
      nonOverflowErrors.add(details);
    };

    try {
      await test(tester);
    } finally {
      await tester.pumpWidget(const SizedBox());
      await tester.pump(const Duration(seconds: 1));

      debugDisableShadows = previousShadowFlag;
      FlutterError.onError = originalHandler;
    }

    if (nonOverflowErrors.isNotEmpty) {
      fail(
        'Golden test threw non-overflow errors:\n'
        '${nonOverflowErrors.map((e) => e.exception).join('\n')}',
      );
    }
  });
}

If your app has animations that never end on their own (shimmers, looping highlights, tap hint overlays), add the same save → set false → restore pattern for each of them inside the wrapper. Otherwise pumpAndSettle() will wait forever and you will not get a frame to capture.

The extraIgnoredErrors parameter is an escape hatch for tests that stub native plugins and produce platform channel exceptions that have nothing to do with what you are rendering. Easy to add now and you do not have to rewrite the wrapper later.

No external packages. Everything is built on top of matchesGoldenFile from flutter_test.

Making goldens survive multiple machines

The first time I shared this setup with a teammate, his tests failed straight away after pulling. Nothing was broken. The way Flutter draws things changes a tiny bit between macOS versions and Flutter SDK builds, just a few pixels. But matchesGoldenFile is exact by default. One pixel off and the test fails.

The fix is a custom golden comparator with a tolerance threshold:

const _kGoldenTolerance = 0.02; // 2%

class _TolerantGoldenFileComparator extends LocalFileComparator {
  _TolerantGoldenFileComparator(super.testFile);

  @override
  Future<bool> compare(Uint8List imageBytes, Uri golden) async {
    final goldenBytes = await getGoldenBytes(golden);
    final result = await GoldenFileComparator.compareLists(
      imageBytes,
      goldenBytes,
    );

    if (result.passed || result.diffPercent <= _kGoldenTolerance) {
      result.dispose();
      return true;
    }

    final error = await generateFailureOutput(result, golden, basedir);
    result.dispose();
    throw FlutterError(error);
  }
}

Two percent is enough to hide the small differences between machines without hiding real regressions. If a button moves a few pixels because of a padding change, the diff is much bigger than 2% and the test still fails.

There is also a second benefit. When you run make update-goldens, you don’t want every PNG to be rewritten just because the bytes are a bit different from the previous run. Otherwise every PR ends up with a bunch of PNG changes that nobody can review. So the comparator also overrides update() to skip rewriting when the new image is within tolerance of the existing one:

@override
Future<void> update(Uri golden, Uint8List imageBytes) async {
  final goldenFile = File.fromUri(basedir.resolveUri(golden));
  if (goldenFile.existsSync()) {
    final existingBytes = Uint8List.fromList(goldenFile.readAsBytesSync());
    final result = await GoldenFileComparator.compareLists(
      imageBytes,
      existingBytes,
    );
    final skip = result.passed || result.diffPercent <= _kGoldenTolerance;
    result.dispose();
    if (skip) return;
  }
  await super.update(golden, imageBytes);
}

The class on its own does nothing, you still have to install it. The install goes inside the goldenTest wrapper from earlier (not in each individual test you write). That way every call to goldenTest uses the tolerant comparator automatically.

Two lines to add. The updated wrapper looks like this, with the new lines marked:

void goldenTest(
  String description, {
  required Future<void> Function(WidgetTester tester) test,
  List<String> extraIgnoredErrors = const [],
}) {
  testWidgets(description, (tester) async {
    final originalHandler = FlutterError.onError;
    final nonOverflowErrors = <FlutterErrorDetails>[];

    final previousShadowFlag = debugDisableShadows;
    debugDisableShadows = false;

    // NEW: install the tolerant comparator.
    final originalComparator = goldenFileComparator;
    if (originalComparator is LocalFileComparator) {
      goldenFileComparator = _TolerantGoldenFileComparator(
        originalComparator.basedir.resolve('_test.dart'),
      );
    }

    FlutterError.onError = (details) {
      final msg = details.exception.toString();
      if (msg.contains('overflowed')) return;
      if (extraIgnoredErrors.any(msg.contains)) return;
      nonOverflowErrors.add(details);
    };

    try {
      await test(tester);
    } finally {
      await tester.pumpWidget(const SizedBox());
      await tester.pump(const Duration(seconds: 1));

      debugDisableShadows = previousShadowFlag;
      // NEW: restore the original comparator so the tolerance does not
      // leak into other tests.
      goldenFileComparator = originalComparator;
      FlutterError.onError = originalHandler;
    }

    if (nonOverflowErrors.isNotEmpty) {
      fail(
        'Golden test threw non-overflow errors:\n'
        '${nonOverflowErrors.map((e) => e.exception).join('\n')}',
      );
    }
  });
}

With this in place, goldens stop being flaky between machines, PRs don’t fill up with random PNG changes, and people actually keep using them.

The context files

Instead of teaching you how to write golden tests, I am giving you the files that teach the AI how to write them.

For Claude Code

Claude Code uses a skill system. Skills are folders inside .claude/skills/ that contain a SKILL.md with instructions and optionally reference files. What is nice is that skills are loaded on demand, they don’t consume context until Claude sees they are relevant. So having a golden tests skill doesn’t slow down your other conversations.

The structure:

.claude/
  skills/
    golden-tests/
      SKILL.md                              # Instructions for the AI
      references/
        golden_test_helpers.dart             # Your actual helper file

The SKILL.md contains the full pattern: scope, file structure, naming conventions, which helpers to call, how to mock dependencies, and how to name the output PNGs. Here is the one I use, adapt to your own helpers:

Full SKILL.md

---
name: golden-tests
description: Generate Flutter golden (screenshot) tests for full screens using the project's `golden_test_helpers.dart`. Use whenever the user asks for a golden test, a screenshot test, or visual regression coverage for a Flutter screen.
---

# Golden tests

Generate a golden test for a Flutter **screen** using the helpers in `test/helpers/golden_test_helpers.dart`. Before writing anything, open `references/golden_test_helpers.dart` and read the exact function signatures — never invent helpers that do not exist there.

## Scope

Golden tests cover **full screens**, never individual widgets. If the user asks for a golden test of a button, a list tile, a card or any sub-component, do not write one — propose covering the screen that renders it instead. The value of a golden is verifying the composed result a user sees, not isolated pieces.

## File location and naming

- Place the test next to the screen it covers, mirroring the `lib/` path under `test/`.
- The filename ends in `_golden_test.dart` (e.g., `foo_screen_golden_test.dart`).
- Generated PNGs live in a sibling `goldens/` folder. The first `--update-goldens` run creates it.

## Skeleton

Every test follows the same shape:

1. Group tests by the screen under test.
2. Call `loadGoldenTestFonts()` in `setUpAll` so text renders instead of black rectangles.
3. Use the `goldenTest(...)` wrapper, never raw `testWidgets`, so shadows, animations and overflow noise are handled.
4. Inside each scenario, loop over `kGoldenDeviceSizes` and emit one PNG per device:
   - `await setGoldenSurfaceSize(tester, size);`
   - Pump the screen wrapped in the dependencies it needs (`MaterialApp`, theme, providers, mocks).
   - `await tester.pumpAndSettle();`
   - `await expectLater(find.byType(MyScreen), matchesGoldenFile('goldens/<scenario>_<deviceKey>.png'));`

## PNG filenames

Format: `<scenario>_<deviceKey>.png`.

- `<scenario>` is snake_case and describes the state (`empty`, `loaded`, `error`, `loading`).
- `<deviceKey>` is the key from `kGoldenDeviceSizes` (`iPhone_SE`, `iPhone_17_Pro`).

Example: `loaded_iPhone_SE.png`, `loaded_iPhone_17_Pro.png`.

## Mocking dependencies

- Use `mocktail`.
- Stub only what the screen reads during build — do not mock methods the test never triggers.
- Inject mocks through the same DI mechanism the app uses (`BlocProvider`, `RepositoryProvider`, `Provider`, etc.).
- For Blocs, stub `state`, `stream` (return `const Stream.empty()` when no state changes are needed), and `close`.

## Suppressing platform-channel noise

If a screen triggers native plugins during render (haptics, share, file pickers...), pass the noisy error substrings via `extraIgnoredErrors` so they do not fail the test:

```dart
goldenTest(
  'with haptics stubbed',
  extraIgnoredErrors: ['MissingPluginException'],
  test: (tester) async { ... },
);
```

Only suppress what is unrelated to rendering. Real rendering errors must keep failing.

## Running

- Regenerate: `make update-goldens` (or `flutter test --update-goldens --tags golden`).
- Verify: `flutter test --tags golden`.
- Exclude from CI: `flutter test --exclude-tags golden`.

## Output expectation

Return only the test file. No explanations, no extra commentary unless the user explicitly asks.

The reference file in references/ is your actual golden_test_helpers.dart, copied so Claude can read the exact API without scanning the whole repo.

When you ask Claude Code to create a golden test for any screen, the skill activates automatically and the AI generates a test file following your exact conventions.

For Cursor

Cursor uses .mdc rule files in .cursor/rules/. The key difference is that you can scope rules to specific file patterns using globs. For golden tests, you scope it to **/*_golden_test.dart so the rule only activates when you’re working on golden test files.

.cursor/
  rules/
    golden-tests.mdc

The .mdc file contains the same conventions in a more compact format. The frontmatter tells Cursor when to apply it, and the body is the same pattern stripped to essentials:

Full golden-tests.mdc

---
description: Golden test generation for Flutter screens.
globs: ["**/*_golden_test.dart"]
alwaysApply: false
---

# Golden tests

Generate Flutter golden tests using the helpers in `test/helpers/golden_test_helpers.dart`. Read that file before writing anything — call only functions that exist there.

## Scope

Golden tests cover **full screens**, never sub-components. If asked for a golden of a button, card or list tile, refuse and propose the parent screen instead.

## File layout

- Tests live next to the screen under test with the suffix `_golden_test.dart`.
- PNGs go in a sibling `goldens/` folder.

## Skeleton

1. Call `loadGoldenTestFonts()` in `setUpAll`.
2. Use `goldenTest('scenario', test: (tester) async { ... })` instead of `testWidgets`.
3. Loop over `kGoldenDeviceSizes` inside each scenario:
   - `await setGoldenSurfaceSize(tester, size);`
   - Pump the screen with its dependencies (MaterialApp, providers, mocks).
   - `await tester.pumpAndSettle();`
   - `await expectLater(find.byType(MyScreen), matchesGoldenFile('goldens/<scenario>_<deviceKey>.png'));`

## PNG filenames

`<scenario>_<deviceKey>.png` — `<scenario>` is snake_case (`loaded`, `error`), `<deviceKey>` comes from `kGoldenDeviceSizes` (`iPhone_SE`, `iPhone_17_Pro`).

## Mocks

- Use `mocktail`.
- Stub only what the screen reads during build.
- Inject through the same DI mechanism the app uses.

## Platform-channel noise

For screens that trigger native plugins, pass `extraIgnoredErrors: ['MissingPluginException']` to `goldenTest`. Never suppress rendering errors.

## Output

Return only the test file.

Same idea, different format.

How I actually use this

My workflow now looks like this: I’m working on a feature, I finish the implementation, and instead of compiling the app to check how it looks, I ask the AI to generate a golden test for that screen. The AI creates the test file with all the scenarios I need (empty state, loaded state, error state, etc.), I run make update-goldens, and I get PNGs for every scenario in two device sizes.

I open the PNGs, verify the UI looks correct, and move on. If I need to tweak something, I update the implementation, run the goldens again, and compare. The whole feedback loop is faster than compile-navigate-verify, and at the end I have tests that will catch visual regressions in the future.

The moment it really clicked for me was when I realized that the golden PNGs are also very useful for code reviews. The reviewers can just look at the golden images in the PR. That alone saves a lot of time.

Getting started

If you want to try this approach:

Set up the infrastructure (helpers, dart_test.yaml, Makefile target). You do this once.
Drop the context files into your project (.claude/skills/golden-tests/ for Claude Code, .cursor/rules/golden-tests.mdc for Cursor, or both).
Make sure the reference helper file in the skill matches your actual golden_test_helpers.dart.
Ask your AI to generate a golden test for any screen and see what happens.

The whole point is that you invest time once in defining how things should work, and then the AI reproduces that pattern consistently across your entire app.