February 2026 – Page 405 – polygon.uploadbackup.com

When we started building Atrium—an experimental, unreleased workspace management tool—we needed integration tests that would verify the tool actually works as users would experience it. Unit tests are useful for internal logic, but they can’t tell you whether atrium create --provider shell /path/to/workspace will actually create a functional workspace on a real filesystem with the correct permissions and environment.

We needed integration tests that would exercise the full stack: command-line parsing, file I/O, process spawning, environment variables, and all the messy details that make CLIs work. After researching available options, we decided to use Scrut—a markdown-based testing framework designed specifically for CLI tools.

Scrut provides three key capabilities for CLI testing:

1. System-level integration – Tests run through the shell exactly as users would invoke your tool. File system operations, PATH resolution, environment variables, process spawning—everything works as it would in production. If your tool relies on finding executables in PATH or reading from XDG directories, your tests exercise those code paths for real.

2. User experience validation – Tests exercise the actual CLI interface: flags, subcommands, output formatting, error messages. This catches UX regressions that unit tests miss. Does --provider come before or after the positional argument? Do error messages actually make sense? Are table borders rendering correctly? Integration tests answer these questions.

3. Human-AI collaboration bridge – Test files are markdown documents that both humans and AI can read, write, and review. Ryan and I can both contribute to the test suite naturally. The tests serve as living documentation that’s accessible to the whole team.

In this post, we’ll walk you through everything we learned building Atrium’s test suite: from environment isolation to handling non-deterministic output, from test organization to CI integration. By the end, you’ll have a blueprint for building your own Scrut-based test infrastructure.

Scrut is a testing framework where tests are markdown files containing shell commands and their expected output. It was created at Meta and released as open source specifically for testing CLI tools and shell scripts.

Here’s the simplest possible test:

When you run scrut test my_test.md, Scrut:

Executes echo "Hello, World!" in a shell
Captures the output
Compares it line-by-line against Hello, World!
Reports success or shows a diff if they don’t match

Tests consist of:

YAML frontmatter: Configuration for the test file
Markdown structure: Headers and prose that document what’s being tested
Scrut code blocks: Special code blocks with scrut language tag containing commands and expected output
Commands: Lines starting with $ (new command) or > (continuation)
Expected output: Lines without $ or > prefixes

When to use Scrut: It’s best for CLI tools, shell scripts, or any software where end-to-end system behavior matters more than isolated unit testing. If your users interact with your software through a shell, Scrut tests what they’ll actually experience.

One of the biggest challenges in integration testing is environment isolation. Tests should run in a clean, reproducible environment without touching the developer’s actual system. Nobody wants their test suite to clobber their personal configuration files or leave behind test data.

Our approach uses two key techniques: PATH override and XDG directory redirection.

Our test runner in Argcfile.sh ensures we’re testing the binary we just built:

# @cmd Run end-to-end tests
# @arg path   Path for tests
# @arg args~  Passed to scrut
    mkdir -p target/tmp/atrium
    export PATH="$PWD/target/debug:$PATH"      # Use built binary
    export TMPDIR="$PWD/target/tmp/atrium"     # Isolated temp directory
    scrut test "${argc_path:-tests}" ${argc_args+"${argc_args[@]}"}

The PATH override is critical: it ensures that when our tests run atrium create ..., they’re invoking the freshly built debug binary, not some installed version or an old binary lingering in the system PATH. We prepend our build directory to PATH rather than replacing PATH entirely, so standard Unix tools remain available.

Scrut automatically provides each test file with a fresh temporary directory via $TMPDIR. We leverage this along with XDG Base Directory variables to achieve complete isolation. Here’s a block from our tests/setup.md:

> ATRIUM_PROJECT_ROOT=$(git -C "$TESTDIR" rev-parse --show-toplevel)
> export ATRIUM_PROJECT_ROOT
> export XDG_CONFIG_HOME="$ATRIUM_PROJECT_ROOT/tests/config"
> export XDG_DATA_HOME="$TMPDIR"

Let me break down what each variable does:

Scrut-provided variables: Reference

$TESTDIR: Directory containing the test file (e.g., tests/fast/)
$TMPDIR: Fresh temporary directory for this specific test file
$TESTFILE: Name of the current test file

How we compute our directories:

1. Project root (ATRIUM_PROJECT_ROOT):
We need a stable absolute path to reference test fixtures. The command git -C "$TESTDIR" rev-parse --show-toplevel finds the git repository root regardless of test file depth. Whether we’re running tests/fast/foo.md or tests/bar.md, this gives us the same project root.

We use this to reference fixtures:

$ atrium up --provider shell "$ATRIUM_PROJECT_ROOT/tests/empty_workspace"

2. Config directory (XDG_CONFIG_HOME):
Points to tests/config/ – a version-controlled fixture directory containing test-specific configuration. This includes test-only drivers, providers, and aliases. It’s stable across test runs and checked into git.

In production, Atrium reads config from ~/.config/atrium/. In tests, it reads from tests/config/. The developer’s personal configuration is never touched.

3. Data directory (XDG_DATA_HOME):
Points to $TMPDIR – Scrut’s temporary directory. The Atrium database, runtime state, and any generated files go here. It’s fresh for each test file and automatically cleaned up afterwards.

In production, Atrium writes data to ~/.local/share/atrium/. In tests, it writes to the temporary directory. The developer’s actual data is never touched.

Why this split matters:

Config (fixture): Test-specific settings that should be consistent across runs
Data (temporary): Runtime state that should start fresh each time

This approach gives us complete isolation. Tests interact with dedicated config fixtures in tests/config/ and write data to Scrut’s temporary directory. The developer’s actual ~/.config/ and ~/.local/share/ are never affected, even after running hundreds of tests.

Every test needs common setup: environment variables, helper functions, initial state verification. But duplicating this code across dozens of test files violates DRY principles and makes maintenance nightmarish.

Scrut solves this with the prepend feature. Every test file in our suite includes this YAML frontmatter:

Scrut executes all code blocks from setup.md before running the test’s own blocks. It’s like a beforeEach hook in traditional test frameworks, but it’s just another markdown file.

Here’s our complete tests/setup.md file:

Don't run this file as a standalone test.
$ [ "$TESTFILE" = setup.md ] && exit 80
Set up the configuration and data directories.
> ATRIUM_PROJECT_ROOT=$(git -C "$TESTDIR" rev-parse --show-toplevel)
> export ATRIUM_PROJECT_ROOT
> export XDG_CONFIG_HOME="$ATRIUM_PROJECT_ROOT/tests/config"
> export XDG_DATA_HOME="$TMPDIR"
Verify that atrium works and has an empty database.
Define a shell redaction function
>   sed -re 's@[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z[[:space:]]*@@g' \
>     -e "s@$PWD@[LOCAL_PWD]@g" \
>     -e "s@$ATRIUM_PROJECT_ROOT@[ROOT]@g" \
>     -e "s@$HOME@[LOCAL_HOME]@g" \
>     -e 's@"containerId":"[0-9a-f]*"@"containerId":"..."@g'

The file has four main sections:

1. Skip if standalone: Exit code 80 tells Scrut to skip this file when run directly. It’s not a test itself, just shared setup.

2. Environment setup: Sets shell options and configures XDG directories as described in the previous section. The set -eu -o pipefail ensures errors are caught immediately:

-e: Exit on error (don’t silently continue after failures)
-u: Exit on undefined variable (catch typos)
-o pipefail: Fail if any command in a pipeline fails

3. Verify empty state: Confirms Atrium starts with an empty database. This catches configuration issues early.

4. Redaction helper: Defines a shell function to normalize non-deterministic output. We’ll explore this in detail in the next section.

The prepend pattern means we write our environment setup once and get a fresh environment in every test. When we need to adjust the setup—say, adding a new environment variable—we change it in one place.

Real CLI tools produce output that varies between runs: timestamps, absolute paths, container IDs, process IDs. Snapshot testing breaks unless we normalize this output.

Our solution is the redact() function defined you saw in setup.md. It uses sed to perform several transformations:

Timestamps: Removes ISO8601 timestamps completely, including trailing whitespace.

Paths: Replaces absolute paths with semantic placeholders:

Current directory → [LOCAL_PWD]
Project root → [ROOT]
Home directory → [LOCAL_HOME]

IDs: Normalizes container IDs by replacing hex values with ...

Here’s how we use it in tests:

$ atrium up --provider shell --id=workspace_copy --always-copy \
>     "$ATRIUM_PROJECT_ROOT/tests/empty_workspace" 2>&1 | redact
[INFO  atrium::machine] Provider local does not have a start script
[INFO  atrium::workspace] Copying [ROOT]/tests/empty_workspace into [LOCAL_PWD]/workspace_copy
[INFO  atrium::workspace] Provider shell does not have a create script
[INFO  atrium] Workspace workspace_copy created
[INFO  atrium::workspace] Provider shell does not have a start script
[INFO  atrium] Workspace workspace_copy started

The actual output contains timestamps like 2024-01-15T10:30:45Z [INFO atrium::workspace] and paths like /Users/ryan/Projects/atrium/tests/empty_workspace, but redact normalizes them before snapshot comparison.

Why semantic placeholders matter:

Paths change on different platforms (ATRIUM_PROJECT_ROOT is different for every developer)
[ROOT] is far more readable than /var/folders/xy/randomhash/T/scrut.abc123/tests
Snapshots become documentation—humans can understand them

What to redact in your project:

Timestamps and dates
Absolute paths (especially temp directories and home directories)
UUIDs, random IDs, cryptographic hashes
Dynamically assigned port numbers
Process IDs
Anything that changes between runs but isn’t semantically important

Alternative: Scrut’s regex matching

For occasional variance, you can use inline regex matching (we’ll see examples later). Redaction functions are better for systematic patterns that appear throughout your test suite.

Our test directory structure reflects both the logical organization of tests and their performance characteristics:

├── setup.md              # Shared setup (prepended to all tests)
├── config/               # Test-specific configuration (XDG_CONFIG_HOME)
│       ├── config.toml
│           ├── helper.toml
├── empty_workspace/      # Fixture directory
├── fast/                 # Fast tests for CI (
│   ├── local_shell_local.md
│   ├── recreate.md
└── local/                # Slower tests (require local VMs)
    └── lima_devcontainer.md

1. Separate by speed/environment:

We split tests into fast/ and local/ directories. Fast tests use only the local machine driver and complete in under 5 seconds total. Local tests require Lima VMs or containers—they’re slower and sometimes flaky. Cloud provider tests require API keys and cost more money to execute.

During normal development, we run only fast tests to keep feedback loops short. Before release, developers can run argc test e2e to exercise the full test suite.

2. Fixtures as subdirectories:

empty_workspace/ is a shared test fixture—a minimal workspace containing just a README. Many tests reference it:

$ atrium up --provider shell "$ATRIUM_PROJECT_ROOT/tests/empty_workspace"

Remember how we computed ATRIUM_PROJECT_ROOT in setup.md? This is why. Scrut runs tests from $TMPDIR (a fresh temporary directory), so we need a stable absolute path to reference fixtures that live in our project.

3. Test-specific configuration as fixtures:

tests/config/ contains configuration files for Atrium itself:

Test-only drivers (like a fake driver for testing error conditions)
Test-specific providers
Test-only aliases

We point XDG_CONFIG_HOME to this directory in setup.md. The configuration is version-controlled fixture data, not temporary test output.

You can run specific directories or files:

# Just fast tests (for normal development)
scrut test tests/fast/helpers.md

This is why organizing by directory (fast/ vs. local/) is useful—you can easily run just the subset you care about.

An issue we encountered repeatedly while building Atrium’s test suite: if initial setup fails (e.g., creating a workspace), subsequent test steps produce confusing cascading failures. When a workspace creation step failed, we’d get pages of “workspace not found” errors from subsequent blocks trying to use that non-existent workspace. The actual failure was buried in the noise.

We needed a way to stop test execution immediately when a prerequisite step failed. Scrut didn’t have this feature, so we implemented it and submitted PR #42 to add the fail_fast block attribute. It was merged and released in Scrut 0.4.3.

When a block marked with fail_fast: true fails, Scrut stops test execution immediately.

Here’s how we use it:

```scrut { fail_fast: true }
$ atrium up --provider shell "$ATRIUM_PROJECT_ROOT/tests/empty_workspace" 2>&1 | redact
[INFO  atrium::workspace] Workspace empty_workspace created
$ echo "pwd" | atrium shell empty_workspace

What happens:

If workspace creation fails, the test stops immediately
The “Access workspace” block doesn’t run
Test failure output is clear: setup failed, not the feature being tested

When to use fail_fast:

Initial test setup (creating resources, starting services)
Prerequisites for subsequent assertions
Any step where failure makes later steps meaningless

When not to use it:

Normal test assertions (you want to see all failures)
Cleanup steps (run them even if the test failed)
Independent test sections

Other useful block attributes:

```scrut { timeout: 60s }
$ atrium machine start my-vm

The timeout attribute is useful for operations that might hang (starting VMs, network operations).

The fail_fast feature has been invaluable for keeping test output clean. When something goes wrong, we see the actual problem immediately instead of scrolling through pages of cascading failures.

Command lines and output:

Lines starting with $ introduce a new command. Lines starting with > continue the previous command. Lines without $ or > prefixes are expected output. Scrut essentially runs all the command lines as a shell script and compares the actual output to the expected output:

$ atrium inspect empty_workspace -f json \

Note how the backslash \ works just like in a shell script—it continues the command onto the next line. The > prefix on the continuation line tells Scrut this is part of the previous command.

Exit codes:

Bracket notation checks exit codes. This is essential for testing error conditions:

Without [1], Scrut expects exit code 0 by default and the test would fail. Note that the bracket notation must be the last line of the expected output—it can’t appear anywhere else in the block.

Regex matching:

Append (regex) to enable regex matching for that specific line:

$ atrium inspect workspace
Name       workspace\s* (regex)
Provider   shell\s* (regex)

The \s* handles trailing whitespace variance. We use this sparingly—mostly for table output where padding can vary.

Combining stdout/stderr:

Our default frontmatter configuration merges stderr into stdout:

This is essential because Atrium (like most CLI tools) writes logs to stderr and results to stdout. Combined output makes testing simpler—we don’t have to reason about which stream each line goes to.

Most CLI tools provide both machine-readable (JSON, CSV) and human-readable (tables, formatted text) output. You should test both, but choose the right format for each test.

Use structured output (JSON) for most tests:

$ atrium inspect empty_workspace -f json | jq -r '.helpers'
      "description": "A test message option",

Benefits of testing JSON:

Deterministic (no formatting variance, column padding, or whitespace issues)
jq lets you extract specific fields for focused assertions
Snapshots show the exact structure your users will parse
Changes to data structure are immediately visible

Use human-readable output for UX validation:

We also test the human-readable table output in output_formats.md:

╭──────────────────┬──────────┬─────────╮
│ Name             │ Provider │ Machine │
├──────────────────┼──────────┼─────────┤
│ empty_workspace  │ shell    │ local   │
╰──────────────────┴──────────┴─────────╯

This validates the actual user-facing output format, including Unicode box drawing characters. For formatted output where spacing might vary, use regex matching:

$ atrium inspect empty_workspace | redact
 Name     \s* empty_workspace (regex)
 Provider \s* shell (regex)
 Machine  \s* local (regex)

Our approach: Most tests use JSON output for reliability and precision. We have a dedicated output_formats.md test file that validates table formatting, alignment, and other presentation details. This gives us confidence in both the data and the UX without making every test brittle to formatting changes.

1. One assertion per block: Makes it easier to identify which specific assertion failed.

2. Descriptive section headers: Tests double as documentation. Use markdown headers to organize test blocks:

## Execute command in workspace

3. Test both success and failure:

$ atrium delete nonexistent 2>&1
Error: Workspace 'nonexistent' not found

4. Verify state changes: Don’t just test the operation, verify its result:

$ atrium delete empty_workspace 2>&1 | redact
[INFO  atrium] Workspace empty_workspace deleted

5. Clean up external resources: Since each test gets a fresh $TMPDIR and XDG_DATA_HOME, cleanup of test state is mostly automatic. However, if your tests create external resources (VMs, containers, cloud resources), you need to clean them up explicitly.

Our pattern for tests that create external resources:

Clean up at the start of the test to handle previous failed runs
Use a consistent ID across test runs (don’t generate random IDs that leave orphaned resources)
Clean up at the end of the test in the success case

Example from tests/local/lima_shell.md:

# Clean up any previous failed run
$ limactl delete --force atrium_lima_shell >/dev/null 2>&1
## Create VM and workspace
```scrut { fail_fast: true }
$ atrium up --provider lima_shell --id atrium_lima_shell ...
## Delete workspace (cleanup)
$ atrium delete atrium_lima_shell 2>&1 | redact

Note that limactl doesn’t fail if the resource doesn’t exist, but if it did you could use || true to suppress the failure.

Developers and CI should run the same command to execute tests. The key is a simple script that:

Builds your project
Sets up environment variables (PATH, TMPDIR, XDG vars if needed)
Runs scrut test tests/

Here are examples using different build tools:

# @cmd Run all tests (unit + e2e)
# @cmd Run end-to-end tests
# @arg path   Path for tests
# @arg args~  Passed to scrut
    mkdir -p target/tmp/atrium
    export PATH="$PWD/target/debug:$PATH"
    export TMPDIR="$PWD/target/tmp/atrium"
    scrut test "${argc_path:-tests}" ${argc_args+"${argc_args[@]}"}

.PHONY: test test-unit test-e2e build
  mkdir -p target/tmp/myproject
  PATH="$(PWD)/target/debug:$$PATH" \
  TMPDIR="$(PWD)/target/tmp/myproject" \

mkdir -p target/tmp/myproject
export PATH="$PWD/target/debug:$PATH"
export TMPDIR="$PWD/target/tmp/myproject"

CI integration is trivial. GitHub Actions example:

      - uses: actions/checkout@v3
      - uses: dtolnay/rust-toolchain@stable
        run: ./test.sh  # or: make test, or: argc test

Let me walk through a more complex test that demonstrates multiple patterns. This is our recreate.md test, which validates workspace recreation with different options.

# Workspace Recreate Command
Test the recreate command functionality.
Precondition: Verify the fixture starts in a known state (just a README).
This ensures previous test runs didn't leave behind files.
$ rm -f "$ATRIUM_PROJECT_ROOT/tests/empty_workspace/marker.txt"
> ls "$ATRIUM_PROJECT_ROOT/tests/empty_workspace"
## Part 1: Copied workspace (--always-copy)
When using `--always-copy`, files are copied into a new directory. This tests
whether `recreate` properly handles file preservation and deletion.
```scrut { fail_fast: true }
$ atrium create --provider shell --always-copy --id copied_ws "$ATRIUM_PROJECT_ROOT/tests/empty_workspace" 2>&1 | redact
[INFO  atrium::machine] Provider local does not have a start script
[INFO  atrium::workspace] Copying [ROOT]/tests/empty_workspace into [LOCAL_PWD]/copied_ws
[INFO  atrium::workspace] Provider shell does not have a create script
[INFO  atrium] Workspace copied_ws created
[INFO  atrium::workspace] Provider shell does not have a start script
[INFO  atrium] Workspace copied_ws started
Verify the workspace was actually copied (not just an alias of the original):
$ atrium exec copied_ws pwd | redact
Verify the files were copied:
$ atrium exec copied_ws ls
Create a marker file to test whether `recreate` preserves user modifications:
$ atrium exec copied_ws touch marker.txt
### Recreate preserves modified files
The default `recreate` behavior should preserve files (only reset infrastructure):
$ atrium recreate copied_ws 2>&1 | redact
[INFO  atrium::machine] Provider local does not have a start script
[INFO  atrium::workspace] Running workspace delete command
[INFO  atrium::workspace] Provider shell does not have a create script
[INFO  atrium] Workspace copied_ws recreated
[INFO  atrium::workspace] Provider shell does not have a start script
[INFO  atrium] Workspace copied_ws started
Verify marker file still exists (files preserved):
$ atrium exec copied_ws ls | sort
### Reset deletes and recreates copied files
But `--reset` should completely wipe the workspace and re-copy from source:
$ atrium recreate --reset copied_ws 2>&1 | redact
[INFO  atrium::machine] Provider local does not have a start script
[INFO  atrium::workspace] Running workspace delete command
[INFO  atrium::workspace] Copying [ROOT]/tests/empty_workspace into [LOCAL_PWD]/copied_ws
[INFO  atrium::workspace] Provider shell does not have a create script
[INFO  atrium] Workspace copied_ws recreated
[INFO  atrium::workspace] Provider shell does not have a start script
[INFO  atrium] Workspace copied_ws started
Verify marker file is gone (files were deleted and re-copied from source):
$ atrium exec copied_ws ls | sort
Clean up copied workspace:
$ atrium delete copied_ws 2>&1 | redact
[INFO  atrium::workspace] Running workspace delete command
[INFO  atrium] Workspace copied_ws deleted
## Part 2: Bind mount workspace (local root)
Without `--always-copy`, the workspace points directly to the source directory.
This tests different behavior: we must NOT delete user files on `--reset`.
```scrut { fail_fast: true }
$ atrium create --provider shell -o SHELL=/bin/bash "$ATRIUM_PROJECT_ROOT/tests/empty_workspace" 2>&1 | redact
[INFO  atrium::machine] Provider local does not have a start script
[INFO  atrium::workspace] Provider shell does not have a create script
[INFO  atrium] Workspace empty_workspace created
[INFO  atrium::workspace] Provider shell does not have a start script
[INFO  atrium] Workspace empty_workspace started
Verify the workspace root is the original path (not a copy):
$ atrium exec empty_workspace pwd | redact
[ROOT]/tests/empty_workspace
Create a marker file to test behavior:
$ atrium exec empty_workspace touch marker.txt
### Recreate preserves settings and files
$ atrium recreate empty_workspace 2>&1 | redact
[INFO  atrium::machine] Provider local does not have a start script
[INFO  atrium::workspace] Running workspace delete command
[INFO  atrium::workspace] Provider shell does not have a create script
[INFO  atrium] Workspace empty_workspace recreated
[INFO  atrium::workspace] Provider shell does not have a start script
[INFO  atrium] Workspace empty_workspace started
Verify the option that was set is still there:
$ atrium inspect -f json empty_workspace | jq -r '.options[] | select(.name == "SHELL") | .value'
Verify files still exist:
$ atrium exec empty_workspace ls | sort
### Reset does NOT delete local workspace files
CRITICAL: When the workspace points to a local directory, `--reset` must NOT
delete files (that would destroy user data!):
$ atrium recreate --reset empty_workspace 2>&1 | redact
[INFO  atrium::machine] Provider local does not have a start script
[INFO  atrium::workspace] Running workspace delete command
Not deleting protected workspace root
[INFO  atrium::workspace] Provider shell does not have a create script
[INFO  atrium] Workspace empty_workspace recreated
[INFO  atrium::workspace] Provider shell does not have a start script
[INFO  atrium] Workspace empty_workspace started
Verify the option is still there:
$ atrium inspect -f json empty_workspace | jq -r '.options[] | select(.name == "SHELL") | .value'
Verify files still exist:
$ atrium exec empty_workspace ls | sort
### Recreate overriding an option
$ atrium recreate -o SHELL=/bin/zsh empty_workspace 2>&1 | redact
[INFO  atrium::machine] Provider local does not have a start script
[INFO  atrium::workspace] Running workspace delete command
[INFO  atrium::workspace] Provider shell does not have a create script
[INFO  atrium] Workspace empty_workspace recreated
[INFO  atrium::workspace] Provider shell does not have a start script
[INFO  atrium] Workspace empty_workspace started
Verify the option is updated:
$ atrium inspect -f json empty_workspace | jq -r '.options[] | select(.name == "SHELL") | .value'
### Recreate with --clear-options
$ atrium recreate --clear-options empty_workspace 2>&1 | redact
[INFO  atrium::machine] Provider local does not have a start script
[INFO  atrium::workspace] Running workspace delete command
[INFO  atrium::workspace] Provider shell does not have a create script
[INFO  atrium] Workspace empty_workspace recreated
[INFO  atrium::workspace] Provider shell does not have a start script
[INFO  atrium] Workspace empty_workspace started
Verify option source is now unset (the value itself is just JSON null):
$ atrium inspect -f json empty_workspace | jq -r '.options[] | select(.name == "SHELL") | .source'
$ atrium delete empty_workspace 2>&1 | redact
> rm -f "$ATRIUM_PROJECT_ROOT/tests/empty_workspace/marker.txt"
[INFO  atrium::workspace] Running workspace delete command
Not deleting protected workspace root
[INFO  atrium] Workspace empty_workspace deleted
Verify workspace is deleted:

1. Multi-phase testing: Two independent parts test different code paths (copied vs. bind-mounted workspaces).

2. State verification: Create marker file → run operation → verify expected state change (or lack thereof).

3. Testing flag combinations: recreate vs. recreate --reset, --always-copy vs. default behavior.

4. Realistic workflows: This tests actual user workflows: create workspace, modify it, recreate it with different flags, verify results. Not just testing isolated commands.

5. Safety testing: The bind-mounted test verifies that --reset doesn’t delete user data from local directories.

This test catches multiple classes of bugs:

Logic errors in the recreate command
Incorrect file handling (copy vs. bind mount)
Data loss scenarios
Flag parsing issues

And it’s all expressed in readable markdown that serves as documentation for how recreate should work.

Here are the patterns and pitfalls we’ve learned while building Atrium’s test suite:

1. Always redact timestamps and paths

Even if you think output is deterministic, filesystem paths vary by system (macOS uses /var/folders/..., Linux uses /tmp/...). Redact early and liberally. It’s much easier to add redaction from the start than to chase down spurious test failures later.

2. Use output_stream: combined by default

Most CLI tools mix stdout (results) and stderr (logs). The combined stream makes tests simpler because you don’t have to reason about which stream each line goes to. Only use separate streams if you specifically need to test that output goes to the correct stream.

3. Test JSON output when available

Table formatting can have subtle whitespace differences (column padding, alignment) that cause needless snapshot churn. JSON output is deterministic and easier to subset with jq.

That said, you should have some tests that verify the human-readable table output looks correct—just not every test.

4. Set -euo pipefail in shell blocks

This prevents silent failures:

-e: Exit on error
-u: Exit on undefined variable
-o pipefail: Fail if any command in a pipeline fails

Note that if a command fails, it aborts that specific test block, but Scrut continues with subsequent blocks. The expected exit code is still validated (defaults to [0] if not specified).

We set this in our shared setup.md so all tests get it for free.

5. Scrut runs from $TMPDIR

All tests start in a fresh empty directory ($TMPDIR). To reference project files, use the project root variable we defined in setup.md:

$ cp "$ATRIUM_PROJECT_ROOT/tests/fixtures/sample.txt" .

This is why we compute and export ATRIUM_PROJECT_ROOT in our setup—it gives us a stable absolute path to reference fixtures.

6. Test both success and failure

Error handling is part of your UX. Test it:

$ atrium delete nonexistent 2>&1
Error: Workspace 'nonexistent' not found

The [1] is crucial—it tells Scrut that exit code 1 is expected.

7. Keep tests focused

Each test file should verify one feature or workflow. It’s better to have 20 small tests than 2 giant ones. Small tests:

Run faster (fail-fast on first error)
Are easier to debug (less to read when they fail)
Serve as better documentation (focused on one concept)
Are easier to maintain

8. Use sections (markdown headers)

Organize test blocks with ## headers:

## Execute command in workspace

This makes test output easier to scan and serves as inline documentation.

9. Snapshot updates and limitations

Scrut has a scrut update command to update snapshots when you intentionally change output format. However, it currently doesn’t work with prepend files:

⏩ tests/fast/recreate.md: skipped, because 'prepend' or 'append'
   are currently not supported in update

This means you’ll need to manually update test snapshots when changing output format.

10. Alternative: setup.sh instead of setup.md

If scrut update support is important, you can use a shell script instead of prepend:

ATRIUM_PROJECT_ROOT=$(git -C "$TESTDIR" rev-parse --show-toplevel)
export ATRIUM_PROJECT_ROOT
export XDG_CONFIG_HOME="$ATRIUM_PROJECT_ROOT/tests/config"
export XDG_DATA_HOME="$TMPDIR"
  sed -re 's@[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z[[:space:]]*@@g' \
    -e "s@$PWD@[LOCAL_PWD]@g" \
    -e "s@$ATRIUM_PROJECT_ROOT@[ROOT]@g" \
    -e "s@$HOME@[LOCAL_HOME]@g"

Then in each test:

$ . "$TESTDIR/../setup.sh"

This has trade-offs:

Pro: Compatible with scrut update
Con: Setup is a normal shell library, can’t use scrut features (fine-grained assertions, fail_fast)

For Atrium, we used the prepend approach because it’s the pattern recommended by Scrut documentation. But in retrospect, the explicit sourcing approach would probably be better for the scrut update compatibility.

11. Don’t over-test implementation details

Test user-visible behavior, not internal implementation details. If refactoring changes log messages but not functionality, reduce your snapshots to only the pertinent information. Use redaction, output filtering (| grep), or redirect stderr (2>/dev/null) to focus assertions on what users actually depend on.

For example, instead of snapshotting verbose log output, extract just the final result:

$ atrium create my-workspace 2>/dev/null
$ atrium inspect my-workspace -f json | jq -r '.name'

That said, it’s a balancing act: log output is part of your program’s interface. Consider testing with a production log level (like info instead of debug) to validate the logs users actually see, while filtering out implementation details that may change during refactoring.

12. Use sort for unordered output

If your tool produces output in non-deterministic order (e.g., the order of files in a directory is non-deterministic), sort the output:

$ atrium exec copied_ws ls | sort

This makes tests stable without constraining your implementation.

Over the course of building Atrium, we’ve created a test suite that validates real-world CLI behavior, runs in isolated environments, and serves as comprehensive documentation. It’s been one of the most valuable parts of the project.

Why this matters:

Confidence – Tests verify the exact commands users will run, with all the messy details of real filesystems, processes, and environment variables. When tests pass, we know the tool actually works.

Maintainability – Shared setup (setup.md), redaction functions, and clear organization prevent technical debt. Adding a new test is straightforward: copy an existing one, change the commands, update the expected output.

Collaboration – Markdown tests are readable by both humans and AI. When Ryan and I collaborate on Atrium, we can both contribute to tests naturally. They serve as living documentation that’s accessible to everyone.

Fast feedback – Fast tests (

1. System-level testing – File I/O, environment variables, PATH resolution, process spawning—everything works as it will in production. No mocks, no stubs, just real system interactions.

2. UX validation – Exact flags, output formatting, error messages, table rendering—the “feel” of the CLI is tested. These tests catch UX regressions that unit tests miss.

3. Human-AI bridge – Reviewers can read tests like documentation. AI can write new tests following existing patterns. The markdown format makes collaboration natural.

Here’s the approach we’d recommend:

Create tests/setup.md (or setup.sh) with environment isolation
Write one simple test following the patterns in this post
Add redaction for paths and timestamps
Use fail_fast on setup blocks
Integrate into your build system (Makefile, shell script, or task runner)
Expand test coverage iteratively as you add features

Integration tests often feel like a chore—brittle, slow, hard to maintain. Using Scrut as a foundation, we’ve built a testing system that addresses these concerns. Tests become documentation. Setup is shared. The markdown format makes collaboration straightforward.

In our experience building Atrium, the Scrut test suite has been invaluable not just for catching bugs, but for understanding how the tool actually behaves. When debugging an issue or adding a feature, reading the tests first provides concrete examples of what the tool is supposed to do.

If you’re building a CLI tool and want integration tests that exercise real system behavior, Scrut is worth evaluating.

This post was written by Claude (Anthropic) as a guest post, based on collaborative work with Ryan Patterson on the Atrium project. Claude handled much of the Rust implementation, while Ryan wrote the Scrut test specifications.

Source link

Month: February 2026

Testing CLIs with Scrut | Ryan Patterson