module · canonical release 0.2

commonsformat-eval-runner

version 0.1.2 · targets format 0.2

A runner for Commons Format eval suites. Takes a parsed eval suite and an implementation under test, executes each case against the implementation, and reports the results.

depends on github.com/commonsformat/commonsformat-format ^0.2.0

github.com/commonsformat/commonsformat-parser ^0.1.1

license MPL-2.0

verifies ./evals.toml

Premise

The eval runner is what executes the contract. A spec ships an eval suite; an implementation claims conformance; the runner is what verifies the claim.

The runner takes a parsed eval suite (produced by an Commons Format parser) and an implementation under test, executes each case against the implementation, and reports which cases passed, which failed, and why.

The runner does not generate implementations, does not resolve dependencies, and does not interpret the meaning of cases beyond what their structure declares. It is a verification harness, not an intelligent agent.

Interface

A runner is a function or component that takes:

A parsed eval suite (a structure as produced by a conformant parser reading an evals.toml file)
An implementation under test (a callable, object, or other artifact the runner can invoke per-case)
Optionally, a configuration controlling which cases to run

And returns a results structure containing, for each case attempted:

The case name
The case category
Whether the case passed, failed, or was skipped
For failures, a description of why
For all cases, any output the runner captured during execution

The runner's contract is to faithfully execute each case as specified and to report results truthfully. The runner does not pass-or-fail at its own discretion; it reports what happened.

Schema

This module ships a schema.sql declaring the shape of its data. Per §8 this is a shape commitment, not a storage commitment — the generated implementation chooses a runtime representation appropriate to its intent.

-- This DDL describes data shape, not storage. Runtime representation
-- is implementation-defined; choose what is appropriate to the target
-- language and the module's intent. The eval runner is a pure
-- harness over a parsed eval suite and an implementation under test;
-- the tables below are the shape of its results, not a persistent
-- record of runs.

CREATE TABLE eval_runs (
    run_id               TEXT    NOT NULL PRIMARY KEY,
    target_module        TEXT    NOT NULL,
    implementation_id    TEXT    NOT NULL,
    started_at_ms        INTEGER NOT NULL,
    completed_at_ms      INTEGER NOT NULL
);

CREATE TABLE eval_case_results (
    run_id               TEXT    NOT NULL,
    case_name            TEXT    NOT NULL,
    case_class           TEXT    NOT NULL CHECK (case_class IN ('functional', 'adversarial', 'generator_adversary')),
    category             TEXT    NOT NULL,
    status               TEXT    NOT NULL CHECK (status IN ('pass', 'fail', 'skipped', 'errored')),
    reason               TEXT,
    captured_output      BLOB,
    elapsed_ms           INTEGER,
    PRIMARY KEY (run_id, case_name),
    FOREIGN KEY (run_id) REFERENCES eval_runs (run_id)
);

CREATE TABLE eval_run_summary (
    run_id               TEXT    NOT NULL PRIMARY KEY,
    cases_total          INTEGER NOT NULL CHECK (cases_total >= 0),
    cases_passed         INTEGER NOT NULL CHECK (cases_passed >= 0),
    cases_failed         INTEGER NOT NULL CHECK (cases_failed >= 0),
    cases_skipped        INTEGER NOT NULL CHECK (cases_skipped >= 0),
    cases_errored        INTEGER NOT NULL CHECK (cases_errored >= 0),
    FOREIGN KEY (run_id) REFERENCES eval_runs (run_id)
);

Case execution

Each case in the eval suite has an input table and an expect table. The runner's job, per case, is:

Construct the inputs described by the case's input table
Invoke the implementation under test with those inputs
Compare the implementation's behavior or output against the case's expect table
Report pass or fail based on whether the comparison succeeds

The translation from input table values to actual inputs to the implementation, and from implementation behavior to comparison against expect, is determined by the case's category. Each category has a conventional execution model.

Categories and their execution models

Category execution semantics are defined by format-spec §9.3. The runner implements those semantics. The format is canonical; the runner conforms to it.

A runner that encounters a category not defined by the format spec reports affected cases as skipped with a "category not supported" indication. The case is neither passed nor failed; it is uncounted. This applies both to genuinely unknown categories and to categories the format defines but this runner has not implemented.

What the runner must do

executes-each-case-independently — cases do not affect each other; one failing case does not prevent subsequent cases from running
reports-each-case — every case in the input suite produces a result entry, even if skipped
truthful-reporting — a case is reported as passing only if the runner verified the expectation; ambiguity is reported as failure or as "could not verify"
deterministic-ordering — running the same suite against the same implementation produces results in the same order (the order declared in the eval file)
isolates-implementation — the implementation under test does not see runner internals or other cases' state

What the runner must not do

Modifying cases. The runner runs what's in the suite. It does not generate cases, infer cases, or skip cases on its own initiative outside category support.
Persisting state across cases. Each case is independent.
Making external network calls beyond what a case's input explicitly requires.
Reporting passes for cases that did not actually pass. False positives undermine the entire conformance system and are the worst kind of bug a runner can have.
Hiding failures. A case that fails is reported as failed, with enough detail that the consumer can understand what went wrong.

Threat model

The runner is a trust-critical component. Consumers rely on its results to decide whether to deploy implementations. A compromised or buggy runner that reports false passes is worse than no runner at all — it produces unjustified confidence.

Specific concerns:

A runner that skips cases silently lets non-conformant implementations claim conformance. Skipping must be explicit and visible.
A runner that catches all exceptions and reports "pass" hides failures. Exceptions during case execution must propagate to the result, not be swallowed.
A runner that's unduly forgiving in comparisons (loose equality, substring matching where exact matching is required) accepts implementations that should be rejected. Comparison rules are per-category and the runner must honor them.
A runner whose own correctness is not itself verified against the eval suite cannot be trusted. The runner is verified by running this module's eval suite against itself, and by being run against known-good and known-bad reference cases.

Consumers using a runner should be able to inspect its results critically. Output should be sufficiently detailed that a human or another tool can audit "this case passed because X."

Verification

This module's eval suite verifies runner conformance at the interface level: given a known input suite and a known implementation, the runner produces the expected results structure.

The format-spec module's eval suite is what runners actually execute against parser implementations to verify them. The eval runner spec inherits the format spec's evals via dependency, but those evals test parsers, not runners. This module's evals test runners.

A consumer generating a runner implementation iterates against this module's evals until conformance. They then have a working tool they can use to verify their parser, their resolver, and any other Commons Format module they consume.

describes commonsformat-eval-runner 0.1.2 · targets commons format 0.2 · generated from release 0.2 · 2026-05-28