commonsformat-eval-runner
A runner for Commons Format eval suites. Takes a parsed eval suite and an implementation under test, executes each case against the implementation, and reports the results.
Premise
The eval runner is what executes the contract. A spec ships an eval suite; an implementation claims conformance; the runner is what verifies the claim.
The runner takes a parsed eval suite (produced by an Commons Format parser) and an implementation under test, executes each case against the implementation, and reports which cases passed, which failed, and why.
The runner does not generate implementations, does not resolve dependencies, and does not interpret the meaning of cases beyond what their structure declares. It is a verification harness, not an intelligent agent.
Interface
A runner is a function or component that takes:
- A parsed eval suite (a structure as produced by a conformant parser reading an
evals.tomlfile) - An implementation under test (a callable, object, or other artifact the runner can invoke per-case)
- Optionally, a configuration controlling which cases to run
And returns a results structure containing, for each case attempted:
- The case name
- The case category
- Whether the case passed, failed, or was skipped
- For failures, a description of why
- For all cases, any output the runner captured during execution
The runner's contract is to faithfully execute each case as specified and to report results truthfully. The runner does not pass-or-fail at its own discretion; it reports what happened.
Schema
This module ships a schema.sql declaring the
shape of its data. Per §8
this is a shape commitment, not a storage commitment — the
generated implementation chooses a runtime representation
appropriate to its intent.
-- This DDL describes data shape, not storage. Runtime representation
-- is implementation-defined; choose what is appropriate to the target
-- language and the module's intent. The eval runner is a pure
-- harness over a parsed eval suite and an implementation under test;
-- the tables below are the shape of its results, not a persistent
-- record of runs.
CREATE TABLE eval_runs (
run_id TEXT NOT NULL PRIMARY KEY,
target_module TEXT NOT NULL,
implementation_id TEXT NOT NULL,
started_at_ms INTEGER NOT NULL,
completed_at_ms INTEGER NOT NULL
);
CREATE TABLE eval_case_results (
run_id TEXT NOT NULL,
case_name TEXT NOT NULL,
case_class TEXT NOT NULL CHECK (case_class IN ('functional', 'adversarial', 'generator_adversary')),
category TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('pass', 'fail', 'skipped', 'errored')),
reason TEXT,
captured_output BLOB,
elapsed_ms INTEGER,
PRIMARY KEY (run_id, case_name),
FOREIGN KEY (run_id) REFERENCES eval_runs (run_id)
);
CREATE TABLE eval_run_summary (
run_id TEXT NOT NULL PRIMARY KEY,
cases_total INTEGER NOT NULL CHECK (cases_total >= 0),
cases_passed INTEGER NOT NULL CHECK (cases_passed >= 0),
cases_failed INTEGER NOT NULL CHECK (cases_failed >= 0),
cases_skipped INTEGER NOT NULL CHECK (cases_skipped >= 0),
cases_errored INTEGER NOT NULL CHECK (cases_errored >= 0),
FOREIGN KEY (run_id) REFERENCES eval_runs (run_id)
);
Case execution
Each case in the eval suite has an input
table and an expect table. The runner's job,
per case, is:
- Construct the inputs described by the case's
inputtable - Invoke the implementation under test with those inputs
- Compare the implementation's behavior or output against the case's
expecttable - Report pass or fail based on whether the comparison succeeds
The translation from input table values to
actual inputs to the implementation, and from
implementation behavior to comparison against
expect, is determined by the case's
category. Each category has a conventional
execution model.
Categories and their execution models
Category execution semantics are defined by format-spec §9.3. The runner implements those semantics. The format is canonical; the runner conforms to it.
A runner that encounters a category not defined by the format spec reports affected cases as skipped with a "category not supported" indication. The case is neither passed nor failed; it is uncounted. This applies both to genuinely unknown categories and to categories the format defines but this runner has not implemented.
What the runner must do
executes-each-case-independently— cases do not affect each other; one failing case does not prevent subsequent cases from runningreports-each-case— every case in the input suite produces a result entry, even if skippedtruthful-reporting— a case is reported as passing only if the runner verified the expectation; ambiguity is reported as failure or as "could not verify"deterministic-ordering— running the same suite against the same implementation produces results in the same order (the order declared in the eval file)isolates-implementation— the implementation under test does not see runner internals or other cases' state
What the runner must not do
- Modifying cases. The runner runs what's in the suite. It does not generate cases, infer cases, or skip cases on its own initiative outside category support.
- Persisting state across cases. Each case is independent.
- Making external network calls beyond what a case's input explicitly requires.
- Reporting passes for cases that did not actually pass. False positives undermine the entire conformance system and are the worst kind of bug a runner can have.
- Hiding failures. A case that fails is reported as failed, with enough detail that the consumer can understand what went wrong.
Threat model
The runner is a trust-critical component. Consumers rely on its results to decide whether to deploy implementations. A compromised or buggy runner that reports false passes is worse than no runner at all — it produces unjustified confidence.
Specific concerns:
- A runner that skips cases silently lets non-conformant implementations claim conformance. Skipping must be explicit and visible.
- A runner that catches all exceptions and reports "pass" hides failures. Exceptions during case execution must propagate to the result, not be swallowed.
- A runner that's unduly forgiving in comparisons (loose equality, substring matching where exact matching is required) accepts implementations that should be rejected. Comparison rules are per-category and the runner must honor them.
- A runner whose own correctness is not itself verified against the eval suite cannot be trusted. The runner is verified by running this module's eval suite against itself, and by being run against known-good and known-bad reference cases.
Consumers using a runner should be able to inspect its results critically. Output should be sufficiently detailed that a human or another tool can audit "this case passed because X."
Verification
This module's eval suite verifies runner conformance at the interface level: given a known input suite and a known implementation, the runner produces the expected results structure.
The format-spec module's eval suite is what runners actually execute against parser implementations to verify them. The eval runner spec inherits the format spec's evals via dependency, but those evals test parsers, not runners. This module's evals test runners.
A consumer generating a runner implementation iterates against this module's evals until conformance. They then have a working tool they can use to verify their parser, their resolver, and any other Commons Format module they consume.