commonsformat-parser
A parser for Commons Format modules. Reads a module directory from disk and produces a structured representation of its contents. Verifies that the module conforms to the format's structural and encoding requirements.
Premise
The parser reads a module directory from disk and produces a structured representation of the module's contents. The structured representation contains the parsed metadata, the parsed prose with extracted tagged sections, and (if present) the parsed eval suite.
The parser does not resolve dependencies, does not fetch from Git, does not verify lockfiles, and does not run evals. Those are separate concerns handled by separate tools, each specified by its own module.
This module describes what a conformant parser must do. It does not prescribe how a parser is implemented, what language it is written in, what its calling conventions are, or how it reports errors. Those are decisions consumers make when generating their parser.
Interface
A parser is a function or component that takes a path to a module directory on disk and returns a Module structure containing:
- The parsed metadata from
commonsformat.toml - The parsed prose from
commonsformat.md, with tagged sections extracted - The parsed eval suite from
evals.tomlif present - The parsed metadata from any auxiliary files referenced from
commonsformat.toml
If the module directory does not contain a valid Commons Format module per the format specification, the parser reports the violation and does not return a Module structure. The format of the violation report is the consumer's choice; the violations themselves are determined by the format.
Schema
This module ships a schema.sql declaring the
shape of its data. Per §8
this is a shape commitment, not a storage commitment — the
generated implementation chooses a runtime representation
appropriate to its intent.
-- This DDL describes data shape, not storage. Runtime representation
-- is implementation-defined; choose what is appropriate to the target
-- language and the module's intent. The parser is a pure function
-- from a module directory to a Module structure; the tables below are
-- the shape of that structure, not a database the parser maintains.
CREATE TABLE modules (
path TEXT NOT NULL PRIMARY KEY,
name TEXT NOT NULL,
version TEXT NOT NULL,
description TEXT NOT NULL,
license TEXT NOT NULL,
commonsformat TEXT NOT NULL,
has_eval_suite BOOLEAN NOT NULL,
has_schema BOOLEAN NOT NULL
);
CREATE TABLE module_authors (
module_path TEXT NOT NULL,
ordinal INTEGER NOT NULL,
author_name TEXT NOT NULL,
email TEXT,
url TEXT,
PRIMARY KEY (module_path, ordinal),
FOREIGN KEY (module_path) REFERENCES modules (path)
);
CREATE TABLE module_dependencies (
module_path TEXT NOT NULL,
ordinal INTEGER NOT NULL,
spec_url TEXT NOT NULL,
version_constraint TEXT,
git_ref TEXT,
commit_sha TEXT,
edge_kind TEXT NOT NULL CHECK (edge_kind IN ('depends_on', 'extends')),
PRIMARY KEY (module_path, ordinal),
FOREIGN KEY (module_path) REFERENCES modules (path)
);
CREATE TABLE tagged_sections (
module_path TEXT NOT NULL,
tag_name TEXT NOT NULL,
ordinal INTEGER NOT NULL,
content TEXT NOT NULL,
PRIMARY KEY (module_path, tag_name, ordinal),
FOREIGN KEY (module_path) REFERENCES modules (path)
);
CREATE TABLE named_constraints (
module_path TEXT NOT NULL,
constraint_name TEXT NOT NULL,
description TEXT NOT NULL,
PRIMARY KEY (module_path, constraint_name),
FOREIGN KEY (module_path) REFERENCES modules (path)
);
CREATE TABLE module_examples (
module_path TEXT NOT NULL,
example_name TEXT NOT NULL,
content TEXT NOT NULL,
PRIMARY KEY (module_path, example_name),
FOREIGN KEY (module_path) REFERENCES modules (path)
);
CREATE TABLE eval_cases (
module_path TEXT NOT NULL,
case_name TEXT NOT NULL,
case_class TEXT NOT NULL CHECK (case_class IN ('functional', 'adversarial', 'generator_adversary')),
category TEXT NOT NULL,
description TEXT NOT NULL,
input_blob BLOB NOT NULL,
expect_blob BLOB NOT NULL,
severity TEXT NOT NULL DEFAULT 'error' CHECK (severity IN ('info', 'warn', 'error', 'critical')),
PRIMARY KEY (module_path, case_name),
FOREIGN KEY (module_path) REFERENCES modules (path)
);
CREATE TABLE eval_case_verifies (
module_path TEXT NOT NULL,
case_name TEXT NOT NULL,
constraint_name TEXT NOT NULL,
PRIMARY KEY (module_path, case_name, constraint_name)
);
CREATE TABLE eval_properties (
module_path TEXT NOT NULL,
property_name TEXT NOT NULL,
property_value TEXT NOT NULL,
PRIMARY KEY (module_path, property_name),
FOREIGN KEY (module_path) REFERENCES modules (path)
);
CREATE TABLE schema_tables (
module_path TEXT NOT NULL,
table_name TEXT NOT NULL,
PRIMARY KEY (module_path, table_name),
FOREIGN KEY (module_path) REFERENCES modules (path)
);
CREATE TABLE schema_columns (
module_path TEXT NOT NULL,
table_name TEXT NOT NULL,
column_name TEXT NOT NULL,
column_type TEXT NOT NULL CHECK (column_type IN ('INTEGER', 'REAL', 'TEXT', 'BLOB', 'TIMESTAMP', 'BOOLEAN')),
ordinal INTEGER NOT NULL,
is_not_null BOOLEAN NOT NULL DEFAULT FALSE,
is_in_primary_key BOOLEAN NOT NULL DEFAULT FALSE,
default_literal TEXT,
PRIMARY KEY (module_path, table_name, column_name),
FOREIGN KEY (module_path, table_name) REFERENCES schema_tables (module_path, table_name)
);
CREATE TABLE parse_violations (
module_path TEXT NOT NULL,
ordinal INTEGER NOT NULL,
kind TEXT NOT NULL,
detail TEXT NOT NULL,
file_path TEXT,
line_number INTEGER,
PRIMARY KEY (module_path, ordinal)
);
What the parser must accept
accepts-conformant-modules— any module directory that conforms to the format specification produces a Module structure containing all fields the format definespreserves-tagged-content— tagged section content is extracted verbatim, preserving whitespace and Markdown formatting inside the tagspreserves-prose— prose outside tagged sections is preserved as parsed Markdown structure (or as raw text; both are conformant)handles-utf8— input files in UTF-8 are accepted; non-UTF-8 input is rejected with an encoding violationhandles-line-endings— LF and CRLF line endings are both accepted; internal representation normalizes to LFdeterministic— parsing the same module twice produces equivalent Module structuresoffline— parsing does not require network access
What the parser must reject
rejects-missing-required-files— a directory withoutcommonsformat.toml,commonsformat.md, orLICENSEis rejectedrejects-malformed-toml— TOML files violating the format's TOML subset are rejectedrejects-malformed-markdown— Markdown files violating the format's Markdown subset are rejected (where the subset is restrictive; the parser need not reject CommonMark constructs the subset permits to be ignored)rejects-malformed-frontmatter— missing required fields or invalid field values incommonsformat.tomlare rejectedrejects-malformed-tags— tagged sections that are unclosed, improperly nested, or violate uniqueness rules are rejectedrejects-non-utf8— files containing non-UTF-8 bytes are rejected
Anti-patterns
- Performing dependency resolution. The parser handles a single module in isolation.
- Fetching from network. The parser operates on local filesystem only.
- Executing eval cases. The parser parses the eval suite into a structured representation; running cases is the eval runner's job.
- Auto-correcting malformed input. If the input violates the format, the parser rejects it. It does not silently fix things.
- Tolerating non-subset TOML or Markdown features. The parser enforces the subsets defined in the format spec.
Module structure produced
The Module structure produced by the parser contains the
logical contents specified in
format-spec §10
(Logical Module Structure): metadata, prose with extracted
tagged sections, constraints, interface, avoid, threat_model,
examples, evals (when verifies is declared), and
path.
Concrete representation is at the implementation's discretion. Two conformant parsers may use different in-memory data structures or field names; what matters is that the logical contents specified by the format are accessible.
Threat model
The parser operates on potentially adversarial input. A malicious module on disk should not cause the parser to:
- Crash unexpectedly (controlled rejection is fine; uncontrolled failure is not)
- Consume unbounded memory or time (parser must have resource bounds)
- Execute arbitrary code (parser does not interpret content as code)
- Access network or filesystem outside the module directory
- Leak information through timing channels in security-relevant comparisons (e.g., when checking checksums or signatures, though those are not the parser's primary responsibility)
The parser is the first line of defense against malformed or hostile modules. Subsequent tools (resolver, eval runner) trust that the parser has validated the input's structural correctness.
Verification
This module's eval suite (evals.toml) defines what
conformance means for a parser implementation. A parser passes
conformance by passing all cases in the merged eval suite —
this module's evals plus the eval cases inherited from the
format-spec module via the dependency.
The format-spec module already contains extensive parsing-related eval cases (TOML subset acceptance and rejection, Markdown subset parsing, tagged section extraction, module loading, encoding handling). This module's evals add interface-level cases that test the parser as a callable component, separately from the format content it processes.
Consumers generating a parser implementation iterate against the merged eval suite until conformance is achieved. The first generation rarely passes everything; the iteration is normal and expected.