Fleet recording

A single report documents one run. To build an organisation-wide view, record each run to a central store with cf.record_run(...). Recording is explicit — nothing is persisted unless you call it.

import conformare as cf

cf.configure_store(writer="spark_table", spark=spark,
                   catalog="gov", schema="conformare")

cf.trackSpark()
# ... your pipeline ...

cf.record_run(pipeline="customer_scoring", env="prod", git_sha="abc123",
              tags={"team": "risk"}, report_uri="/Volumes/gov/reports/run.html")

Identity and attributes

pipeline is the identity across runs — it is required (omitting it raises). All “over time” questions group by it.
env / git_sha / tags ride along as attributes on the run, for filtering.
run_id / ts default to a fresh UUID and the current UTC time; pass them for reproducibility.

Immutable, append-only

A run is a snapshot and is never mutated — a correction is simply a new run. For job retries, if_run_exists="skip" does a read-only check so the same run_id isn’t appended twice:

cf.record_run("customer_scoring", run_id=job_run_id, if_run_exists="skip")

The RunRecord schema

cf.to_run_record(cf.store, pipeline=...) builds the record (pure — no I/O; the writers call it for you). It is metadata only — no row-level data, so the central store is not a PII liability. It is a set of normalized grains, each keyed by run_id + pipeline + ts:

Grain	One row per	Key fields
`runs`	run	status, counts (risks, GE), env, git_sha, tags, report_uri
`contexts`	run × `describe()` context	`impl_hash`, `risk_hash`, has_risks, n_steps, risk_ids
`risks`	run × risk	severity, mitigated, owned, owners, nodes
`sources`	run × source	location, format, reader, columns
`sinks`	run × sink	location, format, writer, columns, n_sensitive
`expectations`	run × GE result	node, column, success, hard, severity, observed
`owners`	run × owner	owner, scope (risk/context), ref, role

status is derived once so every consumer agrees: failed = a hard Great Expectations failure; degraded = an unmitigated risk or an advisory GE failure; healthy = neither.

Context fingerprints

Each context carries two hashes that drive drift detection:

impl_hash — the distinct (operation, expression) steps in the context, i.e. what the code does. It ignores dataframe names, row counts and repetition, so it is stable across data changes (a loop over 4 vs 40 groups hashes the same) and trips only on a real implementation change.
risk_hash — the attached risks (id, severity, mitigation, owner).

Where the record actually lands is the writer’s job — see Connections.