Fleet recording
A single report documents one run. To build an organisation-wide view, record each run to a central store with cf.record_run(...). Recording is explicit — nothing is persisted unless you call it.
import conformare as cf
cf.configure_store(writer="spark_table", spark=spark,
catalog="gov", schema="conformare")
cf.trackSpark()
# ... your pipeline ...
cf.record_run(pipeline="customer_scoring", env="prod", git_sha="abc123",
tags={"team": "risk"}, report_uri="/Volumes/gov/reports/run.html")
Identity and attributes
pipelineis the identity across runs — it is required (omitting it raises). All “over time” questions group by it.env/git_sha/tagsride along as attributes on the run, for filtering.run_id/tsdefault to a fresh UUID and the current UTC time; pass them for reproducibility.
Immutable, append-only
A run is a snapshot and is never mutated — a correction is simply a new run. For job retries, if_run_exists="skip" does a read-only check so the same run_id isn’t appended twice:
cf.record_run("customer_scoring", run_id=job_run_id, if_run_exists="skip")
The RunRecord schema
cf.to_run_record(cf.store, pipeline=...) builds the record (pure — no I/O; the writers call it for you). It is metadata only — no row-level data, so the central store is not a PII liability. It is a set of normalized grains, each keyed by run_id + pipeline + ts:
| Grain | One row per | Key fields |
|---|---|---|
runs | run | status, counts (risks, GE), env, git_sha, tags, report_uri |
contexts | run × describe() context | impl_hash, risk_hash, has_risks, n_steps, risk_ids |
risks | run × risk | severity, mitigated, owned, owners, nodes |
sources | run × source | location, format, reader, columns |
sinks | run × sink | location, format, writer, columns, n_sensitive |
expectations | run × GE result | node, column, success, hard, severity, observed |
owners | run × owner | owner, scope (risk/context), ref, role |
status is derived once so every consumer agrees: failed = a hard Great Expectations failure; degraded = an unmitigated risk or an advisory GE failure; healthy = neither.
Context fingerprints
Each context carries two hashes that drive drift detection:
impl_hash— the distinct(operation, expression)steps in the context, i.e. what the code does. It ignores dataframe names, row counts and repetition, so it is stable across data changes (a loop over 4 vs 40 groups hashes the same) and trips only on a real implementation change.risk_hash— the attached risks (id, severity, mitigation, owner).
Where the record actually lands is the writer’s job — see Connections.