Fleet recording

A single report documents one run. To build an organisation-wide view, record each run to a central store with cf.record_run(...). Recording is explicit — nothing is persisted unless you call it.

import conformare as cf

cf.configure_store(writer="spark_table", spark=spark,
                   catalog="gov", schema="conformare")

cf.trackSpark()
# ... your pipeline ...

cf.record_run(pipeline="customer_scoring", env="prod", git_sha="abc123",
              tags={"team": "risk"}, report_uri="/Volumes/gov/reports/run.html")

Identity and attributes

  • pipeline is the identity across runs — it is required (omitting it raises). All “over time” questions group by it.
  • env / git_sha / tags ride along as attributes on the run, for filtering.
  • run_id / ts default to a fresh UUID and the current UTC time; pass them for reproducibility.

Immutable, append-only

A run is a snapshot and is never mutated — a correction is simply a new run. For job retries, if_run_exists="skip" does a read-only check so the same run_id isn’t appended twice:

cf.record_run("customer_scoring", run_id=job_run_id, if_run_exists="skip")

The RunRecord schema

cf.to_run_record(cf.store, pipeline=...) builds the record (pure — no I/O; the writers call it for you). It is metadata only — no row-level data, so the central store is not a PII liability. It is a set of normalized grains, each keyed by run_id + pipeline + ts:

Grain One row per Key fields
runs run status, counts (risks, GE), env, git_sha, tags, report_uri
contexts run × describe() context impl_hash, risk_hash, has_risks, n_steps, risk_ids
risks run × risk severity, mitigated, owned, owners, nodes
sources run × source location, format, reader, columns
sinks run × sink location, format, writer, columns, n_sensitive
expectations run × GE result node, column, success, hard, severity, observed
owners run × owner owner, scope (risk/context), ref, role

status is derived once so every consumer agrees: failed = a hard Great Expectations failure; degraded = an unmitigated risk or an advisory GE failure; healthy = neither.

Context fingerprints

Each context carries two hashes that drive drift detection:

  • impl_hash — the distinct (operation, expression) steps in the context, i.e. what the code does. It ignores dataframe names, row counts and repetition, so it is stable across data changes (a loop over 4 vs 40 groups hashes the same) and trips only on a real implementation change.
  • risk_hash — the attached risks (id, severity, mitigation, owner).

Where the record actually lands is the writer’s job — see Connections.


This site uses Just the Docs, a documentation theme for Jekyll.