Changelog

All notable changes to Conformare are documented here. The format is based on Keep a Changelog, and Conformare follows Semantic Versioning.

While in 0.x, breaking changes may land in a minor release (0.1 -> 0.2); the move to 1.0.0 signals a commitment to backward compatibility. The minor series track the focus of the work: 0.1.x is dataframe tracking, profiling and the report; 0.2.x is the fleet (cross-run recording, upstream risk and source governance).

[Unreleased]

[0.2.7] - 2026-06-25

Added

  • Static governance scan: cf.scan_governance(paths) AST-parses .py source and extracts every declared cf.describe context and cf.risk (with-blocks, decorators, and risks= arguments), with file/line, without executing the code – so you can build a governance inventory for a whole repo, a branch that didn’t run, or a module nobody imported. cf.scan_governance_report(paths, out) renders it as a self-contained HTML report. It only matches the governance forms, so df.describe() is never picked up; only literal arguments are read. Fits a CI gate (no pipeline run needed). New docs page Static governance scan.

Added

  • Experimental: OpenMetadata bridge (cf.OpenMetadataClient, stdlib-only HTTP).
    • Importcf.read_openmetadata_governance / cf.record_source_risk_from_openmetadata / cf.check_openmetadata_risks map an OpenMetadata table entity (description, owner, a conformare custom property, risk-classification tags, column metadata) to the same governance doc as the comment / dbt readers, so the risks flow through check_upstream_risks / warn_on_source / the fleet dashboard.
    • Exportcf.export_to_openmetadata pushes a tracked run’s table-level lineage (each upstream source table -> each output table) and a per-output-table @conformare risk governance block, with a resolve hook to map locations to OpenMetadata FQNs. The block it writes round-trips back through the import reader.
    • New docs page OpenMetadata bridge. The HTTP layer is isolated for easy version adaptation; an OpenLineage exporter is noted as the future vendor-neutral path.

[0.2.5] - 2026-06-23

Added

  • dbt governance can now be read from the compiled target/manifest.json, not just a schema.yml, and resolved by table name: cf.read_dbt_governance("target/manifest.json", table="db.schema.table") (and cf.record_source_risk_from_dbt(..., table=...)). The manifest maps each model’s database relation to its meta/description, so you can look up governance by the table a pipeline reads without knowing the YAML file. Table matching is flexible (model name, schema.table, db.schema.table, or the quoted relation); reading the manifest needs no extra dependency.

[0.2.4] - 2026-06-23

Added

  • Experimental: governance from dbt models. Read purpose / owner / context / risks from a dbt schema YAML’s meta fields (meta.conformare at model level, and column meta for column-scoped risks): cf.read_dbt_governance(path, model=...) parses it, and cf.record_source_risk_from_dbt(path, model=...) ingests the risks into the fleet as static source risks, so they surface through warn_on_source / check_upstream_risks / the dashboard like any other upstream risk. Needs pyyaml (new dbt extra). New docs page Governance in dbt models. The comment / dbt / static paths now share one governance-doc -> source-risk ingest (cf.fleet-bound, via comments.ingest_source_governance).

Added

  • Experimental: governance from table comments. A team that doesn’t use this package can express purpose / owner / business context / risks in a table’s system comment (e.g. a Spark table comment), in a format your business agrees, and conformare reads it:
    • cf.read_source_governance(table, spark=...) parses a Spark table’s comment (and column comments) into a governance dict; pass comment=... for non-Spark systems.
    • cf.set_comment_parser(parser) makes the format pluggable – the default is a free-text description plus an @conformare block of key: value lines; cf.parse_comment(text) applies it.
    • cf.check_source_comments([...], spark=...) reads + warns (CommentGovernanceWarning); cf.record_source_risk_from_comment(table, spark=...) ingests the comment’s risks into the fleet as static source risks, so they then surface through warn_on_source / check_upstream_risks / the dashboard while you track a pipeline that reads the table.
    • New docs page Extending table descriptions and example example_comment_governance.py.
  • Example example_fleet_source_risks.py: declare static risks for an externally-produced table, then a tracked pipeline that sources it inherits them – warned on load and surfaced in the fleet dashboard’s inherited-risk section. Added to the docs Examples gallery.

[0.2.2] - 2026-06-23

Added

  • Automatic on-load upstream-risk warnings: cf.configure_store(..., warn_on_source=True) makes conformare emit an UpstreamRiskWarning whenever a pipeline loads a table that carries risk recorded in the fleet (a sink another run wrote, or a static declaration) – no explicit check_upstream_risks call needed. The lookup is read once and cached, each table warns at most once per session, and reading the fleet store for the check is suppressed so it never pollutes the pipeline’s own lineage. Off by default.

[0.2.1] - 2026-06-23

Added

  • Upstream-risk inheritance: when you source a table, check whether it carries risk recorded elsewhere in the fleet. cf.check_upstream_risks(locations) reads the store, matches each location against sinks written by other runs, and returns the inherited risks classified as direct (on the sink), indirect (an upstream step/context, with a process distance), or process (a process-wide risk) – raising an UpstreamRiskWarning for any table that carries risk. Backed by a new sink_risks grain computed per sink at record time (so the check is a fast lookup), and cf.fleet.upstream_risks(tables, locations) for the pure query.
  • Static source risks: cf.record_source_risk(location, *risks, column=, owner=, ...) declares a one-off, manual risk for an externally-produced table (one not built with this package) into a new source_risks grain. Considered by check_upstream_risks and labelled static. Has the same owner / severity / mitigation / note fields as any risk.
  • cf.fleet.to_html(..., upstream=...) adds an “Inherited risk on sourced tables” section to the dashboard. The streaming-CLV example now demonstrates sourcing its own CLV output (indirect inheritance) and an external table with a static risk.
  • Fleet record SCHEMA_VERSION bumped to 2 (the two new grains; existing grains unchanged).
  • cf.fleet.to_html(tables, path): render the fleet grain tables to a self-contained HTML dashboard (runs & health, risks across every pipeline, the review report, sinks carrying sensitive columns, and inherited upstream risk). Both fleet examples emit one, so they appear in the docs Examples gallery.
  • Example example_fleet_streaming_clv.py: two PySpark pipelines (a customer-lifetime-value model with an opaque watch forecast + NPV, and an audience-report pipeline with two sinks) recorded to a Spark-table fleet store, then risks surfaced across both pipelines.
  • Docs: a Reporting & connections section (subpages: The HTML report, Fleet recording, Connections, Cross-run governance) describing the per-run report and exports, the fleet RunRecord, the writers/stores, and the cross-run governance queries.

[0.2.0] - 2026-06-23

Added

  • Fleet recording (conformare.fleet): record each run’s outcome to a central, cross-run store for a business-wide governance view. Recording is explicit – nothing is persisted unless you call cf.record_run(pipeline=...).
    • cf.configure_store(...) sets a default destination; cf.record_run(pipeline, dest=...) can supply or override it per call. pipeline is the identity across runs (required; omitting it raises); env / git_sha / tags ride along as attributes.
    • Runs are immutable (append-only; a correction is a new run). if_run_exists="skip" does a read-only check to make retries idempotent.
    • Spark-table writer is the default (writer="spark_table") – it writes through the Spark table interface (createDataFrame(...).write.saveAsTable(...)) with explicit, type-stable schemas, so no blob/object-store access is needed. Portable json / parquet (one immutable file per run) and callable writers are also provided.
    • cf.to_run_record(store, pipeline, ...) builds the versioned, metadata-only record (pure, no I/O) as normalized grains: runs, contexts, risks, sources, sinks, expectations, owners – each keyed by run_id + pipeline + ts.
  • Context instruction hashing: each describe() context carries an impl_hash (the distinct (op, logic) steps – what the code does, stable across data and loop-count changes) and a risk_hash (attached risk id/severity/mitigation/owner). Comparing these across runs powers cf.fleet.review_report(...), which flags contexts where the implementation changed but the risk assessment did not – a stale-governance smell.
  • Fleet read side (conformare.fleet): read_store(dest) / merge_records(records) to roll runs up into grain tables, then review_report, last_healthy_run, owner_pipelines (succession planning), and sinks_as_of (active outputs “as of” a time, derived from the run timestamp – no separate active-sink state).
  • Example example_fleet_recording.py: records two runs where the implementation changes but the risk does not, then surfaces it via the review report.

[0.1.10] - 2026-06-23

Added

  • Profiler sampling modes: cf.set_sample_mode("head" | "random", seed=0) (and cf.sample_mode()). "head" is the cheap first-n default; "random" takes a seeded random subset (reproducible) on every profiler that samples – histograms, outliers, null fractions, Great Expectations. Backends that can’t sample randomly (lazy frames) fall back to head.
  • Report: the side navigation collapses (☰ in the top bar) to reclaim width.
  • Report: long node lists (Column index “Appears in”, Data sensitivity / Context register Nodes, Risk register Where) show the first 3 then “… and N more”, expandable on click, to cut visual clutter.
  • Report: a “Hide non-profiled” toggle on Node profiles (on by default) hides nodes with no profile data, so the section focuses on what was actually measured.

Changed

  • Report: source nodes in the diagram widen to fit the table name (centred on their slot so edges still meet) and put the column count on its own line below the name/location, instead of overflowing the node.
  • Report: the Created column catalog diagram now shows every contributing column – any existing column that feeds a created column, matched by whole-word reference in the expression so attribute access (df.age) and bare names are caught, not just col("x")/df["x"]. Pandas assign is now recognised as a column-creating op. The table still lists created columns only.
  • Report: the process diagram defaults to the Sequential layout.

Fixed

  • Great Expectations: an expectation against a column not present at the profiled node used to come back as a blank failure (the column name didn’t match the node’s columns – often because the profile ran at the wrong stage, e.g. before the withColumns fix in 0.1.9). It is now surfaced clearly as “missing column: <name>“ rather than an empty failed check. Covers single/pair/list column references.

[0.1.9] - 2026-06-23

Fixed

  • Spark: withColumnsRenamed (the plural, 3.4+ multi-rename) was missing from the tracked transforms, so a df.withColumnsRenamed({...}) produced an untracked frame. When that frame was the last step before a write, the sink attached to a fresh, disconnected node (<dfN>) and the lineage broke. The plural/native methods need hooking in their own right – PySpark implements them straight against the JVM, so hooking the singular (withColumn/withColumnRenamed) never covers the plural. Added it, along with other frame-returning transforms that had the same gap: withMetadata, toDF, to, alias (self-joins), fillna, dropna, replace, drop_duplicates, intersectAll, unpivot/melt, sortWithinPartitions, offset, sampleBy, repartitionByRange, hint, and rollup/cube (which return GroupedData like groupBy). df.agg(...) was already covered – it routes through self.groupBy().agg(...). Each method is patched only if the installed PySpark provides it, so older versions are unaffected.

[0.1.8] - 2026-06-22

Changed

  • Report: the Node profiles section is now dynamically collapsible, mirroring the process diagram. A “Collapse profiles by” selector (None / Context / Function / Chained commands) folds a run of steps into one card per dataframe, showing the net column set, the columns created across the run (consolidated, not repeated per step), and the latest profile for each metric – annotated with where in the segment it was measured (e.g. “rows from step 3 of 5”), so a profile taken before the final step is still clear. Defaults to the most useful collapse available (context, else chain). A context with several dataframes shows one card each.
  • Report: a collapsed chain node now lists every rolled-up operation’s details when “Expand operation details” is on (one op: detail line per folded step, in execution order), instead of only the head’s own detail.
  • Report: with “Expand operation details” on, a contracted context (a describe() group collapsed via “Compress context-linked nodes”) now lists all of its member operations the same way, instead of showing nothing.
  • Report: with both “Show function calls” and “Expand operation details” on, each node now lists the functions it uses (ƒ a(), b()) after its operation details – the union across members for a contracted context or collapsed chain. Previously the two toggles were mutually exclusive.

[0.1.7] - 2026-06-22

Added

  • Two experimental pandas opt-ins for the tracking gaps noted in 0.1.6, each off by default and each warning on enable with a link to the new Pandas best practices page:
    • cf.trackPandasSeries() records pandas.Series as lineage nodes – named single-column extracts (s = df["c"]), group-wise Series aggregations (df.groupby(c)[c2].nunique()), and Series transforms such as reset_index / to_frame that let a Series rejoin a DataFrame, so the lineage thread is no longer dropped when a value passes through a Series. Inline column reads (the RHS of an assignment, a boolean-index predicate, an arithmetic operand) are transient sub-expressions and are not recorded.
    • cf.trackPandas__setitem__() tracks in-place column assignment (df["x"] = ...) by versioning the frame: the same object is re-stamped with a new node id, recording an old -> new edge so successive writes form a chain. The operation detail is captured too – the assigned column and the authored right-hand-side expression (df["x"] = df["a"] / df["b"]) – so the write renders like a column creation and feeds the column-level lineage graph. df["x"] += ... (augmented assignment) is covered as well. A run of consecutive in-place writes on the same frame forms a single chain, so the diagram’s “Compress chained operations” rolls them into one node.
    • cf.ConformareExperimentalWarning is exposed so the warnings can be filtered.
  • Tracked DataFrame methods now also link frame/Series parents passed as keyword arguments (e.g. df.merge(right=other), df.assign(x=tracked_series)).
  • pd.concat([...]) is now tracked (a module-level function, so it needs its own hook): every tracked input frame becomes a parent of the result. Keeps the branches connected when a per-category loop builds a list of frames and stitches them back together.
  • Loop-aware names: a frame assigned inside a for loop is now named with the current iterator value (sub[England], sub[Scotland]), so each iteration is a distinct, meaningful node instead of repeated sub. Nested loops include both values (sub[England,18]); while loops and unreadable targets fall back to a 1-based index (sub.1, sub.2). Plain (non-loop) assignments are unchanged.
  • Example example_cross_engine_pipeline.py: a comprehensive PySpark -> pandas -> PySpark pipeline – date/age filter, in-place writes, two group-bys merged back, a per-region loop (subset -> write -> concat), then back to Spark to join reference data, with row-count and distribution-histogram profiling at the milestones. Shows how a category loop appears in the lineage. Added to the docs Examples gallery.
  • New docs page Pandas best practices (keep data in DataFrames; prefer assign over in-place assignment) – the destination of the experimental warnings.
  • Example example_pandas_experimental.py: tracks a group-wise Series aggregation rejoining a frame and in-place df["x"] = ... writes, shown next to the recommended assign form. Added to the docs Examples gallery.
  • Example example_opaque_function.py: how to contain a multi-step helper as one lineage node – @cf.opaque, cf.opaque(fn)(df) inline, and cf.opaque_module(...) – with a side-by-side of the same feature block tracked normally vs. made opaque. Added to the docs Examples gallery.

[0.1.6] - 2026-06-22

Added

  • Cross-backend conversion tracking now covers the pandas-on-Spark boundaries too: DataFrame.pandas_api() (Spark -> pandas-on-Spark), and ps.DataFrame.to_pandas() / ps.DataFrame.to_spark() (pandas-on-Spark -> pandas / Spark), alongside the existing toPandas / createDataFrame edges. A pipeline that hops engines any of these ways stays one connected graph.
  • Example example_spark_pandas_roundtrip.py: start in PySpark, drop to pandas for feature engineering, write back through Spark – tracked end to end as one lineage. Added to the docs Examples gallery.

Known gaps

  • The pandas adapter tracks DataFrame-returning operations only. A pandas.Series result (e.g. df.groupby(c)[c2].nunique()) and in-place column assignment (df[new] = ..., i.e. __setitem__) are not yet captured as lineage nodes; prefer DataFrame-returning equivalents (groupby(...).agg(...), assign(...), merge(...)) to keep the graph connected.

[0.1.5] - 2026-06-22

Added

  • Cross-backend conversion tracking: DataFrame.toPandas() (Spark -> pandas) and SparkSession.createDataFrame(pandas_df) (pandas -> Spark) are recorded as lineage edges, so a PySpark -> pandas -> PySpark pipeline stays one connected graph. Enable the pandas-stage tracking with cf.trackPandas() (note trackAll keeps pandas off by default to avoid double-hooking pd.read_* with Narwhals).

[0.1.4] - 2026-06-22

Added

  • The line-based operation-logic fallback (notebook cells where executing can’t map the frame) is now chain-aware: in a chained one-liner like df.groupBy(...).agg(...), each tracked op is matched to its own call by method name, so the captured expression is correct for each link of the chain rather than always the outermost call.

[0.1.3] - 2026-06-22

Fixed

  • Databricks: pyspark / py4j / Databricks-runtime internals are no longer treated as user code (on Databricks pyspark lives under /databricks/spark/python, not site-packages). With track_functions(True) the function hook was bracketing pyspark’s own methods – fn:filter, fn:create_from_pandas_with_arrow, *exprs for group-by – and polluting the step stack, so docstring risks and operation details were lost. The hook now brackets only your functions, restoring expression details, docstring-based risks, and clean operation names while still following lineage into your functions.
  • diagnose_environment() now reports pyspark_treated_as_user_code (must be False).

[0.1.2] - 2026-06-22

Fixed

  • Databricks notebook support: recognise Databricks 14.x cell frames, which use real-looking but non-existent paths (e.g. /root/.ipykernel/<n>/command-<n>) that no filename prefix matched. A frame whose file isn’t on disk, in a notebook environment, is now treated as a cell – so variable names (via bytecode) and operation logic (via a line-based source fallback used when executing can’t map the wrapped cell) are recovered. Docstring-based risks need cf.track_functions(True) as before.

Added

  • diagnose_environment() – reports how conformare sees the current cell (filename, notebook / user-code detection, environment, IPython shell) for troubleshooting.

[0.1.1] - 2026-06-22

Added

  • Environment detection: environment() (databricks / jupyter / ipython / python), in_notebook(), is_databricks().
  • Notebook / Databricks support: function-boundary tracking and docstring Conformare: risks now work in notebook cells; variable names are recovered from bytecode and operation logic from the IPython input history when a cell has no source on disk.
  • mark_user_packages() to track your own pip-installed (wheel) pipeline code that would otherwise be treated as third-party library code.
  • Examples: ML scoring by region on pandas + scikit-learn and PySpark + spark.ml, and a Databricks notebook-cell simulation; new optional sklearn extra.

Fixed

  • A function documented via a Conformare: docstring block now names its output frame from the call-site assignment target.

[0.1.0] - 2026-06-21

Initial public release.

Added

  • Lineage capture of the authored dataframe pipeline across three backends: trackNarwhals(), trackSpark() (PySpark, zero code change) and trackPandas().
  • Per-step profilers: rowCount, columnCount, dataSize, histogram, nullFraction, iqrOutliers, optional greatExpectations and whylogs.
  • Governance context: describe() / risk() / describe_process(), a built-in risk catalog with register_risk(), mitigation/owner tracking and a governance ranking.
  • Data-sensitivity detection (name-based heuristics + manual mark_sensitive) with an exfiltration check for columns that reach a written output.
  • Outputs: self-contained interactive HTML report (to_html), to_mermaid, to_json, and a formal, sign-off-ready Markdown risk checklist (to_risk_checklist).
  • Non-intrusive modes: docstring tagging and bootstrap() for unmodified scripts.
  • conformare.__version__ exposed from package metadata.

This site uses Just the Docs, a documentation theme for Jekyll.