Pandas best practices

Conformare tracks the DataFrame-returning subset of pandas by default: boolean and projection indexing, query, merge, assign, groupby(...).agg(...), and the read_* / to_* boundaries. That covers idiomatic, pipeline-style pandas without any code change.

Two common patterns fall outside that subset, and the way you keep them connected is the same way that keeps the underlying code clearer.

Keep data in DataFrames

When a value drops to a pandas.Series partway through a pipeline, it leaves the DataFrame world: it has no columns, no schema, and (by default) no lineage node. The classic example is a group-wise aggregate that comes back as a Series and is then joined on:

# Series on the side -- harder to read, and the lineage thread is dropped
per_region = df.groupby("region")["spend"].nunique()   # -> Series
df = df.merge(per_region.reset_index(), on="region")

The DataFrame-first version is both clearer and fully tracked out of the box:

# stays a DataFrame the whole way -- one connected lineage, no opt-in needed
region_n = df.groupby("region", as_index=False).agg(region_n=("spend", "nunique"))
df = df.merge(region_n, on="region", how="left")

Keeping data in a DataFrame means one data structure with named columns, schema you can profile, and a lineage edge at every step.

Prefer assign over in-place assignment

In-place column assignment mutates the frame and returns None, so it is invisible to lineage (and to anyone reading the code who has to track mutation by eye):

df["spend_per_year"] = df["spend"] / df["tenure"]   # in-place, returns None

assign returns a new frame, reads as an expression, and is tracked as a step:

df = df.assign(spend_per_year=df["spend"] / df["tenure"])

This also avoids the subtle bugs that come from mutating a frame that another part of the code still holds a reference to.

Cross-engine pipelines

If you hop between PySpark and pandas (toPandas, createDataFrame, pandas_api, to_pandas, to_spark), enable both adapters so the pandas-stage operations are tracked alongside the Spark ones and the conversion boundaries connect:

cf.trackSpark()
cf.trackPandas()

See the cross-engine example.

Experimental opt-ins

If you genuinely cannot avoid Series or in-place assignment (for example, instrumenting an existing pipeline you are not free to rewrite), two opt-in switches extend tracking to cover them. Both are experimental, both warn on enable, and both link back here.

cf.trackPandasSeries()        # record Series as lineage nodes (verbose)
cf.trackPandas__setitem__()   # track df["x"] = ... by versioning the frame

See the experimental tracking example, which tracks a group-wise Series aggregation rejoining a frame and in-place column writes side by side with the recommended assign form.

  • trackPandasSeries() records pandas.Series results: named single-column extracts (s = df["col"]), group-wise Series aggregations (df.groupby(c)[c2].nunique()), and Series transforms such as reset_index / to_frame that let a Series rejoin a DataFrame. Inline reads – df["a"] / df["b"], df[df["c"] == 1], or the right-hand side of df["x"] = ... – are transient sub-expressions and are not recorded, so the graph stays focused on the Series you actually keep. It is still more verbose than the default; reach for it for debugging or completeness, not as a matter of course.
  • trackPandas__setitem__() tracks in-place column assignment by re-versioning the frame on each write, recording an old -> new edge so successive writes form a chain.

Filter the experimental warnings if you have opted in deliberately:

import warnings
warnings.filterwarnings("ignore", category=cf.ConformareExperimentalWarning)

The recommendation stands regardless: a DataFrame-first pipeline is simpler to read, simpler to profile, and tracked without any opt-in.


This site uses Just the Docs, a documentation theme for Jekyll.