Data sensitivity

Sensitive data rarely stays put. An email column read at the source flows through joins, gets renamed, lands in an aggregate, and sometimes leaves the pipeline in a written output. Conformare flags columns that look like protected data and shows where they are used across the pipeline, and, crucially, whether they reach somewhere they could leave.

Why track it

  • Know the blast radius. If a column is sensitive, you want to see every step it touches, not just where it was introduced. That is the difference between “we hold email” and “email reaches the marketing export”.
  • Spot exfiltration risk. The pressing question for a reviewer is: does this sensitive column reach a written output? Conformare marks each sensitive column with whether it flows into a sink (a write/export). A sensitive column that reaches a sink is the data that actually leaves your control, exactly what a DPO or auditor needs to see.
  • Drive the right risk. Sensitivity findings are the natural prompt to declare a governance risk (for example privacy.pii_exposure) with a mitigation and an owner. See Risks & governance.

How auto-detection works

Detection is name-based. Conformare normalises each column name (splits camelCase, lowercases, and collapses separators), then matches whole words against a catalog of patterns for common protected information:

Category Examples of matched names
Contact email, phone, mobile, telephone
Identity name, surname, dob, date_of_birth
Government ID ssn, national_insurance, passport, licence
Financial iban, card_number, sort_code, salary
Online identifier ip, device_id, cookie, uuid
Location address, postcode, latitude, longitude
Special category health, diagnosis, race, religion, gender
Credentials password, secret, token, api_key

Each match carries a category, a severity (low to critical) and a confidence. Because it works from names, detection flags candidates, not confirmed PII. A column called customer_name is flagged; a column called col_7 that happens to hold names is not. Treat the heuristics as a fast first pass, then confirm with manual marks.

You can test the classifier on a single name:

import conformare as cf

cf.classify_column("date_of_birth")
# [{'tag': 'dob', 'category': 'Identity', 'severity': 'high', ...}]

cf.classify_column("col_7")     # []  -> not detected by name

What name-based detection cannot do

It does not read values, so it misses sensitive data hidden behind opaque names and can over-flag innocuous columns whose names merely look sensitive. That is what manual definitions are for.

Adding a sensitive column definition

When you know a column is sensitive (or want to override a heuristic), assert it with mark_sensitive(...). Manual marks always win over the heuristics and appear in the report as manual source.

import conformare as cf

# A column the name heuristics would not catch on their own.
cf.mark_sensitive("region",
                  tag="location", category="Location", severity="medium")

# Mark several at once.
cf.mark_sensitive("acct_ref", "legacy_id", category="Financial", severity="high")

# Remove a mark (e.g. a false positive from the heuristics you want to suppress).
cf.unmark_sensitive("nickname")

Arguments:

  • one or more column names;
  • tag, a short machine label (default manual);
  • category, the human grouping shown in the report (default Manual);
  • severity, one of low medium high critical (default high).

Declare marks before the pipeline runs so every step that carries the column is classified consistently.

From sensitivity to governance

Sensitivity tells you where protected data is and whether it leaves; the risk register tells you what you are doing about it. They pair naturally:

cf.mark_sensitive("region", tag="location", category="Location", severity="medium")

with cf.describe("Export regional report",
                 risks=cf.risk("privacy.pii_exposure",
                               note="region + spend written to shared export",
                               mitigation="Aggregate to >=50 customers per region",
                               owner="data-governance")):
    report.write_csv("regional_spend.csv")     # sink -> exfiltration check applies

In the report you then see the sensitive column, every step it touches, that it reaches a written output, and the owned mitigation that addresses it: the full picture in one place.


This site uses Just the Docs, a documentation theme for Jekyll.