Data sensitivity
Sensitive data rarely stays put. An email column read at the source flows through joins, gets renamed, lands in an aggregate, and sometimes leaves the pipeline in a written output. Conformare flags columns that look like protected data and shows where they are used across the pipeline, and, crucially, whether they reach somewhere they could leave.
Why track it
- Know the blast radius. If a column is sensitive, you want to see every step it touches, not just where it was introduced. That is the difference between “we hold email” and “email reaches the marketing export”.
- Spot exfiltration risk. The pressing question for a reviewer is: does this sensitive column reach a written output? Conformare marks each sensitive column with whether it flows into a sink (a write/export). A sensitive column that reaches a sink is the data that actually leaves your control, exactly what a DPO or auditor needs to see.
- Drive the right risk. Sensitivity findings are the natural prompt to declare a governance risk (for example
privacy.pii_exposure) with a mitigation and an owner. See Risks & governance.
How auto-detection works
Detection is name-based. Conformare normalises each column name (splits camelCase, lowercases, and collapses separators), then matches whole words against a catalog of patterns for common protected information:
| Category | Examples of matched names |
|---|---|
| Contact | email, phone, mobile, telephone |
| Identity | name, surname, dob, date_of_birth |
| Government ID | ssn, national_insurance, passport, licence |
| Financial | iban, card_number, sort_code, salary |
| Online identifier | ip, device_id, cookie, uuid |
| Location | address, postcode, latitude, longitude |
| Special category | health, diagnosis, race, religion, gender |
| Credentials | password, secret, token, api_key |
Each match carries a category, a severity (low to critical) and a confidence. Because it works from names, detection flags candidates, not confirmed PII. A column called customer_name is flagged; a column called col_7 that happens to hold names is not. Treat the heuristics as a fast first pass, then confirm with manual marks.
You can test the classifier on a single name:
import conformare as cf
cf.classify_column("date_of_birth")
# [{'tag': 'dob', 'category': 'Identity', 'severity': 'high', ...}]
cf.classify_column("col_7") # [] -> not detected by name
What name-based detection cannot do
It does not read values, so it misses sensitive data hidden behind opaque names and can over-flag innocuous columns whose names merely look sensitive. That is what manual definitions are for.
Adding a sensitive column definition
When you know a column is sensitive (or want to override a heuristic), assert it with mark_sensitive(...). Manual marks always win over the heuristics and appear in the report as manual source.
import conformare as cf
# A column the name heuristics would not catch on their own.
cf.mark_sensitive("region",
tag="location", category="Location", severity="medium")
# Mark several at once.
cf.mark_sensitive("acct_ref", "legacy_id", category="Financial", severity="high")
# Remove a mark (e.g. a false positive from the heuristics you want to suppress).
cf.unmark_sensitive("nickname")
Arguments:
- one or more column names;
tag, a short machine label (defaultmanual);category, the human grouping shown in the report (defaultManual);-
severity, one oflowmediumhighcritical(defaulthigh).
Declare marks before the pipeline runs so every step that carries the column is classified consistently.
From sensitivity to governance
Sensitivity tells you where protected data is and whether it leaves; the risk register tells you what you are doing about it. They pair naturally:
cf.mark_sensitive("region", tag="location", category="Location", severity="medium")
with cf.describe("Export regional report",
risks=cf.risk("privacy.pii_exposure",
note="region + spend written to shared export",
mitigation="Aggregate to >=50 customers per region",
owner="data-governance")):
report.write_csv("regional_spend.csv") # sink -> exfiltration check applies
In the report you then see the sensitive column, every step it touches, that it reaches a written output, and the owned mitigation that addresses it: the full picture in one place.