Data Governance + Lineage (Year 3 Capstone)

Final phase of Year 3. Add governance (who can access what, with what audit trail) + lineage (where did this data come from?). Synthesize Y3 into a coherent data platform. basecamp goes public this phase. ~8 weeks, ~100 hrs.

Prerequisites

Phase 18 complete — Trino + Superset + personal-api operational

You accept: without governance, the data platform becomes a compliance nightmare. Lineage answers “where did this come from?” — the question that prevents most data-quality incidents.

Why this phase exists

Year 3’s exit ramp is Senior DevOps / Data Platform Engineer. That role owns not just the running pipelines but the discipline that prevents data incidents:

Who’s allowed to query PII? Audit log proves it.
This dashboard says X; where does X actually come from? Lineage shows it.
This model trained on data including EU users; was that allowed? Governance proves it.

This phase ships those controls. It’s also where basecamp goes public — a real OSS launch with sanitized secrets, README, blog post, and the first time strangers can clone your platform and run it themselves. That’s the moat described in the Master Plan, and Year 3 is when it becomes real.

The capstone shape is intentional. Y3 hasn’t introduced governance as a separate Y2-style discipline — it’s introduced data tooling phase by phase, and only now does the discipline of governing it land. That’s the right order: governance without something to govern is theatre; governance applied to a real working platform is what separates a hobby project from a credible OSS release.

1. PROBLEM

The data platform has many sources, many transformations, many consumers. You need:

Schema management + evolution rules
Access control per-column (PII protection)
Lineage: source → transformation → consumer
Audit log: who queried what when
Data quality tests in the pipeline (gating production)
A catalog where humans browse “what data exists, who owns it, how to use it”

→ See: storage-and-data, observability-and-ops

2. PRINCIPLES

2.1 Data catalog

A catalog = service registry for data. Humans + agents browse “what tables exist, who owns them, what shape they have, and what they’re for.”

Investigate:

Install DataHub (or OpenMetadata) on basecamp.
Auto-discover Iceberg tables via ingestion connector.
Add manual descriptions + ownership; surface in Backstage too — the data catalog and the service catalog should not be two separate silos.

2.2 Lineage

The DAG of data flows: source → transformation → output. The same shape Backstage’s component graph has for services, applied to data.

Investigate:

Configure dbt to emit lineage to DataHub.
Configure Airflow + OpenLineage; see DAG-level lineage.
Find one production query; trace it back to source tables, then back to producers (commits → Flink → Iceberg → dbt model → personal-api endpoint). The full chain from P16 through P18 becomes a single graph in DataHub.

2.3 Access control

Column-level masking (Iceberg row filters or Apache Ranger)
Audit logs from Trino (who ran what)
OIDC + group-based access via Dex (already in place from Year 2)

Investigate:

Configure an Iceberg row-level access policy.
Set up column masking (e.g., email → first letter + ****@domain.com).
Stream Trino query log to Loki for audit (the P14 telemetry stack pays off here).
Build a Superset dashboard: “queries-per-user-per-day” from the audit log.

2.4 Data quality tests

→ Pattern: blameless-postmortem (DEEP — quality tests prevent the postmortem)

dbt tests, Great Expectations, soda-core. The discipline is making test failures block production, not generate Slack noise nobody reads.

Investigate:

Add dbt tests on critical models (not_null, unique, accepted_values); already done in P17.
Install Great Expectations; build a suite for abukix.commits.
Wire failed tests to PagerDuty / Slack via AlertManager.

2.5 Schema evolution rules

What changes are safe? What break consumers?

Investigate:

Iceberg schema evolution rules (which changes are forward-compatible?).
Schema registry for Kafka (Redpanda’s built-in).
A “deprecation policy” for breaking changes — write it down as an ADR so future-you (or future contributors to public basecamp) know the rules.

2.6 Sanitization for public release

basecamp going public means: no plaintext secrets, no internal hostnames, no PII, no employer-internal references. The sanitization checklist below is non-negotiable — one leaked credential makes a year of work into a security postmortem.

Investigate:

Audit basecamp for plaintext secrets; replace all with SealedSecrets (already done in Y2 P12).
Find any internal hostnames / IPs hardcoded; replace with examples.
Run gitleaks on basecamp git history; remediate any findings — and if findings exist, history rewrite is the right answer, not amend.
Add a CONTRIBUTING.md, CODE_OF_CONDUCT.md, LICENSE (MIT or Apache-2.0).

3. TRADE-OFFS

Decision	Option A	Option B	When
Catalog	DataHub (LinkedIn)	OpenMetadata	Apache Atlas (legacy)
Lineage	dbt-emitted	OpenLineage (cross-tool standard)	OpenLineage works across Airflow + Spark + Flink + dbt
Access control	Iceberg row filters	Apache Ranger	lakeFS
Quality	dbt tests	Great Expectations	soda-core

4. TOOLS (as of 2025-10)

DataHub OR OpenMetadata
OpenLineage (cross-tool standard)
Great Expectations OR soda-core
dbt (already from P17)
gitleaks for secret-scanning before public release

5. MASTERY

5.1 Reading list

Required	Why
DataHub / OpenMetadata docs	The implementation
OpenLineage spec	The standard

Recommended	Why
”Data Mesh” (Zhamak Dehghani)	The architectural shift Year 4+ may follow
GDPR / SOC2 overviews — even at homelab scale	Real-world frame

5.2 Operational depth checklist

[ ] Install DataHub (or OpenMetadata) on K3s in basecamp
[ ] Ingest metadata from Iceberg via Nessie + Trino
[ ] Configure dbt to emit lineage to DataHub
[ ] Configure Airflow with OpenLineage; observe DAG lineage in catalog
[ ] Configure Flink with OpenLineage; observe stream lineage
[ ] Add dbt tests on critical models; alert on failures via AlertManager
[ ] Install Great Expectations; build one expectation suite for abukix.commits
[ ] Configure column-level masking on a PII column via Trino + Iceberg row filters
[ ] Stream Trino query log to Loki; build audit dashboard in Superset
[ ] Set up Dex group → Trino role mapping (e.g., analyst-read-only vs platform-admin)
[ ] Document the data platform: what tables exist, who owns them, what they're for

5.3 The big launch: basecamp goes public

This is the year’s biggest moment. A real OSS launch.

basecamp public-release checklist:

[ ] gitleaks scan; remediate any findings (rewrite history if needed)
[ ] All secrets via Sealed Secrets — no plaintext anywhere
[ ] No internal hostnames; sanitize to example.com / .local equivalents
[ ] Add LICENSE (MIT or Apache-2.0); CONTRIBUTING.md; CODE_OF_CONDUCT.md
[ ] Add a comprehensive README:
    - What basecamp is (the 9-tier architecture)
    - Quick start ("clone + bootstrap on K3s in 4 hours")
    - Architecture diagram
    - Per-tier docs
    - Roadmap
[ ] Tag v1.0.0 (or v0.30.0 reflecting the phase number)
[ ] Repository goes public on GitHub
[ ] Blog post on abukix.dev/blog: "basecamp at end of Year 3 — what I learned"
[ ] LinkedIn announcement: thoughtful, with the architecture diagram
[ ] Submit to GitOps community newsletter, /r/kubernetes
[ ] Add to ROOT/program/projects/basecamp/PLAN.md status: "PUBLIC since YYYY-MM-DD"

This is the first major launch since terralabs in Y2 P9. It’s the moment your platform becomes visible — the moment 99% of Master Plan users land here, clone it, and run their own instance. Everything before this phase has been infrastructure for the launch. Everything after is operating the platform with a real (small, but real) audience watching.

→ See: projects/basecamp/plan

6. COMPARE: DataHub vs OpenMetadata

Install both for a week. Compare ingestion ergonomics, UX, ecosystem, lineage rendering, governance feature depth. Pick one and write the ADR explaining why — future contributors to public basecamp will read it.

400 words.

7. OPERATE

4+ runbooks (datahub-ingestion-failure, lineage-broken-trace, column-mask-bypass-investigation, quality-test-flapping)
2+ postmortems (Y3 capstone — these are the year’s biggest learnings; one of them should be from the public-release week)
1+ ADR (why-datahub-or-openmetadata)
Weekly log

8. CONTRIBUTE

Year 3 PR deadline. DataHub, OpenMetadata, OpenLineage, dbt, Great Expectations all have welcoming communities. An OpenLineage producer for a tool that doesn’t have one yet (or a docs PR for a real-world basecamp configuration) lands well here.

Validation criteria (= Year 3 Final Exam readiness)

[ ] All 11 operational depth checks
[ ] Data catalog live; ownership + lineage visible
[ ] Access control: column masking + audit log
[ ] dbt + GE tests gating production
[ ] DataHub vs OpenMetadata writeup
[ ] basecamp PUBLIC on GitHub with LinkedIn + blog launch
[ ] 4+ runbooks; 2+ postmortems; 1+ ADR; 8+ weekly log entries
[ ] All Year 3 patterns DEEP:
    - three-pillars + cardinality + runbook-as-code + blameless-postmortem (P14)
    - oltp-vs-olap + schema-on-read + append-only-log + snapshot-plus-delta (P15)
    - lsm-vs-btree + write-ahead-logging (deepened from Y1 OUTLINE)
    - stream-processing + lambda-and-kappa + batch-processing + materialized-views (P16-P17)
    - caching (P18)
[ ] Year 3 Final Exam passed

Anti-patterns

Anti-pattern	Why
”We’ll add governance later”	Never; do it from day one
Lineage as docs (manual)	Auto-discover; manual gets stale
PII everywhere because “it’s hard to filter”	Compliance disaster; column masking is one line of config
Public-launching basecamp without `gitleaks`	One leaked secret can ruin a year
Loud LinkedIn launch with mediocre README	First impressions matter; polish before announce

Patterns deepened this phase

All Year 3 patterns reach DEEP. By Phase 19 end, basecamp is the concrete embodiment of every pattern from Y1-Y3.

Year 3 graduation

You can:
- Design + operate observability for a multi-cluster platform (incl. eBPF)
- Architect + run a lakehouse (MinIO + Iceberg) with notebooks-as-a-service
- Build streaming pipelines (Redpanda + Flink) with exactly-once-ish semantics
- Build batch pipelines (Spark + Airflow + dbt) with backfill discipline
- Federate queries across data sources (Trino) + serve to analysts (Superset)
- Govern data (catalog, lineage, access control, quality tests)
- Operate a public OSS platform — basecamp now public

Exit ramp: Senior DevOps / Data Platform Engineer / Site Reliability Engineer
Confidence: ~35 patterns DEEP, multi-cloud platform operational + public,
            personal-api running on your own platform

→ Year 3 Final Exam, then Year 4 — ML & AI Infrastructure builds Tiers 5/6/7 on top of the data tier you just stood up.