Kubeflow Operations
Fourth phase of Year 4. Kubeflow Pipelines for ML workflows. Katib for HPO. Training Operators for distributed training. The “Train → register → deploy” composition recipe lands end-to-end. ~8 weeks, ~100 hrs.
Where this phase sits
Phases 20-22 built the ML primitives one by one — registry, serving, features. P23 is the phase where they stop being four parallel deployments and start being one flow. The artifact landing here is structural: a Kubeflow pipeline that wires JupyterHub → Ray → MLflow → KServe → mlship into a single reproducible chain, runnable via platform-ctl pipeline run train-deploy. That chain is the first Studio composition recipe to actually run end-to-end on basecamp — the kind of demo you can put in a video and have someone re-run from a clone of the basecamp plan.
Kubeflow has a reputation as a heavy bundle. P23 takes the position that the bundle is the wrong unit of installation: KFP, Katib, and the Training Operators are each useful on their own, with their own depth, and they install standalone without dragging the rest of the Kubeflow ecosystem along. You’ll operate at the component level. Tier 6 of basecamp lights up this phase — the ML platform tier above Tier 5’s ML services.
This is also the phase where model-lifecycle reaches DEEP. P20 introduced it as a frame; P21-22 reinforced it through serving and feature contracts; P23 closes it because you now have a real pipeline that exercises every hand-off (extract → train → eval → gate → register → deploy) under cache, lineage, and approval gates. Drift detection and the retraining trigger arrive in P25, but the lifecycle as a pattern is fully operational by the end of this phase.
Prerequisites
- Phase 22 complete — Feast features defined, train-serve skew prevention working
- You accept: Kubeflow is a heavy bundle, but the components (Pipelines, Katib, Training Operators) are useful individually. Operate at the component level, not the bundle level.
Why this phase exists
P20-22 set up MLflow + KServe + Ray + Feast. P23 orchestrates them: Kubeflow Pipelines runs the whole notebook → Ray train → MLflow register → KServe deploy → mlship flow as a reproducible pipeline.
This is the phase where the train → register → deploy composition recipe lands end-to-end. By phase end, you can re-run “train + deploy next-week-commits” with one command, reproducibly, across all 4 services.
1. PROBLEM
You have a training script. It works in a notebook. Now you want it to:
- Run on a schedule (every Sunday with new commits)
- Run on a trigger (when feature freshness drops or model drift fires)
- Use distributed training (Ray) when the data grows
- Tune hyperparameters (Katib)
- Promote the winner to MLflow Registry
- Auto-deploy if eval > threshold
That’s a pipeline. Kubeflow Pipelines (KFP) is the K8s-native pipeline engine. Argo Workflows is the lower-level alternative.
2. PRINCIPLES
2.1 Pipelines as DAGs of containers
KFP compiles a Python pipeline definition to an Argo Workflow. Each step is a container. Inputs/outputs flow between steps via S3-compatible storage.
Investigate:
- Read KFP docs — KFP v2 pipelines + components
- Build a 3-step pipeline: extract from Iceberg → train sklearn → register in MLflow
- What’s a “ContainerComponent” vs a “PythonComponent”?
2.2 Hyperparameter optimization (Katib)
Katib runs N parallel training jobs with different hyperparameters; tracks metrics; reports the winner.
Investigate:
- Define a Katib Experiment: 20 trials, Bayesian search over learning_rate + batch_size
- How does Katib pick the next trial? (TPE, random, grid, Bayesian.)
- Compare with Optuna + Ray Tune — same shape, different runtime
2.3 Distributed training (Training Operators)
K8s Training Operators (PyTorchJob, TFJob, MPIJob, RayJob) run distributed training as a CRD.
Investigate:
- Submit a RayJob via KubeRay (already from P21); train a slightly bigger model
- Compare with PyTorchJob: when each wins
- Read about gradient accumulation + DDP — preview only; you won’t have GPUs to test seriously
2.4 Pipeline observability
Pipelines emit OpenLineage events automatically (with the right config). Failed steps log to Loki. Metrics to Prometheus.
Investigate:
- Wire KFP to OpenLineage; verify lineage in DataHub (Y3 P19)
- Build a Grafana dashboard: pipeline success rate, step duration, cost-per-run
- Integrate with Backstage: surface pipelines in the catalog
2.5 Promotion gates
Don’t auto-promote. Auto-evaluate; gate promotion on eval > threshold; require human approve for prod.
Investigate:
- KFP “if” component for conditional logic
- Eval suite: golden test set + metric thresholds
- Where does the human gate live? (Slack approval bot; Argo manual approval node.)
2.6 Idempotent pipelines
Re-running a pipeline shouldn’t double-create artifacts.
→ Pattern: idempotency (revisited)
Investigate:
- KFP cache: same input → cached output, skip the step
- When to disable cache (e.g., when external data changes)
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| Pipelines | Kubeflow Pipelines | Argo Workflows | Dagster |
| HPO | Katib | Optuna + Ray Tune | Hyperopt |
| Distributed train | Ray | PyTorchJob | MPIJob |
| Pipeline auth | OIDC via Dex | k8s ServiceAccount | OIDC for human users; SA for service-to-service |
| Caching | KFP cache (default) | DVC | manual |
4. TOOLS (as of 2025-10)
- Kubeflow Pipelines 2.x (KFP — install standalone, not the full Kubeflow bundle)
- Katib (HPO)
- KubeRay (already from P21)
- PyTorch Operator (CRD; preview only — homelab can’t really)
- Argo Workflows (KFP backend; understand it)
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| KFP docs — Pipelines + Components | The implementation |
| Argo Workflows docs (KFP runs on Argo) | The substrate |
| ”Designing ML Systems” Ch. 8-9 (Data Distribution Shifts + Continual Learning) | Why the loop matters |
5.2 Operational depth checklist
[ ] Deploy KFP standalone (not full Kubeflow bundle) on basecamp Tier 6[ ] Build the Train → Register → Deploy pipeline: 1. Extract: pull abukix.commits from Iceberg with PIT-correct features (Feast) 2. Train: Ray-distributed sklearn or PyTorch on the features 3. Eval: against a held-out test set; emit metrics 4. Gate: if eval > threshold, register in MLflow as Staging 5. Deploy: mlship deploy + KServe (canary)[ ] Configure KFP cache; verify idempotency on rerun[ ] Define a Katib Experiment for HPO; run 20 trials[ ] Schedule the pipeline weekly via KFP recurring runs[ ] Add manual-approval node for Staging → Production promotion[ ] Wire to OpenLineage + DataHub[ ] Build pipeline observability dashboard[ ] Surface pipelines in Backstage catalog[ ] Document the train-deploy flow as basecamp/examples/recipe-train-deploy/5.3 The composition recipe lands
By phase end, the “Train → register → deploy” Studio composition recipe is real:
JupyterHub (P15) → Ray cluster (P21) — distributed training → MLflow (P20) — model registry with provenance → KServe (P21) — online serving with canary → mlship (P21 v0) — one-command deployRunnable end-to-end via platform-ctl pipeline run train-deploy (which under the hood triggers KFP).
This is the first runnable composition recipe you can demo to anyone visiting the Studio. The full composition catalog (5 recipes by Y5 end) is documented in the basecamp plan; P23 ships recipe #1, P24 ships #2, and Year 5 ships the rest.
6. COMPARE: KFP vs Argo Workflows (raw)
KFP runs on Argo. You could write Argo Workflow YAML directly. When does KFP earn its abstraction tax?
400 words.
7. OPERATE
- 4+ runbooks (
kfp-pipeline-failed-step-debug,katib-trials-stuck,pipeline-cache-poisoned,mlflow-registry-promotion-blocked) - 1+ postmortem
- Weekly log
8. CONTRIBUTE
KFP, Katib, KubeRay, MLflow, Argo Workflows — all CNCF-adjacent or Apache-foundation, all welcoming.
Validation criteria
[ ] All 10 operational depth checks[ ] KFP + Katib + Training Operators in basecamp Tier 6[ ] Train → register → deploy recipe runnable end-to-end via platform-ctl[ ] Recipe documented in basecamp/examples/[ ] KFP cache verified idempotent[ ] Manual-approval gate live for Staging → Production[ ] KFP vs Argo comparison written up[ ] 4+ runbooks; 1+ postmortem; 8+ weekly log entries[ ] Pattern entries deepened: - idempotency → reinforced (KFP cache) - model-lifecycle → DEEP (P25 closes the retrain side; for now, end-to-end orchestration earns DEEP here)[ ] Exit Test passedExit Test
Time: 3 hours.
- Build (90 min) — given the recipe in
basecamp/examples/recipe-train-deploy/, modify it to use a new feature from Feast; rerun; verify cache hits where expected; verify model promoted to Staging. - Diagnose (60 min) — scenario: pipeline succeeded but the model never appeared in the registry. Trace via OpenLineage + KFP logs.
- Articulate (30 min) — 600 words: “Walk the train→deploy pipeline. What patterns fire at each step? Where could it go wrong?”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Full Kubeflow bundle install | Heavy; install components individually |
| Pipelines without caching | Slow re-runs; wasted compute |
| Auto-promote to Production | One bad deploy + you’re paged |
| HPO sweep without budget | Cost runaway; cap trial count |
| Pipelines without lineage | When wrong number lands, you can’t trace |
Patterns deepened this phase
- model-lifecycle → DEEP
- idempotency → reinforced