Kubeflow Operations

Fourth phase of Year 4. Kubeflow Pipelines for ML workflows. Katib for HPO. Training Operators for distributed training. The “Train → register → deploy” composition recipe lands end-to-end. ~8 weeks, ~100 hrs.

Where this phase sits

Phases 20-22 built the ML primitives one by one — registry, serving, features. P23 is the phase where they stop being four parallel deployments and start being one flow. The artifact landing here is structural: a Kubeflow pipeline that wires JupyterHub → Ray → MLflow → KServe → mlship into a single reproducible chain, runnable via platform-ctl pipeline run train-deploy. That chain is the first Studio composition recipe to actually run end-to-end on basecamp — the kind of demo you can put in a video and have someone re-run from a clone of the basecamp plan.

Kubeflow has a reputation as a heavy bundle. P23 takes the position that the bundle is the wrong unit of installation: KFP, Katib, and the Training Operators are each useful on their own, with their own depth, and they install standalone without dragging the rest of the Kubeflow ecosystem along. You’ll operate at the component level. Tier 6 of basecamp lights up this phase — the ML platform tier above Tier 5’s ML services.

This is also the phase where model-lifecycle reaches DEEP. P20 introduced it as a frame; P21-22 reinforced it through serving and feature contracts; P23 closes it because you now have a real pipeline that exercises every hand-off (extract → train → eval → gate → register → deploy) under cache, lineage, and approval gates. Drift detection and the retraining trigger arrive in P25, but the lifecycle as a pattern is fully operational by the end of this phase.

Prerequisites

Phase 22 complete — Feast features defined, train-serve skew prevention working

You accept: Kubeflow is a heavy bundle, but the components (Pipelines, Katib, Training Operators) are useful individually. Operate at the component level, not the bundle level.

Why this phase exists

P20-22 set up MLflow + KServe + Ray + Feast. P23 orchestrates them: Kubeflow Pipelines runs the whole notebook → Ray train → MLflow register → KServe deploy → mlship flow as a reproducible pipeline.

This is the phase where the train → register → deploy composition recipe lands end-to-end. By phase end, you can re-run “train + deploy next-week-commits” with one command, reproducibly, across all 4 services.

1. PROBLEM

You have a training script. It works in a notebook. Now you want it to:

Run on a schedule (every Sunday with new commits)
Run on a trigger (when feature freshness drops or model drift fires)
Use distributed training (Ray) when the data grows
Tune hyperparameters (Katib)
Promote the winner to MLflow Registry
Auto-deploy if eval > threshold

That’s a pipeline. Kubeflow Pipelines (KFP) is the K8s-native pipeline engine. Argo Workflows is the lower-level alternative.

2. PRINCIPLES

2.1 Pipelines as DAGs of containers

KFP compiles a Python pipeline definition to an Argo Workflow. Each step is a container. Inputs/outputs flow between steps via S3-compatible storage.

Investigate:

Read KFP docs — KFP v2 pipelines + components
Build a 3-step pipeline: extract from Iceberg → train sklearn → register in MLflow
What’s a “ContainerComponent” vs a “PythonComponent”?

2.2 Hyperparameter optimization (Katib)

Katib runs N parallel training jobs with different hyperparameters; tracks metrics; reports the winner.

Investigate:

Define a Katib Experiment: 20 trials, Bayesian search over learning_rate + batch_size
How does Katib pick the next trial? (TPE, random, grid, Bayesian.)
Compare with Optuna + Ray Tune — same shape, different runtime

2.3 Distributed training (Training Operators)

K8s Training Operators (PyTorchJob, TFJob, MPIJob, RayJob) run distributed training as a CRD.

Investigate:

Submit a RayJob via KubeRay (already from P21); train a slightly bigger model
Compare with PyTorchJob: when each wins
Read about gradient accumulation + DDP — preview only; you won’t have GPUs to test seriously

2.4 Pipeline observability

Pipelines emit OpenLineage events automatically (with the right config). Failed steps log to Loki. Metrics to Prometheus.

Investigate:

Wire KFP to OpenLineage; verify lineage in DataHub (Y3 P19)
Build a Grafana dashboard: pipeline success rate, step duration, cost-per-run
Integrate with Backstage: surface pipelines in the catalog

2.5 Promotion gates

Don’t auto-promote. Auto-evaluate; gate promotion on eval > threshold; require human approve for prod.

Investigate:

KFP “if” component for conditional logic
Eval suite: golden test set + metric thresholds
Where does the human gate live? (Slack approval bot; Argo manual approval node.)

2.6 Idempotent pipelines

Re-running a pipeline shouldn’t double-create artifacts.

→ Pattern: idempotency (revisited)

Investigate:

KFP cache: same input → cached output, skip the step
When to disable cache (e.g., when external data changes)

3. TRADE-OFFS

Decision	Option A	Option B	When
Pipelines	Kubeflow Pipelines	Argo Workflows	Dagster
HPO	Katib	Optuna + Ray Tune	Hyperopt
Distributed train	Ray	PyTorchJob	MPIJob
Pipeline auth	OIDC via Dex	k8s ServiceAccount	OIDC for human users; SA for service-to-service
Caching	KFP cache (default)	DVC	manual

4. TOOLS (as of 2025-10)

Kubeflow Pipelines 2.x (KFP — install standalone, not the full Kubeflow bundle)
Katib (HPO)
KubeRay (already from P21)
PyTorch Operator (CRD; preview only — homelab can’t really)
Argo Workflows (KFP backend; understand it)

5. MASTERY

5.1 Reading list

Required	Why
KFP docs — Pipelines + Components	The implementation
Argo Workflows docs (KFP runs on Argo)	The substrate
”Designing ML Systems” Ch. 8-9 (Data Distribution Shifts + Continual Learning)	Why the loop matters

5.2 Operational depth checklist

[ ] Deploy KFP standalone (not full Kubeflow bundle) on basecamp Tier 6
[ ] Build the Train → Register → Deploy pipeline:
    1. Extract: pull abukix.commits from Iceberg with PIT-correct features (Feast)
    2. Train: Ray-distributed sklearn or PyTorch on the features
    3. Eval: against a held-out test set; emit metrics
    4. Gate: if eval > threshold, register in MLflow as Staging
    5. Deploy: mlship deploy + KServe (canary)
[ ] Configure KFP cache; verify idempotency on rerun
[ ] Define a Katib Experiment for HPO; run 20 trials
[ ] Schedule the pipeline weekly via KFP recurring runs
[ ] Add manual-approval node for Staging → Production promotion
[ ] Wire to OpenLineage + DataHub
[ ] Build pipeline observability dashboard
[ ] Surface pipelines in Backstage catalog
[ ] Document the train-deploy flow as basecamp/examples/recipe-train-deploy/

5.3 The composition recipe lands

By phase end, the “Train → register → deploy” Studio composition recipe is real:

JupyterHub (P15)
  → Ray cluster (P21) — distributed training
  → MLflow (P20) — model registry with provenance
  → KServe (P21) — online serving with canary
  → mlship (P21 v0) — one-command deploy

Runnable end-to-end via platform-ctl pipeline run train-deploy (which under the hood triggers KFP).

This is the first runnable composition recipe you can demo to anyone visiting the Studio. The full composition catalog (5 recipes by Y5 end) is documented in the basecamp plan; P23 ships recipe #1, P24 ships #2, and Year 5 ships the rest.

6. COMPARE: KFP vs Argo Workflows (raw)

KFP runs on Argo. You could write Argo Workflow YAML directly. When does KFP earn its abstraction tax?

400 words.

7. OPERATE

4+ runbooks (kfp-pipeline-failed-step-debug, katib-trials-stuck, pipeline-cache-poisoned, mlflow-registry-promotion-blocked)
1+ postmortem
Weekly log

8. CONTRIBUTE

KFP, Katib, KubeRay, MLflow, Argo Workflows — all CNCF-adjacent or Apache-foundation, all welcoming.

Validation criteria

[ ] All 10 operational depth checks
[ ] KFP + Katib + Training Operators in basecamp Tier 6
[ ] Train → register → deploy recipe runnable end-to-end via platform-ctl
[ ] Recipe documented in basecamp/examples/
[ ] KFP cache verified idempotent
[ ] Manual-approval gate live for Staging → Production
[ ] KFP vs Argo comparison written up
[ ] 4+ runbooks; 1+ postmortem; 8+ weekly log entries
[ ] Pattern entries deepened:
    - idempotency → reinforced (KFP cache)
    - model-lifecycle → DEEP (P25 closes the retrain side; for now, end-to-end orchestration earns DEEP here)
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (90 min) — given the recipe in basecamp/examples/recipe-train-deploy/, modify it to use a new feature from Feast; rerun; verify cache hits where expected; verify model promoted to Staging.
Diagnose (60 min) — scenario: pipeline succeeded but the model never appeared in the registry. Trace via OpenLineage + KFP logs.
Articulate (30 min) — 600 words: “Walk the train→deploy pipeline. What patterns fire at each step? Where could it go wrong?”

Anti-patterns

Anti-pattern	Why
Full Kubeflow bundle install	Heavy; install components individually
Pipelines without caching	Slow re-runs; wasted compute
Auto-promote to Production	One bad deploy + you’re paged
HPO sweep without budget	Cost runaway; cap trial count
Pipelines without lineage	When wrong number lands, you can’t trace

Patterns deepened this phase

model-lifecycle → DEEP
idempotency → reinforced

→ Next: Phase 24: LLM Infrastructure + RAG + llm-gateway v1