Model Serving Infrastructure (KServe)

Phase 38 of /root Year 4: KServe as the K8s-native model serving layer. InferenceService CRDs, scale-to-zero via Keda, canary deploys via Flagger. Tier 6 of basecamp completes. 6-8 weeks, ~70-90 hours.

Eighth and final phase of Year 4. Model serving as a control loop. 6-8 weeks, ~70-90 hrs.

Phase 37 trained models. This phase serves them. KServe is the K8s-native model serving layer — InferenceService is the CRD, KServe controllers reconcile it into the underlying Deployment + Service + Ingress + autoscaler + observability. By phase end basecamp serves models via the same operator-pattern that runs everything else: declare a CRD, the operator handles the rest.

This closes Year 4. Tiers 1-6 of basecamp are alive. The ML platform — data, training, serving — runs end-to-end. Y5 builds the LLM + agent layer (Tier 7) on top.


Prerequisites

  • Phase 37 complete; KubeRay operational
  • At least one model trained + registered in MLflow
  • 12 hrs/week budget reserved

Why this phase exists

Serving is where ML meets reality. The model that worked in your notebook needs to: respond in milliseconds, scale with traffic, version cleanly, deploy safely (canary), observe itself. KServe handles all of this via the standard K8s-native pattern. Roll-your-own FastAPI works until traffic grows; then you reinvent KServe poorly.


The pattern-first frame

Same eight steps.


1. PROBLEM

You have a registered model in MLflow. You want it deployed: behind an HTTP endpoint, scaling with traffic, version-aware (current + canary), observable (latency, error rate, throughput), failure-tolerant (retries, fallback). And you want to do this declaratively, K8s-native.

That’s the model serving problem. KServe is the K8s-native answer. BentoML, Triton, Seldon Core are alternatives.


2. PRINCIPLES

2.1 Model serving as a control loop

A KServe InferenceService CRD declares what model + which framework + scaling profile + canary split. The KServe controller reconciles it into the deployed state.

→ Pattern: model-serving — OUTLINE this phase (deepens through Y5 LLM serving)

Investigate:

  • Walk an InferenceService CRD: declare → KServe creates predictor + transformer + explainer pods → Service + Ingress → autoscaler.
  • Why is “serving as a control loop” structurally similar to “GitOps for deployments”?
  • What’s Predictor vs Transformer vs Explainer in KServe’s architecture?

2.2 Inference shapes — sync, batch, streaming

Different serving patterns: synchronous (low latency, per-request), batch (offline, throughput-optimized), streaming (continuous, async).

→ Pattern: inference-shapes

Investigate:

  • Walk sync inference: HTTP POST → model predict → response.
  • When does batch inference (offline) fit instead of online?
  • What’s streaming inference, and when is it the right shape (real-time feature transformation, video)?

2.3 Scale-to-zero + Keda integration

KServe + Keda: when traffic is zero, replicas scale to zero. When a request arrives, Keda scales up. Cost-efficient for sporadic workloads.

→ Pattern: serverless-inference

Investigate:

  • Walk scale-from-zero: request arrives → KServe waits → Keda spins up replica → request served.
  • What’s the cold-start cost, and when is it acceptable?
  • When does scale-to-zero NOT fit? (Hint: latency-critical, always-warm requirements.)

2.4 Canary deploys for models

Models change frequently. Canary deploys (5% traffic to v2, 95% to v1) let you validate new versions in production before full rollout. Flagger (K8s-native) automates this.

→ Pattern: canary-deploys (general infrastructure pattern, applied to models)

Investigate:

  • Walk a Flagger canary: deploy v2 → 5% traffic → measure SLI → success → increase → full rollout.
  • What’s an automatic rollback trigger?
  • Why does model canary deploy differ from regular service canary? (Hint: model output distributions matter, not just error rates.)

2.5 Observability for ML services

Beyond RED + USE: prediction distribution, feature drift, model staleness, prediction-latency percentiles.

Investigate:

  • What’s prediction drift, and how do you detect it without ground truth?
  • How does KServe integrate with the OTel stack (Phase 28)?
  • When do you alert on prediction-distribution shifts?

2.6 Framework support

KServe supports PyTorch (TorchServe), TensorFlow, sklearn, XGBoost, LightGBM, HuggingFace, ONNX, MLflow models, custom. The runtime is pluggable via ServingRuntime CRDs.

Investigate:

  • What’s the difference between “model server” (Triton, TorchServe) and “serving framework” (KServe)?
  • When do you use Triton’s KServe runtime for performance?
  • How does KServe support custom Python serving code?

3. TRADE-OFFS

DecisionOptionsCost
Serving frameworkKServe; BentoML; Seldon Core; raw FastAPI; Triton Inference ServerKServe: K8s-native CRDs (recommended). BentoML: model-packaging-first. Seldon: similar to KServe. FastAPI: simplest, no autoscale. Triton: NVIDIA-optimized.
AutoscalerKeda (scale-to-zero); HPA (Horizontal Pod Autoscaler); Karpenter (node-level)Keda: event-driven (recommended for sporadic inference). HPA: metric-driven. Karpenter: complementary, node level.
CanaryFlagger; Argo Rollouts; manualFlagger: K8s-native, automated. Argo Rollouts: more controls. Manual: error-prone.
Model package formatMLflow model; ONNX; SavedModel (TF); TorchScriptMLflow: portable. ONNX: cross-framework. SavedModel/TorchScript: framework-specific.

4. TOOLS (as of 2026-06)

K8s-native stack

  • KServeInferenceService CRD (formerly KFServing)
  • Keda — scale-to-zero for InferenceServices (Phase 20 deployed)
  • Flagger — canary deploys (CNCF graduated)
  • Triton Inference Server — as a KServe runtime for perf-critical workloads
  • MLflow — model source (Phase 37 deployed)

Reading

  • KServe docs
  • Triton Inference Server docs
  • “Designing Machine Learning Systems” (Chip Huyen) — Ch. 7-8
  • Flagger docs

5. MASTERY: Model serving alive on basecamp

[ ] KServe installed via Flux + Helm
[ ] Deploy an InferenceService for an MLflow-registered model from Phase 37
[ ] Test inference via curl; verify response shape
[ ] Configure Keda ScaledObject for scale-to-zero on the InferenceService
[ ] Verify cold-start behavior: idle → request → spin-up → response
[ ] Install Flagger; configure a canary policy for the InferenceService
[ ] Deploy a v2 of the model; observe canary rollout
[ ] Add Prometheus metrics: per-model latency, throughput, error rate
[ ] Add a drift-detection job (Spark or simple script) checking prediction distribution
[ ] Integrate with Cilium NetworkPolicy: only specific callers can hit the InferenceService
[ ] Add Triton as a custom runtime for one performance-critical model

6. COMPARE: BentoML or Triton standalone

Pick one:

  • BentoML — package one model with BentoML; compare against KServe deployment
  • Triton standalone — deploy a model with raw Triton Inference Server (no KServe); compare

400-word reflection.


7. OPERATE

  • 4-5 runbooks: InferenceService not ready, scale-to-zero failing, canary rollback triggered, prediction drift alert, framework runtime not loading model
  • 2-3 ADRs (KServe over Seldon; MLflow models over native PyTorch packaging; Flagger for canary)
  • Weekly log

8. CONTRIBUTE

  • KServe (CNCF) — docs, runtimes
  • Flagger (CNCF) — canary policies
  • MLflow — KServe integration

What ships from this phase

  • Tier 6 of basecamp complete: KServe + Keda + Flagger + MLflow alive
  • At least one model serving production-shaped traffic with scale-to-zero + canary
  • Serving runbooks
  • Year 4 portfolio complete — ready for Year 4 Final Exam

Validation criteria

[ ] KServe operational with at least one InferenceService running
[ ] Keda scale-to-zero working
[ ] Flagger canary tested
[ ] Drift detection in place
[ ] All 11 operational depth checks
[ ] Compare reflection (400 words)
[ ] 4-5 serving runbooks
[ ] 2-3 ADRs
[ ] Pattern entries:
    - model-serving → OUTLINE
    - inference-shapes → OUTLINE
    - serverless-inference → OUTLINE
    - canary-deploys → OUTLINE (general infra)
    - operator-pattern reinforced
[ ] Exit Test passed
[ ] Year 4 Final Exam prep can begin

Exit Test

Time: 2.5 hours.

Part 1: Build (90 min)

Deploy a new InferenceService for a different model from MLflow registry. Configure scale-to-zero. Verify cold-start path works.

Part 2: Diagnose (45 min)

A serving scenario (e.g., “InferenceService stuck Ready=False after deploy”). Possible: model artifact missing; framework runtime mismatch; resource limits; storage backend not accessible.

Part 3: Articulate (15 min)

~400 words: “Defend KServe over a hand-rolled FastAPI + Kubernetes Deployment for basecamp’s model serving. Cite the K8s-native ecosystem composition and the control-loop pattern.”


Anti-patterns

Anti-patternWhy
Raw FastAPI for production model servingYou reinvent KServe poorly; missing autoscale, canary, observability
Scale-to-zero on latency-critical workloadsCold-start is unacceptable for some SLOs
Canary without drift detectionWrong-model gets promoted because traffic-error-rate looks fine
Custom Python serving code instead of standard runtimesYou miss Triton/TorchServe perf optimizations
No NetworkPolicy on servingAnyone in the cluster can hit your endpoint

Patterns touched this phase

  • model-serving — OUTLINE
  • inference-shapes — OUTLINE
  • serverless-inference — OUTLINE
  • canary-deploys — OUTLINE
  • operator-pattern reinforced

→ Next: Year 4 Final Exam