Skip to content
5-YEAR PROGRAM · YEAR 4 · PHASE 21
UPCOMING

ML Serving + mlship v0

Second phase of Year 4. KServe for online inference, Ray for batch + training, vLLM as preview. The first version of mlship ships — a minimal CLI that deploys sklearn models. The first version of services/llm-gateway/ ships — a minimal routing endpoint. ~8 weeks, ~100 hrs.


Where this phase sits

Phase 20 settled the lifecycle frame: a model is now a registered, versioned, traceable artifact in MLflow. P21 is where that artifact stops being a file in MinIO and starts being a service. KServe gives you online inference, Ray gives you batch + distributed training, vLLM gives you a preview of the LLM-shaped serving you’ll deepen in Phase 24. The same registry feeds all three.

Two artifacts are born this phase that will accumulate across the rest of the year. mlship v0 — a small Go CLI that takes model.pkl and produces a URL — ships private; its eventual public OSS launch is the mlship project plan’s Year 5 capstone, not a Y4 outcome. services/llm-gateway/ v0 ships as a scaffold inside basecamp: one Go service, one endpoint, one backend, no streaming, no RAG. Both are deliberately minimal. P21’s job is to give the bigger phases (P24, P25) a place to put their code.

The pattern depth shift this phase is inference-shapes reaching OUTLINE. Online vs batch vs streaming aren’t variations on a theme — they’re different problems with different infrastructure, and conflating them is how teams ship one runtime and discover the second use case can’t fit.


Prerequisites

  • Phase 20 complete — MLflow operational; one model registered
  • You accept: inference shapes (online vs batch vs streaming) are different problems with different infra. KServe for online; Ray for batch + train; vLLM specialized for LLMs (deepens P24).

Why this phase exists

Year 4’s payoff is models in production. P20 built the registry; P21 builds the serving layer. By phase end:

  • KServe runs online inference for the next-week-commits model (sklearn) from P20
  • Ray cluster handles batch inference + distributed training (warm-up for P23 Kubeflow)
  • vLLM serves a tiny LLM (Phi-3-mini quantized) — preview; depth is P24
  • mlship v0 — a CLI that takes model.pkl and deploys it to Docker (or KServe)
  • services/llm-gateway/ v0 — minimal Go service routing to vLLM

Each is a “v0” intentionally. P24 + P25 polish.


1. PROBLEM

You have a model in MLflow. You want it serving real (or test) traffic with: stable URL, autoscaling, canary rollouts, observability, a way to roll back.

Different inference shapes have different infra:

  • Online single-prediction (HTTP request → response in <100ms) — KServe
  • Online streaming (chat, transcription) — KServe + custom protocol or vLLM
  • Batch (score 1M rows nightly) — Ray + Spark
  • Edge (mobile, embedded) — TensorRT, ONNX runtime — out of scope for ROOT

→ Pattern: inference-shapes


2. PRINCIPLES

2.1 Inference shapes

Different latency/throughput/cost trade-offs per shape.

→ Pattern: inference-shapes

Investigate:

  • Map 4 hypothetical workloads to shapes: real-time fraud detection (online), nightly recommendations (batch), chat (streaming), search ranking (online with caching)
  • For each: what’s the latency target? Throughput target? Cost shape?

2.2 KServe: Kubernetes-native serving

KServe abstracts away the serving runtime: scikit-learn, PyTorch, TensorFlow, ONNX, custom — all become InferenceService CRDs.

Investigate:

  • Deploy KServe via basecamp Helm chart
  • Deploy the next-week-commits model from MLflow Registry as a KServe InferenceService
  • Configure auto-scaling: scale-to-zero, max replicas, target concurrency
  • Canary: split 90/10 between two model versions; advance based on success rate

2.3 Ray: distributed Python

Ray is for distributed batch + training. Ray Cluster on K8s gives you “submit a function, run on N nodes.”

Investigate:

  • Deploy Ray cluster via Ray Operator on basecamp
  • Run a Ray job that scores all abukix.commits in parallel (batch inference)
  • Compare with sequential PySpark — when does each win?

2.4 vLLM: preview

vLLM is the OSS LLM-serving runtime: paged attention, continuous batching, OpenAI-compat API.

Investigate:

  • Deploy vLLM via KServe with Phi-3-mini quantized (fits in CPU)
  • Hit /v1/chat/completions; observe streaming SSE
  • Read the vLLM paper (“Efficient Memory Management for Large Language Model Serving” — Kwon et al.)
  • Defer depth to P24

2.5 Cold start + warm-up

Models load slowly. First request after scale-from-zero is slow. Mitigate with warm-up requests, image pre-pull, model pre-loaded volumes.

Investigate:

  • Time first-request latency for cold model start
  • Configure KServe initialDelaySeconds + readiness probe to gate traffic until model loaded
  • Pre-pull model artifacts to PVC; mount instead of downloading

2.6 The serving observability stack

Latency, throughput, error rate, cost-per-request, model-version-in-use, input-shape distribution.

Investigate:

  • Configure KServe to emit Prometheus metrics
  • Add OTel traces from request → model → response
  • Build a Grafana dashboard with the 6 above metrics

3. TRADE-OFFS

DecisionOption AOption BWhen
Online runtimeKServeSeldon CoreBentoML
Batch runtimeRaySparkDask
LLM runtimevLLMTGI (HuggingFace)TensorRT-LLM
Auto-scale signalrequest countconcurrencylatency
Cold startscale-to-zero (cheap)min-replicas=1 (warm)Trade cost vs first-request latency

4. TOOLS (as of 2025-10)

  • KServe 0.13+
  • Ray 2.35+ + KubeRay operator
  • vLLM 0.6+ (preview; depth P24)
  • Knative Serving (KServe dependency for scale-to-zero)
  • MLflow (Y4 P20 — registry source)

5. MASTERY

5.1 Reading list

RequiredWhy
KServe docs — InferenceService, Auto-scaling, CanaryThe implementation
Ray docs — Ray Core + Ray ServeThe distributed primitive
Chip Huyen “Designing ML Systems” Ch. 7 (Model Deployment)The discipline

5.2 Operational depth checklist

[ ] Deploy KServe + Knative on basecamp Tier 5
[ ] Deploy the next-week-commits model from MLflow as KServe InferenceService
[ ] Configure scale-to-zero with sub-30s cold start
[ ] Canary deploy a v2 of the model; advance based on success rate
[ ] Deploy Ray cluster via KubeRay; submit a parallel batch-inference job
[ ] Deploy vLLM with Phi-3-mini quantized; hit /v1/chat/completions
[ ] Build serving observability dashboard (latency p50/p95/p99, throughput, errors, version)
[ ] Force cold start; measure first-request latency; tune to <30s
[ ] Document KServe patterns in ops-handbook
[ ] Wire model-lifecycle to platform-ctl: `platform-ctl model deploy <name>` calls mlship under the hood

5.3 Project: mlship v0

mlship v0 scope (PRIVATE — full polish + launch is Y5 P30 capstone):
github.com/abukix/mlship (private)
Go + cobra (Y1 P5 stack)
Commands:
mlship deploy ./model.pkl
→ auto-detect sklearn
→ containerize (multi-stage Dockerfile)
→ deploy to local Docker
→ print URL
mlship deploy ./model.pkl --to k8s --cluster basecamp-k3s
→ also containerize
→ deploy to KServe via basecamp
→ print URL
mlship list / logs / rollback / destroy
v0 NOT included (Y5 capstone):
- PyTorch / TensorFlow / ONNX / HuggingFace auto-detection
- vLLM routing for LLMs
- AWS Fargate / GCP Cloud Run targets
- Polished docs site / demo video / Show HN launch

The point of v0: the train→deploy composition recipe needs something by Y4 end. v0 is “barely usable but real.” Y5 turns it into the launch artifact.

The full version arc — what v0 here grows into, why sklearn-first, the launch plan — lives in the mlship project plan. Read it once before writing the v0 code so you understand which corners you’re allowed to cut and which you aren’t.

5.4 Project: services/llm-gateway/ v0

services/llm-gateway/ v0 scope (PUBLIC via basecamp's repo):
Go service inside basecamp/charts/llm-gateway/
Endpoints:
POST /v1/chat/completions — minimal OpenAI-compat
Routes ALL traffic to ONE backend (vLLM stub or env-configured OpenAI)
Non-streaming response (P24 adds streaming)
No RAG (P24 adds RAG)
No multi-model routing (P24 adds)
Per-request slog + Prometheus counter
mTLS via mesh (Y2 P12)
This is the SCAFFOLD. Real RAG + streaming + routing is P24.

By Y4 end, llm-gateway is much more than v0. But v0 ships now so future code has a place to live. The full P21 → P24P25 growth arc is described in the Year 4 overview.


6. COMPARE: KServe vs BentoML vs custom FastAPI

For the same model, deploy 3 ways. Compare:

  • Setup complexity
  • Auto-scaling story
  • Multi-framework support
  • Cold-start behavior
  • When each wins

400 words.


7. OPERATE

  • 3+ runbooks (kserve-cold-start-investigation, ray-job-stuck, model-canary-rollback)
  • 1+ postmortem
  • Weekly log

8. CONTRIBUTE

KServe, Knative Serving, KubeRay, vLLM — all welcoming.


Validation criteria

[ ] All 10 operational depth checks
[ ] KServe + Ray + vLLM-stub running in basecamp Tier 5 + Tier 7
[ ] mlship v0 working (sklearn → Docker + KServe deploy)
[ ] llm-gateway v0 scaffold in basecamp; routes to vLLM
[ ] platform-ctl model deploy wires to mlship
[ ] KServe vs BentoML vs FastAPI comparison
[ ] 3+ runbooks; 1+ postmortem; 8+ weekly log entries
[ ] Pattern entries deepened:
- inference-shapes → OUTLINE (DEEP after P25)
- model-lifecycle → reinforced via real serving
[ ] Exit Test passed

Exit Test

Time: 3 hours.

  1. Build (90 min) — train a small classifier (or reuse P20 model); use mlship to deploy to local Docker; verify; redeploy to KServe; canary 90/10 with a v2; advance based on success.
  2. Diagnose (60 min) — scenario: KServe pod cold-start > 60s; trace via OTel + Loki + Prometheus; tune to <30s.
  3. Articulate (30 min) — 600 words: “When do I reach for KServe vs Ray vs vLLM? Map 4 inference workloads.”

Anti-patterns

Anti-patternWhy
Cold-start hidden by min-replicas=1Real cost; trade carefully
Canary without success-rate gateManual canary = no canary
Skipping the registry, deploying from notebookNo provenance; can’t roll back
vLLM “for everything”LLMs only; sklearn doesn’t need it
llm-gateway with too much scope at v0P24/P25 is when it grows up

Patterns deepened this phase


→ Next: Phase 22: Feature Store: Feast