ML Serving + mlship v0
Second phase of Year 4. KServe for online inference, Ray for batch + training, vLLM as preview. The first version of
mlshipships — a minimal CLI that deploys sklearn models. The first version ofservices/llm-gateway/ships — a minimal routing endpoint. ~8 weeks, ~100 hrs.
Where this phase sits
Phase 20 settled the lifecycle frame: a model is now a registered, versioned, traceable artifact in MLflow. P21 is where that artifact stops being a file in MinIO and starts being a service. KServe gives you online inference, Ray gives you batch + distributed training, vLLM gives you a preview of the LLM-shaped serving you’ll deepen in Phase 24. The same registry feeds all three.
Two artifacts are born this phase that will accumulate across the rest of the year. mlship v0 — a small Go CLI that takes model.pkl and produces a URL — ships private; its eventual public OSS launch is the mlship project plan’s Year 5 capstone, not a Y4 outcome. services/llm-gateway/ v0 ships as a scaffold inside basecamp: one Go service, one endpoint, one backend, no streaming, no RAG. Both are deliberately minimal. P21’s job is to give the bigger phases (P24, P25) a place to put their code.
The pattern depth shift this phase is inference-shapes reaching OUTLINE. Online vs batch vs streaming aren’t variations on a theme — they’re different problems with different infrastructure, and conflating them is how teams ship one runtime and discover the second use case can’t fit.
Prerequisites
- Phase 20 complete — MLflow operational; one model registered
- You accept: inference shapes (online vs batch vs streaming) are different problems with different infra. KServe for online; Ray for batch + train; vLLM specialized for LLMs (deepens P24).
Why this phase exists
Year 4’s payoff is models in production. P20 built the registry; P21 builds the serving layer. By phase end:
- KServe runs online inference for the next-week-commits model (sklearn) from P20
- Ray cluster handles batch inference + distributed training (warm-up for P23 Kubeflow)
- vLLM serves a tiny LLM (Phi-3-mini quantized) — preview; depth is P24
mlshipv0 — a CLI that takesmodel.pkland deploys it to Docker (or KServe)services/llm-gateway/v0 — minimal Go service routing to vLLM
Each is a “v0” intentionally. P24 + P25 polish.
1. PROBLEM
You have a model in MLflow. You want it serving real (or test) traffic with: stable URL, autoscaling, canary rollouts, observability, a way to roll back.
Different inference shapes have different infra:
- Online single-prediction (HTTP request → response in <100ms) — KServe
- Online streaming (chat, transcription) — KServe + custom protocol or vLLM
- Batch (score 1M rows nightly) — Ray + Spark
- Edge (mobile, embedded) — TensorRT, ONNX runtime — out of scope for ROOT
→ Pattern: inference-shapes
2. PRINCIPLES
2.1 Inference shapes
Different latency/throughput/cost trade-offs per shape.
→ Pattern: inference-shapes
Investigate:
- Map 4 hypothetical workloads to shapes: real-time fraud detection (online), nightly recommendations (batch), chat (streaming), search ranking (online with caching)
- For each: what’s the latency target? Throughput target? Cost shape?
2.2 KServe: Kubernetes-native serving
KServe abstracts away the serving runtime: scikit-learn, PyTorch, TensorFlow, ONNX, custom — all become InferenceService CRDs.
Investigate:
- Deploy KServe via basecamp Helm chart
- Deploy the next-week-commits model from MLflow Registry as a KServe InferenceService
- Configure auto-scaling: scale-to-zero, max replicas, target concurrency
- Canary: split 90/10 between two model versions; advance based on success rate
2.3 Ray: distributed Python
Ray is for distributed batch + training. Ray Cluster on K8s gives you “submit a function, run on N nodes.”
Investigate:
- Deploy Ray cluster via Ray Operator on basecamp
- Run a Ray job that scores all
abukix.commitsin parallel (batch inference) - Compare with sequential PySpark — when does each win?
2.4 vLLM: preview
vLLM is the OSS LLM-serving runtime: paged attention, continuous batching, OpenAI-compat API.
Investigate:
- Deploy vLLM via KServe with Phi-3-mini quantized (fits in CPU)
- Hit
/v1/chat/completions; observe streaming SSE - Read the vLLM paper (“Efficient Memory Management for Large Language Model Serving” — Kwon et al.)
- Defer depth to P24
2.5 Cold start + warm-up
Models load slowly. First request after scale-from-zero is slow. Mitigate with warm-up requests, image pre-pull, model pre-loaded volumes.
Investigate:
- Time first-request latency for cold model start
- Configure KServe
initialDelaySeconds+ readiness probe to gate traffic until model loaded - Pre-pull model artifacts to PVC; mount instead of downloading
2.6 The serving observability stack
Latency, throughput, error rate, cost-per-request, model-version-in-use, input-shape distribution.
Investigate:
- Configure KServe to emit Prometheus metrics
- Add OTel traces from request → model → response
- Build a Grafana dashboard with the 6 above metrics
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| Online runtime | KServe | Seldon Core | BentoML |
| Batch runtime | Ray | Spark | Dask |
| LLM runtime | vLLM | TGI (HuggingFace) | TensorRT-LLM |
| Auto-scale signal | request count | concurrency | latency |
| Cold start | scale-to-zero (cheap) | min-replicas=1 (warm) | Trade cost vs first-request latency |
4. TOOLS (as of 2025-10)
- KServe 0.13+
- Ray 2.35+ + KubeRay operator
- vLLM 0.6+ (preview; depth P24)
- Knative Serving (KServe dependency for scale-to-zero)
- MLflow (Y4 P20 — registry source)
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| KServe docs — InferenceService, Auto-scaling, Canary | The implementation |
| Ray docs — Ray Core + Ray Serve | The distributed primitive |
| Chip Huyen “Designing ML Systems” Ch. 7 (Model Deployment) | The discipline |
5.2 Operational depth checklist
[ ] Deploy KServe + Knative on basecamp Tier 5[ ] Deploy the next-week-commits model from MLflow as KServe InferenceService[ ] Configure scale-to-zero with sub-30s cold start[ ] Canary deploy a v2 of the model; advance based on success rate[ ] Deploy Ray cluster via KubeRay; submit a parallel batch-inference job[ ] Deploy vLLM with Phi-3-mini quantized; hit /v1/chat/completions[ ] Build serving observability dashboard (latency p50/p95/p99, throughput, errors, version)[ ] Force cold start; measure first-request latency; tune to <30s[ ] Document KServe patterns in ops-handbook[ ] Wire model-lifecycle to platform-ctl: `platform-ctl model deploy <name>` calls mlship under the hood5.3 Project: mlship v0
mlship v0 scope (PRIVATE — full polish + launch is Y5 P30 capstone): github.com/abukix/mlship (private) Go + cobra (Y1 P5 stack)
Commands: mlship deploy ./model.pkl → auto-detect sklearn → containerize (multi-stage Dockerfile) → deploy to local Docker → print URL
mlship deploy ./model.pkl --to k8s --cluster basecamp-k3s → also containerize → deploy to KServe via basecamp → print URL
mlship list / logs / rollback / destroy
v0 NOT included (Y5 capstone): - PyTorch / TensorFlow / ONNX / HuggingFace auto-detection - vLLM routing for LLMs - AWS Fargate / GCP Cloud Run targets - Polished docs site / demo video / Show HN launchThe point of v0: the train→deploy composition recipe needs something by Y4 end. v0 is “barely usable but real.” Y5 turns it into the launch artifact.
The full version arc — what v0 here grows into, why sklearn-first, the launch plan — lives in the mlship project plan. Read it once before writing the v0 code so you understand which corners you’re allowed to cut and which you aren’t.
5.4 Project: services/llm-gateway/ v0
services/llm-gateway/ v0 scope (PUBLIC via basecamp's repo): Go service inside basecamp/charts/llm-gateway/
Endpoints: POST /v1/chat/completions — minimal OpenAI-compat Routes ALL traffic to ONE backend (vLLM stub or env-configured OpenAI) Non-streaming response (P24 adds streaming) No RAG (P24 adds RAG) No multi-model routing (P24 adds) Per-request slog + Prometheus counter mTLS via mesh (Y2 P12)
This is the SCAFFOLD. Real RAG + streaming + routing is P24.By Y4 end, llm-gateway is much more than v0. But v0 ships now so future code has a place to live. The full P21 → P24 → P25 growth arc is described in the Year 4 overview.
6. COMPARE: KServe vs BentoML vs custom FastAPI
For the same model, deploy 3 ways. Compare:
- Setup complexity
- Auto-scaling story
- Multi-framework support
- Cold-start behavior
- When each wins
400 words.
7. OPERATE
- 3+ runbooks (
kserve-cold-start-investigation,ray-job-stuck,model-canary-rollback) - 1+ postmortem
- Weekly log
8. CONTRIBUTE
KServe, Knative Serving, KubeRay, vLLM — all welcoming.
Validation criteria
[ ] All 10 operational depth checks[ ] KServe + Ray + vLLM-stub running in basecamp Tier 5 + Tier 7[ ] mlship v0 working (sklearn → Docker + KServe deploy)[ ] llm-gateway v0 scaffold in basecamp; routes to vLLM[ ] platform-ctl model deploy wires to mlship[ ] KServe vs BentoML vs FastAPI comparison[ ] 3+ runbooks; 1+ postmortem; 8+ weekly log entries[ ] Pattern entries deepened: - inference-shapes → OUTLINE (DEEP after P25) - model-lifecycle → reinforced via real serving[ ] Exit Test passedExit Test
Time: 3 hours.
- Build (90 min) — train a small classifier (or reuse P20 model); use mlship to deploy to local Docker; verify; redeploy to KServe; canary 90/10 with a v2; advance based on success.
- Diagnose (60 min) — scenario: KServe pod cold-start > 60s; trace via OTel + Loki + Prometheus; tune to <30s.
- Articulate (30 min) — 600 words: “When do I reach for KServe vs Ray vs vLLM? Map 4 inference workloads.”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Cold-start hidden by min-replicas=1 | Real cost; trade carefully |
| Canary without success-rate gate | Manual canary = no canary |
| Skipping the registry, deploying from notebook | No provenance; can’t roll back |
| vLLM “for everything” | LLMs only; sklearn doesn’t need it |
| llm-gateway with too much scope at v0 | P24/P25 is when it grows up |
Patterns deepened this phase
- inference-shapes → OUTLINE
- model-lifecycle → reinforced
→ Next: Phase 22: Feature Store: Feast