ML Serving + mlship v0

Second phase of Year 4. KServe for online inference, Ray for batch + training, vLLM as preview. The first version of mlship ships — a minimal CLI that deploys sklearn models. The first version of services/llm-gateway/ ships — a minimal routing endpoint. ~8 weeks, ~100 hrs.

Where this phase sits

Phase 20 settled the lifecycle frame: a model is now a registered, versioned, traceable artifact in MLflow. P21 is where that artifact stops being a file in MinIO and starts being a service. KServe gives you online inference, Ray gives you batch + distributed training, vLLM gives you a preview of the LLM-shaped serving you’ll deepen in Phase 24. The same registry feeds all three.

Two artifacts are born this phase that will accumulate across the rest of the year. mlship v0 — a small Go CLI that takes model.pkl and produces a URL — ships private; its eventual public OSS launch is the mlship project plan’s Year 5 capstone, not a Y4 outcome. services/llm-gateway/ v0 ships as a scaffold inside basecamp: one Go service, one endpoint, one backend, no streaming, no RAG. Both are deliberately minimal. P21’s job is to give the bigger phases (P24, P25) a place to put their code.

The pattern depth shift this phase is inference-shapes reaching OUTLINE. Online vs batch vs streaming aren’t variations on a theme — they’re different problems with different infrastructure, and conflating them is how teams ship one runtime and discover the second use case can’t fit.

Prerequisites

Phase 20 complete — MLflow operational; one model registered

You accept: inference shapes (online vs batch vs streaming) are different problems with different infra. KServe for online; Ray for batch + train; vLLM specialized for LLMs (deepens P24).

Why this phase exists

Year 4’s payoff is models in production. P20 built the registry; P21 builds the serving layer. By phase end:

KServe runs online inference for the next-week-commits model (sklearn) from P20
Ray cluster handles batch inference + distributed training (warm-up for P23 Kubeflow)
vLLM serves a tiny LLM (Phi-3-mini quantized) — preview; depth is P24
mlship v0 — a CLI that takes model.pkl and deploys it to Docker (or KServe)
services/llm-gateway/ v0 — minimal Go service routing to vLLM

Each is a “v0” intentionally. P24 + P25 polish.

1. PROBLEM

You have a model in MLflow. You want it serving real (or test) traffic with: stable URL, autoscaling, canary rollouts, observability, a way to roll back.

Different inference shapes have different infra:

Online single-prediction (HTTP request → response in <100ms) — KServe
Online streaming (chat, transcription) — KServe + custom protocol or vLLM
Batch (score 1M rows nightly) — Ray + Spark
Edge (mobile, embedded) — TensorRT, ONNX runtime — out of scope for ROOT

→ Pattern: inference-shapes

2. PRINCIPLES

2.1 Inference shapes

Different latency/throughput/cost trade-offs per shape.

→ Pattern: inference-shapes

Investigate:

Map 4 hypothetical workloads to shapes: real-time fraud detection (online), nightly recommendations (batch), chat (streaming), search ranking (online with caching)
For each: what’s the latency target? Throughput target? Cost shape?

2.2 KServe: Kubernetes-native serving

KServe abstracts away the serving runtime: scikit-learn, PyTorch, TensorFlow, ONNX, custom — all become InferenceService CRDs.

Investigate:

Deploy KServe via basecamp Helm chart
Deploy the next-week-commits model from MLflow Registry as a KServe InferenceService
Configure auto-scaling: scale-to-zero, max replicas, target concurrency
Canary: split 90/10 between two model versions; advance based on success rate

2.3 Ray: distributed Python

Ray is for distributed batch + training. Ray Cluster on K8s gives you “submit a function, run on N nodes.”

Investigate:

Deploy Ray cluster via Ray Operator on basecamp
Run a Ray job that scores all abukix.commits in parallel (batch inference)
Compare with sequential PySpark — when does each win?

2.4 vLLM: preview

vLLM is the OSS LLM-serving runtime: paged attention, continuous batching, OpenAI-compat API.

Investigate:

Deploy vLLM via KServe with Phi-3-mini quantized (fits in CPU)
Hit /v1/chat/completions; observe streaming SSE
Read the vLLM paper (“Efficient Memory Management for Large Language Model Serving” — Kwon et al.)
Defer depth to P24

2.5 Cold start + warm-up

Models load slowly. First request after scale-from-zero is slow. Mitigate with warm-up requests, image pre-pull, model pre-loaded volumes.

Investigate:

Time first-request latency for cold model start
Configure KServe initialDelaySeconds + readiness probe to gate traffic until model loaded
Pre-pull model artifacts to PVC; mount instead of downloading

2.6 The serving observability stack

Latency, throughput, error rate, cost-per-request, model-version-in-use, input-shape distribution.

Investigate:

Configure KServe to emit Prometheus metrics
Add OTel traces from request → model → response
Build a Grafana dashboard with the 6 above metrics

3. TRADE-OFFS

Decision	Option A	Option B	When
Online runtime	KServe	Seldon Core	BentoML
Batch runtime	Ray	Spark	Dask
LLM runtime	vLLM	TGI (HuggingFace)	TensorRT-LLM
Auto-scale signal	request count	concurrency	latency
Cold start	scale-to-zero (cheap)	min-replicas=1 (warm)	Trade cost vs first-request latency

4. TOOLS (as of 2025-10)

KServe 0.13+
Ray 2.35+ + KubeRay operator
vLLM 0.6+ (preview; depth P24)
Knative Serving (KServe dependency for scale-to-zero)
MLflow (Y4 P20 — registry source)

5. MASTERY

5.1 Reading list

Required	Why
KServe docs — InferenceService, Auto-scaling, Canary	The implementation
Ray docs — Ray Core + Ray Serve	The distributed primitive
Chip Huyen “Designing ML Systems” Ch. 7 (Model Deployment)	The discipline

5.2 Operational depth checklist

[ ] Deploy KServe + Knative on basecamp Tier 5
[ ] Deploy the next-week-commits model from MLflow as KServe InferenceService
[ ] Configure scale-to-zero with sub-30s cold start
[ ] Canary deploy a v2 of the model; advance based on success rate
[ ] Deploy Ray cluster via KubeRay; submit a parallel batch-inference job
[ ] Deploy vLLM with Phi-3-mini quantized; hit /v1/chat/completions
[ ] Build serving observability dashboard (latency p50/p95/p99, throughput, errors, version)
[ ] Force cold start; measure first-request latency; tune to <30s
[ ] Document KServe patterns in ops-handbook
[ ] Wire model-lifecycle to platform-ctl: `platform-ctl model deploy <name>` calls mlship under the hood

5.3 Project: `mlship` v0

mlship v0 scope (PRIVATE — full polish + launch is Y5 P30 capstone):
  github.com/abukix/mlship (private)
  Go + cobra (Y1 P5 stack)

  Commands:
    mlship deploy ./model.pkl
      → auto-detect sklearn
      → containerize (multi-stage Dockerfile)
      → deploy to local Docker
      → print URL

    mlship deploy ./model.pkl --to k8s --cluster basecamp-k3s
      → also containerize
      → deploy to KServe via basecamp
      → print URL

    mlship list / logs / rollback / destroy

  v0 NOT included (Y5 capstone):
    - PyTorch / TensorFlow / ONNX / HuggingFace auto-detection
    - vLLM routing for LLMs
    - AWS Fargate / GCP Cloud Run targets
    - Polished docs site / demo video / Show HN launch

The point of v0: the train→deploy composition recipe needs something by Y4 end. v0 is “barely usable but real.” Y5 turns it into the launch artifact.

The full version arc — what v0 here grows into, why sklearn-first, the launch plan — lives in the mlship project plan. Read it once before writing the v0 code so you understand which corners you’re allowed to cut and which you aren’t.

5.4 Project: `services/llm-gateway/` v0

services/llm-gateway/ v0 scope (PUBLIC via basecamp's repo):
  Go service inside basecamp/charts/llm-gateway/

  Endpoints:
    POST /v1/chat/completions — minimal OpenAI-compat
      Routes ALL traffic to ONE backend (vLLM stub or env-configured OpenAI)
      Non-streaming response (P24 adds streaming)
      No RAG (P24 adds RAG)
      No multi-model routing (P24 adds)
      Per-request slog + Prometheus counter
      mTLS via mesh (Y2 P12)

  This is the SCAFFOLD. Real RAG + streaming + routing is P24.

By Y4 end, llm-gateway is much more than v0. But v0 ships now so future code has a place to live. The full P21 → P24 → P25 growth arc is described in the Year 4 overview.

6. COMPARE: KServe vs BentoML vs custom FastAPI

For the same model, deploy 3 ways. Compare:

Setup complexity
Auto-scaling story
Multi-framework support
Cold-start behavior
When each wins

400 words.

7. OPERATE

3+ runbooks (kserve-cold-start-investigation, ray-job-stuck, model-canary-rollback)
1+ postmortem
Weekly log

8. CONTRIBUTE

KServe, Knative Serving, KubeRay, vLLM — all welcoming.

Validation criteria

[ ] All 10 operational depth checks
[ ] KServe + Ray + vLLM-stub running in basecamp Tier 5 + Tier 7
[ ] mlship v0 working (sklearn → Docker + KServe deploy)
[ ] llm-gateway v0 scaffold in basecamp; routes to vLLM
[ ] platform-ctl model deploy wires to mlship
[ ] KServe vs BentoML vs FastAPI comparison
[ ] 3+ runbooks; 1+ postmortem; 8+ weekly log entries
[ ] Pattern entries deepened:
    - inference-shapes → OUTLINE (DEEP after P25)
    - model-lifecycle → reinforced via real serving
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (90 min) — train a small classifier (or reuse P20 model); use mlship to deploy to local Docker; verify; redeploy to KServe; canary 90/10 with a v2; advance based on success.
Diagnose (60 min) — scenario: KServe pod cold-start > 60s; trace via OTel + Loki + Prometheus; tune to <30s.
Articulate (30 min) — 600 words: “When do I reach for KServe vs Ray vs vLLM? Map 4 inference workloads.”

Anti-patterns

Anti-pattern	Why
Cold-start hidden by min-replicas=1	Real cost; trade carefully
Canary without success-rate gate	Manual canary = no canary
Skipping the registry, deploying from notebook	No provenance; can’t roll back
vLLM “for everything”	LLMs only; sklearn doesn’t need it
llm-gateway with too much scope at v0	P24/P25 is when it grows up

Patterns deepened this phase

inference-shapes → OUTLINE
model-lifecycle → reinforced

→ Next: Phase 22: Feature Store: Feast

ML Serving + mlship v0

Where this phase sits

Prerequisites

Why this phase exists

1. PROBLEM

2. PRINCIPLES

2.1 Inference shapes

2.2 KServe: Kubernetes-native serving

2.3 Ray: distributed Python

2.4 vLLM: preview

2.5 Cold start + warm-up

2.6 The serving observability stack

3. TRADE-OFFS

4. TOOLS (as of 2025-10)

5. MASTERY

5.1 Reading list

5.2 Operational depth checklist

5.3 Project: mlship v0

5.4 Project: services/llm-gateway/ v0

6. COMPARE: KServe vs BentoML vs custom FastAPI

7. OPERATE

8. CONTRIBUTE

Validation criteria

Exit Test

Anti-patterns

Patterns deepened this phase

5.3 Project: `mlship` v0

5.4 Project: `services/llm-gateway/` v0