Inference Shapes
The pattern: ML inference comes in four distinct shapes with different latency, throughput, and cost profiles. Online single-prediction — HTTP request, response in <100ms, KServe. Batch — score 1M rows nightly, Ray or Spark. Streaming — low-latency tokens or events, vLLM with continuous batching, Flink for event-driven. Edge — mobile or embedded, TensorRT, ONNX runtime.
The trade-off: latency vs. throughput vs. cost. Online is expensive per request but real-time. Batch is cheap per row but high latency. Streaming sits between via continuous-batching tricks. Edge minimizes latency at the cost of model size and portability. The right shape comes from the workload, not the modeler’s preference.
Deepens in Year 4 Phase 21: ML Serving + mlship v0 (online + batch) and reaches DEEP in Phase 24: LLM Infra + RAG when the streaming variant lands via vLLM SSE inside llm-gateway.
Related patterns
- model-lifecycle — every shape has its own promotion + canary story.
- train-serve-skew — online and batch must compute features the same way.
- feature-store — feeds online and batch from one definition.
- rag-as-pattern — RAG generation is the streaming shape in practice.
- tool-use-as-capability — every tool call is itself a typed inference boundary.
- oltp-vs-olap — online vs. batch maps onto the row vs. column workload split.
mlship—mlship deploydefaults to the online shape; batch and streaming are explicit flags.basecamp— KServe (online) and Ray (batch) live in Tier 5/6.