Skip to content
STUB

Inference Shapes

The pattern: ML inference comes in four distinct shapes with different latency, throughput, and cost profiles. Online single-prediction — HTTP request, response in <100ms, KServe. Batch — score 1M rows nightly, Ray or Spark. Streaming — low-latency tokens or events, vLLM with continuous batching, Flink for event-driven. Edge — mobile or embedded, TensorRT, ONNX runtime.

The trade-off: latency vs. throughput vs. cost. Online is expensive per request but real-time. Batch is cheap per row but high latency. Streaming sits between via continuous-batching tricks. Edge minimizes latency at the cost of model size and portability. The right shape comes from the workload, not the modeler’s preference.

Deepens in Year 4 Phase 21: ML Serving + mlship v0 (online + batch) and reaches DEEP in Phase 24: LLM Infra + RAG when the streaming variant lands via vLLM SSE inside llm-gateway.

  • model-lifecycle — every shape has its own promotion + canary story.
  • train-serve-skew — online and batch must compute features the same way.
  • feature-store — feeds online and batch from one definition.
  • rag-as-pattern — RAG generation is the streaming shape in practice.
  • tool-use-as-capability — every tool call is itself a typed inference boundary.
  • oltp-vs-olap — online vs. batch maps onto the row vs. column workload split.
  • mlshipmlship deploy defaults to the online shape; batch and streaming are explicit flags.
  • basecamp — KServe (online) and Ray (batch) live in Tier 5/6.