Embedding Store

Centralized service for computing, storing, versioning, and serving embeddings. The infrastructure that lets multiple consumers reuse the same embeddings without re-computing.

Embeddings are expensive to compute. Versioned. Reused across consumers. The store treats embeddings as a first-class asset, not a one-off output. Status: STUB — promoted to OUTLINE in Y5 Phase 42.

What this pattern is

An embedding store is the centralized service responsible for computing, storing, versioning, and serving embeddings. It separates embedding production (expensive — needs GPU, embedding model lifecycle) from embedding consumption (cheap — needs fast retrieval). It tracks which embedding model produced which embeddings (re-embedding the entire corpus when the model changes is operationally significant). It serves embeddings via a uniform interface so retrieval systems, similarity-search systems, and recommendation systems all consume the same canonical vectors.

The pattern parallels the feature-store: same shape (centralized service, multi-consumer, versioned), different content. Frontier AI labs typically run dedicated embedding platforms because the operational concerns (model lifecycle, re-embedding sweeps, vector-store integration) are non-trivial. For /root, the embedding-store is a logical role of the pgvector + embedding pipeline + Argo CronWorkflow stack.

The store’s central concern is embedding-model versioning. When the embedding model changes (upgrade to a better model, migrate away from a deprecated API, fine-tune for domain), the entire corpus needs to be re-embedded because vectors from the old model aren’t comparable to vectors from the new model. Re-embedding petabyte-scale corpora is expensive and slow — days to weeks of compute. The store’s job is to make this operationally tractable: track which vectors were embedded by which model, coordinate the sweep, maintain both versions during migration, cut over atomically.

The pattern also shapes cost management. Embedding APIs (OpenAI, Cohere, VoyageAI) charge per token. Self-hosted embedding models (BGE, E5, Nomic) require GPU capacity. Deduplication (don’t re-embed the same document twice), caching (embed once, serve many times), and batching (embed 128 documents in one API call) each save significant cost. Without a store, each consumer implements these ad-hoc; with a store, they’re solved once.

Concrete instances in the wild

  • Feast integrated embeddings (2024+). Feast added support for embedding features. Same feature-store interface, embeddings as feature values.
  • basecamp embedding pipeline (Y5 Phase 42). pgvector storage + embedding pipeline as Argo CronWorkflow + versioning metadata table.
  • LangChain embedding cache. Framework-level embedding cache with pluggable backends.
  • LlamaIndex embedding cache. Similar framework-level cache.
  • OpenAI Embeddings API. Managed embedding service. Convenient but bills per token.
  • Cohere Embed API. Managed alternative, often used for multilingual work.
  • VoyageAI Embeddings. Managed embedding service focused on retrieval quality.
  • Self-hosted BGE / E5 / Nomic models. OSS embedding models. Run on GPU; embed at scale for cost savings vs API.
  • Hugging Face Text Embeddings Inference (TEI). Optimized runtime for serving embedding models. K8s-deployable.
  • Uber’s internal embedding platform. Public engineering blog posts describe scale + design.

Why this pattern matters

Embeddings are increasingly the substrate for search, retrieval, and personalization across many products. Without an embedding store, each team computes embeddings independently — duplicating cost, using different embedding models (making cross-team retrieval impossible), and reinventing versioning discipline. With a store, embeddings become a shared platform asset that multiple teams consume.

The pattern also solves a specific operational problem that hits every team eventually: model migration. The embedding model you started with three years ago is deprecated. The new API only supports the newer model. Or you want to upgrade to a better model for retrieval quality. Whichever way, you need to re-embed the corpus, cut over consumers, and maintain both versions during migration. The store is what makes this an operational procedure rather than a heroic engineering project.

For LLM applications specifically, embeddings are load-bearing. RAG depends on them. Semantic caching depends on them. Personalization depends on them. Vector-based recommendations depend on them. Every one of these use cases needs embeddings to be fresh, versioned, and accessible. An embedding store solves these operational concerns; without one, each use case reinvents them.

The pattern also enables cost optimization at scale. Deduplication before embedding (same document → same embedding → don’t call the API twice). Caching results (embed the query once, serve it to multiple consumers). Batching (embed 128 items in one API call instead of 128 separate calls). Each is a well-known optimization; centralizing them in a store means every consumer benefits from them without implementing them.

The failure modes to know: forgetting to version embeddings (mysterious retrieval quality regressions), running two embedding models simultaneously without tracking which produced which vectors (retrieval returns nonsensical results), corpus growing faster than re-embedding capacity (partial coverage of latest model). Each has known patterns for prevention, but operating an embedding store means owning them.

For basecamp, the embedding-store role is implicit rather than a separately-named component. pgvector holds the vectors. The embedding pipeline is an Argo CronWorkflow. Model versioning lives in a metadata table. The interface is Postgres queries. This is enough at basecamp scale (thousands to millions of documents); at petabyte scale, dedicated embedding infrastructure becomes worth the investment.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y5 Phase 42 stands the embedding pipeline on basecamp.
DEEP     Out of scope for /root unless a Y5 capstone direction calls for it.
         Default: OUTLINE target.

Preview: what OUTLINE will answer

When Y5 Phase 42 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you compute, store, version, and serve embeddings at scale across multiple consumers without duplicating work?
  • PRINCIPLES. Embeddings are versioned by model. Multi-consumer via uniform interface. Deduplication + caching + batching for cost. Migration path when embedding model changes. Store parallels feature store in shape and concerns.
  • TRADE-OFFS. Managed embedding API (OpenAI, Cohere — easy, per-token cost) vs self-hosted (BGE, E5 — GPU cost, cheaper at scale). Dedicated embedding store vs feature-store integration vs vector-DB-as-store. Batch re-embedding (throughput-optimized) vs incremental (responsiveness-optimized).
  • TOOLS (time-stamped as of 2026-06): pgvector + embedding pipeline (basecamp default), Feast with embeddings, LangChain/LlamaIndex caches, OpenAI/Cohere/VoyageAI APIs, self-hosted BGE/E5/Nomic on TEI, Uber’s internal platforms.

The DEEP promotion is out of scope for basecamp; if pursued, it would add MASTERY (operating an embedding platform at scale), COMPARE (managed vs self-hosted embedding), OPERATE (a specific embedding-model migration event), and CONTRIBUTE (a Feast embedding or TEI documentation improvement).

Canonical references

  • Feast documentation on embedding features. Free at feast.dev.
  • Hugging Face TEI documentation. Free at huggingface.co/docs/text-embeddings-inference.
  • Uber engineering blog on their embedding platform. Free at eng.uber.com.
  • Nils Reimers’s blog on embedding model selection. Free.
  • MTEB (Massive Text Embedding Benchmark) — comparative embedding evaluations. Free.

Cross-references