Prompt Engineering

Prompts as infrastructure: versioned, tested, deployed, observed. The discipline that treats prompts like code, not chat-window improvisation.

Prompts have versions. They live in Git or CRDs. They have tests. They have observability. They are infrastructure, not improvisation. Status: STUB — promoted to OUTLINE in Y5 Phase 47.

What this pattern is

Prompt engineering as a pattern (rather than ad-hoc trial-and-error) means treating prompts the same way you treat code. Versioned (Git or CRD-stored, with diff-able history). Tested (eval suites verify the prompt produces expected outputs across golden inputs). Deployed (the LLM gateway serves prompts by name + version; updates roll out controllably). Observed (per-prompt latency, cost, quality metrics; A/B tests between versions). Secured (prompt injection defenses applied uniformly). Specific techniques (system prompt structure, few-shot examples, chain-of-thought, function-calling schemas) all live within this discipline.

The pattern is operationally important because prompts have outsized leverage. A small wording change can shift quality by 20% or shift cost by 5x (if it triggers longer outputs). Without versioning + testing, the change reaches production silently. With versioning + testing, it’s a deliberate, measured rollout.

The pattern’s insight is that prompts are code. They shape system behavior. They have bugs. They have performance characteristics. They regress. Every discipline that applies to code (version control, review, testing, deployment gates, monitoring) applies to prompts. Treating prompts as one-off strings that live in application code is analogous to treating configuration as hardcoded constants — it works until it doesn’t, and by then untangling the mess is expensive.

The techniques within prompt engineering (few-shot examples, chain-of-thought, ReAct, self-consistency, prompt chaining) are important but secondary to the discipline. A team that has excellent prompt techniques but no versioning still loses to a team that has average techniques with rigorous versioning. The discipline compounds; the techniques are per-prompt improvements. Both matter, but the discipline matters more at scale.

Concrete instances in the wild

  • Anthropic’s prompt engineering guides. Free, comprehensive resource. Reference for chain-of-thought, XML-tag structure, few-shot patterns.
  • OpenAI’s prompt engineering cookbook. Similar reference from OpenAI.
  • LangChain PromptTemplate. Framework-level abstraction for parameterized prompts.
  • LlamaIndex prompt templates. Similar approach with retrieval-focused defaults.
  • LangSmith Prompt Hub. Managed prompt registry with versioning and evaluation.
  • Weights & Biases Prompts. Commercial prompt tracking with A/B testing.
  • Promptfoo eval-driven prompt development. Test-driven prompt iteration.
  • basecamp prompt-as-resource (Y5 Phase 47). Prompts stored as K8s CRDs, served by llm-gateway, versioned in Git.
  • Cursor’s system prompts. Public reverse-engineering has revealed how deliberately they’re constructed.
  • Anthropic Claude Code system prompt. The prompt that defines Claude’s Code CLI behavior — hundreds of lines, deeply engineered.

Why this pattern matters

Prompts without discipline become the most fragile part of an LLM application. Everyone modifies them. Nobody tracks the changes. Regressions slip in. Quality drifts. When something goes wrong, nobody can point to what the prompt looked like when it was working. This is the exact failure mode that “code should be in version control” solved for software; the same solution applies to prompts.

The pattern also enables safe iteration. Version-controlled prompts can be A/B tested against production baseline. Eval-gated prompt changes can’t ship regressions past a threshold. Observed prompts reveal which are underperforming so improvements can be prioritized. Without these mechanisms, prompt improvement is guesswork; with them, it’s engineering.

For LLM applications at scale, prompt engineering also affects cost significantly. A prompt that produces 2000-token responses costs 2x a prompt that produces 1000-token responses for the same task. A prompt that unnecessarily triggers chain-of-thought costs more than one that doesn’t. Cost-aware prompt engineering can reduce per-query cost 3-5x without quality regression. Cost-blind prompt engineering leaves that money on the table.

For basecamp specifically, prompt engineering discipline is what turns the ops-handbook chatbot and AIOps agent from experiments into production infrastructure. Versioned prompts. Eval-gated updates. Observed quality. Cost tracking. Without discipline, these are demos; with discipline, they’re systems that improve over months without regressing.

The pattern is also increasingly important for LLM-based tools embedded in developer workflows. Cursor, Claude Code, Copilot — their prompts are extensively engineered and constantly updated. Public reverse-engineering shows how sophisticated production prompts are. The gap between casual prompting and production prompting is significant; adopting the discipline is what closes it.

The failure modes to know: prompts that work great in development but fail in production (missing eval coverage of edge cases); prompts that quietly degrade as base models change (need re-evaluation on model updates); prompt sprawl (dozens of similar prompts nobody remembers the differences between); prompts that leak PII in their few-shot examples (need review discipline). Each has known mitigations, but adopting the pattern means owning them.

Modern tooling makes prompt engineering more tractable. LangSmith and Weights & Biases provide managed prompt registries with A/B testing. Promptfoo enables test-driven prompt development. Framework prompt templates provide parameterization primitives. What used to require internal tooling investment is increasingly available OSS or managed.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y5 Phase 47 implements prompt-as-resource serving in
         llm-gateway, with versioning + at least one eval-gated rollout.
DEEP     Promoted after Y5 Phase 48 — prompts in production for agent runtime
         + AIOps with observed quality metrics over time.

Preview: what OUTLINE will answer

When Y5 Phase 47 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you manage prompts operationally so that changes are safe, tracked, and measurable?
  • PRINCIPLES. Prompts are code. Version them. Test them. Deploy them via gateway. Observe their behavior. A/B test major changes. Cost is part of the prompt’s cost budget.
  • TRADE-OFFS. In-code prompts (simple, opaque) vs external prompts (flexible, needs infrastructure). Template-based (parameterized, structured) vs freeform (flexible, error-prone). Vendor-specific formatting (XML tags for Claude, markdown for GPT) vs neutral. Eval-gated rollout (safe, slow) vs quick iteration (fast, risky).
  • TOOLS (time-stamped as of 2026-06): LangSmith Prompt Hub, Weights & Biases Prompts, LangChain PromptTemplate, LlamaIndex templates, Promptfoo, basecamp prompt-as-resource CRDs.

The DEEP promotion, after Y5 Phase 48 with prompts in agent + AIOps production, will add MASTERY (operating prompts as infrastructure), COMPARE (LangSmith vs W&B vs custom prompt registries), OPERATE (a specific prompt-rollback or A/B test), and CONTRIBUTE (a public prompt-engineering case study or LangSmith documentation improvement).

Canonical references

  • Anthropic’s prompt engineering guides. Free at docs.anthropic.com.
  • OpenAI’s prompt engineering guide. Free at platform.openai.com.
  • LangChain prompt engineering documentation. Free.
  • Simon Willison’s blog on prompt engineering patterns. Free at simonwillison.net.
  • Learn Prompting (OSS course). Free at learnprompting.org.

Cross-references