Most teams building LLM-powered automation hit the same wall around week three of their pilot: the demo worked beautifully, but in production the model occasionally produces output in the wrong format, misclassifies edge cases, or confabulates details that weren't in the input. The difference between a demo and a reliable production system is almost always in how the prompt is structured — not in which model you're using.
This guide distills the prompt engineering patterns we've applied across automation projects involving invoice extraction, customer intent classification, contract summarization, and support ticket routing. These aren't theoretical techniques — they're patterns we've validated against real business data where a 2% error rate has real consequences.
The Foundational Mindset: You're Writing a Spec, Not a Request
Most prompt failures stem from treating the model like a smart assistant you're talking to casually. In production automation, you're not having a conversation — you're writing a specification that will be executed thousands of times against inputs you haven't seen yet. Think of your prompt as a function signature: it should be unambiguous about inputs, processing logic, and output format. Every degree of freedom you leave to the model is a potential failure mode at scale.
The practical implication: write your prompt to handle the worst-case input, not the clean happy-path example from your demo. If you're extracting invoice amounts, your prompt needs to handle amounts with and without currency symbols, European decimal notation (1.234,56), amounts with taxes listed separately, and invoices with no amount at all. If you only test on clean inputs, you'll only discover these edge cases in production.
Pattern 1: Structured Output with a Schema Contract
The single highest-leverage change you can make to any automation prompt is to demand a specific output structure and validate against it programmatically. Prose outputs are fine for summarization tasks consumed by humans, but for any automated downstream processing, you need machine-readable output.
SYSTEM:
You are a document extraction assistant. You must respond with valid JSON only.
Do not include any text before or after the JSON object.
Do not include markdown code fences.
If a field cannot be determined from the document, use null.
USER:
Extract the following fields from this invoice. Respond with JSON
matching this exact schema:
{
"invoice_number": string | null,
"invoice_date": "YYYY-MM-DD" | null,
"vendor_name": string | null,
"total_amount": number | null,
"currency": "USD" | "EUR" | "GBP" | "CAD" | null,
"line_items": [
{
"description": string,
"quantity": number | null,
"unit_price": number | null,
"total": number | null
}
]
}
Document:
{{invoice_text}}
Notice several things about this prompt: the schema is stated explicitly with types, null is the explicit fallback for unknown fields, the date format is specified precisely, and currency is constrained to a known set. This specificity dramatically reduces the variance in outputs across different invoice formats.
In Python, wrap your LLM call with validation using Pydantic:
from pydantic import BaseModel, field_validator
from typing import Optional
import json
class LineItem(BaseModel):
description: str
quantity: Optional[float] = None
unit_price: Optional[float] = None
total: Optional[float] = None
class InvoiceExtraction(BaseModel):
invoice_number: Optional[str] = None
invoice_date: Optional[str] = None
vendor_name: Optional[str] = None
total_amount: Optional[float] = None
currency: Optional[str] = None
line_items: list[LineItem] = []
@field_validator('invoice_date')
@classmethod
def validate_date_format(cls, v):
if v is None:
return v
import re
if not re.match(r'^\d{4}-\d{2}-\d{2}$', v):
raise ValueError(f'Date must be YYYY-MM-DD, got: {v}')
return v
def extract_invoice(llm_response: str) -> InvoiceExtraction:
try:
data = json.loads(llm_response)
return InvoiceExtraction(**data)
except (json.JSONDecodeError, ValueError) as e:
# Log, retry with a correction prompt, or route to human review
raise ExtractionError(f'Invalid LLM output: {e}') from e
Model APIs and structured output: Both the OpenAI API (JSON mode / structured outputs) and Anthropic's Claude API support constrained JSON output at the API level. Use these features in addition to your schema prompt — they enforce JSON syntax at the token generation level, which eliminates JSON parse errors entirely. Your schema prompt then handles the semantic validation of field names and types.
Pattern 2: Chain-of-Thought for Classification Tasks
For classification tasks — routing support tickets, categorizing expenses, assessing contract risk — asking the model to produce a label directly often gives you a brittle system that fails silently on edge cases. A better approach is to ask the model to reason through the classification criteria before committing to a label. This improves accuracy and, crucially, gives you an auditable reasoning trace.
SYSTEM:
You are a support ticket routing agent for a B2B SaaS company.
Classify each ticket into exactly one of these categories:
- BILLING: payment issues, invoice questions, subscription changes
- BUG: software not working as documented
- FEATURE_REQUEST: asking for new functionality
- ACCOUNT: login, password, user management
- OTHER: anything that doesn't clearly fit the above
Always respond in this exact format:
REASONING: [2-3 sentences explaining which category fits and why]
CONFIDENCE: [HIGH | MEDIUM | LOW]
CATEGORY: [one of the five categories above]
A LOW confidence rating means a human should review this ticket.
USER:
Ticket: "We've been charged twice for our November invoice and
also the export feature stopped working this morning."
The model's response will look something like:
REASONING: This ticket contains two distinct issues — a double-charge billing
problem and a software malfunction with the export feature. Billing takes
priority for routing since it involves financial impact and is time-sensitive.
The bug should be captured as a secondary issue after billing resolution.
CONFIDENCE: MEDIUM
CATEGORY: BILLING
Parse this response structurally in your code rather than treating it as free text. The CONFIDENCE field is particularly valuable — routing LOW confidence tickets to human review prevents your automation from making wrong decisions on genuinely ambiguous inputs.
Pattern 3: Few-Shot Examples for Domain-Specific Tasks
General-purpose LLMs have broad knowledge but don't know your company's specific terminology, product names, or classification schemes. Few-shot examples in the prompt are the most efficient way to teach the model your domain conventions without fine-tuning.
The key discipline is selecting representative examples rather than easy ones. Your few-shot examples should cover the most common cases and the most confusing edge cases. If you have a classification that's frequently confused with another, include an example that clearly illustrates the difference.
Classify the following customer feedback as: POSITIVE, NEUTRAL, or NEGATIVE.
For our company, complaints about response time are NEUTRAL (not NEGATIVE)
unless the customer explicitly expresses frustration.
Examples:
---
Feedback: "The new dashboard is much faster than before."
Category: POSITIVE
Feedback: "It took 3 days to hear back on my support ticket."
Category: NEUTRAL
Feedback: "I've been waiting a week for a response and this is unacceptable.
I'm considering switching providers."
Category: NEGATIVE
Feedback: "Your pricing page is confusing."
Category: NEGATIVE
---
Now classify this feedback:
"Setup took longer than expected but the onboarding team was very helpful."
Notice the fourth example (confusing pricing page as NEGATIVE) clarifies a domain-specific rule: confusion about your product is a negative signal even without explicit frustration language. Without this example, many models would classify this as NEUTRAL.
Pattern 4: Explicit Negative Instructions
Telling a model what not to do is at least as important as telling it what to do. Models have strong priors from their training data that sometimes override your instructions. Explicit negative instructions counteract the most common failure modes.
IMPORTANT — Do NOT do any of the following:
- Do not infer or guess missing information. If a field is absent, use null.
- Do not normalize or correct spelling in extracted text. Return verbatim.
- Do not add explanatory text outside the JSON structure.
- Do not combine multiple vendors into a single vendor_name field.
- Do not calculate totals from line items; extract only what is stated.
If the document appears to be something other than an invoice (e.g., a
purchase order, a quote, or a receipt), return this exact JSON:
{"error": "NOT_INVOICE", "document_type": "[your best guess at type]"}
The last instruction — handling documents that aren't what you expect — is often overlooked. In production, your automation will inevitably receive inputs that don't match your assumptions. Defining explicit fallback behavior prevents silent failures where the model extracts garbage data from an unexpected document type.
Pattern 5: Context Windows and Chunking Strategy
Long documents require chunking before they can be processed. The chunking strategy matters more than most teams realize — naive fixed-size chunking often splits semantic units (a contract clause, a financial table, a paragraph of reasoning) across chunks, degrading extraction quality significantly.
For document automation, prefer semantic chunking over fixed-size chunking. For structured documents like invoices and contracts, identify natural section boundaries and chunk at those boundaries. For narrative text, paragraph or section boundaries preserve semantic coherence better than character-count splits.
import re
def chunk_contract_by_section(text: str) -> list[dict]:
"""
Split a contract into chunks at section headings.
Each chunk contains the section number, title, and content.
"""
# Match headings like "1.", "1.1", "Section 1.", "ARTICLE I"
section_pattern = re.compile(
r'^(\d+\.[\d.]*\s+[A-Z][^\n]+|ARTICLE\s+[IVXLCDM]+[^\n]*|'
r'Section\s+\d+[^\n]*)',
re.MULTILINE
)
matches = list(section_pattern.finditer(text))
chunks = []
for i, match in enumerate(matches):
start = match.start()
end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
section_text = text[start:end].strip()
# If a section is still too long, split on paragraphs
if len(section_text) > 3000:
paragraphs = section_text.split('\n\n')
for j, para_group in enumerate(
[paragraphs[k:k+3] for k in range(0, len(paragraphs), 3)]
):
chunks.append({
'section': match.group(0).strip(),
'chunk_index': j,
'content': '\n\n'.join(para_group)
})
else:
chunks.append({
'section': match.group(0).strip(),
'chunk_index': 0,
'content': section_text
})
return chunks
Making Automation Reliable: Retry Logic and Fallback Routing
Even with well-engineered prompts, you'll occasionally get malformed outputs — especially when processing unusual document formats or running at high throughput with temperature > 0. Production-grade automation needs retry logic with correction prompts and a human-review fallback for cases that can't be resolved automatically.
async def extract_with_retry(
document: str,
max_retries: int = 2
) -> InvoiceExtraction:
last_error = None
for attempt in range(max_retries + 1):
if attempt == 0:
prompt = build_extraction_prompt(document)
else:
# On retry, include the previous bad output and the error
prompt = build_correction_prompt(
document, last_output, str(last_error)
)
response = await llm_client.complete(prompt, temperature=0.0)
try:
return extract_invoice(response.text)
except ExtractionError as e:
last_error = e
last_output = response.text
# All retries exhausted — route to human review queue
await human_review_queue.enqueue({
'document': document,
'last_llm_output': last_output,
'error': str(last_error),
'reason': 'extraction_failed_after_retry'
})
raise ExtractionFailedError('Routed to human review')
Set temperature=0.0 for extraction and classification tasks. Temperature controls the randomness of the model's output distribution. For business automation where you want deterministic, consistent outputs, always use temperature=0.0. Reserve non-zero temperatures for creative tasks or cases where you want diverse outputs for sampling.
Measuring and Improving Over Time
The most underrated practice in production prompt engineering is systematic evaluation. Without a labeled test set and a repeatable evaluation process, you're flying blind — you can't tell whether a prompt change actually improved things or just changed the failure mode.
Build your evaluation dataset from real production inputs, especially ones that caused errors. Aim for at least 100–200 examples covering the main categories and known edge cases. Run your new prompt against this dataset before deploying. Track precision, recall, and error rate by category — aggregate accuracy hides category-specific regressions that your users will notice immediately.
The teams that succeed with LLM automation treat prompts the same way they treat code: they version control them, they test changes against a held-out evaluation set, and they have a clear rollback path when a new prompt version regresses on important cases. The teams that struggle treat prompt engineering as a one-time setup activity and wonder why their automation accuracy slowly drifts downward as the model is updated or input distributions shift.
Good prompt engineering is engineering. It responds to the same disciplines — testing, versioning, monitoring, iteration — that make any production system reliable. Apply those disciplines from the start and you'll have automation that actually earns trust from the business teams that depend on it.