Building an AI Document Review System in Python

Every business that handles contracts, invoices, or compliance documents has the same bottleneck: someone has to read them. Not skim — actually read, extract the relevant data points, flag anomalies, and route the document to the right person. A single procurement team might review 200+ vendor contracts a quarter, each one a 15–40 page PDF full of clauses that need to be checked against company policy.

LLMs have changed the calculus on document review systems. What used to require months of NLP pipeline engineering — tokenizers, NER models, custom classifiers, rule engines — can now be built in a few hundred lines of Python with an API call. But "call the API and hope for the best" isn't a production system. Here's how to build one that actually works.

Architecture Overview

A document review system has four stages: ingestion (getting text out of the document), extraction (pulling structured data from unstructured text), validation (checking extracted data against business rules), and routing (deciding what happens next). The LLM handles extraction; you handle everything else with deterministic code.

The stack we'll use: Python 3.11+, pymupdf for PDF text extraction, the Anthropic Python SDK for Claude API calls, pydantic for structured output validation, and a simple SQLite database for tracking review state. In production, you'd swap SQLite for PostgreSQL and add a task queue like Celery, but the core logic stays the same.

Stage 1: Document Ingestion

The quality of your text extraction determines everything downstream. Most PDF extraction libraries lose table structure, merge columns, or strip headers. pymupdf (also known as fitz) handles this better than alternatives for most business documents.

import pymupdf
from pathlib import Path

def extract_text(pdf_path: str) -> list[dict]:
    """Extract text from PDF, preserving page structure."""
    doc = pymupdf.open(pdf_path)
    pages = []
    for i, page in enumerate(doc):
        blocks = page.get_text("dict")["blocks"]
        text_content = page.get_text("text")
        pages.append({
            "page_number": i + 1,
            "text": text_content,
            "char_count": len(text_content),
            "has_tables": any(
                b.get("lines") and len(b["lines"]) > 3
                for b in blocks if b["type"] == 0
            ),
        })
    doc.close()
    return pages

Two things to note here. First, we track has_tables per page because pages with tabular data (pricing schedules, payment terms, SLA metrics) need different extraction prompts than narrative text. Second, we preserve page numbers — when the LLM extracts a clause, we want to cite the exact page so a human reviewer can verify it quickly.

For scanned PDFs (no embedded text), you'll need OCR. The simplest production-ready option is to use the LLM's vision capabilities directly — send page images instead of text. This avoids the Tesseract/EasyOCR pipeline entirely and handles mixed-format documents (part scanned, part digital) gracefully.

Stage 2: Structured Extraction with Pydantic

The key to reliable extraction is defining your output schema before writing any prompts. Pydantic models serve double duty: they document what you expect from the LLM and validate the response at runtime.

from pydantic import BaseModel, Field
from enum import Enum
from datetime import date

class RiskLevel(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class ContractClause(BaseModel):
    clause_type: str = Field(
        description="Category: liability, termination, payment, "
                    "confidentiality, indemnification, ip_ownership, sla"
    )
    summary: str = Field(
        description="1-2 sentence plain English summary of the clause"
    )
    page_number: int
    risk_level: RiskLevel
    risk_reason: str = Field(
        description="Why this risk level was assigned"
    )
    verbatim_excerpt: str = Field(
        description="Exact quote from the document (max 200 chars)"
    )

class ContractReview(BaseModel):
    vendor_name: str
    contract_type: str
    effective_date: date | None
    expiration_date: date | None
    total_value: str | None = Field(
        description="Total contract value with currency, e.g. '$150,000'"
    )
    auto_renewal: bool
    governing_law: str | None
    key_clauses: list[ContractClause]
    overall_risk: RiskLevel
    recommended_action: str = Field(
        description="approve, negotiate, escalate, or reject"
    )

This schema captures what a procurement team actually needs to make a decision. The verbatim_excerpt field is critical — it gives the human reviewer a direct reference into the source document, which builds trust in the system and makes spot-checking fast.

Key insight: The most common failure mode in LLM extraction isn't wrong answers — it's confidently plausible answers. An LLM might extract a contract value of "$150,000" when the document actually says "$150,000 per year for 3 years" (total: $450,000). The verbatim_excerpt field forces the model to ground its extraction in actual document text, and gives reviewers a fast way to verify.

Stage 3: The Extraction Prompt

The prompt is where most teams either over-engineer or under-specify. You don't need a 2,000-word system prompt with examples for every edge case. You need clear instructions, the output schema, and one or two examples of ambiguous situations.

import anthropic
import json

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a contract review specialist. Extract structured
data from the provided contract text. Be precise and conservative:
- If a field is ambiguous or not clearly stated, use null
- Risk levels: low (standard terms), medium (unusual but manageable),
  high (unfavorable terms requiring negotiation), critical (deal-breakers)
- For verbatim_excerpt, quote the EXACT text — do not paraphrase
- Auto-renewal: true only if the contract explicitly states automatic renewal
- If payment terms exceed Net-60, flag as medium risk minimum"""

def review_contract(pages: list[dict]) -> ContractReview:
    full_text = "\n\n".join(
        f"--- PAGE {p['page_number']} ---\n{p['text']}"
        for p in pages
    )

    # Truncate if needed (Claude supports 200K tokens,
    # but shorter context = better extraction)
    if len(full_text) > 100_000:
        full_text = full_text[:100_000] + "\n[TRUNCATED]"

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        system=SYSTEM_PROMPT,
        messages=[{
            "role": "user",
            "content": f"""Review this contract and extract structured data.

Return valid JSON matching this exact schema:
{json.dumps(ContractReview.model_json_schema(), indent=2)}

CONTRACT TEXT:
{full_text}"""
        }]
    )

    # Parse the response text as JSON
    response_text = response.content[0].text
    # Handle cases where the model wraps JSON in markdown
    if "```json" in response_text:
        response_text = response_text.split("```json")[1].split("```")[0]
    elif "```" in response_text:
        response_text = response_text.split("```")[1].split("```")[0]

    return ContractReview.model_validate_json(response_text.strip())

We're using claude-sonnet-4-6 here because it hits the sweet spot of extraction accuracy and cost for document review. Claude Opus would give marginally better results on highly complex legal language, but at 5x the cost per token — rarely justified for a system processing hundreds of documents.

Handling Long Documents

For documents over 50 pages, a single-pass extraction often misses clauses buried in the middle. The more reliable approach is a two-pass strategy: first pass identifies and locates all relevant clauses, second pass extracts details from each one.

def review_long_contract(pages: list[dict]) -> ContractReview:
    # Pass 1: Identify clause locations
    toc_response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"""Scan this contract and list every significant
clause with its page number and type. Return JSON array of objects
with fields: page_number, clause_type, brief_description.

{format_pages(pages)}"""
        }]
    )
    clause_locations = json.loads(
        extract_json(toc_response.content[0].text)
    )

    # Pass 2: Extract details from relevant pages only
    relevant_pages = set()
    for loc in clause_locations:
        pg = loc["page_number"]
        # Include surrounding pages for context
        relevant_pages.update([pg - 1, pg, pg + 1])

    filtered_pages = [
        p for p in pages if p["page_number"] in relevant_pages
    ]

    return review_contract(filtered_pages)

Stage 4: Validation and Business Rules

The LLM gives you structured data. Now you need to apply business rules that the LLM doesn't know about — your company's specific risk thresholds, approval limits, and compliance requirements. This is deterministic code, not AI.

from dataclasses import dataclass

@dataclass
class ReviewDecision:
    approved: bool
    flags: list[str]
    escalate_to: str | None
    requires_legal: bool

def apply_business_rules(review: ContractReview) -> ReviewDecision:
    flags = []
    requires_legal = False
    escalate_to = None

    # Rule 1: Contracts over $100K need VP approval
    if review.total_value:
        amount = parse_currency(review.total_value)
        if amount and amount > 100_000:
            escalate_to = "vp_procurement"
            flags.append(f"High-value contract: {review.total_value}")

    # Rule 2: Auto-renewal contracts need explicit approval
    if review.auto_renewal:
        flags.append("Auto-renewal clause detected")

    # Rule 3: Any critical-risk clause requires legal review
    critical_clauses = [
        c for c in review.key_clauses
        if c.risk_level == RiskLevel.CRITICAL
    ]
    if critical_clauses:
        requires_legal = True
        for c in critical_clauses:
            flags.append(
                f"Critical: {c.clause_type} (p.{c.page_number})"
            )

    # Rule 4: Indemnification without cap is always critical
    for clause in review.key_clauses:
        if (clause.clause_type == "indemnification" and
            "unlimited" in clause.summary.lower()):
            requires_legal = True
            flags.append("Unlimited indemnification detected")

    # Rule 5: Governing law outside home jurisdiction
    if (review.governing_law and
        "texas" not in review.governing_law.lower() and
        "delaware" not in review.governing_law.lower()):
        flags.append(
            f"Non-standard jurisdiction: {review.governing_law}"
        )

    approved = (
        not requires_legal and
        escalate_to is None and
        review.overall_risk in [RiskLevel.LOW, RiskLevel.MEDIUM]
    )

    return ReviewDecision(
        approved=approved,
        flags=flags,
        escalate_to=escalate_to,
        requires_legal=requires_legal,
    )

This is where the real value lives. The LLM extracts the data; your business rules make the decisions. This separation means you can update policies (raise the VP approval threshold to $200K, add a new jurisdiction to the whitelist) without rewriting prompts or retraining anything.

Error Handling and Retry Logic

LLM APIs fail. Responses sometimes aren't valid JSON. Pydantic validation catches fields the model forgot. You need all three handled gracefully.

import time
from pydantic import ValidationError

def review_with_retry(
    pages: list[dict],
    max_retries: int = 3
) -> ContractReview | None:
    last_error = None

    for attempt in range(max_retries):
        try:
            result = review_contract(pages)
            return result
        except json.JSONDecodeError as e:
            last_error = f"Invalid JSON: {e}"
        except ValidationError as e:
            last_error = f"Schema validation: {e.error_count()} errors"
        except anthropic.RateLimitError:
            wait = 2 ** attempt
            time.sleep(wait)
            last_error = "Rate limited"
        except anthropic.APIError as e:
            last_error = f"API error: {e}"

        if attempt < max_retries - 1:
            time.sleep(1)

    # Log failure for manual review
    print(f"Extraction failed after {max_retries} attempts: {last_error}")
    return None

In production, we log every failed extraction to a manual review queue. The failure rate for well-structured business documents is typically 2–5% — mostly scanned documents with poor image quality or heavily formatted documents where text extraction produces garbage. Those documents get routed to human reviewers automatically.

Cost and Latency in Production

Real numbers from a production deployment processing vendor contracts:

Average document: 22 pages, ~35,000 tokens input, ~1,200 tokens output
Cost per document: ~$0.12 with Claude Sonnet (input: $0.105, output: $0.018)
Latency: 8–15 seconds per document (single pass), 15–25 seconds (two pass)
Accuracy: 94% field-level accuracy on structured fields (dates, values, names), 87% on risk classification (validated against legal team decisions over 500 documents)

At $0.12 per document and 200 contracts per quarter, that's $24 per quarter in API costs. Compare that to the 15–30 minutes a procurement analyst spends per contract at fully-loaded cost. The ROI is measured in orders of magnitude, not percentages.

What This System Doesn't Replace

This system accelerates review — it doesn't eliminate it. The workflow we recommend: AI processes every document and generates a structured review. Low-risk, standard-terms contracts (typically 60–70% of volume) get auto-approved with the review attached for audit trails. Medium and high-risk contracts get routed to the appropriate reviewer with the AI's analysis pre-populated, cutting their review time from 30 minutes to 5–10 minutes. Critical-risk contracts go directly to legal with specific clauses highlighted.

The human is always in the loop for anything that matters. The AI just makes sure they're spending their time on the documents that actually need attention, not rubber-stamping routine renewals.

Getting Started

The entire system described here is roughly 400 lines of Python. Start with a narrow scope: pick one document type (vendor contracts, invoices, compliance certifications), define the extraction schema for that type, and run it against 20–30 real documents from your archives. Compare the AI's extraction to what a human reviewer would produce. Iterate on the schema and prompt until you hit 90%+ accuracy on your specific document type, then expand to the next type.

The biggest mistake teams make is trying to build a generic "document understanding" system that handles every document type. Don't. Build a contract review system, then an invoice processing system, then a compliance checker. Each one has different schemas, different business rules, and different accuracy requirements. Specialization is what turns a demo into a production tool.

Need a second opinion on your stack?

Start with a free 20-minute assessment call to scope the problem and decide whether a paid diagnostic or implementation step is worthwhile. Written findings are not included in the free call.

Get a Free Assessment → More Articles