AI EngineeringOpen SourceProduction

Multi-Model LLM Document Intelligence

Production document extraction pipeline — 450+ records/minute, 98%+ accuracy, and three switchable LLM backends (OpenAI, Anthropic, Ollama local). Built for cost-optimized batch processing at scale.

View on GitHub

450+

Records / minute

98%+

Extraction accuracy

LLM backends

$0–$0.50

Per 1K records

Overview

Modern enterprises process thousands of documents daily. LLM APIs offer powerful NLU, but production deployment introduces problems that don't exist in prototypes: rate limits hit within seconds, costs spiral unpredictably, and malformed responses corrupt output datasets silently. This system solves each of those problems systematically.

The Engineering Challenges

Rate limiting at scale

Cloud APIs throttle at 50–500 requests/minute. Naive concurrent implementations hit limits within seconds and fail silently.

Cost management

Processing 10K documents costs $5–$50 depending on backend. Without routing logic, cloud bills spiral unpredictably.

Reliability under failure

Network timeouts, API 429s, and malformed responses are guaranteed in production. Systems that don't handle them corrupt output silently.

Vendor lock-in

Single-provider dependency creates risk. API deprecations and price changes require re-engineering instead of a configuration change.

Multi-Backend via Strategy Pattern

class DocumentAnalyzer:
    """Backend-agnostic document processing interface."""
    def analyze_document(self, row_data: dict) -> dict:
        raise NotImplementedError

# Caller code — identical for all three backends
analyzer = get_analyzer(backend=config.BACKEND)  # "openai" | "anthropic" | "ollama"
result = analyzer.analyze_document(row)

Backend	Throughput	Cost per 1K	Best for
OpenAI GPT-4o-mini	450 req/min	$0.50	High-volume batch
Anthropic Claude Sonnet	45 req/min	$3.00	Research-intensive
Ollama (Qwen 2.5-32B)	Hardware-bound	$0	Privacy-sensitive

At scale (10K docs/day), switching from OpenAI to Ollama saves ~$1,825/year with no code changes.

Thread-Safe Token Bucket Rate Limiter

Token bucket algorithm with threading.Lock for mutual exclusion and sliding deque window for accurate per-minute rate calculation. Dual enforcement on both request count AND estimated token consumption — requests per minute is often not the binding constraint.

class ThreadSafeRateLimiter:
    def __init__(self, max_per_minute: int, tokens_per_request: int,
                 max_tokens_per_minute: int):
        self.lock = Lock()
        self.timestamps = deque()
        # Enforce the binding constraint — whichever limit hits first
        safe_rate = min(
            max_per_minute,
            int((max_tokens_per_minute / tokens_per_request) * 0.85)
        )
        self.safe_rate = safe_rate

    def acquire(self):
        with self.lock:
            now = time.monotonic()
            while self.timestamps and now - self.timestamps[0] >= 60:
                self.timestamps.popleft()
            if len(self.timestamps) >= self.safe_rate:
                sleep_for = 60 - (now - self.timestamps[0])
                time.sleep(sleep_for)
            self.timestamps.append(time.monotonic())

Zero rate limit errors in production, validated across 5,000+ requests. Maintaining 85% of theoretical maximum throughput.

Four-Layer Output Validation

LLMs occasionally deviate from required output format. Production systems cannot silently accept corrupted extractions.

Format validation — Check for exactly 9 pipe-separated fields. Miscount triggers immediate retry with reinforced prompt.

Type validation — Integers must be integers, enums must match schema. Coerce where safe ("85" → 85), reject where not.

Range validation — Clamp scores to 0–100. Out-of-range values clamped with warning, not rejected.

Fuzzy matching — "ontario", "ON", and "Ontario" all resolve correctly. Case-insensitive enum matching prevents unnecessary failures.

Backend	First-pass accuracy
Ollama (Qwen 2.5-32B)	98.1%
OpenAI (GPT-4o-mini)	99.2%
Anthropic (Claude Sonnet)	99.5%

Results

Metric	Manual	With System	Improvement
Processing time (1K docs)	~100 hours	~3 minutes	2,000× faster
Cost per 1K docs	$2,000 (labour)	$0.50 (OpenAI) / $0 (Ollama)	99.98% reduction
Extraction accuracy	~95% (human error)	98%+ validated	+3%
Throughput	10/hour	450/minute	2,700× increase

Engineering Learnings

More workers is not always faster

20 workers underperforms 10 for I/O-bound LLM calls due to thread-switch overhead. Optimal worker count is empirically 2× CPU cores — always measure, never assume linearity.

Token budgets bind harder than request counts

Rate limit errors on well-tuned request counts usually trace to the token-per-minute cap. Dual enforcement on both constraints is necessary.

LLMs require defensive parsing

Without strict schema enforcement and fuzzy matching, ~2% of responses fail to parse on first attempt. Design for this — assume models will deviate.

Prompt examples improve compliance 20%+

Adding a single well-formed output example to the system prompt reduced first-pass parse failures by over 20% across all three backends.

Stack & Quick Start

Python 3.9+OpenAI SDKAnthropic SDKOllamapandastiktokenThreadPoolExecutorPydantic

git clone https://github.com/morteza-mogharrab/llm-document-intelligence.git
cd llm-document-intelligence && pip install -r requirements.txt
cp env.example .env  # Add your API key

python src/document_analyzer_openai.py    # GPT-4o-mini
python src/document_analyzer_anthropic.py # Claude Sonnet
python src/document_analyzer_ollama.py    # Local (free)