Multi-Model LLM Document Intelligence
Production document extraction pipeline — 450+ records/minute, 98%+ accuracy, and three switchable LLM backends (OpenAI, Anthropic, Ollama local). Built for cost-optimized batch processing at scale.
View on GitHubOverview
Modern enterprises process thousands of documents daily. LLM APIs offer powerful NLU, but production deployment introduces problems that don't exist in prototypes: rate limits hit within seconds, costs spiral unpredictably, and malformed responses corrupt output datasets silently. This system solves each of those problems systematically.
The Engineering Challenges
Rate limiting at scale
Cloud APIs throttle at 50–500 requests/minute. Naive concurrent implementations hit limits within seconds and fail silently.
Cost management
Processing 10K documents costs $5–$50 depending on backend. Without routing logic, cloud bills spiral unpredictably.
Reliability under failure
Network timeouts, API 429s, and malformed responses are guaranteed in production. Systems that don't handle them corrupt output silently.
Vendor lock-in
Single-provider dependency creates risk. API deprecations and price changes require re-engineering instead of a configuration change.
Multi-Backend via Strategy Pattern
class DocumentAnalyzer:
"""Backend-agnostic document processing interface."""
def analyze_document(self, row_data: dict) -> dict:
raise NotImplementedError
# Caller code — identical for all three backends
analyzer = get_analyzer(backend=config.BACKEND) # "openai" | "anthropic" | "ollama"
result = analyzer.analyze_document(row)| Backend | Throughput | Cost per 1K | Best for |
|---|---|---|---|
| OpenAI GPT-4o-mini | 450 req/min | $0.50 | High-volume batch |
| Anthropic Claude Sonnet | 45 req/min | $3.00 | Research-intensive |
| Ollama (Qwen 2.5-32B) | Hardware-bound | $0 | Privacy-sensitive |
At scale (10K docs/day), switching from OpenAI to Ollama saves ~$1,825/year with no code changes.
Thread-Safe Token Bucket Rate Limiter
Token bucket algorithm with threading.Lock for mutual exclusion and sliding deque window for accurate per-minute rate calculation. Dual enforcement on both request count AND estimated token consumption — requests per minute is often not the binding constraint.
class ThreadSafeRateLimiter:
def __init__(self, max_per_minute: int, tokens_per_request: int,
max_tokens_per_minute: int):
self.lock = Lock()
self.timestamps = deque()
# Enforce the binding constraint — whichever limit hits first
safe_rate = min(
max_per_minute,
int((max_tokens_per_minute / tokens_per_request) * 0.85)
)
self.safe_rate = safe_rate
def acquire(self):
with self.lock:
now = time.monotonic()
while self.timestamps and now - self.timestamps[0] >= 60:
self.timestamps.popleft()
if len(self.timestamps) >= self.safe_rate:
sleep_for = 60 - (now - self.timestamps[0])
time.sleep(sleep_for)
self.timestamps.append(time.monotonic())Zero rate limit errors in production, validated across 5,000+ requests. Maintaining 85% of theoretical maximum throughput.
Four-Layer Output Validation
LLMs occasionally deviate from required output format. Production systems cannot silently accept corrupted extractions.
| Backend | First-pass accuracy |
|---|---|
| Ollama (Qwen 2.5-32B) | 98.1% |
| OpenAI (GPT-4o-mini) | 99.2% |
| Anthropic (Claude Sonnet) | 99.5% |
Results
| Metric | Manual | With System | Improvement |
|---|---|---|---|
| Processing time (1K docs) | ~100 hours | ~3 minutes | 2,000× faster |
| Cost per 1K docs | $2,000 (labour) | $0.50 (OpenAI) / $0 (Ollama) | 99.98% reduction |
| Extraction accuracy | ~95% (human error) | 98%+ validated | +3% |
| Throughput | 10/hour | 450/minute | 2,700× increase |
Engineering Learnings
More workers is not always faster
20 workers underperforms 10 for I/O-bound LLM calls due to thread-switch overhead. Optimal worker count is empirically 2× CPU cores — always measure, never assume linearity.
Token budgets bind harder than request counts
Rate limit errors on well-tuned request counts usually trace to the token-per-minute cap. Dual enforcement on both constraints is necessary.
LLMs require defensive parsing
Without strict schema enforcement and fuzzy matching, ~2% of responses fail to parse on first attempt. Design for this — assume models will deviate.
Prompt examples improve compliance 20%+
Adding a single well-formed output example to the system prompt reduced first-pass parse failures by over 20% across all three backends.
Stack & Quick Start
git clone https://github.com/morteza-mogharrab/llm-document-intelligence.git
cd llm-document-intelligence && pip install -r requirements.txt
cp env.example .env # Add your API key
python src/document_analyzer_openai.py # GPT-4o-mini
python src/document_analyzer_anthropic.py # Claude Sonnet
python src/document_analyzer_ollama.py # Local (free)