Featured Project · AI Engineering · Open Source

Multi-Model LLM Document Intelligence

Production-grade document processing pipeline — 450+ records/minute, 98%+ extraction accuracy, and three LLM backends (OpenAI, Anthropic, Ollama) switchable with zero code changes.

450+records / minute
98%+extraction accuracy
3LLM backends
$0–$0.50per 1K records
Project typeOpen-Source Production System
RoleSolo Developer & Architect
TimelineDecember 2024 (Active)
Core stackPython · OpenAI · Anthropic · Ollama

Project Overview

Throughput450+ records/minute (cloud) · 60/minute (local Ollama)
Accuracy98%+ structured extraction reliability validated across all three backends
Backends3 — OpenAI, Anthropic, Ollama — switchable with zero code changes via Strategy pattern
Concurrency10× worker parallelisation with thread-safe token bucket rate limiting
Cost range$0/1K records (local Ollama) to $0.50/1K (OpenAI GPT-4o-mini)
Codebase1,000+ lines of production Python · modular architecture · <200 lines per module

The core problem

LLM APIs are powerful but naive implementations fail at production scale — hitting rate limits within seconds, incurring unpredictable costs, and producing unparseable output when models deviate from the expected schema. This system solves each of those problems systematically.

The Engineering Challenge

Modern enterprises process thousands of documents daily — HR resume screening, contract analysis, financial document classification. Manual review is expensive, slow, and inconsistent. LLM APIs offer powerful NLU, but production deployment introduces a set of engineering problems that don't exist in prototypes.

Rate limiting at scale

Cloud APIs throttle at 50–500 requests/minute. Naive concurrent implementations hit limits within seconds and fail silently — or loudly, burning retries.

Cost management

Processing 10K documents costs $5–$50 depending on backend and model choice. Without cost visibility and routing logic, cloud bills spiral unpredictably.

Reliability under failure

Network timeouts, API 429s, and malformed responses are guaranteed in production. Systems that don't handle them gracefully corrupt output datasets silently.

Vendor lock-in

Single-provider dependency creates risk — API deprecations, price changes, and capability gaps all require re-engineering instead of a configuration change.

Thread safety in concurrent processing

Shared rate limiter state, output buffers, and progress trackers across 10 concurrent threads require careful synchronisation or produce race conditions and data loss.

Technical Requirements

Functional

  • Process structured datasets (CSV/Excel) from 1K to 100K records
  • Extract 9+ fields per document with strict schema validation
  • Support multiple LLM backends for cost and performance optimisation
  • Achieve 95%+ extraction accuracy with automatic retry on parse failure
  • Handle 100+ concurrent requests without rate limit failures

Non-Functional

  • Performance: sub-second latency per document, 400+ records/minute
  • Scalability: linear scaling from 100 to 100K records
  • Security: zero hardcoded credentials, environment-based config
  • Maintainability: modular architecture, <200 lines per module
  • Observability: logging, error tracking, per-backend performance metrics

Core Technical Contributions

01

Multi-Backend LLM Integration via Strategy Pattern

Switch providers with zero code changes — OpenAI, Anthropic, Ollama behind a unified interface

Challenge: Three providers with fundamentally different APIs, rate limits, response formats, and capabilities — unified in a way that lets the caller stay completely ignorant of which backend is running.

Solution: Strategy pattern with backend adapters. Each provider implements the same DocumentAnalyzer interface. The caller selects a backend via config; the rest of the system stays unchanged.

Unified interface
class DocumentAnalyzer: """Backend-agnostic document processing interface.""" def analyze_document(self, row_data: dict) -> dict: """ Returns structured extraction result regardless of backend. Concrete implementations: OpenAIAnalyzer, AnthropicAnalyzer, OllamaAnalyzer """ raise NotImplementedError # Caller code — identical for all three backends analyzer = get_analyzer(backend=config.BACKEND) # "openai" | "anthropic" | "ollama" result = analyzer.analyze_document(row)
Backend comparison
Cloud · Speed

OpenAI GPT-4o-mini

  • 450 requests / minute
  • $0.50 per 1K records
  • Best for: high-volume batch jobs
Cloud · Research

Anthropic Claude Sonnet 4

  • 45 requests / minute (with web search)
  • $3.00 per 1K records
  • Best for: research-intensive analysis
Local · Private

Ollama (Qwen 2.5-32B)

  • No API limits — hardware bound
  • $0 cost (compute only)
  • Best for: privacy-sensitive data
Impact: 99% backend compatibility — swap providers with a single config change. At scale (10K docs/day), switching from OpenAI to Ollama saves ~$1,825/year with no code changes.
02

High-Performance Concurrent Processing Engine

7.6× throughput improvement — 10 workers, 84% efficiency vs. theoretical maximum

Challenge: Maximise throughput within strict API rate limits while ensuring all shared state (rate limiter, output buffer, progress counter) remains thread-safe across concurrent workers.

Solution: ThreadPoolExecutor with carefully tuned worker count. I/O-bound workloads plateau at 2× CPU cores — 20 workers underperforms 10 due to thread-switch overhead, a counter-intuitive result that required empirical measurement to discover.

Concurrent job dispatch
from concurrent.futures import ThreadPoolExecutor, as_completed def process_batch(rows: list[dict], analyzer: DocumentAnalyzer) -> list[dict]: results = [] with ThreadPoolExecutor(max_workers=10) as executor: # Submit all jobs; rate limiter inside analyzer enforces API limits futures = {executor.submit(analyzer.analyze_document, row): row for row in rows} for future in as_completed(futures): try: results.append(future.result()) except Exception as e: log_error(futures[future], e) # Log and continue — don't crash the batch return results
7.6×throughput improvement vs. sequential
84%efficiency vs. theoretical maximum
99.2%reliability across 5K job test
Impact: 5,000-document batch processes in 13 minutes instead of 100+ minutes. Worker count is empirically tuned — more threads actually hurt performance past 10 due to I/O-bound context switching overhead.
03

Thread-Safe Token Bucket Rate Limiter

Zero rate limit errors across 5,000+ requests — dual enforcement on requests AND token budget

Challenge: Prevent API throttling across 10 concurrent threads without over-constraining throughput. Requests per minute is only half the picture — token budgets per minute are often the binding constraint on cloud APIs.

Solution: Token bucket algorithm with threading.Lock for mutual exclusion and a sliding deque window for accurate per-minute rate calculation. Dual enforcement on both request count and estimated token consumption, with an 85% safety margin to absorb burst variance.

Rate limiter implementation
from threading import Lock from collections import deque class ThreadSafeRateLimiter: def __init__(self, max_per_minute: int, tokens_per_request: int, max_tokens_per_minute: int): self.lock = Lock() self.timestamps = deque() # Sliding window of request times # Enforce the binding constraint — whichever limit hits first safe_rate = min( max_per_minute, int((max_tokens_per_minute / tokens_per_request) * 0.85) # 85% safety margin ) self.safe_rate = safe_rate def acquire(self): with self.lock: now = time.monotonic() # Evict requests older than 60 seconds from the sliding window while self.timestamps and now - self.timestamps[0] >= 60: self.timestamps.popleft() if len(self.timestamps) >= self.safe_rate: sleep_for = 60 - (now - self.timestamps[0]) time.sleep(sleep_for) self.timestamps.append(time.monotonic())
  • Dual-limit enforcement — tracks both request count and estimated token consumption per minute.
  • Thread synchronisation via threading.Lock — no race conditions on the shared sliding window.
  • 85% safety margin — absorbs burst variance and prevents throttling on API quota boundaries.
  • Sliding window accuracy — deque-based rolling 60-second window, not a fixed-reset bucket.
Impact: Zero rate limit errors in production, validated across 5,000+ requests. Maintaining 85% of the theoretical maximum throughput — not just safe, but maximally efficient.
04

Robust Structured Output Parsing & Validation

Multi-layer defence against LLM format deviation — 98–99.5% parse success across all backends

Challenge: LLMs occasionally deviate from required output format — wrong delimiters, extra whitespace, enum variants with different capitalisation, values outside expected ranges. Production systems cannot silently accept corrupted extractions.

Solution: Four-layer validation pipeline that catches deviations progressively, from structural format down to value semantics. Each layer has its own correction strategy before falling back to retry.

Four-layer validation pipeline
01Format validation — check for exactly 9 pipe-separated fields. Miscount triggers immediate retry with a reinforced prompt.
02Type validation — integers must be integers, enums must match the schema. Coerce where safe (e.g. "85" → 85), reject where not.
03Range validation — clamp scores to 0–100, probabilities to 0–100. Out-of-range values are clamped with a warning, not rejected.
04Fuzzy matching — "ontario", "ON", and "Ontario" all resolve correctly. Case-insensitive enum matching prevents unnecessary failures on trivial deviations.
Parse accuracy by backend
98.1% Ollama (local Qwen 2.5-32B)
99.2% OpenAI (GPT-4o-mini)
99.5% Anthropic (Claude Sonnet 4)
Impact: 98–99.5% first-pass parse success across all three backends. Remaining failures trigger automatic retry with a corrective prompt — the system self-heals rather than silently corrupting the output dataset.

Results & Impact

MetricManual baselineWith systemImprovement
Processing time (1K docs)~100 hours~3 minutes2,000× faster
Cost per 1K docs$2,000 (labour)$0.50 (OpenAI) / $0 (Ollama)99.98% reduction
Extraction accuracy95% (human error rate)98%+ (validated)3% improvement
Throughput10 / hour450 / minute2,700× increase

Real-World Use Cases

HR tech — resume screening

  • 5,000 job applications processed in 13 minutes
  • 9 strategic fields extracted per application
  • Cost: $2.17 (OpenAI) vs. $10,000+ manual screening

Legal tech — contract analysis

  • 1,000 contracts analysed in under 3 minutes
  • Key clauses extracted, risks flagged automatically
  • Cost: $0 (local Ollama) for data privacy compliance

Finance — invoice processing

  • 10,000 invoices classified in 30 minutes
  • Vendor, amount, date, category extracted per record
  • Cost: $5 (OpenAI) vs. $2,000+ manual data entry

Technology Stack

LLM SDKs & models

OpenAI SDK · GPT-4o-miniAnthropic SDK · Claude Sonnet 4Ollama · Qwen 2.5-32B

Core Python

Python 3.9+pandastiktokenpython-dotenvregex (re)

Concurrency & data structures

ThreadPoolExecutorthreading.Lockcollections.dequeconcurrent.futures

DevOps & tooling

Git / GitHubpip / venvVS CodeMIT License

Engineering Learnings

More workers is not always faster

20 workers underperforms 10 for I/O-bound LLM calls due to thread-switch overhead and lock contention on the rate limiter. Optimal worker count is empirically 2× CPU cores — always measure, never assume linearity.

Token budgets bind harder than request counts

Rate limit errors on well-tuned request counts usually trace to the token-per-minute cap, not the request-per-minute cap. Dual enforcement on both constraints is necessary — one without the other gives a false sense of safety.

LLMs require defensive parsing

Without strict schema enforcement and fuzzy matching, ~2% of responses fail to parse on first attempt. Multi-layer validation with graceful retry catches nearly all of them — assume models will deviate, design accordingly.

Multi-backend abstraction cost is negligible

The Strategy pattern abstraction took under 50 lines. The flexibility benefit — switching between a $0/record local model and a $0.50/record cloud model on demand — pays for itself on any non-trivial workload.

Prompt examples improve compliance 20%+

Adding a single well-formed output example to the system prompt reduced first-pass parse failures by over 20% across all three backends. Schema description alone is insufficient — show the model exactly what success looks like.

Documentation drives adoption

Comprehensive README and architecture docs reduced setup friction to zero and enabled contributors to understand design decisions without asking questions. Benchmarks help users pick the right backend for their workload.