Featured Project · AI Engineering · Open Source
Production-grade document processing pipeline — 450+ records/minute, 98%+ extraction accuracy, and three LLM backends (OpenAI, Anthropic, Ollama) switchable with zero code changes.
LLM APIs are powerful but naive implementations fail at production scale — hitting rate limits within seconds, incurring unpredictable costs, and producing unparseable output when models deviate from the expected schema. This system solves each of those problems systematically.
Modern enterprises process thousands of documents daily — HR resume screening, contract analysis, financial document classification. Manual review is expensive, slow, and inconsistent. LLM APIs offer powerful NLU, but production deployment introduces a set of engineering problems that don't exist in prototypes.
Cloud APIs throttle at 50–500 requests/minute. Naive concurrent implementations hit limits within seconds and fail silently — or loudly, burning retries.
Processing 10K documents costs $5–$50 depending on backend and model choice. Without cost visibility and routing logic, cloud bills spiral unpredictably.
Network timeouts, API 429s, and malformed responses are guaranteed in production. Systems that don't handle them gracefully corrupt output datasets silently.
Single-provider dependency creates risk — API deprecations, price changes, and capability gaps all require re-engineering instead of a configuration change.
Shared rate limiter state, output buffers, and progress trackers across 10 concurrent threads require careful synchronisation or produce race conditions and data loss.
Multi-Backend LLM Integration via Strategy Pattern
Switch providers with zero code changes — OpenAI, Anthropic, Ollama behind a unified interface
Challenge: Three providers with fundamentally different APIs, rate limits, response formats, and capabilities — unified in a way that lets the caller stay completely ignorant of which backend is running.
Solution: Strategy pattern with backend adapters. Each provider implements the same DocumentAnalyzer interface. The caller selects a backend via config; the rest of the system stays unchanged.
class DocumentAnalyzer:
"""Backend-agnostic document processing interface."""
def analyze_document(self, row_data: dict) -> dict:
"""
Returns structured extraction result regardless of backend.
Concrete implementations: OpenAIAnalyzer, AnthropicAnalyzer, OllamaAnalyzer
"""
raise NotImplementedError
# Caller code — identical for all three backends
analyzer = get_analyzer(backend=config.BACKEND) # "openai" | "anthropic" | "ollama"
result = analyzer.analyze_document(row)High-Performance Concurrent Processing Engine
7.6× throughput improvement — 10 workers, 84% efficiency vs. theoretical maximum
Challenge: Maximise throughput within strict API rate limits while ensuring all shared state (rate limiter, output buffer, progress counter) remains thread-safe across concurrent workers.
Solution: ThreadPoolExecutor with carefully tuned worker count. I/O-bound workloads plateau at 2× CPU cores — 20 workers underperforms 10 due to thread-switch overhead, a counter-intuitive result that required empirical measurement to discover.
from concurrent.futures import ThreadPoolExecutor, as_completed
def process_batch(rows: list[dict], analyzer: DocumentAnalyzer) -> list[dict]:
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
# Submit all jobs; rate limiter inside analyzer enforces API limits
futures = {executor.submit(analyzer.analyze_document, row): row for row in rows}
for future in as_completed(futures):
try:
results.append(future.result())
except Exception as e:
log_error(futures[future], e) # Log and continue — don't crash the batch
return resultsThread-Safe Token Bucket Rate Limiter
Zero rate limit errors across 5,000+ requests — dual enforcement on requests AND token budget
Challenge: Prevent API throttling across 10 concurrent threads without over-constraining throughput. Requests per minute is only half the picture — token budgets per minute are often the binding constraint on cloud APIs.
Solution: Token bucket algorithm with threading.Lock for mutual exclusion and a sliding deque window for accurate per-minute rate calculation. Dual enforcement on both request count and estimated token consumption, with an 85% safety margin to absorb burst variance.
from threading import Lock
from collections import deque
class ThreadSafeRateLimiter:
def __init__(self, max_per_minute: int, tokens_per_request: int, max_tokens_per_minute: int):
self.lock = Lock()
self.timestamps = deque() # Sliding window of request times
# Enforce the binding constraint — whichever limit hits first
safe_rate = min(
max_per_minute,
int((max_tokens_per_minute / tokens_per_request) * 0.85) # 85% safety margin
)
self.safe_rate = safe_rate
def acquire(self):
with self.lock:
now = time.monotonic()
# Evict requests older than 60 seconds from the sliding window
while self.timestamps and now - self.timestamps[0] >= 60:
self.timestamps.popleft()
if len(self.timestamps) >= self.safe_rate:
sleep_for = 60 - (now - self.timestamps[0])
time.sleep(sleep_for)
self.timestamps.append(time.monotonic())threading.Lock — no race conditions on the shared sliding window.Robust Structured Output Parsing & Validation
Multi-layer defence against LLM format deviation — 98–99.5% parse success across all backends
Challenge: LLMs occasionally deviate from required output format — wrong delimiters, extra whitespace, enum variants with different capitalisation, values outside expected ranges. Production systems cannot silently accept corrupted extractions.
Solution: Four-layer validation pipeline that catches deviations progressively, from structural format down to value semantics. Each layer has its own correction strategy before falling back to retry.
Four-layer validation pipelineLLM SDKs & models
Core Python
Concurrency & data structures
DevOps & tooling
20 workers underperforms 10 for I/O-bound LLM calls due to thread-switch overhead and lock contention on the rate limiter. Optimal worker count is empirically 2× CPU cores — always measure, never assume linearity.
Rate limit errors on well-tuned request counts usually trace to the token-per-minute cap, not the request-per-minute cap. Dual enforcement on both constraints is necessary — one without the other gives a false sense of safety.
Without strict schema enforcement and fuzzy matching, ~2% of responses fail to parse on first attempt. Multi-layer validation with graceful retry catches nearly all of them — assume models will deviate, design accordingly.
The Strategy pattern abstraction took under 50 lines. The flexibility benefit — switching between a $0/record local model and a $0.50/record cloud model on demand — pays for itself on any non-trivial workload.
Adding a single well-formed output example to the system prompt reduced first-pass parse failures by over 20% across all three backends. Schema description alone is insufficient — show the model exactly what success looks like.
Comprehensive README and architecture docs reduced setup friction to zero and enabled contributors to understand design decisions without asking questions. Benchmarks help users pick the right backend for their workload.