Project Overview
The Challenge
Modern enterprises process thousands of documents daily—from HR resume screening to contract analysis and financial document classification. Manual review is expensive, slow, and error-prone. While LLM APIs offer powerful natural language understanding, production deployment requires solving critical engineering challenges:
Rate Limiting
Cloud APIs throttle requests (50-500/min)—naive implementations fail at scale
Cost Management
Processing 10K documents can cost $5-50 depending on backend choice
Reliability
Network failures, API timeouts, and parsing errors affect production systems
Vendor Lock-in
Single API dependency creates risk and limits optimization opportunities
Thread Safety
Concurrent processing requires careful synchronization to avoid race conditions
Technical Requirements
Functional
- Process structured datasets (CSV/Excel) with 1K-100K records
- Extract 9+ fields per document with strict schema validation
- Support multiple LLM backends for cost/performance optimization
- Achieve >95% extraction accuracy with automatic retry logic
- Handle 100+ concurrent requests without rate limit failures
Non-Functional
- Performance: Sub-second latency per document, 400+ records/minute throughput
- Scalability: Linear scaling from 100 to 100K records
- Security: Zero hardcoded credentials, environment-based config
- Maintainability: Modular architecture with <200 lines per module
- Observability: Comprehensive logging, error tracking, performance metrics
Core Technical Contributions
Multi-Backend LLM Integration Architecture
Challenge: Integrate three fundamentally different LLM providers with distinct APIs, response formats, and capabilities.
Solution: Implemented Strategy Pattern with backend adapters:
# Unified interface across backends
class DocumentAnalyzer:
def analyze_document(self, row_data):
# Backend-agnostic processing
pass
OpenAI GPT-4o-mini
- 450 requests/minute rate limit
- $0.50 per 1K records cost
- Best for: High-volume batch processing
Anthropic Claude Sonnet 4
- 45 requests/minute (multi-turn + web search)
- $3.00 per 1K records cost
- Best for: Research-intensive analysis
Ollama (Local Qwen 2.5-32B)
- No API limits (hardware-bound)
- $0 cost (compute only)
- Best for: Privacy-sensitive data
High-Performance Concurrent Processing Engine
Challenge: Maximize throughput within strict API rate limits while ensuring thread safety.
Solution: Implemented ThreadPoolExecutor with intelligent worker management:
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(analyzer.analyze_document, r) for r in rows]
Thread-Safe Token Bucket Rate Limiter
Challenge: Prevent API throttling across concurrent threads while maximizing utilization.
Solution: Implemented token bucket algorithm with thread safety:
class ThreadSafeRateLimiter:
def __init__(self, max_per_minute, tokens_per_request, max_tokens_per_minute):
self.lock = Lock() # Thread-safe access
self.timestamps = deque() # Sliding window
# Dual limits: requests AND tokens
safe_rate = min(
max_per_minute,
(max_tokens_per_minute / tokens_per_request) * 0.85
)
Robust Structured Output Parsing & Validation
Challenge: LLMs occasionally deviate from required output format—production systems need 100% parse success.
Solution: Multi-layer validation with fuzzy matching:
Results & Impact
Real-World Use Cases
HR Tech - Resume Screening
- Process 5,000 job applications in 13 minutes
- Extract 9 strategic fields per application
- Cost: $2.17 (OpenAI) vs. $10,000 manual screening
Legal Tech - Contract Analysis
- Analyze 1,000 contracts in 3 minutes
- Extract key clauses, identify risks
- Cost: $0 (local Ollama) for privacy compliance
Financial Services - Invoice Processing
- Classify 10,000 invoices in 30 minutes
- Extract vendor, amount, date, category
- Cost: $5 (OpenAI) vs. $2,000 manual data entry
Technology Stack
Core Technologies
LLM SDKs
Concurrency & Data Structures
DevOps & Tools
Key Engineering Learnings
Concurrency Isn't Free
Learned that 20 workers performs worse than 10 due to overhead. Optimal worker count = 2x CPU cores for I/O-bound tasks. Always measure, don't assume more workers = better.
Rate Limiting is Critical
Naive implementations hit API limits within seconds. Token budget matters as much as request count. 85% safety margin prevents burst-induced failures.
LLMs Need Constraints
Without strict schema enforcement, models hallucinate formats. Examples in prompts improve compliance by 20%+. Parsing must be defensive—assume models will deviate.
Multi-Backend Architecture Pays Off
Switching from OpenAI to Ollama = $5,000/year savings at scale. Anthropic's web search enables use cases OpenAI can't handle. Abstraction cost < 50 lines, flexibility benefit = infinite.
Documentation Drives Adoption
Comprehensive README reduced setup questions to zero. Architecture docs enable contributors to understand design decisions. Performance benchmarks help users choose the right backend.
Explore the Project
Open-source and production-ready. Contributions welcome!
Try It Yourself
git clone https://github.com/morteza-mogharrab/llm-document-intelligence.git
cd llm-document-intelligence
pip install -r requirements.txt
cp env.example .env # Add your API key
python src/document_analyzer_openai.py