Architecture Decision Records¶
This document records the key architectural decisions made in ullm's design.
Overview¶
ullm was created to provide a lightweight alternative to litellm specifically for DSPy users who need minimal overhead while maintaining API compatibility.
The Problem¶
litellm is excellent but has significant overhead:
- ~200MB memory footprint
- ~1.2s import time
- 50+ dependencies
- Loads all 100+ provider implementations upfront
The Solution¶
ullm (μLLM) provides:
- Only 4 core providers (OpenAI, Anthropic, Groq, AWS Bedrock)
- API compatibility with litellm
- Modern, efficient tooling
- Minimal dependencies (~2MB footprint, 3 dependencies)
ADR-001: Use httpx for HTTP¶
Decision: Use httpx as the single HTTP library for both sync and async operations.
Rationale:
- Single library reduces dependencies
- Modern, async-first design
- Supports both sync and async with same API
- Smaller footprint than requests + aiohttp
- Excellent streaming support
Status: ✅ Implemented
ADR-002: Use tenacity for retry logic¶
Decision: Use tenacity library for exponential backoff retry logic.
Rationale:
- Well-tested, battle-hardened
- Configurable retry strategies
- Only ~200KB overhead
- Handles edge cases (jitter, max delay, etc.)
Status: ✅ Implemented
ADR-003: No built-in caching¶
Decision: Do not implement caching in ullm itself.
Rationale:
- DSPy already has its own caching
- Caching adds complexity and state
- Would increase memory footprint
- Users can add their own caching layer
- Accept
cacheparameter for API compatibility but ignore it
Status: ✅ Implemented (passthrough only)
ADR-004: Pydantic v2 for validation¶
Decision: Use Pydantic v2 for response types and validation.
Rationale:
- Modern, fast (Rust core, 5-50x faster than v1)
- Excellent JSON schema generation
- Type-safe models
- Supports structured output validation
- ~1MB dependency
Status: ✅ Implemented
ADR-005: Provider-specific client classes¶
Decision: Separate client class for each provider.
Rationale:
- Clean separation of concerns
- Each provider has unique API quirks
- Easy to add/remove providers
- Groq can inherit from OpenAI (API-compatible)
- Testable in isolation
Structure:
BaseClient (ABC)
├── OpenAIClient
├── AnthropicClient
├── GroqClient (inherits OpenAIClient)
└── BedrockClient
Status: ✅ Implemented
ADR-006: Client Registry Pattern¶
Decision: Use decorator-based registry for client registration instead of if/elif chains.
Rationale:
- Cleaner, more maintainable code
- Each client self-registers with
@register_client(provider) - No need to modify main.py when adding providers
- Better separation of concerns
- No circular dependencies
Implementation:
Status: ✅ Implemented
ADR-007: Minimal Python 3.8+ support¶
Decision: Target Python 3.8+ (same as litellm).
Rationale:
- Good ecosystem support
- DSPy targets similar range
- Modern syntax (type hints, async/await)
- Still widely used despite 3.8 EOL
Status: ✅ Implemented
ADR-008: Streaming returns iterators¶
Decision: Streaming returns bare Iterator/AsyncIterator, not custom wrapper classes.
Rationale:
- Simpler API
- More Pythonic
- litellm's streaming wrappers add complexity
- Easy iteration:
for chunk in response
Status: ✅ Implemented
ADR-009: OpenAI format for tool calling¶
Decision: Always accept and return tools in OpenAI format, convert internally per provider.
Rationale:
- OpenAI format is most common
- Easier for users (one format to learn)
- Internal conversion keeps API clean
- All providers have similar concepts
Status: ✅ Implemented
ADR-010: Structured output via Pydantic¶
Decision: Accept response_format as either {"type": "json_object"} or a Pydantic model class.
Rationale:
- Pydantic models provide schema + validation
- Automatic JSON schema generation
- Type-safe parsing
- Better than raw JSON strings
Example:
class Person(BaseModel):
name: str
age: int
response = ullm.completion(
model="gpt-4o-mini",
messages=[...],
response_format=Person
)
Status: ✅ Implemented
ADR-011: Custom exception hierarchy¶
Decision: Define exception hierarchy matching litellm's exceptions.
Rationale:
- Predictable error handling for users
- Easy to catch specific error types
- Compatible with existing DSPy error handling
- Include model and provider info
Hierarchy:
UllmException (base)
├── AuthenticationError (401)
├── BadRequestError (400)
├── RateLimitError (429)
├── Timeout (504)
└── APIError (500+)
Status: ✅ Implemented
ADR-012: Use uv for development¶
Decision: Use uv for package management, not pip or poetry.
Rationale:
- Fastest Python package installer (10-100x faster)
- Modern tool by Astral (makers of ruff)
- Simple, no lock files for library
- Growing ecosystem support
Status: ✅ Implemented
ADR-013: Use ruff for linting/formatting¶
Decision: Use ruff as single tool for linting and formatting.
Rationale:
- 10-100x faster than black + flake8
- Single tool, one config
- Rust-based, well-maintained
- Includes isort, pyupgrade, etc.
Status: ✅ Implemented
ADR-014: Relaxed mypy configuration¶
Decision: Use relaxed mypy settings, not strict mode.
Rationale:
- Some type errors from dynamic provider dispatch
- boto3 has poor type stubs
- Async return types complex with Union[ModelResponse, Iterator]
- Pragmatic approach for MVP
Status: ✅ Implemented
ADR-015: boto3 as optional dependency¶
Decision: Make boto3 an optional dependency via extras: ullm[aws].
Rationale:
- boto3 is large (~20MB)
- Not all users need AWS Bedrock
- Keeps base install small
- Fails gracefully with clear error message
Installation:
Status: ✅ Implemented
Design Principles¶
- Lightweight First: Every line of code and dependency must justify its existence
- Compatibility Second: Stay compatible with litellm where practical, not at cost of bloat
- Modern Tooling: Use the best tools of 2025 (uv, ruff, httpx)
- Pragmatic Over Perfect: Ship working code, iterate based on feedback
- Clear Over Clever: Readable code beats clever abstractions
Known Trade-offs¶
1. Limited Provider Support¶
Trade-off: Only 4 providers vs 100+ in litellm Rationale: Dramatically smaller footprint, easier maintenance. Most DSPy users only need 1-2.
2. No Built-in Caching¶
Trade-off: No response caching built-in Rationale: DSPy has its own caching. Keeps ullm simpler and smaller.
3. Bedrock Async Uses Thread Pool¶
Trade-off: boto3 is synchronous, so async uses thread pool Rationale: Functional but not truly async. Could use aioboto3 in future.
4. No Legacy Text Completion API¶
Trade-off: No text_completion() function
Rationale: 95% of use cases use chat completion. Can add if needed.
Future Considerations¶
Potential additions (priority order):
- Embeddings API (Medium) - DSPy uses embeddings for retrieval
- Image Input Support (Medium) - Vision is becoming more common
- Additional Providers (Low) - Only if users request them
- Cost Tracking (Low) - Useful for monitoring
- Lightweight Caching (Low) - Optional in-memory LRU cache
Things to NOT Add¶
- ❌ Proxy server mode
- ❌ Spend analytics dashboard
- ❌ Fine-tuning API
- ❌ Prompt templates (DSPy handles this)
- ❌ Evaluation frameworks (DSPy handles this)