LLM Evaluation Frameworks: Moving Beyond Vibes
You’ve built an LLM feature. It works great in testing. You ship it. Then users report it’s generating nonsense, and you have no systematic way to know what changed or how to prevent it from happening again.
This is the evaluation problem, and it’s why most LLM features languish in “beta” forever.
Why Traditional Testing Fails
Software engineering has solved deterministic testing: same input → same output → pass/fail.
LLMs are probabilistic: same input → different output each time. This breaks everything:
- Unit tests: Can’t assert on exact strings
- Integration tests: Can’t predict API responses
- Regression tests: Can’t detect subtle quality degradation
Most teams respond by not testing at all, or testing manually. Both approaches fail at scale.
What Actually Needs Testing
LLM systems have distinct failure modes that each need evaluation:
1. Correctness (Does it produce accurate information?)
Bad approach: Manual spot-checking
Better approach:
- Curated test sets with known-correct answers
- Semantic similarity scoring
- Automated fact-checking against ground truth
Example:
def test_summarization_accuracy():
for doc, expected_summary in test_cases:
generated = model.summarize(doc)
score = semantic_similarity(generated, expected_summary)
assert score > 0.85 # Threshold tuned to your tolerance
2. Safety (Does it avoid harmful outputs?)
Critical for: Customer-facing features, content generation, decision support
Test for:
- Prompt injection resistance
- Toxicity and bias
- PII leakage
- Inappropriate content generation
Tools:
- LLM Guard
- NeMo Guardrails
- Custom moderation pipelines
3. Consistency (Does it behave predictably?)
The problem: Temperature > 0 means variance. But too much variance breaks UX.
Measure:
- Run same prompt N times (N=10-50 depending on criticality)
- Calculate output diversity metrics
- Flag if variance exceeds thresholds
Example: Classification task should return same category 95%+ of time, even with temperature=0.3
4. Latency and Cost (Does it meet SLAs?)
Track per evaluation run:
- P50, P95, P99 latency
- Token usage (prompt + completion)
- Cost per request
- Rate limit hit frequency
Set budgets: “This feature must stay under $0.10 per request and 3s latency”
5. Robustness (Does it handle edge cases?)
Test:
- Empty inputs
- Very long inputs (near context limits)
- Malformed inputs
- Adversarial inputs
- Non-English text
- Special characters and encoding issues
Building an Evaluation Pipeline
Here’s the architecture that works:
Layer 1: Unit Evals (Fast, Cheap, High Signal)
Run on: Every commit, pre-merge
Dataset: 50-200 curated examples covering core functionality
Assertions:
- Semantic similarity to expected outputs
- Format validation (valid JSON, correct schema)
- No safety violations
- Latency under threshold
Goal: Catch obvious regressions before code review
Layer 2: Integration Evals (Slower, More Comprehensive)
Run on: Before production deploy, nightly on main branch
Dataset: 500-2000 examples including edge cases
Measures:
- All Layer 1 metrics
- Cross-example consistency
- Multi-turn conversation quality
- Tool calling accuracy
- Context retention over long exchanges
Goal: Catch subtle quality degradation and interaction effects
Layer 3: Production Monitoring (Continuous, Real User Data)
Track:
- User satisfaction signals (explicit ratings, implicit engagement)
- Output flagging frequency
- Retry rates
- Downstream task success (e.g., if LLM generates SQL, does query succeed?)
Goal: Detect issues that synthetic evals missed
Practical Implementation
Step 1: Build Your Golden Dataset
This is 80% of the work. You need:
Input/Output Pairs: Real user queries + high-quality expected responses
How to get them:
- Start with 20-50 manually crafted examples
- Mine production logs for interesting cases (anonymize first)
- Expand iteratively when you find bugs
Maintain it: When you fix a bug, add it to the eval set
Step 2: Define Your Metrics
Don’t try to boil everything down to one score. Track:
Quality Metrics:
- Semantic similarity (embeddings cosine similarity)
- Format correctness (JSON schema validation, etc.)
- Factual accuracy (against ground truth)
Safety Metrics:
- Toxicity scores
- PII detection
- Prompt injection attempts
Operational Metrics:
- Latency (p50, p95, p99)
- Cost per request
- Success rate
Step 3: Set Thresholds and Alerts
Example gates for production deploy:
required_metrics:
semantic_similarity_p50: > 0.85
format_correctness: > 0.98
safety_violations: = 0
p95_latency_ms: < 3000
cost_per_request: < 0.10
Fail the build if thresholds aren’t met.
Step 4: Make It Fast
Evals that take 20 minutes don’t get run. Optimize:
- Parallelize: Run examples concurrently
- Cache: Embed test inputs once, reuse across runs
- Sample: Run full suite nightly, subset (50 examples) per-commit
- Use faster models: GPT-4 for production, GPT-3.5 for evals if acceptable
Tools Worth Knowing
Evaluation Frameworks:
- Langsmith - LangChain’s eval platform
- Braintrust - Eval-first development platform
- Weights & Biases - Experiment tracking with LLM support
- PromptLayer - Prompt management + eval
DIY Stack:
- Pytest for test harness
- OpenAI/Anthropic embeddings for semantic similarity
- PostgreSQL for storing results
- Grafana for dashboards
What Actually Matters
After implementing eval systems at multiple companies:
Coverage > Perfection: 100 imperfect test cases beats 10 perfect ones
Speed > Comprehensiveness: Evals you actually run beat thorough evals you skip
Trends > Absolute Scores: Tracking score deltas catches more bugs than threshold checks
Production Data > Synthetic: Real user failures are your best test cases
The Workflow That Works
- Build feature with evals from day one (not “we’ll add tests later”)
- Run fast eval suite pre-commit (< 2 min)
- Run full eval suite pre-deploy (< 15 min)
- Monitor production metrics continuously
- When bugs appear, add to eval set immediately
- Review eval results weekly (are thresholds still relevant?)
Common Pitfalls
Overfitting to evals: Optimizing for your test set ≠ improving real performance
Eval set drift: Your test cases become stale as product evolves
Ignoring flaky tests: LLM variance means some tests will fail randomly; need statistical thresholds
No baseline: Track model version and prompt changes; you need comparison points
Where to Start
If you have nothing today:
- Create 20 input/output examples covering your core use cases
- Write a script that runs them and checks semantic similarity
- Run it manually before each deploy
- Expand from there
Don’t wait for the perfect eval framework. Start measuring something today.
We Can Help
Building rigorous evaluation pipelines for production LLM systems is what we do. If you’re shipping AI features and need systematic quality assurance:
Contact us to discuss evaluation strategy for your specific application.