LLM Evaluation Frameworks: Moving Beyond Vibes

llmevaluationtestingproduction

You’ve built an LLM feature. It works great in testing. You ship it. Then users report it’s generating nonsense, and you have no systematic way to know what changed or how to prevent it from happening again.

This is the evaluation problem, and it’s why most LLM features languish in “beta” forever.

Why Traditional Testing Fails

Software engineering has solved deterministic testing: same input → same output → pass/fail.

LLMs are probabilistic: same input → different output each time. This breaks everything:

  • Unit tests: Can’t assert on exact strings
  • Integration tests: Can’t predict API responses
  • Regression tests: Can’t detect subtle quality degradation

Most teams respond by not testing at all, or testing manually. Both approaches fail at scale.

What Actually Needs Testing

LLM systems have distinct failure modes that each need evaluation:

1. Correctness (Does it produce accurate information?)

Bad approach: Manual spot-checking

Better approach:

  • Curated test sets with known-correct answers
  • Semantic similarity scoring
  • Automated fact-checking against ground truth

Example:

def test_summarization_accuracy():
    for doc, expected_summary in test_cases:
        generated = model.summarize(doc)
        score = semantic_similarity(generated, expected_summary)
        assert score > 0.85  # Threshold tuned to your tolerance

2. Safety (Does it avoid harmful outputs?)

Critical for: Customer-facing features, content generation, decision support

Test for:

  • Prompt injection resistance
  • Toxicity and bias
  • PII leakage
  • Inappropriate content generation

Tools:

3. Consistency (Does it behave predictably?)

The problem: Temperature > 0 means variance. But too much variance breaks UX.

Measure:

  • Run same prompt N times (N=10-50 depending on criticality)
  • Calculate output diversity metrics
  • Flag if variance exceeds thresholds

Example: Classification task should return same category 95%+ of time, even with temperature=0.3

4. Latency and Cost (Does it meet SLAs?)

Track per evaluation run:

  • P50, P95, P99 latency
  • Token usage (prompt + completion)
  • Cost per request
  • Rate limit hit frequency

Set budgets: “This feature must stay under $0.10 per request and 3s latency”

5. Robustness (Does it handle edge cases?)

Test:

  • Empty inputs
  • Very long inputs (near context limits)
  • Malformed inputs
  • Adversarial inputs
  • Non-English text
  • Special characters and encoding issues

Building an Evaluation Pipeline

Here’s the architecture that works:

Layer 1: Unit Evals (Fast, Cheap, High Signal)

Run on: Every commit, pre-merge

Dataset: 50-200 curated examples covering core functionality

Assertions:

  • Semantic similarity to expected outputs
  • Format validation (valid JSON, correct schema)
  • No safety violations
  • Latency under threshold

Goal: Catch obvious regressions before code review

Layer 2: Integration Evals (Slower, More Comprehensive)

Run on: Before production deploy, nightly on main branch

Dataset: 500-2000 examples including edge cases

Measures:

  • All Layer 1 metrics
  • Cross-example consistency
  • Multi-turn conversation quality
  • Tool calling accuracy
  • Context retention over long exchanges

Goal: Catch subtle quality degradation and interaction effects

Layer 3: Production Monitoring (Continuous, Real User Data)

Track:

  • User satisfaction signals (explicit ratings, implicit engagement)
  • Output flagging frequency
  • Retry rates
  • Downstream task success (e.g., if LLM generates SQL, does query succeed?)

Goal: Detect issues that synthetic evals missed

Practical Implementation

Step 1: Build Your Golden Dataset

This is 80% of the work. You need:

Input/Output Pairs: Real user queries + high-quality expected responses

How to get them:

  • Start with 20-50 manually crafted examples
  • Mine production logs for interesting cases (anonymize first)
  • Expand iteratively when you find bugs

Maintain it: When you fix a bug, add it to the eval set

Step 2: Define Your Metrics

Don’t try to boil everything down to one score. Track:

Quality Metrics:

  • Semantic similarity (embeddings cosine similarity)
  • Format correctness (JSON schema validation, etc.)
  • Factual accuracy (against ground truth)

Safety Metrics:

  • Toxicity scores
  • PII detection
  • Prompt injection attempts

Operational Metrics:

  • Latency (p50, p95, p99)
  • Cost per request
  • Success rate

Step 3: Set Thresholds and Alerts

Example gates for production deploy:

required_metrics:
  semantic_similarity_p50: > 0.85
  format_correctness: > 0.98
  safety_violations: = 0
  p95_latency_ms: < 3000
  cost_per_request: < 0.10

Fail the build if thresholds aren’t met.

Step 4: Make It Fast

Evals that take 20 minutes don’t get run. Optimize:

  • Parallelize: Run examples concurrently
  • Cache: Embed test inputs once, reuse across runs
  • Sample: Run full suite nightly, subset (50 examples) per-commit
  • Use faster models: GPT-4 for production, GPT-3.5 for evals if acceptable

Tools Worth Knowing

Evaluation Frameworks:

DIY Stack:

  • Pytest for test harness
  • OpenAI/Anthropic embeddings for semantic similarity
  • PostgreSQL for storing results
  • Grafana for dashboards

What Actually Matters

After implementing eval systems at multiple companies:

Coverage > Perfection: 100 imperfect test cases beats 10 perfect ones

Speed > Comprehensiveness: Evals you actually run beat thorough evals you skip

Trends > Absolute Scores: Tracking score deltas catches more bugs than threshold checks

Production Data > Synthetic: Real user failures are your best test cases

The Workflow That Works

  1. Build feature with evals from day one (not “we’ll add tests later”)
  2. Run fast eval suite pre-commit (< 2 min)
  3. Run full eval suite pre-deploy (< 15 min)
  4. Monitor production metrics continuously
  5. When bugs appear, add to eval set immediately
  6. Review eval results weekly (are thresholds still relevant?)

Common Pitfalls

Overfitting to evals: Optimizing for your test set ≠ improving real performance

Eval set drift: Your test cases become stale as product evolves

Ignoring flaky tests: LLM variance means some tests will fail randomly; need statistical thresholds

No baseline: Track model version and prompt changes; you need comparison points

Where to Start

If you have nothing today:

  1. Create 20 input/output examples covering your core use cases
  2. Write a script that runs them and checks semantic similarity
  3. Run it manually before each deploy
  4. Expand from there

Don’t wait for the perfect eval framework. Start measuring something today.

We Can Help

Building rigorous evaluation pipelines for production LLM systems is what we do. If you’re shipping AI features and need systematic quality assurance:

Contact us to discuss evaluation strategy for your specific application.