Evaluation & Testing Framework
The Evaluation module helps you test your agents and measure output quality automatically.
Overview
- Test Generator - Auto-generate pytest test suites
- Agent Evaluator - Measure output quality with 7 metrics
- Benchmark - Performance testing and comparison
- CLI support - Evaluate agent outputs from the command line
CLI Usage
Basic Evaluation
Evaluate agent output quality directly from the command line:
multi-agent-generator --evaluate \
--query "What is machine learning?" \
--response "Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed."
Output:
📊 Evaluating agent output...
Evaluation Results: ✅ PASSED
==================================================
Query: What is machine learning?
Response: Machine learning is a subset of artificial intelligence that enables computers to learn from data...
Metrics:
• Relevance: 1.00
• Completeness: 0.50
• Coherence: 0.80
• Accuracy: 0.70
• Task Completion: 0.70
• Response Time: 0.00ms
• Token Count: 22
Overall Score: 0.740 (threshold: 0.7)
With Expected Output
Provide an expected output for accuracy comparison:
multi-agent-generator --evaluate \
--query "Explain neural networks" \
--response "Neural networks are computing systems inspired by biological neurons" \
--expected "Neural networks are machine learning models inspired by the human brain"
Custom Threshold
Set a custom passing threshold (default is 0.7):
multi-agent-generator --evaluate \
--query "What is AI?" \
--response "AI stands for Artificial Intelligence" \
--threshold 0.8
Output (when below threshold):
📊 Evaluating agent output...
Evaluation Results: ❌ FAILED
==================================================
Query: What is AI?
Response: AI stands for Artificial Intelligence
Metrics:
• Relevance: 0.70
• Completeness: 0.45
• Coherence: 0.90
• Accuracy: 0.80
• Task Completion: 0.50
Overall Score: 0.670 (threshold: 0.8)
Feedback:
• Response lacks detail and explanation
• Consider providing more context about AI capabilities
Save Results to File
Save evaluation results as JSON:
multi-agent-generator --evaluate \
--query "Summarize machine learning" \
--response "ML is a type of AI that learns from data" \
--output evaluation_results.json
Generated JSON:
{
"query": "Summarize machine learning",
"response": "ML is a type of AI that learns from data",
"metrics": {
"relevance_score": 0.85,
"completeness_score": 0.70,
"coherence_score": 0.90,
"accuracy_score": 0.80,
"response_time_ms": 0.0,
"token_count": 10,
"task_completion_rate": 0.75,
"overall_score": 0.800
},
"passed": true,
"feedback": ["Response is relevant but could be more comprehensive"],
"errors": []
}
Evaluation Metrics Explained
| Metric | Description | CLI Display |
|---|---|---|
| Relevance | How relevant is the output to the query | Relevance: 0.XX |
| Completeness | Does it cover all required aspects | Completeness: 0.XX |
| Coherence | Is the output logically structured | Coherence: 0.XX |
| Accuracy | Factual correctness | Accuracy: 0.XX |
| Task Completion | Did it fulfill the request | Task Completion: 0.XX |
| Response Time | Processing time in milliseconds | Response Time: X.XXms |
| Token Count | Number of tokens in response | Token Count: XX |
Test Generator
Quick Start
from multi_agent_generator.evaluation import TestGenerator
test_gen = TestGenerator()
# Generate a complete test suite
test_suite = test_gen.generate_test_suite(
agent_config=your_config,
test_types=["unit", "integration", "e2e"]
)
# Save to files
test_suite.save("tests/")
Test Types
| Type | Description | What It Tests |
|---|---|---|
| Unit | Individual component testing | Single agent functions |
| Integration | Multi-agent interaction | Agent communication |
| E2E (End-to-End) | Full workflow validation | Complete pipelines |
| Performance | Response time & throughput | Speed and efficiency |
| Reliability | Error handling & recovery | Edge cases and failures |
| Quality | Output quality metrics | Content accuracy |
Generating Specific Tests
from multi_agent_generator.evaluation import TestGenerator, TestType
test_gen = TestGenerator()
# Generate only unit tests
unit_tests = test_gen.generate_tests(
agent_config=config,
test_type=TestType.UNIT
)
# Generate performance tests
perf_tests = test_gen.generate_tests(
agent_config=config,
test_type=TestType.PERFORMANCE,
options={
"iterations": 100,
"timeout": 30
}
)
Generated Test Example
# Generated test file: test_research_agent.py
import pytest
from your_module import ResearchAgent
class TestResearchAgent:
"""Unit tests for ResearchAgent."""
@pytest.fixture
def agent(self):
return ResearchAgent()
def test_agent_initialization(self, agent):
"""Test agent initializes correctly."""
assert agent is not None
assert agent.role == "Researcher"
def test_agent_responds_to_query(self, agent):
"""Test agent responds to basic query."""
response = agent.run("What is AI?")
assert response is not None
assert len(response) > 0
def test_agent_handles_empty_input(self, agent):
"""Test agent handles empty input gracefully."""
with pytest.raises(ValueError):
agent.run("")
Agent Evaluator
Quick Start
from multi_agent_generator.evaluation import AgentEvaluator
evaluator = AgentEvaluator()
result = evaluator.evaluate(
agent_output="The market analysis shows growth of 15%...",
expected_output="Market trends indicate positive growth...",
task_description="Analyze Q4 sales data"
)
print(f"Overall Score: {result.overall_score}") # 0.0 - 1.0
print(f"Metrics: {result.metrics}")
Evaluation Metrics
| Metric | Description | Range |
|---|---|---|
| Relevance | How relevant is the output to the task | 0.0 - 1.0 |
| Completeness | Does it cover all required aspects | 0.0 - 1.0 |
| Coherence | Is the output logically structured | 0.0 - 1.0 |
| Accuracy | Factual correctness (when verifiable) | 0.0 - 1.0 |
| Conciseness | Appropriate length without redundancy | 0.0 - 1.0 |
| Format | Follows expected format/structure | 0.0 - 1.0 |
| Tone | Appropriate tone for the context | 0.0 - 1.0 |
Detailed Evaluation
from multi_agent_generator.evaluation import AgentEvaluator, EvaluationConfig
evaluator = AgentEvaluator()
# Configure which metrics to use
config = EvaluationConfig(
metrics=["relevance", "completeness", "accuracy"],
weights={
"relevance": 0.4,
"completeness": 0.3,
"accuracy": 0.3
}
)
result = evaluator.evaluate(
agent_output=output,
expected_output=expected,
task_description=task,
config=config
)
# Access individual metrics
print(result.metrics["relevance"])
print(result.metrics["completeness"])
print(result.metrics["accuracy"])
Batch Evaluation
from multi_agent_generator.evaluation import AgentEvaluator
evaluator = AgentEvaluator()
# Evaluate multiple outputs
test_cases = [
{"output": "...", "expected": "...", "task": "..."},
{"output": "...", "expected": "...", "task": "..."},
]
results = evaluator.evaluate_batch(test_cases)
# Get aggregate statistics
avg_score = sum(r.overall_score for r in results) / len(results)
print(f"Average Score: {avg_score}")
Benchmark
Running Benchmarks
from multi_agent_generator.evaluation import Benchmark
benchmark = Benchmark()
# Benchmark a single agent
results = benchmark.run(
agent=your_agent,
test_cases=test_cases,
iterations=10
)
print(f"Avg Response Time: {results.avg_response_time}ms")
print(f"Throughput: {results.throughput} req/s")
print(f"Success Rate: {results.success_rate}%")
Comparing Agents
from multi_agent_generator.evaluation import Benchmark
benchmark = Benchmark()
# Compare multiple agents
comparison = benchmark.compare(
agents={
"agent_v1": agent_v1,
"agent_v2": agent_v2,
},
test_cases=test_cases
)
# View comparison report
print(comparison.summary())
Benchmark Metrics
| Metric | Description |
|---|---|
avg_response_time |
Average time to respond (ms) |
p95_response_time |
95th percentile response time |
throughput |
Requests per second |
success_rate |
Percentage of successful responses |
error_rate |
Percentage of errors |
avg_quality_score |
Average output quality |
Integration with CI/CD
Running Tests in CI
# .github/workflows/test.yml
name: Agent Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- run: pip install -e ".[dev]"
- run: pytest tests/ -v
Quality Gates
from multi_agent_generator.evaluation import AgentEvaluator
evaluator = AgentEvaluator()
result = evaluator.evaluate(output, expected, task)
# Fail CI if quality is below threshold
assert result.overall_score >= 0.8, f"Quality score {result.overall_score} below threshold"
API Reference
TestGenerator
| Method | Description |
|---|---|
generate_test_suite(config, test_types) |
Generate complete test suite |
generate_tests(config, test_type) |
Generate specific test type |
AgentEvaluator
| Method | Description |
|---|---|
evaluate(output, expected, task) |
Evaluate single output |
evaluate_batch(test_cases) |
Evaluate multiple outputs |
Benchmark
| Method | Description |
|---|---|
run(agent, test_cases, iterations) |
Run benchmark on agent |
compare(agents, test_cases) |
Compare multiple agents |
EvaluationResult
| Property | Type | Description |
|---|---|---|
overall_score |
float | Overall quality score (0.0-1.0) |
metrics |
dict | Individual metric scores |
feedback |
str | Detailed feedback text |
passed |
bool | Whether it passed threshold |