Evaluation & Testing Framework

The Evaluation module helps you test your agents and measure output quality automatically.

Overview

Test Generator - Auto-generate pytest test suites
Agent Evaluator - Measure output quality with 7 metrics
Benchmark - Performance testing and comparison
CLI support - Evaluate agent outputs from the command line

CLI Usage

Basic Evaluation

Evaluate agent output quality directly from the command line:

multi-agent-generator --evaluate \
  --query "What is machine learning?" \
  --response "Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed."

Output:

📊 Evaluating agent output...

Evaluation Results: ✅ PASSED
==================================================
Query: What is machine learning?
Response: Machine learning is a subset of artificial intelligence that enables computers to learn from data...

Metrics:
  • Relevance:        1.00
  • Completeness:     0.50
  • Coherence:        0.80
  • Accuracy:         0.70
  • Task Completion:  0.70
  • Response Time:    0.00ms
  • Token Count:      22

Overall Score: 0.740 (threshold: 0.7)

With Expected Output

Provide an expected output for accuracy comparison:

multi-agent-generator --evaluate \
  --query "Explain neural networks" \
  --response "Neural networks are computing systems inspired by biological neurons" \
  --expected "Neural networks are machine learning models inspired by the human brain"

Custom Threshold

Set a custom passing threshold (default is 0.7):

multi-agent-generator --evaluate \
  --query "What is AI?" \
  --response "AI stands for Artificial Intelligence" \
  --threshold 0.8

Output (when below threshold):

📊 Evaluating agent output...

Evaluation Results: ❌ FAILED
==================================================
Query: What is AI?
Response: AI stands for Artificial Intelligence

Metrics:
  • Relevance:        0.70
  • Completeness:     0.45
  • Coherence:        0.90
  • Accuracy:         0.80
  • Task Completion:  0.50

Overall Score: 0.670 (threshold: 0.8)

Feedback:
  • Response lacks detail and explanation
  • Consider providing more context about AI capabilities

Save Results to File

Save evaluation results as JSON:

multi-agent-generator --evaluate \
  --query "Summarize machine learning" \
  --response "ML is a type of AI that learns from data" \
  --output evaluation_results.json

Generated JSON:

{
  "query": "Summarize machine learning",
  "response": "ML is a type of AI that learns from data",
  "metrics": {
    "relevance_score": 0.85,
    "completeness_score": 0.70,
    "coherence_score": 0.90,
    "accuracy_score": 0.80,
    "response_time_ms": 0.0,
    "token_count": 10,
    "task_completion_rate": 0.75,
    "overall_score": 0.800
  },
  "passed": true,
  "feedback": ["Response is relevant but could be more comprehensive"],
  "errors": []
}

Evaluation Metrics Explained

Metric	Description	CLI Display
Relevance	How relevant is the output to the query	`Relevance: 0.XX`
Completeness	Does it cover all required aspects	`Completeness: 0.XX`
Coherence	Is the output logically structured	`Coherence: 0.XX`
Accuracy	Factual correctness	`Accuracy: 0.XX`
Task Completion	Did it fulfill the request	`Task Completion: 0.XX`
Response Time	Processing time in milliseconds	`Response Time: X.XXms`
Token Count	Number of tokens in response	`Token Count: XX`

Test Generator

Quick Start

from multi_agent_generator.evaluation import TestGenerator

test_gen = TestGenerator()

# Generate a complete test suite
test_suite = test_gen.generate_test_suite(
    agent_config=your_config,
    test_types=["unit", "integration", "e2e"]
)

# Save to files
test_suite.save("tests/")

Test Types

Type	Description	What It Tests
Unit	Individual component testing	Single agent functions
Integration	Multi-agent interaction	Agent communication
E2E (End-to-End)	Full workflow validation	Complete pipelines
Performance	Response time & throughput	Speed and efficiency
Reliability	Error handling & recovery	Edge cases and failures
Quality	Output quality metrics	Content accuracy

Generating Specific Tests

from multi_agent_generator.evaluation import TestGenerator, TestType

test_gen = TestGenerator()

# Generate only unit tests
unit_tests = test_gen.generate_tests(
    agent_config=config,
    test_type=TestType.UNIT
)

# Generate performance tests
perf_tests = test_gen.generate_tests(
    agent_config=config,
    test_type=TestType.PERFORMANCE,
    options={
        "iterations": 100,
        "timeout": 30
    }
)

Generated Test Example

# Generated test file: test_research_agent.py
import pytest
from your_module import ResearchAgent

class TestResearchAgent:
    """Unit tests for ResearchAgent."""

    @pytest.fixture
    def agent(self):
        return ResearchAgent()

    def test_agent_initialization(self, agent):
        """Test agent initializes correctly."""
        assert agent is not None
        assert agent.role == "Researcher"

    def test_agent_responds_to_query(self, agent):
        """Test agent responds to basic query."""
        response = agent.run("What is AI?")
        assert response is not None
        assert len(response) > 0

    def test_agent_handles_empty_input(self, agent):
        """Test agent handles empty input gracefully."""
        with pytest.raises(ValueError):
            agent.run("")

Agent Evaluator

Quick Start

from multi_agent_generator.evaluation import AgentEvaluator

evaluator = AgentEvaluator()

result = evaluator.evaluate(
    agent_output="The market analysis shows growth of 15%...",
    expected_output="Market trends indicate positive growth...",
    task_description="Analyze Q4 sales data"
)

print(f"Overall Score: {result.overall_score}")  # 0.0 - 1.0
print(f"Metrics: {result.metrics}")

Evaluation Metrics

Metric	Description	Range
Relevance	How relevant is the output to the task	0.0 - 1.0
Completeness	Does it cover all required aspects	0.0 - 1.0
Coherence	Is the output logically structured	0.0 - 1.0
Accuracy	Factual correctness (when verifiable)	0.0 - 1.0
Conciseness	Appropriate length without redundancy	0.0 - 1.0
Format	Follows expected format/structure	0.0 - 1.0
Tone	Appropriate tone for the context	0.0 - 1.0

Detailed Evaluation

from multi_agent_generator.evaluation import AgentEvaluator, EvaluationConfig

evaluator = AgentEvaluator()

# Configure which metrics to use
config = EvaluationConfig(
    metrics=["relevance", "completeness", "accuracy"],
    weights={
        "relevance": 0.4,
        "completeness": 0.3,
        "accuracy": 0.3
    }
)

result = evaluator.evaluate(
    agent_output=output,
    expected_output=expected,
    task_description=task,
    config=config
)

# Access individual metrics
print(result.metrics["relevance"])
print(result.metrics["completeness"])
print(result.metrics["accuracy"])

Batch Evaluation

from multi_agent_generator.evaluation import AgentEvaluator

evaluator = AgentEvaluator()

# Evaluate multiple outputs
test_cases = [
    {"output": "...", "expected": "...", "task": "..."},
    {"output": "...", "expected": "...", "task": "..."},
]

results = evaluator.evaluate_batch(test_cases)

# Get aggregate statistics
avg_score = sum(r.overall_score for r in results) / len(results)
print(f"Average Score: {avg_score}")

Benchmark

Running Benchmarks

from multi_agent_generator.evaluation import Benchmark

benchmark = Benchmark()

# Benchmark a single agent
results = benchmark.run(
    agent=your_agent,
    test_cases=test_cases,
    iterations=10
)

print(f"Avg Response Time: {results.avg_response_time}ms")
print(f"Throughput: {results.throughput} req/s")
print(f"Success Rate: {results.success_rate}%")

Comparing Agents

from multi_agent_generator.evaluation import Benchmark

benchmark = Benchmark()

# Compare multiple agents
comparison = benchmark.compare(
    agents={
        "agent_v1": agent_v1,
        "agent_v2": agent_v2,
    },
    test_cases=test_cases
)

# View comparison report
print(comparison.summary())

Benchmark Metrics

Metric	Description
`avg_response_time`	Average time to respond (ms)
`p95_response_time`	95th percentile response time
`throughput`	Requests per second
`success_rate`	Percentage of successful responses
`error_rate`	Percentage of errors
`avg_quality_score`	Average output quality

Integration with CI/CD

Running Tests in CI

# .github/workflows/test.yml
name: Agent Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install -e ".[dev]"
      - run: pytest tests/ -v

Quality Gates

from multi_agent_generator.evaluation import AgentEvaluator

evaluator = AgentEvaluator()
result = evaluator.evaluate(output, expected, task)

# Fail CI if quality is below threshold
assert result.overall_score >= 0.8, f"Quality score {result.overall_score} below threshold"

API Reference

TestGenerator

Method	Description
`generate_test_suite(config, test_types)`	Generate complete test suite
`generate_tests(config, test_type)`	Generate specific test type

AgentEvaluator

Method	Description
`evaluate(output, expected, task)`	Evaluate single output
`evaluate_batch(test_cases)`	Evaluate multiple outputs

Benchmark

Method	Description
`run(agent, test_cases, iterations)`	Run benchmark on agent
`compare(agents, test_cases)`	Compare multiple agents

EvaluationResult

Property	Type	Description
`overall_score`	float	Overall quality score (0.0-1.0)
`metrics`	dict	Individual metric scores
`feedback`	str	Detailed feedback text
`passed`	bool	Whether it passed threshold