Skip to main content
← Back to Blog

Testing AI Agents: What QA Engineers Must Know in 2025

AI testing testing AI agents LLM testing

When AI Agents Go Wrong: What Testers Must Know About Validating AI-Generated Content

A customer asks an AI travel agent to book a flight to Chicago. The agent confidently books a flight to Charlotte, charges the wrong credit card, and sends a confirmation email with fabricated flight details. The customer doesn’t discover the mistake until they’re standing at the wrong gate.

This isn’t hypothetical. Stories like this are flooding Hacker News, Reddit, and tech Twitter. AI agents are shipping to production faster than teams can figure out how to test them, and the failures range from embarrassing to genuinely harmful. A Chevrolet dealership’s chatbot agreed to sell a car for one dollar. An air travel chatbot fabricated a refund policy, and the airline was held legally responsible. These aren’t edge cases anymore — they’re Tuesday.

If you’re a QA engineer or developer, testing AI agents is about to become the most critical skill in your toolkit. This guide covers practical strategies for validating AI-generated content, building guardrails, and catching failures before your users do.

Why Traditional Testing Falls Short for AI Agents

Traditional software testing rests on a simple premise: given the same input, you expect the same output. AI agents obliterate that assumption. An LLM-powered agent might respond to the same prompt differently every time, and “differently” can mean anything from a slight rewording to a completely fabricated answer.

This creates three fundamental testing challenges:

  • Non-determinism: Identical inputs produce variable outputs, so exact-match assertions break immediately.
  • Emergent failures: The agent might work perfectly on 999 prompts and hallucinate dangerously on the 1,000th.
  • Composability risk: AI agents often chain multiple LLM calls together, meaning errors compound across steps.

You can’t abandon testing because the system is non-deterministic. You shift your strategy. Instead of testing for exact correctness, you test for boundaries, constraints, and safety properties that must hold regardless of the specific output.

Output Validation Testing: Your First Line of Defense

Output validation testing means verifying that every response an AI agent produces meets structural, factual, and safety requirements — even when you can’t predict the exact wording.

The core idea: define invariants that must always be true, then assert against those.

Schema Validation

Start with structure. If your AI agent returns JSON, validate it against a schema every single time. This catches a surprising number of failures where the LLM produces malformed output, invents new fields, or drops required ones.

# test_agent_output_schema.py
import json
import pytest
from jsonschema import validate, ValidationError

# Define the contract your agent's output must satisfy
BOOKING_SCHEMA = {
    "type": "object",
    "required": ["destination", "departure_date", "price", "currency", "confirmation_id"],
    "properties": {
        "destination": {"type": "string", "minLength": 1},
        "departure_date": {
            "type": "string",
            # ISO 8601 date format — no made-up dates
            "pattern": r"^\d{4}-\d{2}-\d{2}$"
        },
        "price": {"type": "number", "minimum": 0, "maximum": 50000},
        "currency": {"type": "string", "enum": ["USD", "EUR", "GBP"]},
        "confirmation_id": {
            "type": "string",
            # Must match your real confirmation ID format
            "pattern": r"^[A-Z]{2}\d{6}$"
        },
    },
    "additionalProperties": False  # No hallucinated extra fields
}


class TestAgentOutputSchema:
    """Validate that the AI agent's booking output always conforms to schema."""

    def test_valid_booking_passes_schema(self):
        agent_output = {
            "destination": "Chicago O'Hare (ORD)",
            "departure_date": "2025-03-15",
            "price": 342.50,
            "currency": "USD",
            "confirmation_id": "AB123456"
        }
        # Should not raise — this output is well-formed
        validate(instance=agent_output, schema=BOOKING_SCHEMA)

    def test_missing_required_field_fails(self):
        agent_output = {
            "destination": "Chicago O'Hare (ORD)",
            "departure_date": "2025-03-15",
            # Agent "forgot" to include a price — this must fail
        }
        with pytest.raises(ValidationError):
            validate(instance=agent_output, schema=BOOKING_SCHEMA)

    def test_hallucinated_field_rejected(self):
        agent_output = {
            "destination": "Chicago O'Hare (ORD)",
            "departure_date": "2025-03-15",
            "price": 342.50,
            "currency": "USD",
            "confirmation_id": "AB123456",
            # Agent invented a field that doesn't exist in our system
            "loyalty_points_earned": 500
        }
        with pytest.raises(ValidationError):
            validate(instance=agent_output, schema=BOOKING_SCHEMA)

    def test_impossible_price_rejected(self):
        agent_output = {
            "destination": "Chicago O'Hare (ORD)",
            "departure_date": "2025-03-15",
            "price": -50.00,  # Negative price — the Chevy dealership problem
            "currency": "USD",
            "confirmation_id": "AB123456"
        }
        with pytest.raises(ValidationError):
            validate(instance=agent_output, schema=BOOKING_SCHEMA)

Semantic Validation

Schema checks catch structural problems, but they won’t catch an agent that returns a perfectly formatted response about the wrong city. For that, you need semantic validators — lightweight checks that verify the output makes sense in context.

# test_semantic_validation.py
import pytest
from datetime import date


def validate_destination_matches_request(user_request: str, agent_destination: str) -> bool:
    """Check that the agent's destination is actually what the user asked for.
    
    In production, this could use embedding similarity or a lightweight
    classifier. Here we demonstrate the pattern with keyword matching.
    """
    # Normalize both strings for comparison
    request_lower = user_request.lower()
    destination_lower = agent_destination.lower()

    # Extract city names from common airport formats like "Chicago O'Hare (ORD)"
    destination_city = destination_lower.split("(")[0].strip()

    return destination_city in request_lower or request_lower in destination_city


def validate_date_is_future(departure_date: str) -> bool:
    """Agent should never book flights in the past."""
    parsed = date.fromisoformat(departure_date)
    return parsed > date.today()


def validate_price_is_plausible(price: float, route: str) -> bool:
    """Catch obviously wrong prices using known route baselines.
    
    You'd populate these ranges from historical booking data.
    """
    # Baseline price ranges per route (min, max)
    route_baselines = {
        "NYC-CHI": (80.0, 800.0),
        "NYC-LAX": (120.0, 1200.0),
        "NYC-LHR": (250.0, 3000.0),
    }
    if route in route_baselines:
        min_price, max_price = route_baselines[route]
        return min_price <= price <= max_price
    # Unknown route — can't validate, flag for human review
    return True


class TestSemanticValidation:

    def test_destination_matches_user_request(self):
        user_request = "Book me a flight to Chicago"
        agent_destination = "Chicago O'Hare (ORD)"
        assert validate_destination_matches_request(user_request, agent_destination)

    def test_wrong_destination_caught(self):
        user_request = "Book me a flight to Chicago"
        # Agent confused Chicago with Charlotte — a real failure mode
        agent_destination = "Charlotte Douglas (CLT)"
        assert not validate_destination_matches_request(user_request, agent_destination)

    def test_past_date_rejected(self):
        assert not validate_date_is_future("2020-01-15")

    def test_future_date_accepted(self):
        assert validate_date_is_future("2026-12-25")

    def test_plausible_price_accepted(self):
        assert validate_price_is_plausible(342.50, "NYC-CHI")

    def test_absurd_price_rejected(self):
        # $1 flight — someone is about to have a bad day
        assert not validate_price_is_plausible(1.00, "NYC-CHI")

Guardrail Testing: Preventing Harmful Outputs

Guardrail testing validates that your AI agent respects boundaries it must never cross. These aren’t nice-to-haves — they’re the tests that prevent your company from trending on social media for the wrong reasons.

Building a Guardrail Test Suite

Think of guardrails in three categories: content safety, scope boundaries, and authority limits.

# test_guardrails.py
import pytest
import re
from dataclasses import dataclass


@dataclass
class GuardrailResult:
    passed: bool
    violation: str = ""


def check_content_safety(agent_response: str) -> GuardrailResult:
    """Screen agent output for content that should never appear."""
    # Block patterns that indicate the agent is leaking system internals
    dangerous_patterns = [
        (r"sk-[a-zA-Z0-9]{32,}", "API key leak detected"),
        (r"password\s*[:=]\s*\S+", "Password exposure detected"),
        (r"(BEGIN|END)\s+(RSA|DSA|EC)\s+PRIVATE\s+KEY", "Private key leak detected"),
        (r"\b\d{3}-\d{2}-\d{4}\b", "Possible SSN detected"),
    ]
    for pattern, description in dangerous_patterns:
        if re.search(pattern, agent_response, re.IGNORECASE):
            return GuardrailResult(passed=False, violation=description)

    return GuardrailResult(passed=True)


def check_scope_boundary(agent_response: str, allowed_actions: list[str]) -> GuardrailResult:
    """Ensure the agent doesn't promise or perform actions outside its scope.
    
    A booking agent should never offer medical advice, legal counsel,
    or claim capabilities it doesn't have.
    """
    out_of_scope_indicators = {
        "medical": ["diagnosis", "prescribe", "medication", "symptoms indicate"],
        "legal": ["legal advice", "you should sue", "liability", "not liable"],
        "financial": ["investment advice", "guaranteed returns", "buy this stock"],
    }
    response_lower = agent_response.lower()
    for category, indicators in out_of_scope_indicators.items():
        if category in allowed_actions:
            continue
        for indicator in indicators:
            if indicator in response_lower:
                return GuardrailResult(
                    passed=False,
                    violation=f"Out-of-scope {category} content: '{indicator}'"
                )
    return GuardrailResult(passed=True)


def check_authority_limits(agent_response: str, max_refund: float = 500.0) -> GuardrailResult:
    """Verify the agent doesn't exceed its authority — e.g., issuing refunds
    above its approved threshold.
    """
    # Look for refund amounts in the response
    refund_pattern = r"refund\s+(?:of\s+)?\$?([\d,]+\.?\d*)"
    matches = re.findall(refund_pattern, agent_response, re.IGNORECASE)
    for match in matches:
        amount = float(match.replace(",", ""))
        if amount > max_refund:
            return GuardrailResult(
                passed=False,
                violation=f"Refund of ${amount} exceeds authority limit of ${max_refund}"
            )
    return GuardrailResult(passed=True)


class TestGuardrails:

    def test_no_api_key_leakage(self):
        # Simulate an agent that accidentally includes an API key
        response = "Here's your booking info. Debug: sk-abc123def456ghi789jkl012mno345pqr678"
        result = check_content_safety(response)
        assert not result.passed
        assert "API key" in result.violation

    def test_clean_response_passes_safety(self):
        response = "Your flight to Chicago is confirmed! Confirmation: AB123456"
        result = check_content_safety(response)
        assert result.passed

    def test_agent_stays_in_scope(self):
        response = "I can help you book a flight! Departures are available at 9am and 2pm."
        result = check_scope_boundary(response, allowed_actions=["booking"])
        assert result.passed

    def test_agent_giving_medical_advice_caught(self):
        # Prompt injection or model drift could cause this
        response = "Your symptoms indicate you might have the flu. I'd prescribe rest."
        result = check_scope_boundary(response, allowed_actions=["booking"])
        assert not result.passed
        assert "medical" in result.violation

    def test_refund_within_authority(self):
        response = "I've processed a refund of $150.00 to your account."
        result = check_authority_limits(response, max_refund=500.0)
        assert result.passed

    def test_refund_exceeding_authority_caught(self):
        response = "Sure! I've issued a refund of $5,000.00 for your inconvenience."
        result = check_authority_limits(response, max_refund=500.0)
        assert not result.passed
        assert "exceeds authority" in result.violation

Building an AI Agent Test Framework

Individual checks are useful. A composable framework that runs all of them against every agent response is what you actually need in production. Here’s how to wire it together.

# ai_agent_test_framework.py
from dataclasses import dataclass, field
from typing import Callable
from enum import Enum


class Severity(Enum):
    CRITICAL = "critical"  # Block response immediately
    HIGH = "high"          # Log + alert, consider blocking
    MEDIUM = "medium"      # Log + alert
    LOW = "low"            # Log for analysis


@dataclass
class ValidationRule:
    name: str
    check: Callable[[dict], bool]  # Takes context dict, returns pass/fail
    severity: Severity
    description: str


@dataclass
class ValidationResult:
    rule_name: str
    passed: bool
    severity: Severity
    detail: str = ""


@dataclass
class AgentResponseValidator:
    """Composable validator that runs all registered checks against
    an AI agent's response before it reaches the user.
    """
    rules: list[ValidationRule] = field(default_factory=list)

    def add_rule(self, rule: ValidationRule):
        self.rules.append(rule)

    def validate(self, context: dict) -> list[ValidationResult]:
        """Run all rules against the agent's response context.
        
        Args:
            context: Dict containing at minimum:
                - user_input: The original user prompt
                - agent_output: The agent's response text
                - agent_data: Any structured data the agent produced
        """
        results = []
        for rule in self.rules:
            try:
                passed = rule.check(context)
                results.append(ValidationResult(
                    rule_name=rule.name,
                    passed=passed,
                    severity=rule.severity,
                ))
            except Exception as e:
                # A validator that crashes should not silently pass
                results.append(ValidationResult(
                    rule_name=rule.name,
                    passed=False,
                    severity=rule.severity,
                    detail=f"Validator raised exception: {e}",
                ))
        return results

    def should_block(self, results: list[ValidationResult]) -> bool:
        """Returns True if any CRITICAL rule failed."""
        return any(
            not r.passed and r.severity == Severity.CRITICAL
            for r in results
        )

    def get_failures(self, results: list[ValidationResult]) -> list[ValidationResult]:
        return [r for r in results if not r.passed]


# --- Setting up the validator with real rules ---

def build_production_validator() -> AgentResponseValidator:
    """Factory that assembles the full validation pipeline."""
    validator = AgentResponseValidator()

    # Rule: Response must not be empty
    validator.add_rule(ValidationRule(
        name="non_empty_response",
        check=lambda ctx: len(ctx.get("agent_output", "").strip()) > 0,
        severity=Severity.CRITICAL,
        description="Agent must always produce a non-empty response",
    ))

    # Rule: Response length must be reasonable (not a novel, not a single char)
    validator.add_rule(ValidationRule(
        name="response_length_bounds",
        check=lambda ctx: 10 < len(ctx.get("agent_output", "")) < 5000,
        severity=Severity.HIGH,
        description="Response should be between 10 and 5000 characters",
    ))

    # Rule: Structured data must include required fields
    validator.add_rule(ValidationRule(
        name="required_fields_present",
        check=lambda ctx: all(
            key in ctx.get("agent_data", {})
            for key in ["destination", "price", "confirmation_id"]
        ),
        severity=Severity.CRITICAL,
        description="Structured output must have all required booking fields",
    ))

    # Rule: Agent must not contradict the user's request
    validator.add_rule(ValidationRule(
        name="destination_consistency",
        check=lambda ctx: ctx.get("agent_data", {}).get("destination", "").lower()
        in ctx.get("user_input", "").lower()
        or ctx.get("user_input", "").lower()
        in ctx.get("agent_data", {}).get("destination", "").lower(),
        severity=Severity.CRITICAL,
        description="Booked destination must match what the user requested",
    ))

    return validator


# --- Usage example ---

if __name__ == "__main__":
    validator = build_production_validator()

    # Simulate a bad agent response
    context = {
        "user_input": "Book me a flight to Chicago",
        "agent_output": "Your flight to Charlotte is confirmed!",
        "agent_data": {
            "destination": "Charlotte",
            "price": 342.50,
            "confirmation_id": "AB123456",
        },
    }

    results = validator.validate(context)
    failures = validator.get_failures(results)

    if validator.should_block(results):
        print("BLOCKED: Response failed critical validation")
        for f in failures:
            print(f"  [{f.severity.value}] {f.rule_name}: {f.detail}")
    else:
        print("Response cleared for delivery")

    # Output:
    # BLOCKED: Response failed critical validation
    #   [critical] destination_consistency:

This framework gives you a clean pattern: define rules with severity levels, compose them into a validator, and gate every agent response through it before delivery. In a real deployment, you’d wire this into your agent’s response pipeline as middleware.

Strategies for Ongoing AI Quality Assurance

Shipping a validation framework is step one. Keeping it effective is the long game. Here’s what a sustainable practice around quality assurance best practices looks like:

Run adversarial test suites on every model update. LLM behavior changes between versions. Build a corpus of 200+ adversarial prompts — prompt injections, boundary probes, ambiguous requests — and run them as regression tests whenever you update a model or system prompt.

Log everything, sample-review weekly. You can’t validate every response manually, but you can log all inputs and outputs, then review a random sample each week. Flag patterns: are certain prompt types producing more failures? Are guardrails catching issues you didn’t anticipate?

Track validation failure rates as metrics. Treat guardrail violations like bugs. If your “destination consistency” check starts failing 3% of the time instead of 0.5%, something changed — a model update, a prompt regression, or a new class of user input you haven’t seen before.

Test the guardrails themselves. Your validators are code. They have bugs too. Write tests for your tests. If you add a new content safety pattern, write a test case that proves it catches the bad output and doesn’t false-positive on good output.

Conclusion

AI agents are shipping faster than most teams can build quality gates around them. The gap between “it works in the demo” and “it works in production with real users trying real things” is where careers — and lawsuits — are made.

The good news: you don’t need exotic new tooling. The patterns here — schema validation, semantic checks, guardrail boundaries, composable validation frameworks — build on skills QA engineers already have. The shift is in what you’re asserting against: not exact outputs, but safety properties, boundary constraints, and behavioral invariants.

Key Takeaways:

  • Shift from exact-match to invariant-based testing. AI outputs are non-deterministic, so test for properties that must always hold rather than specific expected values.
  • Validate at two levels: structural and semantic. Schema validation catches malformed output; semantic validation catches output that’s well-formed but wrong.
  • Build guardrails with teeth. Content safety, scope boundaries, and authority limits should block dangerous responses automatically, not just log warnings.
  • Compose validators into a pipeline. A single validate() call should run every check and make a clear block/pass decision before any response reaches a user.
  • Treat your validation framework as a living system. Log failures, review samples, track metrics, and evolve your test corpus as new failure modes emerge.