This example demonstrates how to use Agent as Judge evaluation to assess the main agent’s output as a background task. Unlike blocking validation, background evaluation:
- Does NOT block the response to the user
- Logs evaluation results for monitoring and analytics
- Can trigger alerts or store metrics without affecting latency
Use cases:
- Quality monitoring in production
- Compliance auditing
- Validating hallucinations or other inappropriate content
Create a Python file
background_output_evaluation.py
from agno.agent import Agent
from agno.db.sqlite import AsyncSqliteDb
from agno.eval.agent_as_judge import AgentAsJudgeEval
from agno.models.openai import OpenAIResponses
from agno.os import AgentOS
# Setup database for agent and evaluation storage
db = AsyncSqliteDb(db_file="tmp/evaluation.db")
# Create the evaluator using Agent as Judge
evaluator = AgentAsJudgeEval(
db=db,
name="Response Quality Check",
model=OpenAIResponses(id="gpt-5.2"),
criteria="Response should be helpful, accurate, and well-structured",
additional_guidelines=[
"Evaluate if the response addresses the user's question directly",
"Check if the information provided is correct and reliable",
"Assess if the response is well-organized and easy to understand",
],
threshold=7,
run_in_background=True, # Runs evaluation without blocking the response
)
# Create the main agent with Agent as Judge evaluation
main_agent = Agent(
id="support-agent",
name="CustomerSupportAgent",
model=OpenAIResponses(id="gpt-5.2"),
instructions=[
"You are a helpful customer support agent.",
"Provide clear, accurate, and friendly responses.",
"If you don't know something, say so honestly.",
],
db=db,
post_hooks=[evaluator], # Automatically evaluates each response
markdown=True,
)
# Create AgentOS
agent_os = AgentOS(agents=[main_agent])
app = agent_os.get_app()
if __name__ == "__main__":
agent_os.serve(app="background_output_evaluation:app", port=7777, reload=True)
Set up your virtual environment
uv venv --python 3.12
source .venv/bin/activate
Install dependencies
uv pip install -U agno openai uvicorn
Export your OpenAI API key
export OPENAI_API_KEY="your_openai_api_key_here"
Run the server
python background_output_evaluation.py
Test the endpoint
curl -X POST http://localhost:7777/agents/support-agent/runs \
-F "message=How do I reset my password?" \
-F "stream=false"
The response will be returned immediately. The evaluation runs in the background and results are stored in the database.
What Happens
- User sends a request to the agent
- The agent processes and generates a response
- The response is sent to the user immediately
- Background evaluation runs:
AgentAsJudgeEval automatically evaluates the response against the criteria
- Scores the response on a scale of 1-10
- Stores results in the database
Production Extensions
In production, you could extend this pattern to:
| Extension | Description |
|---|
| Database Storage | Store evaluations for analytics dashboards |
| Alerting | Use on_fail callback to send alerts when evaluations fail |
| Observability | Log to platforms like Datadog or OpenTelemetry |
| A/B Testing | Compare response quality across model versions |
| Training Data | Build datasets for fine-tuning |
Background evaluation is ideal for quality monitoring without impacting user experience. For scenarios where you need to block bad responses, use synchronous hooks instead.