Execute Evaluation

curl --request POST \
  --url https://api.example.com/eval-runs \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "eval_type": "accuracy",
  "input": "<string>",
  "agent_id": "<string>",
  "team_id": "<string>",
  "model_id": "<string>",
  "model_provider": "<string>",
  "additional_guidelines": "<string>",
  "additional_context": "<string>",
  "num_iterations": 1,
  "name": "<string>",
  "expected_output": "<string>",
  "criteria": "<string>",
  "scoring_strategy": "binary",
  "threshold": 7,
  "warmup_runs": 0,
  "expected_tool_calls": [
    "<string>"
  ]
}
'

{
  "id": "f2b2d72f-e9e2-4f0e-8810-0a7e1ff58614",
  "agent_id": "basic-agent",
  "model_id": "gpt-4o",
  "model_provider": "OpenAI",
  "eval_type": "reliability",
  "eval_data": {
    "eval_status": "PASSED",
    "failed_tool_calls": [],
    "passed_tool_calls": [
      "multiply"
    ]
  },
  "created_at": "2025-08-27T15:41:59Z",
  "updated_at": "2025-08-27T15:41:59Z"
}

POST

eval-runs

Execute Evaluation

curl --request POST \
  --url https://api.example.com/eval-runs \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "eval_type": "accuracy",
  "input": "<string>",
  "agent_id": "<string>",
  "team_id": "<string>",
  "model_id": "<string>",
  "model_provider": "<string>",
  "additional_guidelines": "<string>",
  "additional_context": "<string>",
  "num_iterations": 1,
  "name": "<string>",
  "expected_output": "<string>",
  "criteria": "<string>",
  "scoring_strategy": "binary",
  "threshold": 7,
  "warmup_runs": 0,
  "expected_tool_calls": [
    "<string>"
  ]
}
'

{
  "id": "f2b2d72f-e9e2-4f0e-8810-0a7e1ff58614",
  "agent_id": "basic-agent",
  "model_id": "gpt-4o",
  "model_provider": "OpenAI",
  "eval_type": "reliability",
  "eval_data": {
    "eval_status": "PASSED",
    "failed_tool_calls": [],
    "passed_tool_calls": [
      "multiply"
    ]
  },
  "created_at": "2025-08-27T15:41:59Z",
  "updated_at": "2025-08-27T15:41:59Z"
}

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Query Parameters

db_id

string | null

Database ID to use for evaluation

table

string | null

Table to use for evaluation

Body

application/json

eval_type

enum<string>

required

Type of evaluation to run (accuracy, performance, or reliability)

Available options:

accuracy,

agent_as_judge,

performance,

reliability

input

string

required

Input text/query for the evaluation

Minimum string length: 1

agent_id

string | null

Agent ID to evaluate

team_id

string | null

Team ID to evaluate

model_id

string | null

Model ID to use for evaluation

model_provider

string | null

Model provider name

additional_guidelines

string | null

Additional guidelines for the evaluation

additional_context

string | null

Additional context for the evaluation

num_iterations

integer

default:1

Number of times to run the evaluation

Required range: 1 <= x <= 100

name

string | null

Name for this evaluation run

expected_output

string | null

Expected output for accuracy evaluation

criteria

string | null

Evaluation criteria for agent-as-judge evaluation

scoring_strategy

enum<string> | null

default:binary

Scoring strategy: 'numeric' (1-10 with threshold) or 'binary' (PASS/FAIL)

Available options:

numeric,

binary

threshold

integer | null

default:7

Score threshold for pass/fail (1-10), only used with numeric scoring

Required range: 1 <= x <= 10

warmup_runs

integer

default:0

Number of warmup runs before measuring performance

Required range: 0 <= x <= 10

expected_tool_calls

string[] | null

Expected tool calls for reliability evaluation

Response

Evaluation executed successfully

string

required

Unique identifier for the evaluation run

eval_type

enum<string>

required

Type of evaluation (accuracy, performance, or reliability)

Available options:

accuracy,

agent_as_judge,

performance,

reliability

eval_data

Eval Data · object

required

Evaluation results and metrics

agent_id

string | null

Agent ID that was evaluated

model_id

string | null

Model ID used in evaluation

model_provider

string | null

Model provider name

team_id

string | null

Team ID that was evaluated

workflow_id

string | null

Workflow ID that was evaluated

name

string | null

Name of the evaluation run

evaluated_component_name

string | null

Name of the evaluated component

eval_input

Eval Input · object

Input parameters used for the evaluation

created_at

string<date-time> | null

Timestamp when evaluation was created

updated_at

string<date-time> | null

Timestamp when evaluation was last updated

Get Evaluation Run Update Evaluation Run

⌘I

Agno SDK Reference

AgentOS API Reference

Agno Infra Reference

Execute Evaluation

Authorizations

Query Parameters

Body

Response