Code Examples
Benchmark your agent
A standalone script that runs the same query twice — once direct to your provider, once via Orqen — and compares input token counts. One dependency (httpx), no mocking, numbers come straight from the provider response.
Expected output
A 31-tool agent, weather query, claude-haiku-4-5. Your numbers will vary by model and prompt — the ratio is what matters.
Tools available: 31
Model: claude-haiku-4-5-20251001
Prompt: What is the current weather in Paris, France? Should I bring an umbrella today?
Running direct (api.anthropic.com)... done 1,842ms
Running via Orqen (api.orqen.app)... done 967ms
──────────────────────────────────────────────────
DIRECT VIA ORQEN DELTA
──────────────────────────────────────────────────
Input tokens 8,142 1,207 −6,935 (−85%)
Output tokens 24 24 +0
Tools forwarded 31 1
Orqen overhead — <20ms
──────────────────────────────────────────────────
Answer match: ✓
Response kind: direct=tool_call orqen=tool_call
At claude-haiku-4-5 pricing ($0.80/M input):
Direct: $0.00651 per call
Via Orqen: $0.00097 per call
Savings: $0.00555 per call (~$5.55 per 1,000 calls)Answer match: ✓ means both calls returned the same tool call. Output tokens are unchanged — Orqen only trims input. If the script prints a recall miss warning, check that tool's description and re-run.
Quick start
pip install httpx
export ORQEN_API_KEY=sk-orq-... # orqen.app → Settings
export ANTHROPIC_API_KEY=sk-ant-... # your Anthropic key (direct baseline)
python benchmark.py
# Try a different prompt or model:
python benchmark.py --prompt "Book a flight from London to Tokyo"
python benchmark.py --model claude-sonnet-4-6What it measures
Input tokens — the only dimension Orqen optimises. Both calls use the same model; the delta comes entirely from Orqen routing to a smaller, relevant tool subset rather than forwarding all 31 schemas.
Output tokens — should be equal. If Orqen's output is significantly larger and the response kind flips from tool_call to text, the script flags a recall miss: the needed tool was pruned and the model answered in prose instead. That hurts correctness, not just cost — fix the tool description and re-run.
Answer match — both calls should select the same tool. A mismatch is surfaced explicitly so you don't mistake a correctness difference for a token savings win.
The script
#!/usr/bin/env python3
"""Orqen benchmark — reproducible token savings measurement.
Runs the same query against a 31-tool agent twice:
1. Direct: straight to your LLM provider with all 31 tool schemas
2. Via Orqen: Orqen routes the request, forwards only the relevant tool
Both calls go to the same model. Input token counts come from the provider response.
Prerequisites
-------------
pip install httpx
export ORQEN_API_KEY=sk-orq-... # orqen.app -> Settings
export ANTHROPIC_API_KEY=sk-ant-... # for the direct baseline
Usage
-----
python benchmark.py
python benchmark.py --model claude-sonnet-4-6
python benchmark.py --prompt "Book a flight from London to Tokyo"
"""
from __future__ import annotations
import argparse, os, time
from typing import Any
try:
import httpx
except ImportError:
raise SystemExit("pip install httpx")
# ── Tool list (31 tools — only 1 is relevant for a weather query) ─────────────
def _fn(name, desc, props, req=None):
return {"name": name, "description": desc,
"input_schema": {"type": "object", "properties": props, "required": req or []}}
TOOLS = [
_fn("get_current_weather", "Get current live weather for a city.",
{"city": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}}, ["city"]),
_fn("get_weather_forecast", "Get a weather forecast for the next few days.",
{"city": {"type": "string"}, "days": {"type": "integer"}}, ["city"]),
_fn("geocode_city", "Find lat/lon, country, and timezone for a city.",
{"city": {"type": "string"}}, ["city"]),
_fn("calculate_compound_interest", "Calculate compound interest.",
{"principal": {"type": "number"}, "annual_rate_percent": {"type": "number"}, "years": {"type": "number"}},
["principal", "annual_rate_percent", "years"]),
_fn("calculate_loan_payment", "Calculate a fixed monthly loan payment.",
{"principal": {"type": "number"}, "annual_rate_percent": {"type": "number"}, "years": {"type": "number"}},
["principal", "annual_rate_percent", "years"]),
_fn("convert_temperature", "Convert a temperature between Celsius and Fahrenheit.",
{"value": {"type": "number"}, "from_unit": {"type": "string"}, "to_unit": {"type": "string"}},
["value", "from_unit", "to_unit"]),
_fn("lookup_customer", "Look up a customer profile by customer id.",
{"customer_id": {"type": "string"}}, ["customer_id"]),
_fn("get_order_status", "Look up an e-commerce order status.",
{"order_id": {"type": "string"}}, ["order_id"]),
_fn("estimate_shipping_rate", "Estimate shipping cost from weight and destination.",
{"weight_kg": {"type": "number"}, "destination_country": {"type": "string"}},
["weight_kg", "destination_country"]),
_fn("search_knowledge_base", "Search a support knowledge base.", {"query": {"type": "string"}}, ["query"]),
_fn("summarize_incident", "Summarize an incident from severity and notes.",
{"severity": {"type": "string"}, "service": {"type": "string"}, "notes": {"type": "string"}},
["severity", "service", "notes"]),
_fn("calculate_sla_deadline", "Calculate an SLA deadline.",
{"created_at": {"type": "string"}, "sla_hours": {"type": "number"}}, ["created_at", "sla_hours"]),
_fn("validate_email_address", "Validate the format of an email address.", {"email": {"type": "string"}}, ["email"]),
_fn("normalize_phone_number", "Normalize a phone number to E.164.", {"phone": {"type": "string"}}, ["phone"]),
_fn("lookup_airport", "Look up an airport by IATA code.", {"iata_code": {"type": "string"}}, ["iata_code"]),
_fn("calculate_distance_estimate", "Estimate great-circle distance between two cities.",
{"from_city": {"type": "string"}, "to_city": {"type": "string"}}, ["from_city", "to_city"]),
_fn("create_support_ticket", "Create a support ticket.",
{"subject": {"type": "string"}, "description": {"type": "string"}, "priority": {"type": "string"}},
["subject", "description"]),
_fn("get_subscription_status", "Look up subscription status.", {"account_id": {"type": "string"}}, ["account_id"]),
_fn("calculate_invoice_total", "Calculate invoice total.",
{"amounts": {"type": "array", "items": {"type": "number"}}, "tax_rate_percent": {"type": "number"}}, ["amounts"]),
_fn("check_inventory", "Check inventory level for a SKU.", {"sku": {"type": "string"}}, ["sku"]),
_fn("reserve_inventory", "Reserve inventory.", {"sku": {"type": "string"}, "quantity": {"type": "integer"}},
["sku", "quantity"]),
_fn("recommend_plan", "Recommend a SaaS plan.",
{"seats": {"type": "integer"}, "features": {"type": "array", "items": {"type": "string"}}}, ["seats"]),
_fn("parse_iso_datetime", "Parse an ISO-8601 datetime.", {"value": {"type": "string"}}, ["value"]),
_fn("get_business_hours", "Return business hours for a region.",
{"region": {"type": "string", "enum": ["us", "eu", "apac"]}}, ["region"]),
_fn("calculate_tax_estimate", "Estimate tax.",
{"amount": {"type": "number"}, "region": {"type": "string"}}, ["amount", "region"]),
_fn("search_product_catalog", "Search a product catalog.", {"query": {"type": "string"}}, ["query"]),
_fn("get_refund_policy", "Return a refund policy.", {"category": {"type": "string"}}, ["category"]),
_fn("classify_sentiment", "Classify text sentiment.", {"text": {"type": "string"}}, ["text"]),
_fn("detect_language", "Detect the language of a text snippet.", {"text": {"type": "string"}}, ["text"]),
_fn("redact_pii", "Redact emails and phone numbers from text.", {"text": {"type": "string"}}, ["text"]),
_fn("schedule_follow_up", "Schedule a follow-up.",
{"days_from_now": {"type": "integer"}, "topic": {"type": "string"}}, ["days_from_now", "topic"]),
]
PRICING = {
"claude-haiku-4-5-20251001": 0.80,
"claude-haiku-4-5": 0.80,
"claude-sonnet-4-6": 3.00,
"claude-opus-4-7": 5.00,
"claude-3-5-haiku-20241022": 0.80,
"gpt-4o": 2.50,
"gpt-4o-mini": 0.15,
}
# ── API call ──────────────────────────────────────────────────────────────────
def _anthropic_call(base_url, api_key, model, prompt, tools, label, timeout):
print(f"Running {label} ({base_url.replace('https://', '')})...", end=" ", flush=True)
t0 = time.perf_counter()
with httpx.Client(timeout=timeout) as client:
resp = client.post(
f"{base_url.rstrip('/')}/v1/messages",
headers={"x-api-key": api_key, "anthropic-version": "2023-06-01",
"content-type": "application/json"},
json={"model": model, "max_tokens": 512, "tools": tools,
"messages": [{"role": "user", "content": prompt}]},
)
elapsed_ms = (time.perf_counter() - t0) * 1000
if resp.status_code >= 400:
raise SystemExit(f"HTTP {resp.status_code}: {resp.text[:400]}")
print(f"done {elapsed_ms:,.0f}ms")
return resp.json(), {k.lower(): v for k, v in resp.headers.items()}, elapsed_ms
def _extract_tool(data):
for b in data.get("content", []):
if isinstance(b, dict) and b.get("type") == "tool_use":
return b.get("name")
return None
def _input_tokens(data): return int(data.get("usage", {}).get("input_tokens") or 0)
def _output_tokens(data): return int(data.get("usage", {}).get("output_tokens") or 0)
def _kind(data): return "tool_call" if _extract_tool(data) else "text"
# ── Report ────────────────────────────────────────────────────────────────────
def _print_report(prompt, model, d_data, o_data, o_hdrs):
d_tok, o_tok = _input_tokens(d_data), _input_tokens(o_data)
d_out, o_out = _output_tokens(d_data), _output_tokens(o_data)
d_kind, o_kind = _kind(d_data), _kind(o_data)
d_tool, o_tool = _extract_tool(d_data) or "none", _extract_tool(o_data) or "none"
tools_out = o_hdrs.get("x-orqen-tools-output", "?")
routing = o_hdrs.get("x-orqen-routing", "")
overhead = o_hdrs.get("x-orqen-pipeline-ms", "<20")
match_sym = "\u2713" if d_tool == o_tool else "\u2717"
pct = f"-{(d_tok - o_tok) / d_tok * 100:.0f}%" if d_tok else "n/a"
out_delta = o_out - d_out
print()
print(f"Tools available: {len(TOOLS)}")
print(f"Model: {model}")
print(f"Prompt: {prompt[:80]}{'...' if len(prompt) > 80 else ''}")
print()
W = 50
sep = "-" * W
print(sep)
print(f"{'':20}{'DIRECT':>9} {'VIA ORQEN':>9} {'DELTA':>9}")
print(sep)
print(f"{'Input tokens':20}{d_tok:>9,} {o_tok:>9,} -{d_tok - o_tok:,} ({pct})")
print(f"{'Output tokens':20}{d_out:>9,} {o_out:>9,} {out_delta:>+9,}")
print(f"{'Tools forwarded':20}{len(TOOLS):>9} {tools_out:>9}")
if routing:
print(f"{'Routing method':20}{'--':>9} {routing:>9}")
print(f"{'Orqen overhead':20}{'--':>9} {overhead + 'ms':>9}")
print(sep)
print(f"Answer match: {match_sym}")
print(f"Response kind: direct={d_kind} orqen={o_kind}")
if d_tool != o_tool or (d_kind == "tool_call" and o_kind == "text"):
print()
if d_kind == "tool_call" and o_kind == "text":
print("WARNING: recall miss — needed tool was pruned, model answered in prose.")
print(f" Fix: improve the description of '{d_tool}' so it ranks into the forwarded set.")
else:
print(f"WARNING: tool mismatch — direct={d_tool}, orqen={o_tool}.")
print(" Compare answer correctness before reading the token delta.")
price = next((p for k, p in PRICING.items() if k in model), None)
if price and d_tok:
d_cost = d_tok * price / 1_000_000
o_cost = o_tok * price / 1_000_000
saving = d_cost - o_cost
print()
print("At " + model + " pricing ($" + f"{price:.2f}" + "/M input):")
print(" Direct: $" + f"{d_cost:.5f}" + " per call")
print(" Via Orqen: $" + f"{o_cost:.5f}" + " per call")
print(" Savings: $" + f"{saving:.5f}" + " per call (~$" + f"{saving * 1000:.2f}" + " per 1,000 calls)")
print()
# ── Main ──────────────────────────────────────────────────────────────────────
def main():
p = argparse.ArgumentParser()
p.add_argument("--prompt", default="What is the current weather in Paris? Should I bring an umbrella?")
p.add_argument("--model", default=os.environ.get("BENCHMARK_MODEL", "claude-haiku-4-5-20251001"))
p.add_argument("--orqen-url", default="https://api.orqen.app")
p.add_argument("--direct-url", default="https://api.anthropic.com")
p.add_argument("--timeout", type=float, default=120.0)
args = p.parse_args()
orqen_key = os.environ.get("ORQEN_API_KEY") or raise_("Set ORQEN_API_KEY — free account at orqen.app")
direct_key = os.environ.get("ANTHROPIC_API_KEY") or raise_("Set ANTHROPIC_API_KEY for the direct baseline")
d_data, _, d_ms = _anthropic_call(args.direct_url, direct_key, args.model, args.prompt, TOOLS, "direct", args.timeout)
o_data, o_hdrs, o_ms = _anthropic_call(args.orqen_url, orqen_key, args.model, args.prompt, TOOLS, "via Orqen", args.timeout)
_print_report(args.prompt, args.model, d_data, o_data, o_hdrs)
def raise_(msg): raise SystemExit(msg)
if __name__ == "__main__":
main()