The auto-reviewer agent is a small standalone agent whose only job is to evaluate another agent’s recent executions and write reviews. It replaces the manual flow of opening the Observability tab and clicking “Evaluate” on every conversation — humans now only see the executions the judge flagged as bad or in need of improvement. This pattern uses three SDK primitives together:Documentation Index
Fetch the complete documentation index at: https://docs.runflow.ai/llms.txt
Use this file to discover all available pages before exploring further.
Executions, Reviews, and the LLM module.
What you’ll build
A reviewer agent
Deployed like any other agent. Has its own ID and code, but never receives user input directly.
A CRON trigger
Fires the reviewer once a day (e.g., 06:00). Could be hourly for high-traffic agents.
Setup
You need two things:- A target agent already in production (the one you want to evaluate). Note its slug or UUID.
- An OpenAI API key (or any LLM provider) — the judge uses it.
rf create, then drop in the code below.
The code
Wiring the cron trigger
In the portal, on the reviewer agent’s Triggers tab:Auto-dismiss the “good” verdicts (optional)
Out of the box every verdict creates apending_review row. To keep the human inbox focused only on bad / needs_improvement, auto-dismiss the good ones:
reviews.create() call inside the for loop.
Why this works
Tenant-scoped
The reviewer agent’s API key only sees its own tenant. Cross-tenant references return 404 — no risk of evaluating someone else’s data.
Idempotent
checkHasReview() prevents double-reviewing. Re-run the cron as often as you want — already-reviewed executions are skipped silently.Auditable
Reviews stamped by the SDK show up in the UI as
reviewedBy: apikey:<name>. Easy to filter from human reviewers.Decoupled
Lives in its own agent. The target agent doesn’t know it’s being reviewed — zero coupling, change one without touching the other.
Cost model
- Per execution review: 1 LLM call to the judge (~$0.0001 with
gpt-4o-mini) - Per cron run: N LLM calls for N un-reviewed executions in the window
- For 1000 executions/day, the daily cost is < $0.10
Variants
Flag bad reviews to Slack
Flag bad reviews to Slack
Replace
reviews.create() with a wrapper that also POSTs to a Slack webhook when verdict.rating === 'bad'. You get an auto-curated inbox and real-time alerts.Promote bad reviews to training data
Promote bad reviews to training data
Filter resolved reviews with
correctedOutput and export them via reviews.exportForTraining({ agentId, status: 'resolved' }) — OpenAI fine-tuning format. Lets you close the loop: reviewer flags → human corrects → model gets retrained on the corrections.Multi-agent fleet
Multi-agent fleet
Loop
runCrossAgentReview over a list of target agents. Each cron firing evaluates the entire fleet. Use agents.list() to discover targets dynamically.Next steps
Cross-Agent SDK
Full reference of the primitives this use case uses.
Schedule
More on cron triggers and scheduling patterns.
Standalone Modules
Reviews, Executions, and LLM reference.Observability
What goes into the executions and traces this reviewer reads.