Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.runflow.ai/llms.txt

Use this file to discover all available pages before exploring further.

The auto-reviewer agent is a small standalone agent whose only job is to evaluate another agent’s recent executions and write reviews. It replaces the manual flow of opening the Observability tab and clicking “Evaluate” on every conversation — humans now only see the executions the judge flagged as bad or in need of improvement. This pattern uses three SDK primitives together: Executions, Reviews, and the LLM module.

What you’ll build

1

A reviewer agent

Deployed like any other agent. Has its own ID and code, but never receives user input directly.
2

A CRON trigger

Fires the reviewer once a day (e.g., 06:00). Could be hourly for high-traffic agents.
3

The chain

On each trigger: list recent executions of the target → check if already reviewed (idempotency) → run LLM judge → persist the verdict via Reviews.create().

Setup

You need two things:
  1. A target agent already in production (the one you want to evaluate). Note its slug or UUID.
  2. An OpenAI API key (or any LLM provider) — the judge uses it.
The reviewer agent itself is a fresh agent. Create it via the portal or rf create, then drop in the code below.

The code

// main.ts of the reviewer agent
import { Executions } from '@runflow-ai/sdk/executions';
import { Reviews } from '@runflow-ai/sdk/reviews';
import { LLM } from '@runflow-ai/sdk/llm';

const JUDGE_PROMPT = `You are a strict QA reviewer for a customer-facing AI
agent. Read the user's question and the agent's answer, then output ONLY a
JSON object on a single line with exactly these fields:

  rating  — "good" | "bad" | "needs_improvement"
  reason  — short Portuguese explanation, at most 240 chars

Rubric:
  - "bad" if the answer is wrong, unsafe, contradicts the user, or hallucinates.
  - "needs_improvement" if correct but unclear, too short, missing context,
    or fails to ask a follow-up that the situation clearly required.
  - "good" if correct, clear, and well-targeted.
`;

export async function main() {
  return runCrossAgentReview({
    targetAgentId: process.env.TARGET_AGENT_ID ?? 'customer-support',
    windowHours:   24,
    limit:         200,
  });
}

async function runCrossAgentReview(opts: {
  targetAgentId: string;
  windowHours:   number;
  limit:         number;
}) {
  const executions = new Executions();
  const reviews    = new Reviews();
  const judge      = LLM.openai('gpt-4o-mini');

  // 1. Pull recent executions of the target agent
  const { data: list } = await executions.list({
    agentId: opts.targetAgentId,
    limit:   opts.limit,
  });

  const cutoff = Date.now() - opts.windowHours * 60 * 60 * 1000;
  let created = 0, skipped = 0, errors = 0;

  for (const exec of list) {
    try {
      // Skip outside the window
      const startedAtMs = exec.startedAt ? new Date(exec.startedAt).getTime() : Date.now();
      if (startedAtMs < cutoff) { skipped++; continue; }

      // Idempotency — never double-review
      const { exists } = await reviews.checkHasReview(exec.id);
      if (exists) { skipped++; continue; }

      // Drill into the execution to get the actual messages
      const detail  = await executions.get(exec.id);
      const userMsg = stringify(detail?.input?.message ?? detail?.input);
      const agentMsg = stringify(detail?.output?.message ?? detail?.output);
      if (!userMsg || !agentMsg) { skipped++; continue; }

      // Ask the LLM judge
      const verdictRaw = await judge.chat(
        `${JUDGE_PROMPT}\n\nUSER:\n${userMsg}\n\nAGENT:\n${agentMsg}`,
      );
      const verdict = parseVerdict(verdictRaw);
      if (!verdict) { errors++; continue; }

      // Persist the review
      await reviews.create({
        executionId: exec.id,
        agentId:     opts.targetAgentId,
        rating:      verdict.rating,
        comment:     verdict.reason,
        priority:    verdict.rating === 'bad' ? 'high' : 'medium',
        tags:        ['auto-judge', 'model:gpt-4o-mini'],
      });
      created++;
    } catch (err: any) {
      console.error(`error on exec=${exec.id}: ${err?.message}`);
      errors++;
    }
  }

  return { scanned: list.length, created, skipped, errors };
}

function stringify(v: unknown): string {
  if (v == null) return '';
  if (typeof v === 'string') return v;
  try { return JSON.stringify(v); } catch { return String(v); }
}

function parseVerdict(raw: string) {
  const match = raw.match(/\{[^}]*\}/);
  if (!match) return null;
  try {
    const parsed = JSON.parse(match[0]);
    const rating = parsed?.rating;
    const reason = String(parsed?.reason ?? '').slice(0, 240);
    if (!['good', 'bad', 'needs_improvement'].includes(rating)) return null;
    if (reason.length < 10) return null;
    return { rating, reason };
  } catch {
    return null;
  }
}

Wiring the cron trigger

In the portal, on the reviewer agent’s Triggers tab:
1

Add a CRON trigger

Click + Trigger → choose CRON.
2

Schedule

0 6 * * * runs daily at 06:00. For a high-traffic agent, try 0 * * * * (hourly).
3

Save

The trigger fires main() automatically — no input payload needed.

Auto-dismiss the “good” verdicts (optional)

Out of the box every verdict creates a pending_review row. To keep the human inbox focused only on bad / needs_improvement, auto-dismiss the good ones:
if (verdict.rating === 'good') {
  const { reviews: justCreated } = await reviews.list({
    agentId: opts.targetAgentId,
    status:  'pending_review',
    limit:   5,
  });
  const fresh = justCreated.find((r) => r.executionId === exec.id);
  if (fresh) {
    await reviews.dismiss(fresh.id, {
      resolutionNotes: 'auto-judge: good — no action needed.',
    });
  }
}
Drop this block right after the reviews.create() call inside the for loop.

Why this works

Tenant-scoped

The reviewer agent’s API key only sees its own tenant. Cross-tenant references return 404 — no risk of evaluating someone else’s data.

Idempotent

checkHasReview() prevents double-reviewing. Re-run the cron as often as you want — already-reviewed executions are skipped silently.

Auditable

Reviews stamped by the SDK show up in the UI as reviewedBy: apikey:<name>. Easy to filter from human reviewers.

Decoupled

Lives in its own agent. The target agent doesn’t know it’s being reviewed — zero coupling, change one without touching the other.

Cost model

  • Per execution review: 1 LLM call to the judge (~$0.0001 with gpt-4o-mini)
  • Per cron run: N LLM calls for N un-reviewed executions in the window
  • For 1000 executions/day, the daily cost is < $0.10
The dominant cost is the LLM judge, not the SDK round-trips.

Variants

Replace reviews.create() with a wrapper that also POSTs to a Slack webhook when verdict.rating === 'bad'. You get an auto-curated inbox and real-time alerts.
Filter resolved reviews with correctedOutput and export them via reviews.exportForTraining({ agentId, status: 'resolved' }) — OpenAI fine-tuning format. Lets you close the loop: reviewer flags → human corrects → model gets retrained on the corrections.
Loop runCrossAgentReview over a list of target agents. Each cron firing evaluates the entire fleet. Use agents.list() to discover targets dynamically.

Next steps

Cross-Agent SDK

Full reference of the primitives this use case uses.

Schedule

More on cron triggers and scheduling patterns.

Standalone Modules

Reviews, Executions, and LLM reference.

Observability

What goes into the executions and traces this reviewer reads.