Skip to main content
Runflow agents can process audio and images automatically. This is essential for WhatsApp integrations where users send voice messages and photos, and for HTTP clients that upload files directly to the agent via multipart/form-data. There are two entry points for media:
  • input.file — singular. Used by webhook handlers (Twilio/WhatsApp, Meta/Messenger) that resolve a single media file per inbound message. Auto-processed when media.transcribeAudio or media.processImages is enabled.
  • input.attachments[] — plural. Populated when the agent is invoked via multipart/form-data (one or more files in the same request). Auto-bridged to a multimodal chat message when media.processAttachments is enabled.
Both paths produce the same downstream effect: the LLM receives a multimodal user message with image and/or file parts. Choose the entry point that matches how media reaches your agent.

Audio Transcription

Transcribe audio files to text using multiple providers:
import { transcribe, Media } from '@runflow-ai/sdk';

// Standalone function (default: OpenAI Whisper)
const result = await transcribe({
  audioUrl: 'https://example.com/audio.ogg',
  language: 'pt',
});

console.log(result.text); // "Olá, como vai?"

// Using specific provider
const result2 = await transcribe({
  audioUrl: 'https://example.com/audio.ogg',
  provider: 'deepgram',
  language: 'pt',
});

// Or via Media class
const result3 = await Media.transcribe({
  audioUrl: 'https://example.com/audio.ogg',
  provider: 'openai',
});

Supported Providers

ProviderStatusDescription
openaiAvailableOpenAI Whisper (default)
deepgramAvailableDeepgram
assemblyaiAvailableAssemblyAI
googleAvailableGoogle Speech-to-Text

Agent with Auto Media Processing

Configure agents to automatically handle audio and image files. When a user sends a voice message, it’s transcribed before processing. When they send an image, it’s analyzed with vision capabilities.
import { Agent, openai } from '@runflow-ai/sdk';

const agent = new Agent({
  name: 'WhatsApp Assistant',
  instructions: 'You are a helpful assistant.',
  model: openai('gpt-4o'),

  media: {
    transcribeAudio: true,
    processImages: true,
    audioProvider: 'openai',
    audioLanguage: 'pt',
  },
});

// Audio files are automatically transcribed before processing
const result = await agent.process({
  message: '',
  file: {
    url: 'https://zenvia.com/storage/audio.ogg',
    contentType: 'audio/ogg',
    caption: 'Voice message',
  },
});

// Images are automatically processed as multimodal
const result2 = await agent.process({
  message: 'What is in this image?',
  file: {
    url: 'https://example.com/image.jpg',
    contentType: 'image/jpeg',
  },
});

Media Config Options

OptionTypeDescription
transcribeAudiobooleanAuto-transcribe input.file when it’s audio (default: false)
processImagesbooleanAuto-process input.file when it’s an image (default: false)
processAttachmentsbooleanAuto-bridge input.attachments[] (from multipart uploads) into a multimodal message (default: false)
audioLanguagestringLanguage code (pt, en, es, etc.)
audioProviderstringopenai | deepgram | assemblyai | google
audioModelstringProvider-specific model

Multipart Uploads (Files & Images via HTTP)

When you call the agent directly over HTTP and need to send files or images, post multipart/form-data to the agent endpoint. Runflow stores each upload, generates a short-lived URL, and delivers an input.attachments[] array to your agent.
curl -X POST "https://executor.runflow.ai/agent/<agentId>?token=<agentToken>" \
  -F "message=quanto custa esse produto?" \
  -F "photo=@./produto.jpg"
Enable media.processAttachments to have the SDK build the multimodal message for you. No glue code, no transform — the agent receives the attachments and forwards them to the LLM automatically.
agent.ts
import { Agent, openai } from '@runflow-ai/sdk';

export const agent = new Agent({
  name: 'product-helper',
  model: openai('gpt-4o'),                  // any vision-capable model
  instructions: 'Help the user evaluate products.',
  media: {
    processAttachments: true,                // ← opt-in
  },
});
What happens:
  1. Runflow stores the upload and delivers your agent an input that looks like this:
    {
      "message": "quanto custa esse produto?",
      "attachments": [{
        "field": "photo",
        "name": "produto.jpg",
        "content_type": "image/jpeg",
        "size": 12345,
        "url": "https://.../produto.jpg",
        "object_key": "uploads/<tenant>/<execId>/<uuid>_produto.jpg"
      }]
    }
    
  2. The SDK builds a multimodal user message — text plus image — and sends it to the model.
  3. The image reaches the LLM in whatever format that provider expects (OpenAI/Gemini get the URL directly; Anthropic, Groq, xAI and Azure all receive their respective shapes — handled transparently by the SDK).
Routing rule (kept simple by design):
  • content_type starting with image/{ type: 'image_url', image_url: { url } }
  • anything else → { type: 'file_url', file_url: { url }, name }

Manual transform (advanced)

If you need custom routing — download a CSV to parse locally, OCR a PDF before sending, fan out images to different conversations — leave processAttachments off and transform the attachments yourself:
import { buildAttachmentsContent } from '@runflow-ai/sdk/core';

export async function main(input: any) {
  // Use the same helper the SDK uses internally:
  const content = buildAttachmentsContent(input.attachments, input.message);

  // Or build it by hand, attachment by attachment:
  const parts = [{ type: 'text', text: input.message }];
  for (const att of input.attachments ?? []) {
    if (att.content_type.startsWith('image/')) {
      parts.push({ type: 'image_url', image_url: { url: att.url } });
    } else if (att.content_type === 'application/pdf') {
      // Maybe OCR locally, then send the extracted text.
      const text = await extractPdfText(att.url);
      parts.push({ type: 'text', text });
    } else {
      parts.push({ type: 'file_url', file_url: { url: att.url }, name: att.name });
    }
  }

  return await agent.process({
    ...input,
    messages: [{ role: 'user', content: parts }],
  });
}

Limits

LimitDefault
Max per file25 MiB
Max total per request100 MiB
Max per text form field1 MiB
Presigned URL TTL15 minutes
Oversize uploads return HTTP 413 AttachmentTooLarge. Malformed multipart returns 400 InvalidMultipart. The URLs in input.attachments[].url are short-lived — your agent and the LLM provider should consume them within minutes of the request. For documents that need to live longer, persist them yourself (e.g., copy to your own storage).

Provider support for attachments

ProviderImagesFiles (PDF, CSV, …)
OpenAIURL or base64file_id (uploaded via /runtime/v1/files)
Azure OpenAIURL or base64text label only
Anthropic / BedrockURL or base64text label only
Geminibase64 inline (or data: URI)text label only
Groq / xAIURL or base64 (vision models)text label only
For non-image files on providers that don’t support arbitrary documents, the SDK emits a [File: <name>] text placeholder so the model sees something coherent. If you need the model to actually read a PDF/CSV, parse it locally first and send the extracted text.

Real-World Example: WhatsApp Support Agent

A complete WhatsApp agent that handles text, voice messages, and photos. Users can send a voice message to explain their issue or a photo of a damaged product.

Project Structure

whatsapp-support/
├── main.ts
├── agent.ts
├── tools/
│   ├── index.ts
│   └── create-ticket.ts
├── .runflow/
│   └── rf.json
├── package.json
└── tsconfig.json

Agent with Media

agent.ts
import { Agent, openai } from '@runflow-ai/sdk';

export const whatsappAgent = new Agent({
  name: 'WhatsApp Support',
  instructions: `You are a customer support agent for WhatsApp.

## Behavior
- Respond in the customer's language
- Be concise — WhatsApp messages should be short
- When the customer sends a voice message, you'll receive the transcription — respond naturally
- When the customer sends a photo, analyze it and respond accordingly

## Tools
- Use create-ticket when the issue needs human follow-up
- If a customer sends a photo of a damaged product, create a ticket with priority 'high'`,
  model: openai('gpt-4o'),
  memory: { maxTurns: 30 },
  media: {
    transcribeAudio: true,
    processImages: true,
    audioProvider: 'openai',
    audioLanguage: 'pt',
  },
  tools: {
    createTicket: createTicketTool,
  },
  observability: 'full',
});

Main Entry Point

main.ts
import { identify, track } from '@runflow-ai/sdk/observability';
import { whatsappAgent } from './agent';

function parseWhatsAppInput(input: any) {
  // Zenvia webhook format
  if (input.message?.from) {
    const content = input.message.contents?.[0];
    return {
      phone: input.message.from,
      message: content?.text || content?.caption || '',
      file: content?.fileUrl ? {
        url: content.fileUrl,
        contentType: content.fileMimeType,
        caption: content.caption,
      } : undefined,
      channel: 'zenvia',
    };
  }

  // Direct API
  return {
    phone: input.phone,
    message: input.message || '',
    file: input.file,
    channel: input.channel || 'api',
  };
}

export async function main(input: any) {
  const { phone, message, file, channel } = parseWhatsAppInput(input);

  if (!phone) {
    return { error: 'phone is required' };
  }

  if (!message && !file) {
    return { error: 'message or file is required' };
  }

  identify(phone);

  // Track media type for analytics
  const mediaType = file?.contentType?.startsWith('audio') ? 'audio'
    : file?.contentType?.startsWith('image') ? 'image'
    : 'text';

  track('whatsapp_message_received', { channel, mediaType });

  try {
    const result = await whatsappAgent.process({
      message,
      sessionId: `whatsapp_${phone}`,
      file,
    });

    return {
      message: result.message,
      phone,
    };
  } catch (error) {
    console.error('[whatsapp-support] Error:', error);
    return { error: 'An error occurred processing your message' };
  }
}

Real-World Example: Transcription Tool

When you need more control over transcription (e.g., saving the transcription, analyzing it), use transcribe() inside a tool:
tools/process-voice.ts
import { createTool } from '@runflow-ai/sdk';
import { transcribe } from '@runflow-ai/sdk';
import { track } from '@runflow-ai/sdk/observability';
import { z } from 'zod';

export const processVoiceTool = createTool({
  id: 'process-voice',
  description: 'Transcribe and analyze a voice message',
  inputSchema: z.object({
    audioUrl: z.string().describe('URL of the audio file'),
    language: z.string().optional().describe('Language code (default: pt)'),
  }),
  execute: async (params) => {
    try {
      const result = await transcribe({
        audioUrl: params.audioUrl,
        language: params.language || 'pt',
        provider: 'openai',
      });

      track('voice_transcribed', {
        language: params.language || 'pt',
        textLength: result.text.length,
      });

      return {
        success: true,
        text: result.text,
        language: params.language || 'pt',
      };
    } catch (error) {
      return {
        success: false,
        error: error instanceof Error ? error.message : 'Transcription failed',
      };
    }
  },
});

Tips

For WhatsApp agents, always enable both transcribeAudio and processImages. Users frequently send voice messages instead of typing, especially on mobile.
Audio transcription adds latency to your agent’s response (typically 1-3 seconds depending on audio length). Consider tracking transcription time with track() to monitor performance.

Next Steps

Agents

Configure agents with media

Tools

Build custom media tools

Context Management

Identify users in WhatsApp

Best Practices

Tips for effective agents