Process audio, images, and other media types in your agents
Runflow agents can process audio and images automatically. This is essential for WhatsApp integrations where users send voice messages and photos, and for HTTP clients that upload files directly to the agent via multipart/form-data.There are two entry points for media:
input.file — singular. Used by webhook handlers (Twilio/WhatsApp, Meta/Messenger) that resolve a single media file per inbound message. Auto-processed when media.transcribeAudio or media.processImages is enabled.
input.attachments[] — plural. Populated when the agent is invoked via multipart/form-data (one or more files in the same request). Auto-bridged to a multimodal chat message when media.processAttachments is enabled.
Both paths produce the same downstream effect: the LLM receives a multimodal user message with image and/or file parts. Choose the entry point that matches how media reaches your agent.
Transcribe audio files to text using multiple providers:
import { transcribe, Media } from '@runflow-ai/sdk';// Standalone function (default: OpenAI Whisper)const result = await transcribe({ audioUrl: 'https://example.com/audio.ogg', language: 'pt',});console.log(result.text); // "Olá, como vai?"// Using specific providerconst result2 = await transcribe({ audioUrl: 'https://example.com/audio.ogg', provider: 'deepgram', language: 'pt',});// Or via Media classconst result3 = await Media.transcribe({ audioUrl: 'https://example.com/audio.ogg', provider: 'openai',});
Configure agents to automatically handle audio and image files. When a user sends a voice message, it’s transcribed before processing. When they send an image, it’s analyzed with vision capabilities.
import { Agent, openai } from '@runflow-ai/sdk';const agent = new Agent({ name: 'WhatsApp Assistant', instructions: 'You are a helpful assistant.', model: openai('gpt-4o'), media: { transcribeAudio: true, processImages: true, audioProvider: 'openai', audioLanguage: 'pt', },});// Audio files are automatically transcribed before processingconst result = await agent.process({ message: '', file: { url: 'https://zenvia.com/storage/audio.ogg', contentType: 'audio/ogg', caption: 'Voice message', },});// Images are automatically processed as multimodalconst result2 = await agent.process({ message: 'What is in this image?', file: { url: 'https://example.com/image.jpg', contentType: 'image/jpeg', },});
When you call the agent directly over HTTP and need to send files or images, post multipart/form-data to the agent endpoint. Runflow stores each upload, generates a short-lived URL, and delivers an input.attachments[] array to your agent.
curl -X POST "https://executor.runflow.ai/agent/<agentId>?token=<agentToken>" \ -F "message=quanto custa esse produto?" \ -F "photo=@./produto.jpg"
Enable media.processAttachments to have the SDK build the multimodal message for you. No glue code, no transform — the agent receives the attachments and forwards them to the LLM automatically.
agent.ts
import { Agent, openai } from '@runflow-ai/sdk';export const agent = new Agent({ name: 'product-helper', model: openai('gpt-4o'), // any vision-capable model instructions: 'Help the user evaluate products.', media: { processAttachments: true, // ← opt-in },});
What happens:
Runflow stores the upload and delivers your agent an input that looks like this:
The SDK builds a multimodal user message — text plus image — and sends it to the model.
The image reaches the LLM in whatever format that provider expects (OpenAI/Gemini get the URL directly; Anthropic, Groq, xAI and Azure all receive their respective shapes — handled transparently by the SDK).
If you need custom routing — download a CSV to parse locally, OCR a PDF before sending, fan out images to different conversations — leave processAttachments off and transform the attachments yourself:
import { buildAttachmentsContent } from '@runflow-ai/sdk/core';export async function main(input: any) { // Use the same helper the SDK uses internally: const content = buildAttachmentsContent(input.attachments, input.message); // Or build it by hand, attachment by attachment: const parts = [{ type: 'text', text: input.message }]; for (const att of input.attachments ?? []) { if (att.content_type.startsWith('image/')) { parts.push({ type: 'image_url', image_url: { url: att.url } }); } else if (att.content_type === 'application/pdf') { // Maybe OCR locally, then send the extracted text. const text = await extractPdfText(att.url); parts.push({ type: 'text', text }); } else { parts.push({ type: 'file_url', file_url: { url: att.url }, name: att.name }); } } return await agent.process({ ...input, messages: [{ role: 'user', content: parts }], });}
Oversize uploads return HTTP 413 AttachmentTooLarge. Malformed multipart returns 400 InvalidMultipart.The URLs in input.attachments[].url are short-lived — your agent and the LLM provider should consume them within minutes of the request. For documents that need to live longer, persist them yourself (e.g., copy to your own storage).
For non-image files on providers that don’t support arbitrary documents, the SDK emits a [File: <name>] text placeholder so the model sees something coherent. If you need the model to actually read a PDF/CSV, parse it locally first and send the extracted text.
A complete WhatsApp agent that handles text, voice messages, and photos. Users can send a voice message to explain their issue or a photo of a damaged product.
import { Agent, openai } from '@runflow-ai/sdk';export const whatsappAgent = new Agent({ name: 'WhatsApp Support', instructions: `You are a customer support agent for WhatsApp.## Behavior- Respond in the customer's language- Be concise — WhatsApp messages should be short- When the customer sends a voice message, you'll receive the transcription — respond naturally- When the customer sends a photo, analyze it and respond accordingly## Tools- Use create-ticket when the issue needs human follow-up- If a customer sends a photo of a damaged product, create a ticket with priority 'high'`, model: openai('gpt-4o'), memory: { maxTurns: 30 }, media: { transcribeAudio: true, processImages: true, audioProvider: 'openai', audioLanguage: 'pt', }, tools: { createTicket: createTicketTool, }, observability: 'full',});
For WhatsApp agents, always enable both transcribeAudio and processImages. Users frequently send voice messages instead of typing, especially on mobile.
Audio transcription adds latency to your agent’s response (typically 1-3 seconds depending on audio length). Consider tracking transcription time with track() to monitor performance.