Skip to main content
Runflow accepts multimodal content — text + image, text + file, or text + multiple attachments — through three entry points. Pick the one that matches how the media reaches your agent.

1. Direct multimodal call

Build the messages array yourself and pass it to agent.process. Best when you already have URLs or base64 strings on hand.
import { Agent, openai } from '@runflow-ai/sdk';

const agent = new Agent({
  name: 'vision-agent',
  instructions: 'You can analyze images.',
  model: openai('gpt-4o'),
});

await agent.process({
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'What is in this image?' },
        { type: 'image_url', image_url: { url: 'https://example.com/image.jpg' } },
      ],
    },
  ],
});
The same content array works across providers — the SDK translates the parts to the native format each model expects (OpenAI, Anthropic, Bedrock, Gemini, Groq, xAI, Azure OpenAI).

2. multipart/form-data uploads

When a client sends a file or image directly to the agent endpoint over HTTP, post multipart/form-data. Runflow stores the upload and exposes it to your agent as input.attachments[].
curl -X POST "https://executor.runflow.ai/agent/<agentId>?token=<agentToken>" \
  -F "message=quanto custa esse produto?" \
  -F "photo=@./produto.jpg"
Enable media.processAttachments to let the SDK build the multimodal message for you:
const agent = new Agent({
  name: 'product-helper',
  model: openai('gpt-4o'),
  instructions: 'Help the user evaluate products.',
  media: { processAttachments: true },
});
Routing rule:
  • content_type starting with image/image_url
  • anything else → file_url
For custom routing (OCR a PDF, parse a CSV locally, etc.), leave the flag off and transform input.attachments[] yourself — see Media Processing for the manual recipe.

3. Webhook handlers with input.file

Twilio/WhatsApp and Meta/Messenger handlers deliver a single media file per message as input.file. Auto-processed when media.transcribeAudio or media.processImages is enabled.
const agent = new Agent({
  name: 'whatsapp-bot',
  model: openai('gpt-4o'),
  instructions: 'Reply in the customer\'s language.',
  media: { transcribeAudio: true, processImages: true },
});
See Media Processing for the full WhatsApp example.

Provider support at a glance

ProviderImagesFiles (PDF, CSV, …)
OpenAIURL or base64file_id (uploaded via Runtime API)
Azure OpenAIURL or base64text label only
Anthropic / BedrockURL or base64text label only
Geminibase64 inline / data: URItext label only
Groq / xAIURL or base64 (vision models)text label only
For non-image files on providers that don’t support arbitrary documents, the SDK emits a [File: <name>] text placeholder so the model sees something coherent. If you need the model to actually read a PDF/CSV, parse it locally first and send the extracted text as a text part.

Next Steps

Media Processing

Full media handling guide with WhatsApp example

Streaming

Stream multimodal responses