1. Direct multimodal call
Build themessages array yourself and pass it to agent.process. Best when you already have URLs or base64 strings on hand.
content array works across providers — the SDK translates the parts to the native format each model expects (OpenAI, Anthropic, Bedrock, Gemini, Groq, xAI, Azure OpenAI).
2. multipart/form-data uploads
When a client sends a file or image directly to the agent endpoint over HTTP, postmultipart/form-data. Runflow stores the upload and exposes it to your agent as input.attachments[].
media.processAttachments to let the SDK build the multimodal message for you:
content_typestarting withimage/→image_url- anything else →
file_url
input.attachments[] yourself — see Media Processing for the manual recipe.
3. Webhook handlers with input.file
Twilio/WhatsApp and Meta/Messenger handlers deliver a single media file per message as input.file. Auto-processed when media.transcribeAudio or media.processImages is enabled.
Provider support at a glance
| Provider | Images | Files (PDF, CSV, …) |
|---|---|---|
| OpenAI | URL or base64 | file_id (uploaded via Runtime API) |
| Azure OpenAI | URL or base64 | text label only |
| Anthropic / Bedrock | URL or base64 | text label only |
| Gemini | base64 inline / data: URI | text label only |
| Groq / xAI | URL or base64 (vision models) | text label only |
[File: <name>] text placeholder so the model sees something coherent. If you need the model to actually read a PDF/CSV, parse it locally first and send the extracted text as a text part.
Next Steps
Media Processing
Full media handling guide with WhatsApp example
Streaming
Stream multimodal responses