Skip to main content
Runflow agents can process audio and images automatically. This is essential for WhatsApp integrations where users send voice messages and photos.

Audio Transcription

Transcribe audio files to text using multiple providers:
import { transcribe, Media } from '@runflow-ai/sdk';

// Standalone function (default: OpenAI Whisper)
const result = await transcribe({
  audioUrl: 'https://example.com/audio.ogg',
  language: 'pt',
});

console.log(result.text); // "Olá, como vai?"

// Using specific provider
const result2 = await transcribe({
  audioUrl: 'https://example.com/audio.ogg',
  provider: 'deepgram',
  language: 'pt',
});

// Or via Media class
const result3 = await Media.transcribe({
  audioUrl: 'https://example.com/audio.ogg',
  provider: 'openai',
});

Supported Providers

ProviderStatusDescription
openaiAvailableOpenAI Whisper (default)
deepgramComingDeepgram
assemblyaiComingAssemblyAI
googleComingGoogle Speech-to-Text

Agent with Auto Media Processing

Configure agents to automatically handle audio and image files. When a user sends a voice message, it’s transcribed before processing. When they send an image, it’s analyzed with vision capabilities.
import { Agent, openai } from '@runflow-ai/sdk';

const agent = new Agent({
  name: 'WhatsApp Assistant',
  instructions: 'You are a helpful assistant.',
  model: openai('gpt-4o'),

  media: {
    transcribeAudio: true,
    processImages: true,
    audioProvider: 'openai',
    audioLanguage: 'pt',
  },
});

// Audio files are automatically transcribed before processing
const result = await agent.process({
  message: '',
  file: {
    url: 'https://zenvia.com/storage/audio.ogg',
    contentType: 'audio/ogg',
    caption: 'Voice message',
  },
});

// Images are automatically processed as multimodal
const result2 = await agent.process({
  message: 'What is in this image?',
  file: {
    url: 'https://example.com/image.jpg',
    contentType: 'image/jpeg',
  },
});

Media Config Options

OptionTypeDescription
transcribeAudiobooleanAuto-transcribe audio (default: false)
processImagesbooleanAuto-process images (default: false)
audioLanguagestringLanguage code (pt, en, es, etc.)
audioProviderstringopenai | deepgram | assemblyai | google
audioModelstringProvider-specific model

Real-World Example: WhatsApp Support Agent

A complete WhatsApp agent that handles text, voice messages, and photos. Users can send a voice message to explain their issue or a photo of a damaged product.

Project Structure

whatsapp-support/
├── main.ts
├── agent.ts
├── tools/
│   ├── index.ts
│   └── create-ticket.ts
├── .runflow/
│   └── rf.json
├── package.json
└── tsconfig.json

Agent with Media

agent.ts
import { Agent, openai } from '@runflow-ai/sdk';

export const whatsappAgent = new Agent({
  name: 'WhatsApp Support',
  instructions: `You are a customer support agent for WhatsApp.

## Behavior
- Respond in the customer's language
- Be concise — WhatsApp messages should be short
- When the customer sends a voice message, you'll receive the transcription — respond naturally
- When the customer sends a photo, analyze it and respond accordingly

## Tools
- Use create-ticket when the issue needs human follow-up
- If a customer sends a photo of a damaged product, create a ticket with priority 'high'`,
  model: openai('gpt-4o'),
  memory: { maxTurns: 30 },
  media: {
    transcribeAudio: true,
    processImages: true,
    audioProvider: 'openai',
    audioLanguage: 'pt',
  },
  tools: {
    createTicket: createTicketTool,
  },
  observability: 'full',
});

Main Entry Point

main.ts
import { identify, track } from '@runflow-ai/sdk/observability';
import { whatsappAgent } from './agent';

function parseWhatsAppInput(input: any) {
  // Zenvia webhook format
  if (input.message?.from) {
    const content = input.message.contents?.[0];
    return {
      phone: input.message.from,
      message: content?.text || content?.caption || '',
      file: content?.fileUrl ? {
        url: content.fileUrl,
        contentType: content.fileMimeType,
        caption: content.caption,
      } : undefined,
      channel: 'zenvia',
    };
  }

  // Direct API
  return {
    phone: input.phone,
    message: input.message || '',
    file: input.file,
    channel: input.channel || 'api',
  };
}

export async function main(input: any) {
  const { phone, message, file, channel } = parseWhatsAppInput(input);

  if (!phone) {
    return { error: 'phone is required' };
  }

  if (!message && !file) {
    return { error: 'message or file is required' };
  }

  identify(phone);

  // Track media type for analytics
  const mediaType = file?.contentType?.startsWith('audio') ? 'audio'
    : file?.contentType?.startsWith('image') ? 'image'
    : 'text';

  track('whatsapp_message_received', { channel, mediaType });

  try {
    const result = await whatsappAgent.process({
      message,
      sessionId: `whatsapp_${phone}`,
      file,
    });

    return {
      message: result.message,
      phone,
    };
  } catch (error) {
    console.error('[whatsapp-support] Error:', error);
    return { error: 'An error occurred processing your message' };
  }
}

Real-World Example: Transcription Tool

When you need more control over transcription (e.g., saving the transcription, analyzing it), use transcribe() inside a tool:
tools/process-voice.ts
import { createTool } from '@runflow-ai/sdk';
import { transcribe } from '@runflow-ai/sdk';
import { track } from '@runflow-ai/sdk/observability';
import { z } from 'zod';

export const processVoiceTool = createTool({
  id: 'process-voice',
  description: 'Transcribe and analyze a voice message',
  inputSchema: z.object({
    audioUrl: z.string().describe('URL of the audio file'),
    language: z.string().optional().describe('Language code (default: pt)'),
  }),
  execute: async ({ context }) => {
    try {
      const result = await transcribe({
        audioUrl: context.audioUrl,
        language: context.language || 'pt',
        provider: 'openai',
      });

      track('voice_transcribed', {
        language: context.language || 'pt',
        textLength: result.text.length,
      });

      return {
        success: true,
        text: result.text,
        language: context.language || 'pt',
      };
    } catch (error) {
      return {
        success: false,
        error: error instanceof Error ? error.message : 'Transcription failed',
      };
    }
  },
});

Tips

For WhatsApp agents, always enable both transcribeAudio and processImages. Users frequently send voice messages instead of typing, especially on mobile.
Audio transcription adds latency to your agent’s response (typically 1-3 seconds depending on audio length). Consider tracking transcription time with track() to monitor performance.

Next Steps

Agents

Configure agents with media

Tools

Build custom media tools

Context Management

Identify users in WhatsApp

Best Practices

Tips for effective agents