Multimodal Capabilities#
MassGen supports comprehensive multimodal AI workflows, enabling agents to both understand and generate images, audio, video, and file content. This includes analyzing existing content and creating new multimodal outputs.
Note
Multimodal Tools (v0.1.3+):
MassGen provides custom tools for both understanding and generating multimodal content:
Understanding Tools:
✅ understand_audio: Transcribe audio files to text (uses OpenAI’s
gpt-4o-transcribeby default)✅ understand_file: Analyze documents (PDF, DOCX, XLSX, PPTX) and text files
✅ understand_image: Describe and analyze images — routes to the agent’s native backend when supported
✅ understand_video: Extract and analyze key frames from videos — routes to the best available backend
Native Backend Routing (v0.1.55+):
Image and video understanding now route API calls to the agent’s own backend when it supports the capability
Supported image backends: OpenAI, Claude, Gemini, Grok, Claude Code (SDK), Codex (CLI)
If the agent’s backend does not support image understanding, falls back to OpenAI
gpt-5.4This preserves model diversity and per-agent consistency — a Claude agent analyzes images via Claude, not GPT
Backend Requirements:
For native routing, the agent’s backend API key must be available (e.g.,
ANTHROPIC_API_KEYfor Claude)Fallback to OpenAI requires
OPENAI_API_KEYenvironment variable set in.envfileClaude Code requires the
claudeCLI installed and authenticatedCodex requires the
codexCLI installed and authenticated
Generation Tools:
✅ text_to_image_generation: Generate images from text prompts (GPT-4.1)
✅ image_to_image_generation: Create image variations from existing images
✅ text_to_video_generation: Generate videos from text descriptions (Sora-2)
✅ text_to_speech_continue_generation: Generate expressive speech with emotional tone
✅ text_to_speech_transcription_generation: Convert text to speech (TTS)
✅ text_to_file_generation: Generate formatted documents (TXT, MD, PDF)
File Access:
Files must be accessible via
context_pathsconfiguration or created within agent workspacesSupports both pre-existing files and agent-generated content
Provides secure, sandboxed file access to agents
Overview#
Multimodal capabilities extend MassGen’s multi-agent collaboration across different content types:
Image Capabilities:
Understanding: Analyze and describe image content (Vision models)
Generation: Create images from text prompts, generate variations from existing images
Audio Capabilities:
Understanding: Transcription, audio analysis
Generation: Text-to-speech with emotional expression, direct TTS conversion
Video Capabilities:
Understanding: Analyze video content through key frame extraction
Generation: Create videos from text descriptions
File Operations:
Understanding: Analyze documents and files (PDF, DOCX, XLSX, PPTX, text files)
Generation: Generate formatted documents from text prompts
Custom Tools: Comprehensive multimodal file handling
Image Understanding#
Image understanding enables agents to analyze visual content, extract information, and answer questions about images using the understand_image custom tool.
Note
Native backend routing (v0.1.55+): The understand_image tool now routes to the agent’s own backend when it supports image_understanding. For example, a Claude agent will use Claude’s vision API, a Gemini agent will use Gemini’s multimodal API, etc. If the agent’s backend doesn’t support image understanding, it falls back to OpenAI gpt-5.4.
Supported backends: OpenAI, Claude, Gemini, Grok, Claude Code (SDK), Codex (CLI).
Basic Configuration#
Configure agents with the understand_image tool:
agents:
- id: "vision_agent"
backend:
type: "openai"
model: "gpt-5-nano"
cwd: "workspace1"
custom_tools:
- name: ["understand_image"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_image.py"
function: ["understand_image"]
system_message: "You are a helpful assistant"
orchestrator:
context_paths:
- path: "@examples/resources/v0.0.27-example/multimodality.jpg"
permission: "read"
Example Command:
massgen \
--config @examples/basic/single/single_gpt5nano_image_understanding.yaml \
"Please summarize the content in this image."
Multi-Agent Image Analysis#
Multiple agents can provide diverse perspectives on image content:
agents:
- id: "response_agent1"
backend:
type: "openai"
model: "gpt-5-nano"
cwd: "workspace1"
custom_tools:
- name: ["understand_image"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_image.py"
function: ["understand_image"]
system_message: "You are a helpful assistant"
- id: "response_agent2"
backend:
type: "openai"
model: "gpt-5-nano"
cwd: "workspace2"
custom_tools:
- name: ["understand_image"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_image.py"
function: ["understand_image"]
system_message: "You are a helpful assistant"
orchestrator:
context_paths:
- path: "@examples/resources/v0.0.27-example/multimodality.jpg"
permission: "read"
Example Command:
massgen \
--config @examples/basic/multi/gpt5nano_image_understanding.yaml \
"Analyze this image and identify key elements, mood, and composition."
Use Cases:
Document analysis and OCR
Visual content description for accessibility
Image classification and categorization
Design feedback and critique
Scene understanding for robotics
Image Generation#
Generate images from text descriptions using AI models. MassGen provides two generation approaches:
Text-to-Image Generation#
Create new images from text prompts using GPT-4.1:
agents:
- id: "image_generator"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace1"
enable_image_generation: true
custom_tools:
- name: ["text_to_image_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_image_generation.py"
function: ["text_to_image_generation"]
system_message: "You are an AI assistant with access to text-to-image generation capabilities."
Example Command:
massgen \
--config massgen/configs/tools/custom_tools/multimodal_tools/text_to_image_generation_single.yaml \
"Please generate an image of a cat in space."
Key Features:
Powered by OpenAI’s GPT-4.1 model
Generates high-quality images from text descriptions
Automatically saves images to agent workspace
Image-to-Image Generation#
Create variations or modifications of existing images:
agents:
- id: "image_editor"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace1"
enable_image_generation: true
custom_tools:
- name: ["image_to_image_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/image_to_image_generation.py"
function: ["image_to_image_generation"]
- name: ["understand_image"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_image.py"
function: ["understand_image"]
orchestrator:
context_paths:
- path: "path/to/source_image.jpg"
permission: "read"
Use Cases:
Create artistic variations of existing images
Style transfer and image transformation
Generate similar images with different characteristics
Image editing and enhancement workflows
Multi-Agent Image Generation#
Combine understanding and generation capabilities with multiple agents:
agents:
- id: "text_to_image_generation_tool1"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace1"
enable_image_generation: true
custom_tools:
- name: ["text_to_image_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_image_generation.py"
function: ["text_to_image_generation"]
- name: ["understand_image"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_image.py"
function: ["understand_image"]
- name: ["image_to_image_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/image_to_image_generation.py"
function: ["image_to_image_generation"]
system_message: |
You are an AI assistant with access to text-to-image generation capabilities.
- id: "text_to_image_generation_tool2"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace2"
enable_image_generation: true
custom_tools:
- name: ["text_to_image_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_image_generation.py"
function: ["text_to_image_generation"]
- name: ["understand_image"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_image.py"
function: ["understand_image"]
system_message: |
You are an AI assistant with access to text-to-image generation capabilities.
orchestrator:
snapshot_storage: "snapshots"
agent_temporary_workspace: "temp_workspaces"
Example Command:
massgen \
--config massgen/configs/tools/custom_tools/multimodal_tools/text_to_image_generation_multi.yaml \
"Please generate an image of a cat in space."
Audio Understanding#
Transcribe and analyze audio files using the understand_audio custom tool.
Note
The understand_audio tool uses OpenAI’s Transcription API with the gpt-4o-transcribe model by default. This requires an OpenAI API key regardless of which backend your agent uses.
agents:
- id: "transcriber"
backend:
type: "openai"
model: "gpt-5-nano"
cwd: "workspace1"
custom_tools:
- name: ["understand_audio"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_audio.py"
function: ["understand_audio"]
orchestrator:
context_paths:
- path: "path/to/audio.mp3"
permission: "read"
Supported Formats:
WAV, MP3, M4A, MP4, OGG, FLAC, AAC, WMA, OPUS
Example Use Cases:
Meeting transcription
Podcast analysis
Voice memo processing
Interview transcription
Audio content summarization
Audio/Speech Generation#
Generate speech and audio content from text using OpenAI’s audio generation capabilities. MassGen provides two text-to-speech approaches:
Expressive Speech Generation#
Generate natural-sounding speech with emotional expression using GPT-4o Audio:
agents:
- id: "speech_generator"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace1"
enable_audio_generation: true
custom_tools:
- name: ["text_to_speech_continue_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_speech_continue_generation.py"
function: ["text_to_speech_continue_generation"]
system_message: "You are an AI assistant with access to text-to-speech generation capabilities."
Example Command:
massgen \
--config massgen/configs/tools/custom_tools/multimodal_tools/text_to_speech_generation_single.yaml \
"I want you to tell me a very short introduction about Sherlock Holmes in one sentence, and I want you to use emotion voice to read it out loud."
Key Features:
Powered by GPT-4o Audio Preview model
Supports emotional and expressive speech
Multiple voice options (alloy, echo, fable, onyx, nova, shimmer)
Output formats: WAV, MP3
Natural conversation flow with context awareness
Direct Text-to-Speech (TTS)#
Convert text directly to speech using OpenAI’s TTS API:
agents:
- id: "tts_agent"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace1"
enable_audio_generation: true
custom_tools:
- name: ["text_to_speech_transcription_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_speech_transcription_generation.py"
function: ["text_to_speech_transcription_generation"]
Key Features:
Uses GPT-4o-mini-TTS for fast, cost-effective generation
Direct text-to-speech conversion
Supports multiple voices and output formats
Optional instructions for voice style customization
Streaming response for efficient processing
Supported Voices:
alloy- Neutral, balanced voiceecho- Clear, professional voicefable- Warm, storytelling voiceonyx- Deep, authoritative voicenova- Energetic, friendly voiceshimmer- Soft, gentle voice
Supported Formats:
MP3 (default)
WAV
OPUS
AAC
FLAC
Multi-Agent Audio/Speech Generation#
Combine understanding and generation capabilities with multiple agents:
agents:
- id: "text_to_speech_continue_generation_tool1"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace1"
enable_audio_generation: true
custom_tools:
- name: ["text_to_speech_transcription_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_speech_transcription_generation.py"
function: ["text_to_speech_transcription_generation"]
- name: ["understand_audio"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_audio.py"
function: ["understand_audio"]
- name: ["text_to_speech_continue_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_speech_continue_generation.py"
function: ["text_to_speech_continue_generation"]
system_message: |
You are an AI assistant with access to text-to-speech generation capabilities.
- id: "text_to_speech_continue_generation_tool2"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace2"
enable_audio_generation: true
custom_tools:
- name: ["text_to_speech_transcription_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_speech_transcription_generation.py"
function: ["text_to_speech_transcription_generation"]
- name: ["understand_audio"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_audio.py"
function: ["understand_audio"]
- name: ["text_to_speech_continue_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_speech_continue_generation.py"
function: ["text_to_speech_continue_generation"]
system_message: |
You are an AI assistant with access to text-to-speech generation capabilities.
orchestrator:
snapshot_storage: "snapshots"
agent_temporary_workspace: "temp_workspaces"
Example Command:
massgen \
--config massgen/configs/tools/custom_tools/multimodal_tools/text_to_speech_generation_multi.yaml \
"I want to you tell me a very short introduction about Sherlock Homes in one sentence, and I want you to use emotion voice to read it out loud."
Video Understanding#
Analyze and extract information from video files using the understand_video custom tool.
Note
The understand_video tool now routes to the agent’s native backend when it supports video_understanding. If the agent’s backend doesn’t support video understanding, it falls back to OpenAI gpt-5.4. The OpenAI fallback requires an OPENAI_API_KEY.
agents:
- id: "video_analyzer"
backend:
type: "openai"
model: "gpt-5-nano"
cwd: "workspace1"
custom_tools:
- name: ["understand_video"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_video.py"
function: ["understand_video"]
orchestrator:
context_paths:
- path: "path/to/video.mp4"
permission: "read"
Supported Formats:
MP4, AVI, MOV, MKV, FLV, WMV, WEBM, M4V, MPG, MPEG
Example Use Cases:
Video content analysis
Scene detection and description
Action recognition
Video summarization
Quality assessment
Requirements:
Requires opencv-python (
pip install opencv-python)Optional:
pip install massgen[video]for scene-based frame extraction
Configurable Frame Extraction (v0.1.56+):
By default, video understanding uses scene-based frame extraction (PySceneDetect) to select the most informative frames. You can configure the extraction strategy via multimodal_config:
agents:
- id: "video_analyzer"
backend:
type: "openai"
model: "gpt-5.4"
enable_multimodal_tools: true
multimodal_config:
video:
extraction_mode: "scene" # "scene" (default) | "uniform"
max_frames: 30 # Hard cap (default: 30, absolute max: 60)
fps: 1.0 # Uniform mode: frames per second
threshold: 0.3 # Scene mode: detection sensitivity
frames_per_scene: 3 # Scene mode: frames per detected scene
Extraction modes:
scene (default): Detects scene boundaries using PySceneDetect’s
ContentDetector, then samplesframes_per_sceneframes within each scene. Falls back to uniform when PySceneDetect is not installed or no scenes are detected.uniform: Evenly spaced frames based on
fps(default 1.0 frame/sec) ornum_frames(fixed count, overrides fps). Always capped atmax_frames.
Cost guardrails: The max_frames setting (default 30) prevents runaway token costs on long videos. The absolute maximum is 60 frames regardless of configuration.
Video Generation#
Generate videos from text descriptions using OpenAI’s Sora-2 API:
agents:
- id: "video_generator"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace1"
enable_video_generation: true
custom_tools:
- name: ["text_to_video_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_video_generation.py"
function: ["text_to_video_generation"]
system_message: "You are an AI assistant with access to text-to-video generation capabilities."
Example Command:
massgen \
--config massgen/configs/tools/custom_tools/multimodal_tools/text_to_video_generation_single.yaml \
"Generate a 4 seconds video with neon-lit alley at night, light rain, slow push-in, cinematic."
Key Features:
Powered by OpenAI’s Sora-2 model
Generate high-quality videos from text descriptions
Customizable video duration (4-20 seconds)
Automatic video download and storage
Supports detailed scene descriptions and camera movements
Use Cases:
Marketing and advertising content creation
Concept visualization and storyboarding
Educational and training videos
Social media content generation
Creative storytelling and animation
Product demonstration videos
Best Practices for Video Generation:
Provide detailed scene descriptions including:
Setting and environment
Lighting conditions
Camera movements (push-in, pull-out, pan, etc.)
Atmosphere and mood
Objects and characters
Use cinematic terminology for better results
Specify duration based on content complexity
Combine with
understand_videotool for quality verification
Multi-Agent Video Generation#
Combine video generation with analysis for iterative improvement:
agents:
- id: "text_to_video_generation_tool1"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace1"
enable_video_generation: true
custom_tools:
- name: ["understand_video"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_video.py"
function: ["understand_video"]
- name: ["text_to_video_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_video_generation.py"
function: ["text_to_video_generation"]
system_message: |
You are an AI assistant with access to text-to-video generation capabilities.
- id: "text_to_video_generation_tool2"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace2"
enable_video_generation: true
custom_tools:
- name: ["understand_video"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_video.py"
function: ["understand_video"]
- name: ["text_to_video_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_video_generation.py"
function: ["text_to_video_generation"]
system_message: |
You are an AI assistant with access to text-to-video generation capabilities.
orchestrator:
snapshot_storage: "snapshots"
agent_temporary_workspace: "temp_workspaces"
Example Command:
massgen \
--config massgen/configs/tools/custom_tools/multimodal_tools/text_to_video_generation_multi.yaml \
"Generate a 4 seconds video with neon-lit alley at night, light rain, slow push-in, cinematic."
File Understanding#
File understanding capabilities enable agents to analyze documents and perform Q&A using the understand_file custom tool.
Configure agents to analyze files:
agents:
- id: "document_agent"
backend:
type: "openai"
model: "gpt-5-nano"
cwd: "workspace1"
custom_tools:
- name: ["understand_file"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_file.py"
function: ["understand_file"]
orchestrator:
context_paths:
- path: "path/to/document.pdf"
permission: "read"
- path: "path/to/report.docx"
permission: "read"
Supported File Types:
Text Files: .py, .js, .java, .md, .txt, .log, .csv, .json, .yaml, etc.
PDF: Requires PyPDF2 (
pip install PyPDF2)Word: .docx - Requires python-docx (
pip install python-docx)Excel: .xlsx - Requires openpyxl (
pip install openpyxl)PowerPoint: .pptx - Requires python-pptx (
pip install python-pptx)
Example Use Case:
# Document Q&A
massgen \
--config @examples/basic/single/single_gpt5nano_file_search.yaml \
"What are the main conclusions from the research paper?"
File Generation#
Generate formatted documents from text using AI. The text_to_file_generation tool can create professional documents in various formats:
agents:
- id: "document_generator"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace1"
enable_file_generation: true
custom_tools:
- name: ["text_to_file_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_file_generation.py"
function: ["text_to_file_generation"]
system_message: "You are an AI assistant with access to text-to-file generation capabilities."
Example Command:
massgen \
--config massgen/configs/tools/custom_tools/multimodal_tools/text_to_file_generation_single.yaml \
"Please generate a comprehensive technical report about the latest developments in Large Language Models (LLMs) and Generative AI. The report should include: 1) Executive Summary, 2) Introduction to LLMs, 3) Recent breakthroughs, 4) Applications in industry, 5) Ethical considerations, 6) Future directions. Save it as a PDF file."
Supported Output Formats:
TXT - Plain text files
MD - Markdown formatted documents
PDF - Professional PDF documents with formatting
PPTX - PowerPoint presentations with slide structure
Multi-Agent Document Workflow#
Combine generation with review and refinement:
agents:
- id: "text_to_file_generation_tool1"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace1"
enable_file_generation: true
custom_tools:
- name: ["text_to_file_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_file_generation.py"
function: ["text_to_file_generation"]
- name: ["understand_file"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_file.py"
function: ["understand_file"]
system_message: |
You are an AI assistant with access to text-to-file generation capabilities.
- id: "text_to_file_generation_tool2"
backend:
type: "openai"
model: "gpt-4o"
cwd: "workspace2"
enable_file_generation: true
custom_tools:
- name: ["text_to_file_generation"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/text_to_file_generation.py"
function: ["text_to_file_generation"]
- name: ["understand_file"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_file.py"
function: ["understand_file"]
system_message: |
You are an AI assistant with access to text-to-file generation capabilities.
orchestrator:
snapshot_storage: "snapshots"
agent_temporary_workspace: "temp_workspaces"
Example Command:
massgen \
--config massgen/configs/tools/custom_tools/multimodal_tools/text_to_file_generation_multi.yaml \
"Please generate a comprehensive technical report about the latest developments in Large Language Models (LLMs) and Generative AI. The report should include: 1) Executive Summary, 2) Introduction to LLMs, 3) Recent breakthroughs, 4) Applications in industry, 5) Ethical considerations, 6) Future directions. Save it as a PDF file."
Requirements:
PDF generation requires
reportlab(pip install reportlab)PPTX generation requires
python-pptx(pip install python-pptx)
Supported Backends#
Supported Backends: OpenAI, Claude, Claude Code, Gemini, Grok, Chat Completions (generic API), LM Studio, Inference (vLLM/SGLang)
Not Supported: Azure OpenAI, AG2 (these backends don’t support custom tools)
How It Works: Understanding tools route to the agent’s native backend when supported (v0.1.55+). Image understanding supports OpenAI, Claude, Gemini, Grok, Claude Code, and Codex natively. Unsupported backends fall back to OpenAI.
Requirements:
Your agent backend must support custom tools
The agent’s own API key should be available for native routing (e.g.,
ANTHROPIC_API_KEYfor Claude agents)OPENAI_API_KEYis needed as a fallback for backends without native image understandingClaude Code requires the
claudeCLI; Codex requires thecodexCLI
See Custom Tools for complete details on custom tool support by backend, and Backend Configuration for all backend capabilities including web search, code execution, and MCP support.
Configuration Examples#
Complete configuration files are available in the MassGen repository:
Custom Multimodal Understanding Tools (v0.1.3+):
massgen/configs/tools/custom_tools/multimodal_tools/understand_audio.yaml- Audio transcription toolmassgen/configs/tools/custom_tools/multimodal_tools/understand_file.yaml- File understanding tool (PDF, DOCX, etc.)massgen/configs/tools/custom_tools/multimodal_tools/understand_image.yaml- Image understanding toolmassgen/configs/tools/custom_tools/multimodal_tools/understand_video.yaml- Video understanding tool
Custom Multimodal Generation Tools (Latest):
massgen/configs/tools/custom_tools/multimodal_tools/text_to_image_generation_single.yaml- Single-agent image generationmassgen/configs/tools/custom_tools/multimodal_tools/text_to_image_generation_multi.yaml- Multi-agent image generationmassgen/configs/tools/custom_tools/multimodal_tools/text_to_video_generation_single.yaml- Single-agent video generationmassgen/configs/tools/custom_tools/multimodal_tools/text_to_video_generation_multi.yaml- Multi-agent video generationmassgen/configs/tools/custom_tools/multimodal_tools/text_to_speech_generation_single.yaml- Single-agent speech generationmassgen/configs/tools/custom_tools/multimodal_tools/text_to_speech_generation_multi.yaml- Multi-agent speech generationmassgen/configs/tools/custom_tools/multimodal_tools/text_to_file_generation_single.yaml- Single-agent document generationmassgen/configs/tools/custom_tools/multimodal_tools/text_to_file_generation_multi.yaml- Multi-agent document generation
Image Understanding:
@examples/basic/single/single_gpt5nano_image_understanding.yaml- Image understanding@examples/basic/multi/gpt5nano_image_understanding.yaml- Multi-agent image analysis
Audio Understanding:
@examples/basic/single/single_openrouter_audio_understanding.yaml- Audio transcription
Video Understanding:
@examples/basic/single/single_qwen_video_understanding.yaml- Video analysis with Qwen
File Operations:
@examples/basic/single/single_gpt5nano_file_search.yaml- Document Q&A with file search
Browse all examples in the Configuration README.
File Size Limits and Optimization#
MassGen automatically handles file size limits to prevent memory issues and API errors.
Default Size Limits#
Each multimodal tool has configurable size limits:
Images: 10MB (automatically resized if exceeded)
Videos: 50MB
Audio: 25MB
Automatic Image Resizing#
When an image exceeds the size limit, MassGen automatically:
Detects the oversized file
Compresses and resizes the image
Saves the optimized version to a temporary location
Processes the optimized image
Supported formats for auto-resizing: PNG, JPEG, JPG, WebP
Example log output:
Image size (12.5 MB) exceeds limit (10 MB). Attempting to resize...
Successfully resized image from 12.5 MB to 8.3 MB
Customizing Size Limits#
You can override size limits per tool call using the MAX_FILE_SIZE_MB parameter:
custom_tools:
- name: ["understand_image"]
category: "multimodal"
path: "massgen/tool/_multimodal_tools/understand_image.py"
function: ["understand_image"]
preset_args:
MAX_FILE_SIZE_MB: 15 # Increase limit to 15MB
Note: Increasing limits may cause:
Higher memory usage
API errors for very large files
Increased processing time
Best Practices#
API Keys and Backend Configuration
Native routing (v0.1.55+): Image and video understanding tools now route to the agent’s own backend when it supports the capability
Ensure your agent’s API key is set (e.g.,
ANTHROPIC_API_KEYfor Claude,GEMINI_API_KEYfor Gemini,XAI_API_KEYfor Grok)Set
OPENAI_API_KEYas a fallback for backends without native image understandingClaude Code requires the
claudeCLI installed and authenticated; Codex requires thecodexCLIAudio understanding still uses OpenAI’s
gpt-4o-transcribeby default
File Access and Configuration
Use
context_pathsto provide secure file access to agents for understanding tasksEnsure files are accessible before running - use absolute paths or paths relative to execution directory
Install required dependencies before use:
Audio Understanding: No additional dependencies (uses OpenAI API)
Video Understanding:
pip install opencv-pythonFile Understanding (PDF):
pip install PyPDF2File Understanding (Word):
pip install python-docxFile Understanding (Excel):
pip install openpyxlFile Understanding (PowerPoint):
pip install python-pptxFile Generation (PDF):
pip install reportlabFile Generation (PPTX):
pip install python-pptx
Generation Tool Configuration
Enable generation capabilities with backend flags:
enable_image_generation: truefor image generationenable_video_generation: truefor video generationenable_audio_generation: truefor speech generationenable_file_generation: truefor document generation
Set appropriate
cwdfor organized output storageUse
storage_pathparameter to customize output locationsVerify generated content with corresponding understanding tools
Performance and Cost Optimization
Understanding Tools:
Set appropriate
max_charslimits for large documents to control API costsAdjust
num_framesfor videos (default: 8) based on content length and detail neededMonitor OpenAI API usage when processing large files or many files
Generation Tools:
Image generation (GPT-4.1) is more expensive than standard API calls
Video generation (Sora-2) can be costly - use appropriate duration (4-20 seconds)
Speech generation costs vary by model (gpt-4o-audio-preview vs gpt-4o-mini-tts)
Use multi-agent to refine prompts before generation
Quality and Accuracy
Understanding:
Use high-quality source files (clear images, high-quality audio, well-lit videos)
Ask specific, detailed questions to get better responses
Use multi-agent collaboration for diverse perspectives on complex content
Generation:
Provide detailed, specific prompts for better generation results
For images: Include style, composition, lighting, and mood details
For videos: Specify scene, camera movements, duration, and atmosphere
For speech: Choose appropriate voice and specify emotional tone
For documents: Outline structure, sections, and formatting requirements
Combine understanding and generation agents for iterative refinement
Workspace Management
Configure
cwdfor organized file storage (both input and output)Use
snapshot_storagefor agent collaboration and sharing generated contentReview generated content in agent workspaces before distribution
Include
.massgen/in.gitignoreClean up old workspaces periodically to manage storage
Use descriptive filenames for generated content (automatic timestamp-based naming available)
Troubleshooting#
Image Issues:
Image file not found: Ensure image path is added to
context_pathsand the file existsorchestrator: context_paths: - path: "path/to/image.jpg" permission: "read"
Audio Issues:
Audio file not found: Ensure audio path is in
context_pathsand file existsUnsupported audio format: Use supported formats: WAV, MP3, M4A, MP4, OGG, FLAC, AAC, WMA, OPUS
API transcription error: Verify OpenAI API key is set in
.envfile
Video Issues:
opencv-python not installed: Install with
pip install opencv-pythonVideo file not found: Ensure video path is in
context_pathsand file existsorchestrator: context_paths: - path: "path/to/video.mp4" permission: "read"
Unsupported video format: Use supported formats: MP4, AVI, MOV, MKV, FLV, WMV, WEBM, M4V, MPG, MPEG
High API costs: Reduce
num_framesparameter (default: 8) to extract fewer frames
General File Issues:
File not found: Ensure the file path is added to
context_pathsin the orchestrator configurationorchestrator: context_paths: - path: "path/to/your/file" permission: "read"
Permission errors: Verify that files are readable and paths are accessible
Missing dependencies: Install required Python packages for specific file types
pip install PyPDF2 python-docx openpyxl python-pptx opencv-python reportlab
API and Dependency Issues:
Missing OpenAI API key: Set
OPENAI_API_KEYin.envfile or environment variableImport errors: Install required dependencies for your file types (see Best Practices section)
API costs: Monitor usage carefully - multimodal understanding can be expensive with large files or many frames
Use Cases#
Content Understanding:
Document Processing:
Analyze PDFs, Word docs, Excel sheets, PowerPoint presentations
Extract data from forms, tables, and structured documents
Summarize research papers, technical documentation, and reports
Media Analysis:
Transcribe meeting recordings, interviews, and podcasts
Analyze video content through key frame extraction
Extract information from screenshots, charts, and diagrams
Code and Visual Analysis:
Code analysis with AI-powered explanations
Visual content description for accessibility
Scene detection and description in videos
Content Generation:
Creative Content Creation:
Generate marketing visuals and product images from descriptions
Create social media content (images, videos, audio)
Produce concept art and design mockups
Generate voice-overs and narration for videos
Document and Report Generation:
Automatically generate technical reports and white papers
Create formatted business documentation (PDF, MD, TXT)
Produce meeting summaries and documentation
Generate educational materials and training guides
Video Production:
Create promotional and marketing videos from text descriptions
Generate concept visualization and storyboards
Produce educational content and tutorials
Create social media video content
Audio Content:
Generate audiobooks and narrated content
Create podcast intros and outros
Produce accessibility audio for visually impaired users
Generate multilingual voice content
Next Steps#
Backend Configuration - Backend-specific multimodal capabilities
File Operations & Workspace Management - Workspace and file management
Tools and Capabilities - Custom tools configuration and usage
Advanced Patterns - Advanced multimodal patterns
YAML Configuration Reference - Complete configuration reference