Terminal Evaluation#
MassGen can evaluate its own terminal display and frontend user experience by recording terminal sessions as videos and analyzing them using AI vision models. This is useful for:
Frontend development: Evaluate UI/UX changes to the terminal display
Quality assurance: Verify that status indicators, coordination displays, and agent outputs are clear
Case study creation: Record demos and automatically generate video content
User testing: Analyze how well the terminal communicates agent progress and results
Note
Quick Setup Summary:
Install VHS terminal recorder:
brew install vhs(macOS) orgo install github.com/charmbracelet/vhs@latestEnsure OpenAI API key is configured in
.envUse the
run_massgen_with_recordingtool in your configAgent records, analyzes, and provides UX feedback automatically
Quick Start: Try It Now#
MassGen includes a working example you can try immediately:
# Evaluate the terminal display for a simple task
massgen \
--config massgen/configs/tools/custom_tools/terminal_evaluation.yaml \
"Record and evaluate the terminal display for the todo example config"
The agent will:
Record a MassGen session running the todo example
Save the recording as an MP4 video in the workspace
Extract key frames and analyze them with GPT-4.1
Provide detailed feedback on terminal display quality
How It Works#
The run_massgen_with_recording tool follows this workflow:
Create VHS Tape: Generates a VHS script to record the terminal session
Run MassGen: Executes MassGen WITHOUT
--automationflag (to capture rich terminal display)Record Video: VHS records the terminal session as MP4/GIF/WebM
Extract Frames: Extracts key frames from the video (default: 12 frames)
Analyze Display: Uses the
understand_videotool to evaluate UX qualityReturn Feedback: Provides structured evaluation with recommendations
The tool automatically saves videos to the agent workspace for reuse in case studies.
Prerequisites#
1. Install VHS Terminal Recorder
VHS (by Charm) is required to record terminal sessions:
# macOS
brew install vhs
# Linux/Windows (requires Go)
go install github.com/charmbracelet/vhs@latest
Verify installation:
vhs --version
2. OpenAI API Key
The tool uses GPT-4.1 for video analysis. Ensure your .env file contains:
OPENAI_API_KEY=sk-...
3. Dependencies
The understand_video tool requires opencv-python:
pip install opencv-python
Basic Usage#
Example 1: Evaluate a Simple Config
# terminal_eval_basic.yaml
agents:
- id: "evaluator"
backend:
type: "openai"
model: "gpt-5-mini"
cwd: "workspace"
enable_mcp_command_line: true # Required for VHS
custom_tools:
- name: ["run_massgen_with_recording"]
category: "terminal_evaluation"
path: "massgen/tool/_multimodal_tools/run_massgen_with_recording.py"
function: ["run_massgen_with_recording"]
system_message: |
You can record and evaluate MassGen terminal displays.
Use run_massgen_with_recording to test configs and provide UX feedback.
orchestrator:
context_paths:
- path: "massgen/configs/simple_two_agents.yaml"
permission: "read"
ui:
display_type: "rich_terminal"
logging_enabled: true
Run with:
massgen --config terminal_eval_basic.yaml "Evaluate the simple two agents config"
Example 2: Custom Evaluation Criteria
You can customize the evaluation prompt to focus on specific aspects:
# In the agent's prompt or directly in tool call
run_massgen_with_recording(
config_path="my_config.yaml",
question="Create a todo list app",
evaluation_prompt="""
Focus on the coordination display. Evaluate:
1. How clearly does it show agent collaboration?
2. Are status transitions (streaming → answered → voted) clear?
3. Is the winner selection process visible?
4. What improvements would enhance multi-agent visualization?
"""
)
Tool Parameters#
async def run_massgen_with_recording(
config_path: str,
question: str,
evaluation_prompt: str = "Evaluate the terminal display quality...",
output_format: str = "mp4",
num_frames: int = 12,
timeout_seconds: int = 300,
width: int = 1200,
height: int = 800,
allowed_paths: Optional[List[str]] = None,
agent_cwd: Optional[str] = None,
) -> ExecutionResult
Parameters:
config_path(str, required): Path to MassGen config file (YAML)Relative paths resolved relative to agent workspace
Absolute paths must be within allowed directories
question(str, required): Question to pass to MassGenevaluation_prompt(str): Prompt for evaluating terminal displayDefault: Comprehensive UX evaluation (clarity, information density, status indicators, user experience)
Customize to focus on specific aspects (coordination, readability, etc.)
output_format(str): Video format -"mp4"(default),"gif", or"webm"MP4: Best quality, suitable for case studies
GIF: Smaller file size, easier to embed in docs
WebM: Modern web format with good compression
num_frames(int): Number of frames to extract for analysis (default: 12)Higher values (16+) provide more detail but increase API costs
Lower values (4-8) faster and cheaper but may miss details
Recommended: 8-16 frames for most evaluations
timeout_seconds(int): Maximum time to wait for MassGen completion (default: 300)Adjust based on task complexity
Longer tasks need higher timeouts
VHS will wait this long before stopping recording
width(int): Terminal width in pixels (default: 1200)height(int): Terminal height in pixels (default: 800)Adjust for your preferred terminal dimensions
Larger dimensions capture more detail but increase file size
Returns:
{
"success": true,
"operation": "run_massgen_with_recording",
"config_path": "/path/to/config.yaml",
"question": "Create a todo list",
"video_path": "/path/to/workspace/massgen_terminal.mp4",
"video_format": "mp4",
"video_size_bytes": 2458624,
"recording_duration_seconds": 45.3,
"massgen_timeout_seconds": 300,
"terminal_dimensions": {"width": 1200, "height": 800},
"evaluation": {
"success": true,
"num_frames_extracted": 12,
"prompt": "Evaluate the terminal display quality...",
"response": "The terminal display demonstrates excellent clarity..."
}
}
Advanced Usage#
Recording as GIF for Documentation#
GIFs are ideal for embedding in documentation and case studies:
massgen --config terminal_eval.yaml \
"Record the todo example as a GIF with focus on agent coordination"
In your agent’s system message, guide it to use GIF format:
When recording for documentation, use output_format="gif" and num_frames=8
for faster processing and smaller file sizes.
Batch Evaluation of Multiple Configs#
You can create an agent that systematically evaluates multiple configs:
orchestrator:
context_paths:
- path: "massgen/configs/tools/"
permission: "read"
agents:
- id: "batch_evaluator"
system_message: |
Evaluate all configs in massgen/configs/tools/ directory.
For each config:
1. Record a simple test question
2. Analyze the terminal display
3. Compile a comparative report
Integration with Case Studies#
The tool automatically saves videos to the workspace for case study reuse:
# Videos are saved as: workspace/massgen_terminal.{format}
# Reference them in case studies:
## Demo Video
Here's a recording of MassGen solving the task:

**Evaluation:** The terminal display effectively shows agent collaboration
with clear status indicators and smooth coordination visualization.
Evaluation Criteria#
The default evaluation prompt assesses:
Visual Clarity and Readability
Font rendering and contrast
Color scheme effectiveness
ANSI escape code handling
Text layout and spacing
Information Density and Organization
Multi-column layout for parallel agents
Content aggregation and streaming display
Log message formatting
Scroll handling for long outputs
Status Indicator Effectiveness
Agent states (streaming, answered, voted, completed)
Progress tracking visibility
Coordination phase transitions
Winner selection clarity
Overall User Experience
Real-time feedback quality
Mental model alignment (does display match user expectations?)
Error visibility and handling
Cognitive load and information hierarchy
Troubleshooting#
VHS Not Found Error
{
"success": false,
"error": "VHS is not installed. Please install it from https://github.com/charmbracelet/vhs"
}
Solution: Install VHS:
brew install vhs # macOS
go install github.com/charmbracelet/vhs@latest # Linux/Windows
Video File Not Created
If VHS completes but no video file is created:
Check VHS stderr output in the error response
Verify terminal dimensions are reasonable (width: 800-1920, height: 600-1080)
Ensure sufficient disk space for video recording
Try a shorter timeout (simpler task)
Recording Timeout
{
"success": false,
"error": "VHS recording timed out after 330 seconds"
}
Solution: Increase timeout for complex tasks:
run_massgen_with_recording(
config_path="complex_config.yaml",
question="Complex question",
timeout_seconds=600 # 10 minutes
)
OpenCV Import Error
pip install opencv-python
Best Practices#
Use Appropriate Timeouts
Simple tasks: 60-120 seconds
Medium tasks: 120-300 seconds
Complex tasks: 300-600 seconds
Optimize Frame Count
Quick evaluation: 4-8 frames
Standard evaluation: 8-12 frames
Detailed analysis: 12-16 frames
Choose Right Format
Case studies: MP4 (best quality)
Documentation: GIF (easy embedding)
Web publishing: WebM (modern, efficient)
Customize Evaluation Prompts
Focus on specific aspects you’re testing:
“Evaluate the multi-agent coordination display”
“Assess readability for color-blind users”
“Analyze information hierarchy and visual flow”
Save Videos for Reference
Videos are automatically saved to workspace - commit them to git for:
Regression testing (compare old vs new displays)
Documentation and tutorials
Case study demonstrations
User research artifacts
Example Workflow: UI Iteration#
Step 1: Baseline Evaluation
massgen --config terminal_eval.yaml \
"Record baseline terminal display for simple_two_agents config"
Step 2: Make Display Changes
Edit massgen/frontend/displays/terminal_display.py to improve UX.
Step 3: Re-evaluate
massgen --config terminal_eval.yaml \
"Record updated terminal display for simple_two_agents config"
Step 4: Compare
The agent can compare both evaluations and highlight improvements/regressions.
See Also#
Multimodal Capabilities - Other multimodal tools (understand_video, understand_image, understand_audio)
Custom Tools - Creating custom tools for your domain
LLM Agent Automation Guide - Running MassGen in automation mode for backend testing
Writing Configuration Files - Best practices for config development
Case Study Template:
docs/case_studies/case-study-template.md