Terminal Evaluation

Terminal Evaluation#

MassGen can evaluate its own terminal display and frontend user experience by recording terminal sessions as videos and analyzing them using AI vision models. This is useful for:

Frontend development: Evaluate UI/UX changes to the terminal display
Quality assurance: Verify that status indicators, coordination displays, and agent outputs are clear
Case study creation: Record demos and automatically generate video content
User testing: Analyze how well the terminal communicates agent progress and results

Note

Quick Setup Summary:

Install VHS terminal recorder: brew install vhs (macOS) or go install github.com/charmbracelet/vhs@latest
Ensure OpenAI API key is configured in .env
Use the run_massgen_with_recording tool in your config
Agent records, analyzes, and provides UX feedback automatically

Quick Start: Try It Now#

MassGen includes a working example you can try immediately:

# Evaluate the terminal display for a simple task
massgen \
  --config massgen/configs/tools/custom_tools/terminal_evaluation.yaml \
  "Record and evaluate the terminal display for the todo example config"

The agent will:

Record a MassGen session running the todo example
Save the recording as an MP4 video in the workspace
Extract key frames and analyze them with GPT-4.1
Provide detailed feedback on terminal display quality

How It Works#

The run_massgen_with_recording tool follows this workflow:

Create VHS Tape: Generates a VHS script to record the terminal session
Run MassGen: Executes MassGen WITHOUT --automation flag (to capture rich terminal display)
Record Video: VHS records the terminal session as MP4/GIF/WebM
Extract Frames: Extracts key frames from the video (default: 12 frames)
Analyze Display: Uses the understand_video tool to evaluate UX quality
Return Feedback: Provides structured evaluation with recommendations

The tool automatically saves videos to the agent workspace for reuse in case studies.

Prerequisites#

1. Install VHS Terminal Recorder

VHS (by Charm) is required to record terminal sessions:

# macOS
brew install vhs

# Linux/Windows (requires Go)
go install github.com/charmbracelet/vhs@latest

Verify installation:

vhs --version

2. OpenAI API Key

The tool uses GPT-4.1 for video analysis. Ensure your .env file contains:

OPENAI_API_KEY=sk-...

3. Dependencies

The understand_video tool requires opencv-python:

pip install opencv-python

Basic Usage#

Example 1: Evaluate a Simple Config

# terminal_eval_basic.yaml
agents:
  - id: "evaluator"
    backend:
      type: "openai"
      model: "gpt-5-mini"
      cwd: "workspace"
      enable_mcp_command_line: true  # Required for VHS
      custom_tools:
        - name: ["run_massgen_with_recording"]
          category: "terminal_evaluation"
          path: "massgen/tool/_multimodal_tools/run_massgen_with_recording.py"
          function: ["run_massgen_with_recording"]
    system_message: |
      You can record and evaluate MassGen terminal displays.
      Use run_massgen_with_recording to test configs and provide UX feedback.

orchestrator:
  context_paths:
    - path: "massgen/configs/simple_two_agents.yaml"
      permission: "read"

ui:
  display_type: "rich_terminal"
  logging_enabled: true

Run with:

massgen --config terminal_eval_basic.yaml "Evaluate the simple two agents config"

Example 2: Custom Evaluation Criteria

You can customize the evaluation prompt to focus on specific aspects:

# In the agent's prompt or directly in tool call
run_massgen_with_recording(
    config_path="my_config.yaml",
    question="Create a todo list app",
    evaluation_prompt="""
    Focus on the coordination display. Evaluate:
    1. How clearly does it show agent collaboration?
    2. Are status transitions (streaming → answered → voted) clear?
    3. Is the winner selection process visible?
    4. What improvements would enhance multi-agent visualization?
    """
)

Tool Parameters#

async def run_massgen_with_recording(
    config_path: str,
    question: str,
    evaluation_prompt: str = "Evaluate the terminal display quality...",
    output_format: str = "mp4",
    num_frames: int = 12,
    timeout_seconds: int = 300,
    width: int = 1200,
    height: int = 800,
    allowed_paths: Optional[List[str]] = None,
    agent_cwd: Optional[str] = None,
) -> ExecutionResult

Parameters:

config_path (str, required): Path to MassGen config file (YAML)
- Relative paths resolved relative to agent workspace
- Absolute paths must be within allowed directories
question (str, required): Question to pass to MassGen
evaluation_prompt (str): Prompt for evaluating terminal display
- Default: Comprehensive UX evaluation (clarity, information density, status indicators, user experience)
- Customize to focus on specific aspects (coordination, readability, etc.)
output_format (str): Video format - "mp4" (default), "gif", or "webm"
- MP4: Best quality, suitable for case studies
- GIF: Smaller file size, easier to embed in docs
- WebM: Modern web format with good compression
num_frames (int): Number of frames to extract for analysis (default: 12)
- Higher values (16+) provide more detail but increase API costs
- Lower values (4-8) faster and cheaper but may miss details
- Recommended: 8-16 frames for most evaluations
timeout_seconds (int): Maximum time to wait for MassGen completion (default: 300)
- Adjust based on task complexity
- Longer tasks need higher timeouts
- VHS will wait this long before stopping recording
width (int): Terminal width in pixels (default: 1200)
height (int): Terminal height in pixels (default: 800)
- Adjust for your preferred terminal dimensions
- Larger dimensions capture more detail but increase file size

Returns:

{
  "success": true,
  "operation": "run_massgen_with_recording",
  "config_path": "/path/to/config.yaml",
  "question": "Create a todo list",
  "video_path": "/path/to/workspace/massgen_terminal.mp4",
  "video_format": "mp4",
  "video_size_bytes": 2458624,
  "recording_duration_seconds": 45.3,
  "massgen_timeout_seconds": 300,
  "terminal_dimensions": {"width": 1200, "height": 800},
  "evaluation": {
    "success": true,
    "num_frames_extracted": 12,
    "prompt": "Evaluate the terminal display quality...",
    "response": "The terminal display demonstrates excellent clarity..."
  }
}

Advanced Usage#

Recording as GIF for Documentation#

GIFs are ideal for embedding in documentation and case studies:

massgen --config terminal_eval.yaml \
  "Record the todo example as a GIF with focus on agent coordination"

In your agent’s system message, guide it to use GIF format:

When recording for documentation, use output_format="gif" and num_frames=8
for faster processing and smaller file sizes.

Batch Evaluation of Multiple Configs#

You can create an agent that systematically evaluates multiple configs:

orchestrator:
  context_paths:
    - path: "massgen/configs/tools/"
      permission: "read"

agents:
  - id: "batch_evaluator"
    system_message: |
      Evaluate all configs in massgen/configs/tools/ directory.
      For each config:
      1. Record a simple test question
      2. Analyze the terminal display
      3. Compile a comparative report

Integration with Case Studies#

The tool automatically saves videos to the workspace for case study reuse:

# Videos are saved as: workspace/massgen_terminal.{format}
# Reference them in case studies:

## Demo Video

Here's a recording of MassGen solving the task:

![Terminal Demo](workspace/massgen_terminal.gif)

**Evaluation:** The terminal display effectively shows agent collaboration
with clear status indicators and smooth coordination visualization.

Evaluation Criteria#

The default evaluation prompt assesses:

Visual Clarity and Readability
- Font rendering and contrast
- Color scheme effectiveness
- ANSI escape code handling
- Text layout and spacing
Information Density and Organization
- Multi-column layout for parallel agents
- Content aggregation and streaming display
- Log message formatting
- Scroll handling for long outputs
Status Indicator Effectiveness
- Agent states (streaming, answered, voted, completed)
- Progress tracking visibility
- Coordination phase transitions
- Winner selection clarity
Overall User Experience
- Real-time feedback quality
- Mental model alignment (does display match user expectations?)
- Error visibility and handling
- Cognitive load and information hierarchy

Troubleshooting#

VHS Not Found Error

{
  "success": false,
  "error": "VHS is not installed. Please install it from https://github.com/charmbracelet/vhs"
}

Solution: Install VHS:

brew install vhs  # macOS
go install github.com/charmbracelet/vhs@latest  # Linux/Windows

Video File Not Created

If VHS completes but no video file is created:

Check VHS stderr output in the error response
Verify terminal dimensions are reasonable (width: 800-1920, height: 600-1080)
Ensure sufficient disk space for video recording
Try a shorter timeout (simpler task)

Recording Timeout

{
  "success": false,
  "error": "VHS recording timed out after 330 seconds"
}

Solution: Increase timeout for complex tasks:

run_massgen_with_recording(
    config_path="complex_config.yaml",
    question="Complex question",
    timeout_seconds=600  # 10 minutes
)

OpenCV Import Error

pip install opencv-python

Best Practices#

Use Appropriate Timeouts
- Simple tasks: 60-120 seconds
- Medium tasks: 120-300 seconds
- Complex tasks: 300-600 seconds
Optimize Frame Count
- Quick evaluation: 4-8 frames
- Standard evaluation: 8-12 frames
- Detailed analysis: 12-16 frames
Choose Right Format
- Case studies: MP4 (best quality)
- Documentation: GIF (easy embedding)
- Web publishing: WebM (modern, efficient)
Customize Evaluation Prompts

Focus on specific aspects you’re testing:
- “Evaluate the multi-agent coordination display”
- “Assess readability for color-blind users”
- “Analyze information hierarchy and visual flow”
Save Videos for Reference

Videos are automatically saved to workspace - commit them to git for:
- Regression testing (compare old vs new displays)
- Documentation and tutorials
- Case study demonstrations
- User research artifacts

Example Workflow: UI Iteration#

Step 1: Baseline Evaluation

massgen --config terminal_eval.yaml \
  "Record baseline terminal display for simple_two_agents config"

Step 2: Make Display Changes

Edit massgen/frontend/displays/terminal_display.py to improve UX.

Step 3: Re-evaluate

massgen --config terminal_eval.yaml \
  "Record updated terminal display for simple_two_agents config"

Step 4: Compare

The agent can compare both evaluations and highlight improvements/regressions.