Computer Use Tools

Computer Use Tools#

MassGen provides powerful computer use tools that allow AI agents to autonomously control browsers and desktop environments. These tools enable agents to browse websites, interact with applications, execute commands, and complete complex multi-step workflows.

Note

Currently Available Tools:

gemini_computer_use - Google Gemini Computer Use (requires gemini-2.5-computer-use-preview-10-2025 model)
claude_computer_use - Anthropic Claude Computer Use (requires claude-sonnet-4-5 or newer)
browser_automation - Simple browser automation (works with ANY model: gpt-4.1, gpt-4o, etc.)
computer_use - OpenAI Computer Use (requires computer-use-preview model from OpenAI)
- WARNING: OpenAI Computer Use model has not gone through sophisticated testing due to access restrictions on computer-use-preview model. Performance is not guaranteed. Be cautious while using.
ui_tars_computer_use - UI-TARS Computer Use from ByteDance (open-sourced)

Environments:

We try to accommodate as many systems as we can, but practically, we observe that computer use models tend to work best when they start on a browser or linux docker. Hence, we have two recommended environments:

browser - Launch computer use agents in a browser, suitable for web tasks.
linux docker - Launch computer use agents in a Docker container, suitable for all web and desktop tasks.

Automatic Docker Setup: MassGen will automatically create and configure the Docker container on first run when using a Docker-based computer use config. No manual setup required! The container includes Ubuntu 22.04 with Xfce desktop, X11 virtual display, xdotool, Firefox, Chromium, and scrot.

See here for quick set-up guides for those two environments, and here for visualization guides.

Naming:

We name our configs in this convention: ${TOOL_NAME}_computer_use_${ENVIRONMENT}_example.yaml.

For example, if you would like to use Claude in linux docker environment, you should use the config massgen/configs/tools/custom_tools/claude_computer_use_docker_example.yaml.

If ${ENVIRONMENT} is not specified, we use browser as default value.

We welcome proposals of new tool and environment combinations!

Overview#

Computer use tools transform AI agents from text processors into active automation systems capable of:

Browser Automation - Navigate websites, fill forms, extract data, search for information
Desktop Control - Interact with applications, manage files, execute system commands
Visual Understanding - Take screenshots and use visual feedback to guide actions
Multi-Step Workflows - Chain together complex sequences of actions autonomously

Tool Comparison#

Feature	`computer_use`	`gemini_computer_use`	`claude_computer_use`	`browser_automation`
Model Support	`computer-use-preview` only	`gemini-2.5-computer-use-preview` only	`claude-sonnet-4-5` or newer	Any model
Provider	OpenAI	Google	Anthropic	Any
Environments	Browser, Linux/Docker, Mac, Windows	Browser, Linux/Docker	Browser, Linux/Docker	Browser only
Action Planning	Autonomous multi-step	Autonomous multi-step	Autonomous multi-step	User-directed
Complexity	High (full agentic)	High (full agentic)	High (full agentic)	Low (simple)
Safety Checks	Built-in	Built-in + confirmations	Built-in	Manual
Performance	Fast (~1-2 sec/action)	Fast (~1-2 sec/action)	Thorough (~2-5 sec/action)	Very Fast (~1 sec)
Best Use Case	Complex workflows (OpenAI)	Complex workflows (Google)	Precision tasks (Anthropic)	Simple automation

Quick Start#

1. Simple Browser Automation (Works with Any Model)

# Install dependencies
pip install playwright
playwright install

# Run with gpt-4.1 or any other model
uv run massgen \
  --config massgen/configs/tools/custom_tools/simple_browser_automation_example.yaml \
  "Go to Wikipedia and search for Jimmy Carter"

2. Gemini Computer Use

Browser automation:

# Set API key
export GEMINI_API_KEY="your-api-key"

# Run Gemini browser automation
uv run massgen \
  --config massgen/configs/tools/custom_tools/gemini_computer_use_example.yaml \
  "Go to cnn.com and get the top headline"

Docker/Linux desktop automation:

# Set API key
export GEMINI_API_KEY="your-api-key"

# Run Gemini desktop automation
# Docker container is automatically created on first run!
uv run massgen \
  --config massgen/configs/tools/custom_tools/gemini_computer_use_docker_example.yaml \
  "Open Firefox and search for Python documentation"

3. Claude Computer Use (Docker/Linux)

# Set API key
export ANTHROPIC_API_KEY="your-api-key"

# Run Claude desktop automation
# Docker container is automatically created on first run!
uv run massgen \
  --config massgen/configs/tools/custom_tools/claude_computer_use_docker_example.yaml \
  "Navigate to Wikipedia and search for Artificial Intelligence"

Detailed Tool Guides#

1. Gemini Computer Use#

Description: Full implementation of Google’s Gemini 2.5 Computer Use API with native computer control capabilities and built-in safety checks.

Model Requirement:

MUST use gemini-2.5-computer-use-preview-10-2025 model
Will NOT work with other Gemini models

Example Configuration (Browser)

agents:
  - id: "gemini_automation_agent"
    backend:
      type: "google"
      model: "gemini-2.5-computer-use-preview-10-2025"  # Required!
      custom_tools:
        - name: ["gemini_computer_use"]
          path: "massgen/tool/_gemini_computer_use/gemini_computer_use_tool.py"
          function: ["gemini_computer_use"]
          preset_args:
            environment: "browser"
            display_width: 1440  # Recommended by Gemini
            display_height: 900  # Recommended by Gemini
            environment_config:
              headless: false  # Set to true for headless
              browser_type: "chromium"

ui:
  display_type: "rich_terminal"

Supported Actions:

open_web_browser - Open browser
click_at - Click at coordinates (normalized 0-1000)
hover_at - Hover at coordinates
type_text_at - Type text at coordinates
key_combination - Press key combinations
scroll_document - Scroll entire page
scroll_at - Scroll specific area
navigate - Go to URL
go_back / go_forward - Browser navigation
search - Go to search engine
wait_5_seconds - Wait for content
drag_and_drop - Drag elements

Safety Features:

Built-in safety system checks all actions
require_confirmation - User must approve risky actions
Automatically handles safety acknowledgements
All actions logged for auditing

Use Cases:

Complex multi-step browser workflows
Research and information gathering
E-commerce product research
Form filling with validation
Web scraping with navigation
Automated testing

Supported Environments:

Browser - Playwright-based web automation (Chromium recommended)
Linux/Docker - Desktop automation in Docker container (xdotool)

Example Docker Configuration:

agents:
  - id: "gemini_desktop_agent"
    backend:
      type: "openai"  # Orchestration backend
      model: "gpt-4.1"
      custom_tools:
        - name: ["gemini_computer_use"]
          path: "massgen/tool/_gemini_computer_use/gemini_computer_use_tool.py"
          function: ["gemini_computer_use"]
          preset_args:
            environment: "linux"  # Use Docker
            display_width: 1024
            display_height: 768
            max_iterations: 30
            environment_config:
              container_name: "cua-container"
              display: ":99"

Prerequisites:

GEMINI_API_KEY environment variable
For browser: pip install playwright && playwright install
For Docker: Docker installed and running (container auto-created on first run)
pip install google-genai docker (included in requirements.txt)

2. Claude Computer Use#

Description: Full implementation of Anthropic’s Claude Computer Use API with enhanced actions and thorough execution capabilities.

Model Requirement:

Recommended: claude-sonnet-4-5 (latest with computer use)
Compatible with Claude models supporting computer use
Will NOT work with older Claude models

Example Configuration (Docker/Linux)

agents:
  - id: "claude_automation_agent"
    backend:
      type: "anthropic"
      model: "claude-sonnet-4-5"  # Recommended!
      custom_tools:
        - name: ["claude_computer_use"]
          path: "massgen/tool/_claude_computer_use/claude_computer_use_tool.py"
          function: ["claude_computer_use"]
          preset_args:
            environment: "linux"
            display_width: 1024
            display_height: 768
            max_iterations: 25
            environment_config:
              container_name: "cua-container"
              display: ":99"

Example Configuration (Browser)

agents:
  - id: "claude_browser_agent"
    backend:
      type: "anthropic"
      model: "claude-sonnet-4-5"
      custom_tools:
        - name: ["claude_computer_use"]
          path: "massgen/tool/_claude_computer_use/claude_computer_use_tool.py"
          function: ["claude_computer_use"]
          preset_args:
            environment: "browser"
            display_width: 1024
            display_height: 768
            max_iterations: 25
            headless: false  # Set to true for headless
            browser_type: "chromium"

Supported Actions:

Standard Actions:

screenshot - Capture current screen
mouse_move - Move mouse to coordinates
left_click / right_click / middle_click / double_click - Mouse control
left_click_drag - Click and drag
type - Type text
key - Press single key
scroll - Scroll up/down

Enhanced Actions (Claude-specific):

triple_click - Triple-click to select lines
left_mouse_down / left_mouse_up - Precise drag control
hold_key - Hold key while performing action
wait - Wait for specified duration

Text Editor Actions:

str_replace_based_edit_tool - File editing with find/replace
bash - Execute bash commands (if enabled)

Supported Environments:

Browser - Playwright-based web automation (Chromium)
Linux - Docker container with desktop (xdotool, similar to OpenAI implementation)

Performance Characteristics:

Thorough but slower: ~2-5 seconds per action (vs 1-2 sec for other tools)
High iteration count: Typically 25-40 iterations for simple web tasks
Recommended for: Complex tasks where thoroughness matters more than speed
Not recommended for: Simple tasks requiring quick execution

Example Performance:

Task: "Go to cnn.com and get the top headline"
- Claude Computer Use: 25-40 iterations, ~60-100 seconds
- Browser Automation: 2-3 actions, ~5-10 seconds

Choose based on task complexity vs speed requirements.

Headless Mode:

Automatically enforced on Linux servers without DISPLAY environment variable
Can be overridden for systems with X server
Check logs: “Forcing headless mode on Linux without X server”

Use Cases:

✅ Complex research requiring deep navigation
✅ Multi-step workflows with verification
✅ Tasks requiring precision and thoroughness
✅ When using Anthropic’s ecosystem
❌ Simple/quick automation tasks (use browser_automation instead)

Prerequisites:

ANTHROPIC_API_KEY environment variable
pip install playwright && playwright install
pip install anthropic (included in requirements.txt)
Python 3.8+

3. Browser Automation#

Description: Simple, direct browser automation tool using Playwright. User explicitly controls each action. Works with any LLM model.

Model Support:

✅ gpt-4.1
✅ gpt-4o
✅ Gemini
✅ Claude (with appropriate backend)
✅ Any other model

Example Configuration:

agents:
  - id: "browser_agent"
    backend:
      type: "openai"
      model: "gpt-4.1"  # Can be any model!
      custom_tools:
        - name: ["browser_automation"]
          path: "massgen/tool/_browser_automation/browser_automation_tool.py"
          function: ["browser_automation"]

ui:
  display_type: "rich_terminal"

Supported Actions:

navigate - Go to URL
click - Click element by CSS selector
type - Type text into element
extract - Extract text from elements
screenshot - Capture page image

Example Usage:

# Navigate to a page
await browser_automation(
    task="Open Wikipedia",
    url="https://en.wikipedia.org",
    action="navigate"
)

# Type in search box
await browser_automation(
    task="Search for Jimmy Carter",
    action="type",
    selector="input[name='search']",
    text="Jimmy Carter"
)

# Click search button
await browser_automation(
    task="Click search",
    action="click",
    selector="button[type='submit']"
)

# Extract results
await browser_automation(
    task="Get first paragraph",
    action="extract",
    selector="p.first-paragraph"
)

Use Cases:

Simple page navigation
Data extraction
Testing specific actions
Screenshot capture
Form interactions
When you need precise control
When specialized computer use models are not available

Decision Guide#

When to Use Each Tool#

Use computer_use when:

✅ You have access to computer-use-preview model (OpenAI)
✅ Task requires multiple autonomous steps
✅ Task is complex (e.g., “research topic and create report”)
✅ You want the model to plan its own actions
✅ You need Linux/Docker/OS-level automation
✅ You need fast execution (1-2 sec/action)

Use gemini_computer_use when:

✅ You have access to Gemini 2.5 Computer Use model (Google)
✅ You prefer Google’s AI models
✅ Task requires autonomous browser control
✅ You want built-in safety confirmations
✅ Task is complex and browser-based
✅ You need fast execution (1-2 sec/action)

Use claude_computer_use when:

✅ You have access to Claude Sonnet 4.5 or newer (Anthropic)
✅ You prefer Anthropic’s AI models
✅ Task requires thorough, careful execution
✅ Task is complex and multi-step
✅ Quality and precision matter more than speed
✅ You need enhanced actions (triple_click, mouse_down/up, hold_key)
⚠️ Accept ~2-5 sec/action and 25-40+ iterations

Use browser_automation when:

✅ You don’t have specialized computer use model access
✅ Using gpt-4.1, gpt-4o, or other standard models
✅ Task is simple and direct
✅ You want explicit control over each action
✅ You’re testing specific workflows
✅ You only need browser automation
✅ You need very fast execution (~1 sec/action)

Performance Quick Reference#

Tool	Speed/Action	Iterations (Simple Task)	Best For
`browser_automation`	~1 sec	2-5	Simple tasks, explicit control
`computer_use`	~1-2 sec	10-20	Complex OpenAI workflows
`gemini_computer_use`	~1-2 sec	10-20	Complex Google workflows
`claude_computer_use`	~2-5 sec	25-40	Thorough Anthropic workflows

Visualization and Monitoring#

Visualizing computer use agents helps you understand what they’re doing in real-time and debug issues.

VNC Viewer (Docker/Linux)#

For Claude Computer Use in Docker, you can watch the desktop in real-time using VNC.

Quick Setup:

# 1. Enable VNC on the Docker container
./scripts/enable_vnc_viewer.sh

# 2. Install a VNC viewer (one-time setup)
# Ubuntu/Debian:
sudo apt-get install tigervnc-viewer
# Or:
sudo snap install remmina

# 3. Connect to the container
# Get container IP:
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' cua-container
# Connect with: <container-ip>:5900

What You’ll See:

Real-time desktop with Xfce window manager
Mouse movements and clicks as the agent executes actions
Terminal windows opening for bash commands
Applications launching (Firefox, text editors, etc.)
File browser operations
All desktop interactions in real-time

Non-Headless Browser Mode#

For Gemini and Claude browser automation, watch the browser by disabling headless mode.

Update Configuration:

Use preset_args (not default_params):

# For Gemini Computer Use
custom_tools:
  - name: ["gemini_computer_use"]
    path: "massgen/tool/_gemini_computer_use/gemini_computer_use_tool.py"
    function: ["gemini_computer_use"]
    preset_args:
      environment: "browser"
      display_width: 1440
      display_height: 900
      environment_config:
        headless: false  # Set to false for visible browser
        browser_type: "chromium"

# For Claude Computer Use (browser mode)
custom_tools:
  - name: ["claude_computer_use"]
    path: "massgen/tool/_claude_computer_use/claude_computer_use_tool.py"
    function: ["claude_computer_use"]
    preset_args:
      environment: "browser"
      headless: false  # Set to false for visible browser

Running with Visible Browser:

Important

You must set the DISPLAY environment variable when running:

# Check your available displays
ls /tmp/.X11-unix/
# Shows: X0, X20, etc.

# Run MassGen with DISPLAY variable (example using :20)
DISPLAY=:20 uv run massgen --config gemini_computer_use_example.yaml

# For Claude browser
DISPLAY=:20 uv run massgen --config claude_computer_use_browser_example.yaml

What You’ll See:

Actual browser window opens on your desktop
For Claude: Browser opens with Google homepage loaded
For Gemini: Browser opens at specified URL or blank page
Pages loading and navigation
Form filling and clicking in real-time
Scrolling and text entry
Mouse movements and interactions

Requirements:

X11 display server running (check with echo $DISPLAY)
Desktop environment (GUI) or X server available
DISPLAY environment variable set (e.g., :0, :20)
Cannot run on headless servers without X forwarding or Xvfb

Using Xvfb (Virtual Display on Headless Servers):

# Install Xvfb
sudo apt-get install xvfb

# Start virtual display
Xvfb :20 -screen 0 1440x900x24 &

# Run with visible browser on virtual display
DISPLAY=:20 uv run massgen --config config.yaml

# To see it, use VNC or x11vnc
x11vnc -display :20 -forever -shared -rfbport 5900 -nopw &
vncviewer localhost:5900

Terminal Output Monitoring#

Real-time Logs:

# Watch MassGen logs in real-time
tail -f massgen_logs/log_*/agent_chat.log

# Watch tool execution
tail -f massgen_logs/log_*/tool_calls.log

Verbose Mode:

# Enable debug logging
export MASSGEN_LOG_LEVEL=DEBUG
uv run massgen --config config.yaml

Multi-Agent Computer Use#

You can combine multiple computer use tools in a single configuration for complex workflows.

Example: Claude (Desktop) + Gemini (Browser)

agents:
  # Agent 1: Claude Computer Use with Docker
  - id: "claude_desktop_agent"
    backend:
      type: "claude"
      model: "claude-sonnet-4-5"
      betas: ["computer-use-2025-01-24"]
      custom_tools:
        - name: ["claude_computer_use"]
          path: "massgen/tool/_claude_computer_use/claude_computer_use_tool.py"
          function: ["claude_computer_use"]
          preset_args:
            environment: "linux"
            display_width: 1024
            display_height: 768
            max_iterations: 30

    system_message: |
      You are a Linux desktop automation specialist.
      Your specialty: File operations, bash scripts, system-level tasks.

  # Agent 2: Gemini Computer Use with Browser
  - id: "gemini_browser_agent"
    backend:
      type: "openai"
      model: "gpt-4.1"
      custom_tools:
        - name: ["gemini_computer_use"]
          path: "massgen/tool/_gemini_computer_use/gemini_computer_use_tool.py"
          function: ["gemini_computer_use"]
          preset_args:
            environment: "browser"
            display_width: 1440
            display_height: 900
            environment_config:
              headless: false
              browser_type: "chromium"

    system_message: |
      You are a web research and browser automation specialist.
      Your specialty: Web browsing, data extraction, online research.

Example Use Cases:

“Search for the latest Python releases on the web, then create a summary document”
“Download a file from the web and process it with a bash script”
“Research information online and save it to a file on the desktop”

Troubleshooting#

Common Configuration Mistake#

Issue: Browser always runs in headless mode even with headless: false

Solution: MassGen’s custom tools use preset_args, NOT default_params:

# ❌ WRONG - Will not work
custom_tools:
  - name: ["gemini_computer_use"]
    default_params:
      environment_config:
        headless: false

# ✅ CORRECT - Use preset_args
custom_tools:
  - name: ["gemini_computer_use"]
    preset_args:
      environment: "browser"
      display_width: 1440
      display_height: 900
      environment_config:
        headless: false
        browser_type: "chromium"

VNC Issues#

# Check if VNC is running
docker exec cua-container ps aux | grep x11vnc

# Restart VNC
docker exec cua-container pkill x11vnc
./scripts/enable_vnc_viewer.sh

# Check firewall
sudo ufw allow 5900/tcp

Browser Not Showing#

# 1. Check DISPLAY variable is set
echo $DISPLAY
# Should show something like: :0 or :20

# 2. List available displays
ls /tmp/.X11-unix/
# Shows: X0, X20, etc.

# 3. Test with simple X app
DISPLAY=:20 xeyes  # Should open a window

# 4. If no DISPLAY, create virtual display
Xvfb :20 -screen 0 1440x900x24 &
export DISPLAY=:20

# 5. Verify config uses preset_args (not default_params)
grep -A5 "preset_args" your_config.yaml

# 6. Ensure headless: false in environment_config
grep "headless" your_config.yaml

Best Practices#

Development: Use VNC + non-headless browser for debugging
Testing: Use terminal logs with occasional screenshots
Production: Use headless mode with comprehensive logging
Demos: Record sessions with VNC/browser recording
Remote Work: Use X11 forwarding or VNC over SSH tunnel
Iteration Limits: Set appropriate max_iterations based on task complexity
Safety: Test actions in isolated environments before production use
Error Handling: Monitor logs for errors and adjust configurations

File Structure#

massgen/
├── tool/
│   ├── _computer_use/              # OpenAI CUA implementation
│   │   ├── __init__.py
│   │   ├── computer_use_tool.py    # Requires computer-use-preview
│   │   ├── README.md
│   │   └── QUICKSTART.md
│   │
│   ├── _gemini_computer_use/       # Google Gemini implementation
│   │   ├── __init__.py
│   │   └── gemini_computer_use_tool.py
│   │
│   ├── _claude_computer_use/       # Anthropic Claude implementation
│   │   ├── __init__.py
│   │   └── claude_computer_use_tool.py
│   │
│   └── _browser_automation/        # Simple browser tool
│       ├── __init__.py
│       └── browser_automation_tool.py
│
└── configs/tools/custom_tools/
    ├── gemini_computer_use_example.yaml
    ├── gemini_computer_use_docker_example.yaml
    ├── claude_computer_use_docker_example.yaml
    ├── claude_computer_use_browser_example.yaml
    ├── simple_browser_automation_example.yaml
    └── multi_agent_computer_use_example.yaml

Next Steps#

Related Guides:
- Custom Tools - Learn about creating custom tools
- Multimodal Capabilities - Multimodal capabilities
- MCP Integration - External tools via MCP
- YAML Configuration Reference - Complete YAML reference
Configuration Examples:
- Computer Use Examples
- massgen/backend/docs/COMPUTER_USE_TOOLS_GUIDE.md - Comprehensive implementation guide
- massgen/backend/docs/COMPUTER_USE_VISUALIZATION.md - Visualization guide
Setup Guides:
- scripts/computer_use_setup.md - Docker installation guide
- ./scripts/setup_docker_cua.sh - Manual Docker setup script (optional - auto-created on first run)
- ./scripts/enable_vnc_viewer.sh - VNC visualization setup

Computer Use Tools

Contents

Computer Use Tools#

Overview#

Tool Comparison#

Quick Start#

Detailed Tool Guides#

1. Gemini Computer Use#

2. Claude Computer Use#

3. Browser Automation#

Decision Guide#

When to Use Each Tool#

Performance Quick Reference#

Visualization and Monitoring#

VNC Viewer (Docker/Linux)#

Non-Headless Browser Mode#

Terminal Output Monitoring#

Multi-Agent Computer Use#

Troubleshooting#

Common Configuration Mistake#

VNC Issues#

Browser Not Showing#

Best Practices#

File Structure#

Next Steps#