Computer Use Tools#
MassGen provides powerful computer use tools that allow AI agents to autonomously control browsers and desktop environments. These tools enable agents to browse websites, interact with applications, execute commands, and complete complex multi-step workflows.
Note
Currently Available Tools:
gemini_computer_use- Google Gemini Computer Use (requiresgemini-2.5-computer-use-preview-10-2025model)claude_computer_use- Anthropic Claude Computer Use (requiresclaude-sonnet-4-5or newer)browser_automation- Simple browser automation (works with ANY model: gpt-4.1, gpt-4o, etc.)computer_use- OpenAI Computer Use (requirescomputer-use-previewmodel from OpenAI)WARNING: OpenAI Computer Use model has not gone through sophisticated testing due to access restrictions on computer-use-preview model. Performance is not guaranteed. Be cautious while using.
ui_tars_computer_use- UI-TARS Computer Use from ByteDance (open-sourced)
Environments:
We try to accommodate as many systems as we can, but practically, we observe that computer use models tend to work best when they start on a browser or linux docker. Hence, we have two recommended environments:
browser- Launch computer use agents in a browser, suitable for web tasks.linux docker- Launch computer use agents in a Docker container, suitable for all web and desktop tasks.
Automatic Docker Setup: MassGen will automatically create and configure the Docker container on first run when using a Docker-based computer use config. No manual setup required! The container includes Ubuntu 22.04 with Xfce desktop, X11 virtual display, xdotool, Firefox, Chromium, and scrot.
See here for quick set-up guides for those two environments, and here for visualization guides.
Naming:
We name our configs in this convention: ${TOOL_NAME}_computer_use_${ENVIRONMENT}_example.yaml.
For example, if you would like to use Claude in linux docker environment, you should use the config massgen/configs/tools/custom_tools/claude_computer_use_docker_example.yaml.
If ${ENVIRONMENT} is not specified, we use browser as default value.
We welcome proposals of new tool and environment combinations!
Overview#
Computer use tools transform AI agents from text processors into active automation systems capable of:
Browser Automation - Navigate websites, fill forms, extract data, search for information
Desktop Control - Interact with applications, manage files, execute system commands
Visual Understanding - Take screenshots and use visual feedback to guide actions
Multi-Step Workflows - Chain together complex sequences of actions autonomously
Tool Comparison#
Feature |
|
|
|
|
|---|---|---|---|---|
Model Support |
|
|
|
Any model |
Provider |
OpenAI |
Anthropic |
Any |
|
Environments |
Browser, Linux/Docker, Mac, Windows |
Browser, Linux/Docker |
Browser, Linux/Docker |
Browser only |
Action Planning |
Autonomous multi-step |
Autonomous multi-step |
Autonomous multi-step |
User-directed |
Complexity |
High (full agentic) |
High (full agentic) |
High (full agentic) |
Low (simple) |
Safety Checks |
Built-in |
Built-in + confirmations |
Built-in |
Manual |
Performance |
Fast (~1-2 sec/action) |
Fast (~1-2 sec/action) |
Thorough (~2-5 sec/action) |
Very Fast (~1 sec) |
Best Use Case |
Complex workflows (OpenAI) |
Complex workflows (Google) |
Precision tasks (Anthropic) |
Simple automation |
Quick Start#
1. Simple Browser Automation (Works with Any Model)
# Install dependencies
pip install playwright
playwright install
# Run with gpt-4.1 or any other model
uv run massgen \
--config massgen/configs/tools/custom_tools/simple_browser_automation_example.yaml \
"Go to Wikipedia and search for Jimmy Carter"
2. Gemini Computer Use
Browser automation:
# Set API key
export GEMINI_API_KEY="your-api-key"
# Run Gemini browser automation
uv run massgen \
--config massgen/configs/tools/custom_tools/gemini_computer_use_example.yaml \
"Go to cnn.com and get the top headline"
Docker/Linux desktop automation:
# Set API key
export GEMINI_API_KEY="your-api-key"
# Run Gemini desktop automation
# Docker container is automatically created on first run!
uv run massgen \
--config massgen/configs/tools/custom_tools/gemini_computer_use_docker_example.yaml \
"Open Firefox and search for Python documentation"
3. Claude Computer Use (Docker/Linux)
# Set API key
export ANTHROPIC_API_KEY="your-api-key"
# Run Claude desktop automation
# Docker container is automatically created on first run!
uv run massgen \
--config massgen/configs/tools/custom_tools/claude_computer_use_docker_example.yaml \
"Navigate to Wikipedia and search for Artificial Intelligence"
Detailed Tool Guides#
1. Gemini Computer Use#
Description: Full implementation of Google’s Gemini 2.5 Computer Use API with native computer control capabilities and built-in safety checks.
Model Requirement:
MUST use
gemini-2.5-computer-use-preview-10-2025modelWill NOT work with other Gemini models
Example Configuration (Browser)
agents:
- id: "gemini_automation_agent"
backend:
type: "google"
model: "gemini-2.5-computer-use-preview-10-2025" # Required!
custom_tools:
- name: ["gemini_computer_use"]
path: "massgen/tool/_gemini_computer_use/gemini_computer_use_tool.py"
function: ["gemini_computer_use"]
preset_args:
environment: "browser"
display_width: 1440 # Recommended by Gemini
display_height: 900 # Recommended by Gemini
environment_config:
headless: false # Set to true for headless
browser_type: "chromium"
ui:
display_type: "rich_terminal"
Supported Actions:
open_web_browser- Open browserclick_at- Click at coordinates (normalized 0-1000)hover_at- Hover at coordinatestype_text_at- Type text at coordinateskey_combination- Press key combinationsscroll_document- Scroll entire pagescroll_at- Scroll specific areanavigate- Go to URLgo_back/go_forward- Browser navigationsearch- Go to search enginewait_5_seconds- Wait for contentdrag_and_drop- Drag elements
Safety Features:
Built-in safety system checks all actions
require_confirmation- User must approve risky actionsAutomatically handles safety acknowledgements
All actions logged for auditing
Use Cases:
Complex multi-step browser workflows
Research and information gathering
E-commerce product research
Form filling with validation
Web scraping with navigation
Automated testing
Supported Environments:
Browser - Playwright-based web automation (Chromium recommended)
Linux/Docker - Desktop automation in Docker container (xdotool)
Example Docker Configuration:
agents:
- id: "gemini_desktop_agent"
backend:
type: "openai" # Orchestration backend
model: "gpt-4.1"
custom_tools:
- name: ["gemini_computer_use"]
path: "massgen/tool/_gemini_computer_use/gemini_computer_use_tool.py"
function: ["gemini_computer_use"]
preset_args:
environment: "linux" # Use Docker
display_width: 1024
display_height: 768
max_iterations: 30
environment_config:
container_name: "cua-container"
display: ":99"
Prerequisites:
GEMINI_API_KEYenvironment variableFor browser:
pip install playwright && playwright installFor Docker: Docker installed and running (container auto-created on first run)
pip install google-genai docker(included in requirements.txt)
2. Claude Computer Use#
Description: Full implementation of Anthropic’s Claude Computer Use API with enhanced actions and thorough execution capabilities.
Model Requirement:
Recommended:
claude-sonnet-4-5(latest with computer use)Compatible with Claude models supporting computer use
Will NOT work with older Claude models
Example Configuration (Docker/Linux)
agents:
- id: "claude_automation_agent"
backend:
type: "anthropic"
model: "claude-sonnet-4-5" # Recommended!
custom_tools:
- name: ["claude_computer_use"]
path: "massgen/tool/_claude_computer_use/claude_computer_use_tool.py"
function: ["claude_computer_use"]
preset_args:
environment: "linux"
display_width: 1024
display_height: 768
max_iterations: 25
environment_config:
container_name: "cua-container"
display: ":99"
Example Configuration (Browser)
agents:
- id: "claude_browser_agent"
backend:
type: "anthropic"
model: "claude-sonnet-4-5"
custom_tools:
- name: ["claude_computer_use"]
path: "massgen/tool/_claude_computer_use/claude_computer_use_tool.py"
function: ["claude_computer_use"]
preset_args:
environment: "browser"
display_width: 1024
display_height: 768
max_iterations: 25
headless: false # Set to true for headless
browser_type: "chromium"
Supported Actions:
Standard Actions:
screenshot- Capture current screenmouse_move- Move mouse to coordinatesleft_click/right_click/middle_click/double_click- Mouse controlleft_click_drag- Click and dragtype- Type textkey- Press single keyscroll- Scroll up/down
Enhanced Actions (Claude-specific):
triple_click- Triple-click to select linesleft_mouse_down/left_mouse_up- Precise drag controlhold_key- Hold key while performing actionwait- Wait for specified duration
Text Editor Actions:
str_replace_based_edit_tool- File editing with find/replacebash- Execute bash commands (if enabled)
Supported Environments:
Browser - Playwright-based web automation (Chromium)
Linux - Docker container with desktop (xdotool, similar to OpenAI implementation)
Performance Characteristics:
Thorough but slower: ~2-5 seconds per action (vs 1-2 sec for other tools)
High iteration count: Typically 25-40 iterations for simple web tasks
Recommended for: Complex tasks where thoroughness matters more than speed
Not recommended for: Simple tasks requiring quick execution
Example Performance:
Task: "Go to cnn.com and get the top headline"
- Claude Computer Use: 25-40 iterations, ~60-100 seconds
- Browser Automation: 2-3 actions, ~5-10 seconds
Choose based on task complexity vs speed requirements.
Headless Mode:
Automatically enforced on Linux servers without DISPLAY environment variable
Can be overridden for systems with X server
Check logs: “Forcing headless mode on Linux without X server”
Use Cases:
✅ Complex research requiring deep navigation
✅ Multi-step workflows with verification
✅ Tasks requiring precision and thoroughness
✅ When using Anthropic’s ecosystem
❌ Simple/quick automation tasks (use
browser_automationinstead)
Prerequisites:
ANTHROPIC_API_KEYenvironment variablepip install playwright && playwright installpip install anthropic(included in requirements.txt)Python 3.8+
3. Browser Automation#
Description: Simple, direct browser automation tool using Playwright. User explicitly controls each action. Works with any LLM model.
Model Support:
✅ gpt-4.1
✅ gpt-4o
✅ Gemini
✅ Claude (with appropriate backend)
✅ Any other model
Example Configuration:
agents:
- id: "browser_agent"
backend:
type: "openai"
model: "gpt-4.1" # Can be any model!
custom_tools:
- name: ["browser_automation"]
path: "massgen/tool/_browser_automation/browser_automation_tool.py"
function: ["browser_automation"]
ui:
display_type: "rich_terminal"
Supported Actions:
navigate- Go to URLclick- Click element by CSS selectortype- Type text into elementextract- Extract text from elementsscreenshot- Capture page image
Example Usage:
# Navigate to a page
await browser_automation(
task="Open Wikipedia",
url="https://en.wikipedia.org",
action="navigate"
)
# Type in search box
await browser_automation(
task="Search for Jimmy Carter",
action="type",
selector="input[name='search']",
text="Jimmy Carter"
)
# Click search button
await browser_automation(
task="Click search",
action="click",
selector="button[type='submit']"
)
# Extract results
await browser_automation(
task="Get first paragraph",
action="extract",
selector="p.first-paragraph"
)
Use Cases:
Simple page navigation
Data extraction
Testing specific actions
Screenshot capture
Form interactions
When you need precise control
When specialized computer use models are not available
Decision Guide#
When to Use Each Tool#
Use computer_use when:
✅ You have access to
computer-use-previewmodel (OpenAI)✅ Task requires multiple autonomous steps
✅ Task is complex (e.g., “research topic and create report”)
✅ You want the model to plan its own actions
✅ You need Linux/Docker/OS-level automation
✅ You need fast execution (1-2 sec/action)
Use gemini_computer_use when:
✅ You have access to Gemini 2.5 Computer Use model (Google)
✅ You prefer Google’s AI models
✅ Task requires autonomous browser control
✅ You want built-in safety confirmations
✅ Task is complex and browser-based
✅ You need fast execution (1-2 sec/action)
Use claude_computer_use when:
✅ You have access to Claude Sonnet 4.5 or newer (Anthropic)
✅ You prefer Anthropic’s AI models
✅ Task requires thorough, careful execution
✅ Task is complex and multi-step
✅ Quality and precision matter more than speed
✅ You need enhanced actions (triple_click, mouse_down/up, hold_key)
⚠️ Accept ~2-5 sec/action and 25-40+ iterations
Use browser_automation when:
✅ You don’t have specialized computer use model access
✅ Using gpt-4.1, gpt-4o, or other standard models
✅ Task is simple and direct
✅ You want explicit control over each action
✅ You’re testing specific workflows
✅ You only need browser automation
✅ You need very fast execution (~1 sec/action)
Performance Quick Reference#
Tool |
Speed/Action |
Iterations (Simple Task) |
Best For |
|---|---|---|---|
|
~1 sec |
2-5 |
Simple tasks, explicit control |
|
~1-2 sec |
10-20 |
Complex OpenAI workflows |
|
~1-2 sec |
10-20 |
Complex Google workflows |
|
~2-5 sec |
25-40 |
Thorough Anthropic workflows |
Visualization and Monitoring#
Visualizing computer use agents helps you understand what they’re doing in real-time and debug issues.
VNC Viewer (Docker/Linux)#
For Claude Computer Use in Docker, you can watch the desktop in real-time using VNC.
Quick Setup:
# 1. Enable VNC on the Docker container
./scripts/enable_vnc_viewer.sh
# 2. Install a VNC viewer (one-time setup)
# Ubuntu/Debian:
sudo apt-get install tigervnc-viewer
# Or:
sudo snap install remmina
# 3. Connect to the container
# Get container IP:
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' cua-container
# Connect with: <container-ip>:5900
What You’ll See:
Real-time desktop with Xfce window manager
Mouse movements and clicks as the agent executes actions
Terminal windows opening for bash commands
Applications launching (Firefox, text editors, etc.)
File browser operations
All desktop interactions in real-time
Non-Headless Browser Mode#
For Gemini and Claude browser automation, watch the browser by disabling headless mode.
Update Configuration:
Use preset_args (not default_params):
# For Gemini Computer Use
custom_tools:
- name: ["gemini_computer_use"]
path: "massgen/tool/_gemini_computer_use/gemini_computer_use_tool.py"
function: ["gemini_computer_use"]
preset_args:
environment: "browser"
display_width: 1440
display_height: 900
environment_config:
headless: false # Set to false for visible browser
browser_type: "chromium"
# For Claude Computer Use (browser mode)
custom_tools:
- name: ["claude_computer_use"]
path: "massgen/tool/_claude_computer_use/claude_computer_use_tool.py"
function: ["claude_computer_use"]
preset_args:
environment: "browser"
headless: false # Set to false for visible browser
Running with Visible Browser:
Important
You must set the DISPLAY environment variable when running:
# Check your available displays
ls /tmp/.X11-unix/
# Shows: X0, X20, etc.
# Run MassGen with DISPLAY variable (example using :20)
DISPLAY=:20 uv run massgen --config gemini_computer_use_example.yaml
# For Claude browser
DISPLAY=:20 uv run massgen --config claude_computer_use_browser_example.yaml
What You’ll See:
Actual browser window opens on your desktop
For Claude: Browser opens with Google homepage loaded
For Gemini: Browser opens at specified URL or blank page
Pages loading and navigation
Form filling and clicking in real-time
Scrolling and text entry
Mouse movements and interactions
Requirements:
X11 display server running (check with
echo $DISPLAY)Desktop environment (GUI) or X server available
DISPLAY environment variable set (e.g.,
:0,:20)Cannot run on headless servers without X forwarding or Xvfb
Using Xvfb (Virtual Display on Headless Servers):
# Install Xvfb
sudo apt-get install xvfb
# Start virtual display
Xvfb :20 -screen 0 1440x900x24 &
# Run with visible browser on virtual display
DISPLAY=:20 uv run massgen --config config.yaml
# To see it, use VNC or x11vnc
x11vnc -display :20 -forever -shared -rfbport 5900 -nopw &
vncviewer localhost:5900
Terminal Output Monitoring#
Real-time Logs:
# Watch MassGen logs in real-time
tail -f massgen_logs/log_*/agent_chat.log
# Watch tool execution
tail -f massgen_logs/log_*/tool_calls.log
Verbose Mode:
# Enable debug logging
export MASSGEN_LOG_LEVEL=DEBUG
uv run massgen --config config.yaml
Multi-Agent Computer Use#
You can combine multiple computer use tools in a single configuration for complex workflows.
Example: Claude (Desktop) + Gemini (Browser)
agents:
# Agent 1: Claude Computer Use with Docker
- id: "claude_desktop_agent"
backend:
type: "claude"
model: "claude-sonnet-4-5"
betas: ["computer-use-2025-01-24"]
custom_tools:
- name: ["claude_computer_use"]
path: "massgen/tool/_claude_computer_use/claude_computer_use_tool.py"
function: ["claude_computer_use"]
preset_args:
environment: "linux"
display_width: 1024
display_height: 768
max_iterations: 30
system_message: |
You are a Linux desktop automation specialist.
Your specialty: File operations, bash scripts, system-level tasks.
# Agent 2: Gemini Computer Use with Browser
- id: "gemini_browser_agent"
backend:
type: "openai"
model: "gpt-4.1"
custom_tools:
- name: ["gemini_computer_use"]
path: "massgen/tool/_gemini_computer_use/gemini_computer_use_tool.py"
function: ["gemini_computer_use"]
preset_args:
environment: "browser"
display_width: 1440
display_height: 900
environment_config:
headless: false
browser_type: "chromium"
system_message: |
You are a web research and browser automation specialist.
Your specialty: Web browsing, data extraction, online research.
Example Use Cases:
“Search for the latest Python releases on the web, then create a summary document”
“Download a file from the web and process it with a bash script”
“Research information online and save it to a file on the desktop”
Troubleshooting#
Common Configuration Mistake#
Issue: Browser always runs in headless mode even with headless: false
Solution: MassGen’s custom tools use preset_args, NOT default_params:
# ❌ WRONG - Will not work
custom_tools:
- name: ["gemini_computer_use"]
default_params:
environment_config:
headless: false
# ✅ CORRECT - Use preset_args
custom_tools:
- name: ["gemini_computer_use"]
preset_args:
environment: "browser"
display_width: 1440
display_height: 900
environment_config:
headless: false
browser_type: "chromium"
VNC Issues#
# Check if VNC is running
docker exec cua-container ps aux | grep x11vnc
# Restart VNC
docker exec cua-container pkill x11vnc
./scripts/enable_vnc_viewer.sh
# Check firewall
sudo ufw allow 5900/tcp
Browser Not Showing#
# 1. Check DISPLAY variable is set
echo $DISPLAY
# Should show something like: :0 or :20
# 2. List available displays
ls /tmp/.X11-unix/
# Shows: X0, X20, etc.
# 3. Test with simple X app
DISPLAY=:20 xeyes # Should open a window
# 4. If no DISPLAY, create virtual display
Xvfb :20 -screen 0 1440x900x24 &
export DISPLAY=:20
# 5. Verify config uses preset_args (not default_params)
grep -A5 "preset_args" your_config.yaml
# 6. Ensure headless: false in environment_config
grep "headless" your_config.yaml
Best Practices#
Development: Use VNC + non-headless browser for debugging
Testing: Use terminal logs with occasional screenshots
Production: Use headless mode with comprehensive logging
Demos: Record sessions with VNC/browser recording
Remote Work: Use X11 forwarding or VNC over SSH tunnel
Iteration Limits: Set appropriate
max_iterationsbased on task complexitySafety: Test actions in isolated environments before production use
Error Handling: Monitor logs for errors and adjust configurations
File Structure#
massgen/
├── tool/
│ ├── _computer_use/ # OpenAI CUA implementation
│ │ ├── __init__.py
│ │ ├── computer_use_tool.py # Requires computer-use-preview
│ │ ├── README.md
│ │ └── QUICKSTART.md
│ │
│ ├── _gemini_computer_use/ # Google Gemini implementation
│ │ ├── __init__.py
│ │ └── gemini_computer_use_tool.py
│ │
│ ├── _claude_computer_use/ # Anthropic Claude implementation
│ │ ├── __init__.py
│ │ └── claude_computer_use_tool.py
│ │
│ └── _browser_automation/ # Simple browser tool
│ ├── __init__.py
│ └── browser_automation_tool.py
│
└── configs/tools/custom_tools/
├── gemini_computer_use_example.yaml
├── gemini_computer_use_docker_example.yaml
├── claude_computer_use_docker_example.yaml
├── claude_computer_use_browser_example.yaml
├── simple_browser_automation_example.yaml
└── multi_agent_computer_use_example.yaml
Next Steps#
Related Guides:
Custom Tools - Learn about creating custom tools
Multimodal Capabilities - Multimodal capabilities
MCP Integration - External tools via MCP
YAML Configuration Reference - Complete YAML reference
Configuration Examples:
massgen/backend/docs/COMPUTER_USE_TOOLS_GUIDE.md- Comprehensive implementation guidemassgen/backend/docs/COMPUTER_USE_VISUALIZATION.md- Visualization guide
Setup Guides:
scripts/computer_use_setup.md- Docker installation guide./scripts/setup_docker_cua.sh- Manual Docker setup script (optional - auto-created on first run)./scripts/enable_vnc_viewer.sh- VNC visualization setup