Memory and Context Management#

MassGen’s memory system enables agents to maintain knowledge across conversations, handle long context windows gracefully, and share insights across multi-turn sessions. The system automatically manages context compression, semantic memory retrieval, and cross-agent knowledge sharing.

Overview#

The memory system consists of two complementary components:

ConversationMemory (Short-term)

Fast in-memory storage for recent messages. Maintains verbatim conversation history for the current context window.

PersistentMemory (Long-term)

Vector database storage (via mem0) with semantic search. Extracts and stores key facts that persist across sessions and can be retrieved when relevant.

Key Features#

  • Automatic Context Compression: When approaching token limits, old messages are removed while remaining accessible via semantic search

  • Semantic Retrieval: Retrieve relevant facts from past conversations based on current context

  • Cross-Agent Memory Sharing: Agents access previous winning agents’ knowledge from past turns

  • Session Management: Memories isolated by session for clean separation of different tasks

  • Turn-Aware Filtering: Prevents temporal leakage by filtering memories by turn number

Quick Start#

Prerequisites#

For multi-agent setups, start the Qdrant vector database server:

# Start Qdrant (required for persistent memory)
docker-compose -f docker-compose.qdrant.yml up -d

# Verify it's running
curl http://localhost:6333/health

# (Optional) View Qdrant dashboard
open http://localhost:6333/dashboard

Basic Configuration#

Add memory configuration to your YAML config:

memory:
  enabled: true

  conversation_memory:
    enabled: true  # Short-term tracking

  persistent_memory:
    enabled: true  # Long-term storage

    # LLM for fact extraction (uses mem0's native providers)
    llm:
      provider: "openai"
      model: "gpt-4.1-nano-2025-04-14"

    # Embeddings for vector search
    embedding:
      provider: "openai"
      model: "text-embedding-3-small"

    # Qdrant configuration
    qdrant:
      mode: "server"  # Use "local" for single-agent only
      host: "localhost"
      port: 6333

  # Context compression settings
  compression:
    trigger_threshold: 0.75  # Compress at 75% usage
    target_ratio: 0.40       # Keep 40% after compression

  # Retrieval settings
  retrieval:
    limit: 5              # Facts to retrieve
    exclude_recent: true  # Only retrieve after compression

  # Recording settings (v0.1.9+)
  recording:
    record_all_tool_calls: false  # Set true to capture ALL MCP tools
    record_reasoning: false       # Set true to capture thinking separately

Run with Memory#

# Interactive mode with memory
massgen --config @examples/memory/gpt5mini_gemini_context_window_management.yaml

# Single question with memory
massgen \
  --config @examples/memory/gpt5mini_gemini_context_window_management.yaml \
  "Analyze the MassGen codebase and create an architecture document"

How It Works#

Custom Fact Extraction#

MassGen uses custom prompts designed to extract high-quality, domain-focused memories. The goal is to filter facts to be:

Self-Contained and Specific:

Facts should be understandable 6 months later without the original conversation

Focused on Domain Knowledge:
  • ✅ Concrete data points with context (“OpenAI revenue reached $12B annualized”)

  • ✅ Insights with explanations (“Narrative depth valued in creative writing because…”)

  • ✅ Capabilities with use cases (“MassGen v0.1.1 supports Python tools via YAML”)

  • ✅ Domain expertise with details (“Binet’s formula uses golden ratio phi=(1+√5)/2”)

  • ✅ Specific recommendations with WHAT, WHEN, WHY

Tool Usage Patterns (v0.1.9+):
  • ✅ Tool sequences that work (“For code analysis, directory_tree → read_file → grep provides systematic understanding”)

  • ✅ Problem-solving approaches (“Breaking large tasks into focused searches yields better results than broad queries”)

  • ✅ What worked/failed with reasoning (“Sequential exploration prevents getting lost in implementation details”)

Excluded for Quality:
  • ❌ Agent comparisons (“Agent 1’s response is better”)

  • ❌ Voting details (“The reason for voting…”)

  • ❌ Meta-instructions (“Response should start with…”)

  • ❌ Generic advice without specifics (“Providing templates improves docs”)

  • ❌ Usage statistics without insight (“Used grep 5 times”)

Implementation: massgen/memory/_fact_extraction_prompts.py::MASSGEN_UNIVERSAL_FACT_EXTRACTION_PROMPT

Memory Flow#

Every Turn:

  1. User message added to conversation_memory (verbatim)

  2. Agent responds with reasoning and answer

  3. Response recorded to:

    • ConversationMemory: Full message for immediate context

    • PersistentMemory: mem0’s LLM extracts key facts and stores in vector DB

  4. Context window checked:

    • Below threshold: Continue normally

    • Above threshold: Compress old messages, enable retrieval

What Gets Recorded (Default):

✅ User messages
✅ Final answer text (accumulated from content chunks)
✅ Workflow tools (new_answer, vote) with full arguments

❌ System messages (orchestrator prompts - filtered out)
❌ MCP tool calls (unless record_all_tool_calls: true)
❌ Reasoning chunks (unless record_reasoning: true)

Configurable Recording (v0.1.9+):

You can now control what gets recorded to memory via YAML configuration:

memory:
  recording:
    record_all_tool_calls: false  # Set to true to capture ALL MCP tools
    record_reasoning: false       # Set to true to capture thinking separately

See Recording Settings (v0.1.9+) below for details.

Context Compression#

MassGen uses reactive compression for context window management. This is due to a fundamental limitation of most LLM APIs.

Why Reactive?

Most LLM providers (OpenAI, Anthropic, Google) only report token usage after a request completes. There is no mid-stream token counting or pre-flight validation API. This means MassGen cannot proactively prevent context overflow—it can only react when the provider returns a context length error.

How It Works

  1. MassGen sends the conversation to the LLM

  2. If the context is too long, the provider returns an error

  3. MassGen catches the error and generates a summary of the work done so far

  4. The summarized conversation is retried automatically (single retry to prevent loops)

After compression, the message structure looks like:

Before Error:
[system] → [user 1] → [assistant 1] → ... → [user 20] → [assistant 20] ← ERROR

After Compression:
[system] → [user request] → [summary as assistant message]
↑           ↑                ↑
System      User's original  Summary of ALL work done so far
preserved   request          (most recent context - model continues from here)

Key Design: User → Summary Ordering

The summary is placed after the user message as an assistant message. This ordering is critical for preventing redundant work:

  • The model sees its own summary as the most recent context

  • It naturally continues from the summary rather than starting fresh

  • File reads, analysis, and other completed work are preserved in the summary

What Gets Summarized

The compression system captures everything in the streaming buffer, including:

  • Tool calls and their results (file reads, directory listings, etc.)

  • Reasoning and analysis performed

  • Partial answers and work in progress

  • Any content that was streaming when the context limit was hit

This ensures the model doesn’t re-read files or redo analysis after compression.

Configuration

coordination:
  compression_target_ratio: 0.20  # Preserve 20% of messages, summarize 80%

The compression_target_ratio controls how aggressively to compress when the context limit is exceeded:

  • 0.20 (default): Preserve ~20% of messages verbatim, summarize the rest

  • 0.30: More conservative, preserve ~30% of messages

  • 0.10: More aggressive, preserve only ~10% of messages

Note

Compression is reactive - it only triggers when the provider returns a context length error. MassGen cannot predict when context will exceed the limit because token counts are only available after each LLM call completes.

Best Practices

  • For very long tasks, consider breaking into multiple sessions

  • Use clear_history=True when starting unrelated topics

  • Critical information should be in recent messages or system prompt

  • Lower compression_target_ratio for more aggressive compression (preserves less)

Future Improvements

Some providers may add better token tracking in the future:

  • Pre-flight token counting APIs

  • Streaming token usage updates

  • Local models with tiktoken-based estimation

Memory Retrieval#

Retrieval happens when:

  • After compression: Retrieve facts from compressed messages

  • On restart/reset: Restore recent context

  • Before compression: Skip (all context already in conversation_memory)

Retrieval process:

  1. Search own agent’s memories (all turns, current session)

  2. Search previous winners’ memories (filtered by turn - see below)

  3. Format and inject as system message before processing

Retrieved memories injected as:

┌─────────────────────────────────────┐
│ Relevant memories:                   │
│ • User asked about backend system    │
│ • Agent analyzed 5 backend files     │
│ • [From agent_b Turn 1] Explained    │
│   stateful vs stateless backends     │
└─────────────────────────────────────┘
↓
[user msg 15] → [agent response 15] → ...

Use Cases#

Scenario 1: Long Analysis Tasks#

Use case: Analyzing a large codebase that requires reading 50+ files

Without memory:

Context fills up after ~15 files, agent loses track of earlier analysis

With memory:
  • Agent reads files 1-15, context compresses

  • Files 16-30: Agent retrieves relevant facts from 1-15

  • Maintains complete understanding throughout analysis

Configuration:

memory:
  enabled: true
  compression:
    trigger_threshold: 0.75  # Compress when 75% full
    target_ratio: 0.40        # Keep 40% of recent context

Example:

massgen --config @examples/memory/gpt5mini_gemini_context_window_management.yaml \
  "Analyze the entire MassGen codebase and create comprehensive documentation"

Scenario 2: Multi-Turn Sessions#

Use case: Interactive development across multiple sessions

Without memory:

Each turn starts fresh, agents forget previous turns’ insights

With memory:
  • Turn 1: Agent A wins, explains backend architecture

  • Turn 2: Agent B retrieves Agent A’s Turn 1 insights

  • Turn 3: Agent A sees both own past work + Agent B’s Turn 2 insights

How winner memory sharing works:

Turn 1: agent_a wins → Memories tagged {"agent_id": "agent_a", "turn": 1}
Turn 2:
  agent_b retrieves:
    ✅ Own memories (all turns)
    ✅ agent_a's Turn 1 memories (previous winner)
    ❌ agent_a's Turn 2 memories (not yet complete)

Turn 3:
  agent_a retrieves:
    ✅ Own memories (Turns 1, 2)
    ✅ agent_b's Turn 2 memories (previous winner)

Configuration:

Session ID automatically generated for interactive mode: session_20251028_143000

Memories are isolated per session unless you specify a custom session name.

Scenario 3: Orchestrator Restarts#

Use case: Agent needs to restart due to errors or new answers from other agents

Without memory:

Partial work lost, agent starts from scratch

With memory:
  • Before restart: Current conversation recorded to persistent_memory

  • On restart: Relevant facts retrieved to restore context

  • Agent continues seamlessly with knowledge of prior attempts

Example flow:

Agent A working on task...
📝 Read 5 files, analyzed architecture
🔄 Other agent submits better answer → Restart triggered
💾 Recording 10 messages before reset
🔄 Retrieving memories after reset...
💭 Retrieved: "Analyzed backend/base.py", "Found adapter pattern", ...
✅ Agent continues with restored context

Configuration Reference#

Complete Configuration#

memory:
  # Global enable/disable
  enabled: true

  # Short-term conversation tracking
  conversation_memory:
    enabled: true

  # Long-term knowledge storage
  persistent_memory:
    enabled: true
    on_disk: true  # Persist across restarts

    # Session isolation (optional)
    # session_name: "my_project_analysis"  # Specific session
    # session_name: null                   # Cross-session memory

    # LLM for fact extraction
    llm:
      provider: "openai"
      model: "gpt-4.1-nano-2025-04-14"  # Fast, cheap for memory ops
      # api_key: "sk-..."  # Optional - reads from OPENAI_API_KEY env var

    # Embeddings for vector search
    embedding:
      provider: "openai"
      model: "text-embedding-3-small"
      # api_key: "sk-..."  # Optional - reads from OPENAI_API_KEY env var

    # Vector store (Qdrant)
    qdrant:
      mode: "server"      # "server" or "local"
      host: "localhost"   # Server mode only
      port: 6333          # Server mode only
      # path: ".massgen/qdrant"  # Local mode only

  # Context window compression
  compression:
    trigger_threshold: 0.75  # Compress at 75% context usage
    target_ratio: 0.40       # Target 40% after compression

  # Memory retrieval
  retrieval:
    limit: 5              # Max facts per agent
    exclude_recent: true  # Skip retrieval before compression

  # Memory recording (v0.1.9+)
  recording:
    record_all_tool_calls: false  # Record ALL MCP tools (not just workflow)
    record_reasoning: false       # Record reasoning chunks separately

Configuration Options#

Memory Toggle#

memory:
  enabled: false  # Disable entire memory system

Conversation Memory#

conversation_memory:
  enabled: true  # Almost always true - needed for context management

Persistent Memory#

LLM Configuration (for fact extraction):

Provider

Configuration

OpenAI

provider: "openai", model: "gpt-4.1-nano-2025-04-14" or "gpt-4o-mini"

Anthropic

provider: "anthropic", model: "claude-haiku-4-5-20251001"

Groq

provider: "groq", model: "llama-3.1-8b-instant"

Embedding Configuration (for vector search):

Provider

Configuration

OpenAI

provider: "openai", model: "text-embedding-3-small" (1536 dims)

Together

provider: "together", model: "togethercomputer/m2-bert-80M-8k-retrieval"

Azure OpenAI

provider: "azure_openai", model: "text-embedding-ada-002"

Qdrant Configuration:

# Server mode (RECOMMENDED for multi-agent)
qdrant:
  mode: "server"
  host: "localhost"
  port: 6333

# Local mode (single agent only)
qdrant:
  mode: "local"
  path: ".massgen/qdrant"

Warning

Local file-based Qdrant does NOT support concurrent access. For multi-agent setups, always use server mode.

Session Management#

Automatic sessions:

All sessions are automatically created and tracked in the registry:

  • Interactive mode: session_20251028_143000 (shared across all turns in that session)

  • Single question: session_20251028_143001 (each run gets its own tracked session)

Custom sessions:

persistent_memory:
  session_name: "my_project_analysis"  # Continue specific session

Cross-session memory (search across all sessions):

persistent_memory:
  session_name: null  # or omit the field

Loading Previous Sessions#

MassGen automatically tracks all memory sessions in a registry (~/.massgen/sessions.json). You can list and load previous sessions to continue conversations with their memory context intact.

List available sessions:

massgen --list-sessions

Example output:

Available Memory Sessions:
============================================================

Session ID: session_20251028_143000
  Status:  completed
  Started: 2025-10-28 14:30:00
  Model:   gpt-4o-mini
  Config:  memory_config.yaml

Session ID: session_20251027_091500
  Status:  completed
  Started: 2025-10-27 09:15:00
  Model:   gpt-4o
  Description: Codebase analysis project
  Config:  research_config.yaml

============================================================
To load a session, use: massgen --session-id <SESSION_ID> "Your question"

Load session via CLI:

# Continue previous session
massgen --session-id session_20251028_143000 "What did we discuss about the backend?"

# Interactive mode with previous session
massgen --session-id session_20251028_143000 --config my_config.yaml

Load session via YAML config:

# Add to your config file
session_id: "session_20251028_143000"

memory:
  enabled: true
  persistent_memory:
    enabled: true
    # ... rest of memory config

Priority order: CLI argument (--session-id) > YAML config (session_id:) > Auto-generated

Benefits:

  • Continue conversations across multiple CLI runs

  • Access memory from previous analysis sessions

  • Build on previous agents’ knowledge without re-analysis

  • Maintain context for long-running research projects

Note: All sessions (both interactive and single-question modes) are tracked in the registry and can be continued later

Compression Settings#

compression:
  trigger_threshold: 0.75  # Not reliably enforceable - see note below
  target_ratio: 0.20        # Preserve 20% of messages after compression

Note

Reactive Compression Limitation: The trigger_threshold cannot be proactively enforced because token counts are only available after each LLM call completes. MassGen uses reactive compression—catching context length errors from the provider and summarizing automatically. Only target_ratio is reliably enforced.

Example configurations:

  • Aggressive compression: target_ratio: 0.10 (preserve only 10%)

  • Moderate (default): target_ratio: 0.20 (preserve 20%)

  • Conservative: target_ratio: 0.40 (preserve 40%)

Retrieval Settings#

retrieval:
  limit: 5              # Max facts per agent (default: 5)
  exclude_recent: true  # Smart retrieval (default: true)
  • More context: Increase limit to 10-20 (uses more tokens)

  • Always retrieve: Set exclude_recent: false (may duplicate recent context)

Recording Settings (v0.1.9+)#

New in v0.1.9: Control what gets recorded to memory for better observability and learning.

memory:
  recording:
    record_all_tool_calls: false  # Record ALL MCP tools (not just workflow)
    record_reasoning: false       # Record reasoning chunks separately

record_all_tool_calls (default: false):

false:

Only workflow tools (new_answer, vote) are recorded

true:

ALL MCP tools are captured (list_directory, read_file, write_file, etc.)

When to enable: - Learning tool usage patterns across sessions - Debugging which tools agents use most - Understanding tool sequences (e.g., “directory_tree → read_file → grep”) - Maximum observability during development

Example with ALL tools enabled:

[Tool Usage]
[Tool Call: mcp__filesystem__directory_tree]
Arguments: {"path": "/Users/.../massgen"}
Result: [directory structure with 50+ files...]

[Tool Call: mcp__filesystem__read_text_file]
Arguments: {"path": ".../orchestrator.py"}
Result: [full file contents...]

[Tool Call: new_answer]
Arguments: {"content": "Architecture analysis complete..."}
Result: Answer submitted

record_reasoning (default: false):

false:

Reasoning mixed with final answer in main response

true:

Reasoning chunks saved separately with [Reasoning] prefix

When to enable: - Debugging agent decision-making - Learning problem-solving approaches - Capturing strategic thinking separate from final output

Example with reasoning enabled:

[Reasoning]
I should analyze the file structure first before diving into specific implementations.
This will help me build a mental model of the codebase organization.

[Reasoning Summary]
Decided to use directory_tree followed by selective file reads for systematic analysis.

Final answer: The codebase follows a modular architecture...

Performance Impact:

  • With both disabled (default): ~1-2 KB per recording, concise memory

  • With both enabled: ~10-50 KB per recording, maximum detail

  • mem0 extraction cost: Same LLM calls regardless (extracts from whatever is sent)

Recommendation: - Development: Enable both for debugging - Production: Keep disabled for concise, focused memory

Monitoring and Debugging#

Context Window Logs#

MassGen uses buffer-based context tracking to accurately monitor token usage. The conversation buffer captures ALL content including tool calls, tool results, injections, and reasoning—not just turn-level messages.

Token Tracking Priority:

  1. Official API counts (at stream end): Most accurate for cost/pricing

  2. Buffer estimation (fallback): Captures all content provided by API

Monitor context usage in real-time:

# Using official API token counts (most accurate)
📊 Context Window (Turn 5): 45,000 / 128,000 tokens (35%) [API actual]

# Using buffer estimation (fallback, assuming API provides all content)
📊 Context Buffer (Turn 5): 45,000 / 128,000 tokens (35%) [buffer]

When compression triggers:

⚠️  Context Buffer (Turn 11): 96,000 / 128,000 tokens (75%) [buffer] - Approaching limit!
🔄 Attempting compression (96,000 → 51,200 tokens)
📦 Context compressed: Removed 15 messages (44,800 tokens).
   Kept 8 recent messages (51,200 tokens).

Why Buffer-Based Tracking?

The conversation buffer is the true source of context sent to agents. Unlike message-based tracking, it includes:

  • Tool calls and their arguments

  • Tool results (can be very large)

  • Injections from other agents

  • Pending content not yet flushed

  • Reasoning/thinking content (may not be available, depending on the API)

This provides accurate context usage even mid-stream, before official API counts are available.

Memory Operations#

Recording:

🔍 [_mem0_add] Recording to mem0 (agent=agent_a, session=session_123, turn=1)
   messages: 2 message(s)
   assistant: [Reasoning] I analyzed the backend files...
   assistant: The backend system consists of...
✅ mem0 extracted 5 fact(s), 2 relation(s)

Retrieval:

🔄 Retrieving memories after reset for agent_a (restoring recent context + 1 winner(s))...
🔍 [retrieve] Searching memories (agent=agent_a, limit=5, winners=1)
   Previous winners: [{'agent_id': 'agent_b', 'turn': 1}]
   🔎 Searching own memories (agent_a)...
      → Found 3 memory/memories
   🔎 Searching 1 previous winner(s)...
      → Searching agent_b (turn 1)...
         Found 2 memory/memories
✅ Total: 5 memories retrieved
   [1] User asked about MassGen architecture
   [2] [From agent_b Turn 1] Explained the adapter pattern

Debug Files (v0.1.9+)#

New in v0.1.9: Memory debug mode saves complete message→fact mappings when using the --debug flag.

Enable debug mode:

massgen --debug --config your_config.yaml "Your question"

Debug files saved to:

.massgen/massgen_logs/log_{timestamp}/attempt_{N}/memory_debug/
└── {agent_id}/
    ├── turn_1_20251029_200335.json
    ├── turn_2_20251029_200438.json
    └── turn_3_20251029_200557.json

File structure:

{
  "timestamp": "2025-10-29T20:03:35.123456",
  "agent_id": "test_agent",
  "session_id": "temp_20251029_200122",
  "turn": 1,
  "metadata": {
    "tools_used": ["mcp__filesystem__directory_tree", "read_text_file"],
    "has_tools": true,
    "message_count": 1
  },
  "messages_sent": [
    {
      "role": "assistant",
      "content": "[Tool Usage]\n[Tool Call: directory_tree]\nArguments: {...}\nResult: ..."
    }
  ],
  "facts_extracted": [
    {
      "id": "abc123",
      "memory": "For analyzing Python codebases, directory_tree → read_file sequence...",
      "event": "ADD"
    }
  ],
  "extraction_count": 10
}

Use cases:

  • Verify tool capture: Check if MCP tools appear in messages_sent

  • Tune prompts: Compare input vs. extracted facts to improve extraction quality

  • Debug 0 facts: See what content was sent when extraction fails

  • Monitor quality: Review if facts are actionable or generic

Testing Memory Setup#

Verify your memory configuration:

# Run test script
uv run python scripts/test_memory_setup.py

Expected output:

🧪 MEMORY SYSTEM TEST SUITE

============================================================
TEST 1: Environment Variables
============================================================
✅ OPENAI_API_KEY found (starts with: sk-proj...)

============================================================
TEST 2: OpenAI Embedding API
============================================================
✅ Embedding successful!
   Vector dimensions: 1536

============================================================
TEST 3: mem0 LLM API (gpt-4.1-nano)
============================================================
✅ LLM call successful!

============================================================
TEST 4: Qdrant Connection
============================================================
✅ Qdrant server connected!

============================================================
TEST 5: Full Memory Integration
============================================================
✅ PersistentMemory created!
✅ Messages recorded!

Advanced Usage#

Per-Agent Memory Configuration#

Override memory settings for specific agents:

memory:
  # Global defaults
  retrieval:
    limit: 5

agents:
  - id: "researcher"
    memory:
      retrieval:
        limit: 20  # This agent gets more context

  - id: "writer"
    memory:
      retrieval:
        limit: 3   # This agent gets less

Different Embedding Providers#

Using Together AI (cost-effective):

persistent_memory:
  embedding:
    provider: "together"
    model: "togethercomputer/m2-bert-80M-8k-retrieval"
    # Reads TOGETHER_API_KEY from environment

Using Azure OpenAI:

persistent_memory:
  llm:
    provider: "azure_openai"
    model: "gpt-4o-mini"
    api_key: "${AZURE_OPENAI_API_KEY}"
  embedding:
    provider: "azure_openai"
    model: "text-embedding-ada-002"

Session Continuation#

Continue a previous session:

persistent_memory:
  session_name: "codebase_analysis_oct2025"

All agents will access memories from this session across multiple CLI runs.

Cross-session knowledge:

persistent_memory:
  session_name: null  # Search across ALL sessions

Useful for: - Building knowledge base across projects - Learning from past conversations - Avoiding repeating analysis

Troubleshooting#

Common Issues#

Qdrant Connection Error

⚠️  Failed to create shared Qdrant client: Storage folder .massgen/qdrant
is already accessed by another instance

Solution:

  1. Check if Qdrant server is running:

    docker-compose -f docker-compose.qdrant.yml ps
    
  2. Remove stale lock files:

    ./scripts/cleanup_qdrant_lock.sh
    # Or manually:
    rm .massgen/qdrant/.lock
    
  3. Use server mode for multi-agent:

    qdrant:
      mode: "server"
    

API Key Not Found

⚠️  OPENAI_API_KEY not found in environment - embedding will fail!

Solution:

Create .env file in project root:

OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...  # If using Anthropic

No Memories Retrieved

🔄 Retrieving memories after reset...
ℹ️  No relevant memories found

This is normal if: - First turn (no memories yet) - Query doesn’t match stored memories semantically - mem0 hasn’t processed messages yet (async extraction)

Check: 1. Verify recording succeeded: Look for mem0 extracted X fact(s) in logs 2. Browse Qdrant collections: http://localhost:6333/dashboard 3. Check debug files: .massgen/.../memory_debug/*.json

0 Facts Extracted

✅ mem0 extracted 0 fact(s), 0 relation(s)
⚠️  mem0 extracted 0 facts (check fact extraction prompt or content quality)

Common causes: 1. Content too short: Less than 10 chars or empty messages 2. Weak extraction model: gpt-4o-mini may fail on complex content 3. Generic content: No extractable facts (e.g., voting messages) 4. JSON parsing error: Model hit token limit mid-response

Solutions: 1. Use stronger model: Change llm.model to "gpt-4o" 2. Enable debug mode: --debug to inspect messages_sent 3. Check content length in logs: Combined content length: X chars 4. Enable record_all_tool_calls: true to provide more context

PointStruct Validation Errors

Error: 6 validation errors for PointStruct
vector.list[float] Input should be a valid list [type=list_type, input_value=None]

Cause: Embedding API returned None instead of valid vector

Common reasons: 1. Empty content: Message with no text sent to embedding API 2. API failure: Rate limit, timeout, or invalid API key 3. Malformed input: Special characters or encoding issues

Solution: This is now automatically prevented by content validation (messages < 10 chars filtered out). If still occurring, check API key and embedding provider status.

JSON Parsing Errors from mem0

Invalid JSON response: Unterminated string starting at: line 108 column 7

Cause: mem0’s extraction LLM hit token limit mid-response, didn’t close JSON string

Solution: Use stronger extraction model (gpt-4o) or reduce content length

Cleaning Up#

Stop Qdrant:

docker-compose -f docker-compose.qdrant.yml down

Clear all memories:

# Remove Qdrant storage (WARNING: deletes all memories!)
rm -rf .massgen/qdrant_storage

Clear session data:

# Remove specific session
rm -rf .massgen/memory_test_sessions/session_20251028_143000

# Or all sessions
rm -rf .massgen/memory_test_sessions

Design Decisions#

Why These Architecture Choices? (Click to expand)

Why mem0’s Native LLMs/Embedders?#

Decision: Use mem0’s built-in providers (OpenAI, Anthropic, etc.) instead of wrapping MassGen backends

Rationale:

  • Simpler: No adapter layer, direct integration

  • No async issues: mem0’s adapters are sync, wrapping async MassGen backends caused event loop conflicts

  • Optimized: mem0’s default (gpt-4.1-nano) is optimized for memory operations

  • Flexible: Support for many providers without custom code

Trade-off: Requires separate API keys (can’t reuse agent’s backend). But memory operations are cheap (~1-2 cents/session).

Why MCP Tools Are Optional in Memory (v0.1.9+)#

Default: MCP tool calls (read_file, list_directory, etc.) are not recorded

Rationale:

  1. Implementation details: HOW the work was done, not WHAT was learned

  2. Redundant: The final answer usually captures insights from reading those files

  3. Noise: 50+ file reads can overwhelm mem0’s extraction, making it harder to extract semantic facts

  4. Focus on outcomes: Agent’s conclusions more valuable than execution trace

  5. Token efficiency: Keeps memory concise and focused

Example (default mode):

Recorded to memory:
✅ Final answer: "The backend uses an adapter pattern in base.py that enables provider abstraction"

Not recorded:
❌ [Tool: read_file] path=/foo/base.py
❌ [Tool: read_file] path=/foo/openai.py
❌ [Tool: read_file] path=/foo/claude.py

When to Enable (record_all_tool_calls: true):

  • Learning tool patterns: Understand which tool sequences work best

  • Debugging: See exactly what agent explored

  • Pattern analysis: Extract insights like “directory_tree before read_file is more effective”

  • Development: Maximum observability during testing

Example (all tools mode):

Recorded to memory:
✅ [Tool Call: mcp__filesystem__directory_tree]
   Arguments: {"path": "/massgen"}
   Result: [50+ files and directories...]
✅ [Tool Call: mcp__filesystem__read_text_file]
   Arguments: {"path": "/massgen/base.py"}
   Result: [full file contents...]
✅ Final answer: "The backend uses an adapter pattern..."

mem0’s LLM can then extract: “For analyzing codebases, using directory_tree first followed by reading key files provides systematic understanding”

If you just need execution history (not learning patterns): Check orchestrator logs or agent workspace snapshots instead.

Why Record Reasoning?#

Decision: Include full reasoning chains and summaries in memory

Rationale:

  • Context for decisions: Final answer is meaningless without the reasoning

  • Better fact extraction: mem0’s LLM can extract richer facts from reasoning

  • Debugging: Understand WHY agent made certain choices

  • Learning: Future turns benefit from understanding past reasoning

Example memory facts extracted:

  • Without reasoning: “Agent said backend uses adapters”

  • With reasoning: “Agent analyzed base.py first, then compared 5 implementations, concluded adapters enable provider abstraction”

Why Filter System Messages?#

Decision: Exclude role: "system" messages from memory

Rationale:

  • Orchestrator noise: System messages contain coordination prompts like “You are evaluating answers from multiple agents…”

  • Not conversation content: System prompts are framework instructions, not user/agent dialogue

  • Bloat: Can be 5-10KB per message, mostly boilerplate

  • Focus on semantics: User questions and agent answers are what matter for memory

Why Smart Retrieval (exclude_recent)?#

Decision: Default exclude_recent: true - only retrieve after compression

Rationale:

  • Before compression: All context already in conversation_memory sent to LLM

  • Retrieval would duplicate: Waste tokens on information already present

  • After compression: Old messages removed, retrieval fills the gap

  • On restart: Always retrieve to restore context

Token efficiency:

  • Without exclude_recent: ~500 extra tokens per turn (duplicated context)

  • With exclude_recent: ~100 tokens only when needed (after compression)

Context Compression Thresholds#

Decision: Default 75% trigger, 40% target

Rationale:

  • 75% trigger: Provides buffer before hitting limit (avoid truncation)

  • 40% target: Balances context retention vs. token budget

  • Room for retrieval: Retrieved facts + recent context fit comfortably

  • Headroom for response: LLM has space to generate long responses

Alternative configurations:

  • Long analysis tasks: Lower threshold (50%) to compress more aggressively

  • Short conversations: Higher threshold (90%) to compress rarely

Why Qdrant Server for Multi-Agent?#

Decision: Require Qdrant server mode (Docker) for multi-agent setups

Rationale:

  • Concurrent access: File-based Qdrant locks on first access

  • Performance: Server mode handles parallel searches better

  • Robustness: No stale lock files from crashed processes

  • Scalability: Can scale to many agents

Trade-off: Requires Docker. But setup is one command: docker-compose up -d

Why Separate Memories Per Agent?#

Decision: Each agent has isolated memories, filtered by agent_id

Rationale:

  • Specialization: Different agents can build different knowledge bases

  • Controlled sharing: Only share via turn-aware winner mechanism

  • Scalability: Single Qdrant database, filtered by metadata

  • Privacy: Agent-specific knowledge stays private until winning

Alternative considered: Shared memory pool for all agents. Rejected because: - Information overload: Agent sees irrelevant memories from other agents - Loss of specialization: Can’t maintain agent-specific expertise - Temporal issues: Agent sees work-in-progress from concurrent agents

Why Turn-Aware Memory Filtering?#

Decision: Filter previous winners’ memories by {"turn": 1} metadata

Rationale:

Prevents temporal leakage:

Turn 2 (concurrent):
- agent_a working... (incomplete)
- agent_b working... (incomplete)

Without filtering:
- agent_a could see agent_b's Turn 2 work-in-progress ❌
- Leads to confusion, inconsistent state

With filtering:
- agent_a only sees agent_b's Turn 1 (complete, winner) ✅
- Clean separation of concurrent work

Implementation: Memories tagged with {"turn": N} on recording, filtered on retrieval.

API Reference#

For programmatic usage, see the memory module docstrings:

  • massgen.memory.PersistentMemory - Persistent memory API

  • massgen.memory.ConversationMemory - Conversation memory API

  • massgen.memory._context_monitor - Context monitoring utilities

    • log_context_usage_from_buffer(buffer, turn_number) - Buffer-based tracking (recommended)

    • log_context_usage_from_tokens(tokens, turn_number) - Official API token counts

    • log_context_usage(messages, turn_number) - Legacy message-based tracking

  • massgen.conversation_buffer.AgentConversationBuffer - Conversation buffer

    • estimate_tokens(calculator) - Get total token count including pending content

    • get_token_stats(calculator) - Get breakdown by entry type (user, assistant, tool_call, etc.)

Examples#

See complete examples in:

  • massgen/configs/memory/gpt5mini_gemini_context_window_management.yaml

  • massgen/configs/memory/gpt5mini_high_reasoning_gemini.yaml

Future Improvements#

Note

The memory system is production-ready but has several planned enhancements.

Planned Features#

1. Proactive Streaming Interruption (Partially Implemented)

Implemented: Buffer-based token tracking captures ALL content during streaming:

[Agent streaming response...]
→ [Buffer tracks: tool calls, results, reasoning, content]
→ [Pre-processing: 📊 Context Buffer: 45K / 128K tokens [buffer]]
→ [Post-processing: 📊 Context Window: 45K / 128K tokens [API actual]]
→ [Compress if needed]

Remaining: Proactive interruption when approaching budget

Planned: Inject warning to agent mid-stream when approaching limit

[Agent streaming...]
→ [Buffer counter: 95K / 128K budget]
→ [Agent sees: "⚠️ Approaching token limit, wrap up"]
→ [Agent concludes early]

2. Memory Analytics Dashboard

Planned: Visualize memory quality and tool usage patterns

Memory Analytics Dashboard
===========================

Facts Extracted: 245 (last 7 days)
Tool Patterns Learned: 12

Top Tool Sequences:
1. directory_tree → read_file → grep (85% success)
2. list_directory → read_file (92% success)

Fact Quality:
- Actionable: 78%
- Generic: 15%
- Redundant: 7%

3. Smart Tool Result Summarization

Planned: Automatically summarize large MCP tool results before recording

memory:
  recording:
    record_all_tool_calls: true
    summarize_large_results: true  # Auto-summarize results > 5KB
    summary_model: "gpt-4o-mini"   # Model for summarization

Benefit: Capture tool usage patterns without overwhelming mem0’s extraction LLM with 50KB directory trees

4. Memory Summarization on Compression (Implemented)

Compression now generates a comprehensive summary of all work done:

Compression Flow:
1. Context limit error detected
2. Generate summary of buffer content (tool calls, results, analysis)
3. Rebuild context: [system] → [user request] → [summary]
4. Summary placed LAST so model continues from it (not restart)

The user→summary ordering prevents models from re-reading files or redoing analysis that was already completed before compression.

Known Limitations#

Token Counting During Streaming (Improved in v0.1.25+)

Buffer-based tracking now provides context estimates during streaming:

  • Before processing: Buffer estimation shows current context size

  • After response: Official API counts used when available

  • Accurate tracking: Includes tool calls, results, injections, reasoning

  • ❌ Can’t stop mid-response if too large (proactive interruption planned)

  • ❌ No real-time budget warnings to agent yet

  • ❌ Reasoning not provided by APIs so buffer can be inaccurate

Workaround: Set conservative compression thresholds (50-60%) to leave headroom.

Extraction Quality Depends on Model

The quality of extracted facts varies significantly by model:

  • gpt-4.1-nano / gpt-4o-mini: Fast, cheap, but may produce generic facts or JSON parsing errors on complex content

  • gpt-4o / gpt-4-turbo: Slower, more expensive, but extracts specific, actionable insights

Recommendation: Use gpt-4o-mini for development, gpt-4o for production if fact quality matters.

MCP Tools Recording is Opt-In

By default, MCP tool calls (read_file, list_directory) are excluded to keep memory concise.

To enable: Set memory.recording.record_all_tool_calls: true

Trade-off: More data for pattern learning vs. potential information overload for mem0’s extraction LLM.

Session-Level Memory Isolation

Memories are isolated per session. To access knowledge from previous sessions, either: - Set session_name: null (search all sessions) - Explicitly continue a session with session_name: "my_session"

Local Qdrant Single-Agent Only

File-based Qdrant (mode: "local") does NOT support concurrent access.

For multi-agent: Always use mode: "server" with Docker.

Next Steps#