Memory and Context Management#
MassGen’s memory system enables agents to maintain knowledge across conversations, handle long context windows gracefully, and share insights across multi-turn sessions. The system automatically manages context compression, semantic memory retrieval, and cross-agent knowledge sharing.
Overview#
The memory system consists of two complementary components:
- ConversationMemory (Short-term)
Fast in-memory storage for recent messages. Maintains verbatim conversation history for the current context window.
- PersistentMemory (Long-term)
Vector database storage (via mem0) with semantic search. Extracts and stores key facts that persist across sessions and can be retrieved when relevant.
Key Features#
Automatic Context Compression: When approaching token limits, old messages are removed while remaining accessible via semantic search
Semantic Retrieval: Retrieve relevant facts from past conversations based on current context
Cross-Agent Memory Sharing: Agents access previous winning agents’ knowledge from past turns
Session Management: Memories isolated by session for clean separation of different tasks
Turn-Aware Filtering: Prevents temporal leakage by filtering memories by turn number
Quick Start#
Prerequisites#
For multi-agent setups, start the Qdrant vector database server:
# Start Qdrant (required for persistent memory)
docker-compose -f docker-compose.qdrant.yml up -d
# Verify it's running
curl http://localhost:6333/health
# (Optional) View Qdrant dashboard
open http://localhost:6333/dashboard
Basic Configuration#
Add memory configuration to your YAML config:
memory:
enabled: true
conversation_memory:
enabled: true # Short-term tracking
persistent_memory:
enabled: true # Long-term storage
# LLM for fact extraction (uses mem0's native providers)
llm:
provider: "openai"
model: "gpt-4.1-nano-2025-04-14"
# Embeddings for vector search
embedding:
provider: "openai"
model: "text-embedding-3-small"
# Qdrant configuration
qdrant:
mode: "server" # Use "local" for single-agent only
host: "localhost"
port: 6333
# Context compression settings
compression:
trigger_threshold: 0.75 # Compress at 75% usage
target_ratio: 0.40 # Keep 40% after compression
# Retrieval settings
retrieval:
limit: 5 # Facts to retrieve
exclude_recent: true # Only retrieve after compression
# Recording settings (v0.1.9+)
recording:
record_all_tool_calls: false # Set true to capture ALL MCP tools
record_reasoning: false # Set true to capture thinking separately
Run with Memory#
# Interactive mode with memory
massgen --config @examples/memory/gpt5mini_gemini_context_window_management.yaml
# Single question with memory
massgen \
--config @examples/memory/gpt5mini_gemini_context_window_management.yaml \
"Analyze the MassGen codebase and create an architecture document"
How It Works#
Custom Fact Extraction#
MassGen uses custom prompts designed to extract high-quality, domain-focused memories. The goal is to filter facts to be:
- Self-Contained and Specific:
Facts should be understandable 6 months later without the original conversation
- Focused on Domain Knowledge:
✅ Concrete data points with context (“OpenAI revenue reached $12B annualized”)
✅ Insights with explanations (“Narrative depth valued in creative writing because…”)
✅ Capabilities with use cases (“MassGen v0.1.1 supports Python tools via YAML”)
✅ Domain expertise with details (“Binet’s formula uses golden ratio phi=(1+√5)/2”)
✅ Specific recommendations with WHAT, WHEN, WHY
- Tool Usage Patterns (v0.1.9+):
✅ Tool sequences that work (“For code analysis, directory_tree → read_file → grep provides systematic understanding”)
✅ Problem-solving approaches (“Breaking large tasks into focused searches yields better results than broad queries”)
✅ What worked/failed with reasoning (“Sequential exploration prevents getting lost in implementation details”)
- Excluded for Quality:
❌ Agent comparisons (“Agent 1’s response is better”)
❌ Voting details (“The reason for voting…”)
❌ Meta-instructions (“Response should start with…”)
❌ Generic advice without specifics (“Providing templates improves docs”)
❌ Usage statistics without insight (“Used grep 5 times”)
Implementation: massgen/memory/_fact_extraction_prompts.py::MASSGEN_UNIVERSAL_FACT_EXTRACTION_PROMPT
Memory Flow#
Every Turn:
User message added to conversation_memory (verbatim)
Agent responds with reasoning and answer
Response recorded to:
ConversationMemory: Full message for immediate context
PersistentMemory: mem0’s LLM extracts key facts and stores in vector DB
Context window checked:
Below threshold: Continue normally
Above threshold: Compress old messages, enable retrieval
What Gets Recorded (Default):
✅ User messages
✅ Final answer text (accumulated from content chunks)
✅ Workflow tools (new_answer, vote) with full arguments
❌ System messages (orchestrator prompts - filtered out)
❌ MCP tool calls (unless record_all_tool_calls: true)
❌ Reasoning chunks (unless record_reasoning: true)
Configurable Recording (v0.1.9+):
You can now control what gets recorded to memory via YAML configuration:
memory:
recording:
record_all_tool_calls: false # Set to true to capture ALL MCP tools
record_reasoning: false # Set to true to capture thinking separately
See Recording Settings (v0.1.9+) below for details.
Context Compression#
MassGen uses reactive compression for context window management. This is due to a fundamental limitation of most LLM APIs.
Why Reactive?
Most LLM providers (OpenAI, Anthropic, Google) only report token usage after a request completes. There is no mid-stream token counting or pre-flight validation API. This means MassGen cannot proactively prevent context overflow—it can only react when the provider returns a context length error.
How It Works
MassGen sends the conversation to the LLM
If the context is too long, the provider returns an error
MassGen catches the error and generates a summary of the work done so far
The summarized conversation is retried automatically (single retry to prevent loops)
After compression, the message structure looks like:
Before Error:
[system] → [user 1] → [assistant 1] → ... → [user 20] → [assistant 20] ← ERROR
After Compression:
[system] → [user request] → [summary as assistant message]
↑ ↑ ↑
System User's original Summary of ALL work done so far
preserved request (most recent context - model continues from here)
Key Design: User → Summary Ordering
The summary is placed after the user message as an assistant message. This ordering is critical for preventing redundant work:
The model sees its own summary as the most recent context
It naturally continues from the summary rather than starting fresh
File reads, analysis, and other completed work are preserved in the summary
What Gets Summarized
The compression system captures everything in the streaming buffer, including:
Tool calls and their results (file reads, directory listings, etc.)
Reasoning and analysis performed
Partial answers and work in progress
Any content that was streaming when the context limit was hit
This ensures the model doesn’t re-read files or redo analysis after compression.
Configuration
coordination:
compression_target_ratio: 0.20 # Preserve 20% of messages, summarize 80%
The compression_target_ratio controls how aggressively to compress when the
context limit is exceeded:
0.20 (default): Preserve ~20% of messages verbatim, summarize the rest
0.30: More conservative, preserve ~30% of messages
0.10: More aggressive, preserve only ~10% of messages
Note
Compression is reactive - it only triggers when the provider returns a context length error. MassGen cannot predict when context will exceed the limit because token counts are only available after each LLM call completes.
Best Practices
For very long tasks, consider breaking into multiple sessions
Use
clear_history=Truewhen starting unrelated topicsCritical information should be in recent messages or system prompt
Lower
compression_target_ratiofor more aggressive compression (preserves less)
Future Improvements
Some providers may add better token tracking in the future:
Pre-flight token counting APIs
Streaming token usage updates
Local models with tiktoken-based estimation
Memory Retrieval#
Retrieval happens when:
✅ After compression: Retrieve facts from compressed messages
✅ On restart/reset: Restore recent context
❌ Before compression: Skip (all context already in conversation_memory)
Retrieval process:
Search own agent’s memories (all turns, current session)
Search previous winners’ memories (filtered by turn - see below)
Format and inject as system message before processing
Retrieved memories injected as:
┌─────────────────────────────────────┐
│ Relevant memories: │
│ • User asked about backend system │
│ • Agent analyzed 5 backend files │
│ • [From agent_b Turn 1] Explained │
│ stateful vs stateless backends │
└─────────────────────────────────────┘
↓
[user msg 15] → [agent response 15] → ...
Use Cases#
Scenario 1: Long Analysis Tasks#
Use case: Analyzing a large codebase that requires reading 50+ files
- Without memory:
Context fills up after ~15 files, agent loses track of earlier analysis
- With memory:
Agent reads files 1-15, context compresses
Files 16-30: Agent retrieves relevant facts from 1-15
Maintains complete understanding throughout analysis
Configuration:
memory:
enabled: true
compression:
trigger_threshold: 0.75 # Compress when 75% full
target_ratio: 0.40 # Keep 40% of recent context
Example:
massgen --config @examples/memory/gpt5mini_gemini_context_window_management.yaml \
"Analyze the entire MassGen codebase and create comprehensive documentation"
Scenario 2: Multi-Turn Sessions#
Use case: Interactive development across multiple sessions
- Without memory:
Each turn starts fresh, agents forget previous turns’ insights
- With memory:
Turn 1: Agent A wins, explains backend architecture
Turn 2: Agent B retrieves Agent A’s Turn 1 insights
Turn 3: Agent A sees both own past work + Agent B’s Turn 2 insights
How winner memory sharing works:
Turn 1: agent_a wins → Memories tagged {"agent_id": "agent_a", "turn": 1}
Turn 2:
agent_b retrieves:
✅ Own memories (all turns)
✅ agent_a's Turn 1 memories (previous winner)
❌ agent_a's Turn 2 memories (not yet complete)
Turn 3:
agent_a retrieves:
✅ Own memories (Turns 1, 2)
✅ agent_b's Turn 2 memories (previous winner)
Configuration:
Session ID automatically generated for interactive mode: session_20251028_143000
Memories are isolated per session unless you specify a custom session name.
Scenario 3: Orchestrator Restarts#
Use case: Agent needs to restart due to errors or new answers from other agents
- Without memory:
Partial work lost, agent starts from scratch
- With memory:
Before restart: Current conversation recorded to persistent_memory
On restart: Relevant facts retrieved to restore context
Agent continues seamlessly with knowledge of prior attempts
Example flow:
Agent A working on task...
📝 Read 5 files, analyzed architecture
🔄 Other agent submits better answer → Restart triggered
💾 Recording 10 messages before reset
🔄 Retrieving memories after reset...
💭 Retrieved: "Analyzed backend/base.py", "Found adapter pattern", ...
✅ Agent continues with restored context
Configuration Reference#
Complete Configuration#
memory:
# Global enable/disable
enabled: true
# Short-term conversation tracking
conversation_memory:
enabled: true
# Long-term knowledge storage
persistent_memory:
enabled: true
on_disk: true # Persist across restarts
# Session isolation (optional)
# session_name: "my_project_analysis" # Specific session
# session_name: null # Cross-session memory
# LLM for fact extraction
llm:
provider: "openai"
model: "gpt-4.1-nano-2025-04-14" # Fast, cheap for memory ops
# api_key: "sk-..." # Optional - reads from OPENAI_API_KEY env var
# Embeddings for vector search
embedding:
provider: "openai"
model: "text-embedding-3-small"
# api_key: "sk-..." # Optional - reads from OPENAI_API_KEY env var
# Vector store (Qdrant)
qdrant:
mode: "server" # "server" or "local"
host: "localhost" # Server mode only
port: 6333 # Server mode only
# path: ".massgen/qdrant" # Local mode only
# Context window compression
compression:
trigger_threshold: 0.75 # Compress at 75% context usage
target_ratio: 0.40 # Target 40% after compression
# Memory retrieval
retrieval:
limit: 5 # Max facts per agent
exclude_recent: true # Skip retrieval before compression
# Memory recording (v0.1.9+)
recording:
record_all_tool_calls: false # Record ALL MCP tools (not just workflow)
record_reasoning: false # Record reasoning chunks separately
Configuration Options#
Memory Toggle#
memory:
enabled: false # Disable entire memory system
Conversation Memory#
conversation_memory:
enabled: true # Almost always true - needed for context management
Persistent Memory#
LLM Configuration (for fact extraction):
Provider |
Configuration |
|---|---|
OpenAI |
|
Anthropic |
|
Groq |
|
Embedding Configuration (for vector search):
Provider |
Configuration |
|---|---|
OpenAI |
|
Together |
|
Azure OpenAI |
|
Qdrant Configuration:
# Server mode (RECOMMENDED for multi-agent)
qdrant:
mode: "server"
host: "localhost"
port: 6333
# Local mode (single agent only)
qdrant:
mode: "local"
path: ".massgen/qdrant"
Warning
Local file-based Qdrant does NOT support concurrent access. For multi-agent setups, always use server mode.
Session Management#
Automatic sessions:
All sessions are automatically created and tracked in the registry:
Interactive mode:
session_20251028_143000(shared across all turns in that session)Single question:
session_20251028_143001(each run gets its own tracked session)
Custom sessions:
persistent_memory:
session_name: "my_project_analysis" # Continue specific session
Cross-session memory (search across all sessions):
persistent_memory:
session_name: null # or omit the field
Loading Previous Sessions#
MassGen automatically tracks all memory sessions in a registry (~/.massgen/sessions.json). You can list and load previous sessions to continue conversations with their memory context intact.
List available sessions:
massgen --list-sessions
Example output:
Available Memory Sessions:
============================================================
Session ID: session_20251028_143000
Status: completed
Started: 2025-10-28 14:30:00
Model: gpt-4o-mini
Config: memory_config.yaml
Session ID: session_20251027_091500
Status: completed
Started: 2025-10-27 09:15:00
Model: gpt-4o
Description: Codebase analysis project
Config: research_config.yaml
============================================================
To load a session, use: massgen --session-id <SESSION_ID> "Your question"
Load session via CLI:
# Continue previous session
massgen --session-id session_20251028_143000 "What did we discuss about the backend?"
# Interactive mode with previous session
massgen --session-id session_20251028_143000 --config my_config.yaml
Load session via YAML config:
# Add to your config file
session_id: "session_20251028_143000"
memory:
enabled: true
persistent_memory:
enabled: true
# ... rest of memory config
Priority order: CLI argument (--session-id) > YAML config (session_id:) > Auto-generated
Benefits:
Continue conversations across multiple CLI runs
Access memory from previous analysis sessions
Build on previous agents’ knowledge without re-analysis
Maintain context for long-running research projects
Note: All sessions (both interactive and single-question modes) are tracked in the registry and can be continued later
Compression Settings#
compression:
trigger_threshold: 0.75 # Not reliably enforceable - see note below
target_ratio: 0.20 # Preserve 20% of messages after compression
Note
Reactive Compression Limitation: The trigger_threshold cannot be proactively
enforced because token counts are only available after each LLM call completes. MassGen
uses reactive compression—catching context length errors from the provider and
summarizing automatically. Only target_ratio is reliably enforced.
Example configurations:
Aggressive compression:
target_ratio: 0.10(preserve only 10%)Moderate (default):
target_ratio: 0.20(preserve 20%)Conservative:
target_ratio: 0.40(preserve 40%)
Retrieval Settings#
retrieval:
limit: 5 # Max facts per agent (default: 5)
exclude_recent: true # Smart retrieval (default: true)
More context: Increase
limitto 10-20 (uses more tokens)Always retrieve: Set
exclude_recent: false(may duplicate recent context)
Recording Settings (v0.1.9+)#
New in v0.1.9: Control what gets recorded to memory for better observability and learning.
memory:
recording:
record_all_tool_calls: false # Record ALL MCP tools (not just workflow)
record_reasoning: false # Record reasoning chunks separately
record_all_tool_calls (default: false):
false:Only workflow tools (
new_answer,vote) are recordedtrue:ALL MCP tools are captured (
list_directory,read_file,write_file, etc.)
When to enable: - Learning tool usage patterns across sessions - Debugging which tools agents use most - Understanding tool sequences (e.g., “directory_tree → read_file → grep”) - Maximum observability during development
Example with ALL tools enabled:
[Tool Usage]
[Tool Call: mcp__filesystem__directory_tree]
Arguments: {"path": "/Users/.../massgen"}
Result: [directory structure with 50+ files...]
[Tool Call: mcp__filesystem__read_text_file]
Arguments: {"path": ".../orchestrator.py"}
Result: [full file contents...]
[Tool Call: new_answer]
Arguments: {"content": "Architecture analysis complete..."}
Result: Answer submitted
record_reasoning (default: false):
false:Reasoning mixed with final answer in main response
true:Reasoning chunks saved separately with
[Reasoning]prefix
When to enable: - Debugging agent decision-making - Learning problem-solving approaches - Capturing strategic thinking separate from final output
Example with reasoning enabled:
[Reasoning]
I should analyze the file structure first before diving into specific implementations.
This will help me build a mental model of the codebase organization.
[Reasoning Summary]
Decided to use directory_tree followed by selective file reads for systematic analysis.
Final answer: The codebase follows a modular architecture...
Performance Impact:
With both disabled (default): ~1-2 KB per recording, concise memory
With both enabled: ~10-50 KB per recording, maximum detail
mem0 extraction cost: Same LLM calls regardless (extracts from whatever is sent)
Recommendation: - Development: Enable both for debugging - Production: Keep disabled for concise, focused memory
Monitoring and Debugging#
Context Window Logs#
MassGen uses buffer-based context tracking to accurately monitor token usage. The conversation buffer captures ALL content including tool calls, tool results, injections, and reasoning—not just turn-level messages.
Token Tracking Priority:
Official API counts (at stream end): Most accurate for cost/pricing
Buffer estimation (fallback): Captures all content provided by API
Monitor context usage in real-time:
# Using official API token counts (most accurate)
📊 Context Window (Turn 5): 45,000 / 128,000 tokens (35%) [API actual]
# Using buffer estimation (fallback, assuming API provides all content)
📊 Context Buffer (Turn 5): 45,000 / 128,000 tokens (35%) [buffer]
When compression triggers:
⚠️ Context Buffer (Turn 11): 96,000 / 128,000 tokens (75%) [buffer] - Approaching limit!
🔄 Attempting compression (96,000 → 51,200 tokens)
📦 Context compressed: Removed 15 messages (44,800 tokens).
Kept 8 recent messages (51,200 tokens).
Why Buffer-Based Tracking?
The conversation buffer is the true source of context sent to agents. Unlike message-based tracking, it includes:
Tool calls and their arguments
Tool results (can be very large)
Injections from other agents
Pending content not yet flushed
Reasoning/thinking content (may not be available, depending on the API)
This provides accurate context usage even mid-stream, before official API counts are available.
Memory Operations#
Recording:
🔍 [_mem0_add] Recording to mem0 (agent=agent_a, session=session_123, turn=1)
messages: 2 message(s)
assistant: [Reasoning] I analyzed the backend files...
assistant: The backend system consists of...
✅ mem0 extracted 5 fact(s), 2 relation(s)
Retrieval:
🔄 Retrieving memories after reset for agent_a (restoring recent context + 1 winner(s))...
🔍 [retrieve] Searching memories (agent=agent_a, limit=5, winners=1)
Previous winners: [{'agent_id': 'agent_b', 'turn': 1}]
🔎 Searching own memories (agent_a)...
→ Found 3 memory/memories
🔎 Searching 1 previous winner(s)...
→ Searching agent_b (turn 1)...
Found 2 memory/memories
✅ Total: 5 memories retrieved
[1] User asked about MassGen architecture
[2] [From agent_b Turn 1] Explained the adapter pattern
Debug Files (v0.1.9+)#
New in v0.1.9: Memory debug mode saves complete message→fact mappings when using the --debug flag.
Enable debug mode:
massgen --debug --config your_config.yaml "Your question"
Debug files saved to:
.massgen/massgen_logs/log_{timestamp}/attempt_{N}/memory_debug/
└── {agent_id}/
├── turn_1_20251029_200335.json
├── turn_2_20251029_200438.json
└── turn_3_20251029_200557.json
File structure:
{
"timestamp": "2025-10-29T20:03:35.123456",
"agent_id": "test_agent",
"session_id": "temp_20251029_200122",
"turn": 1,
"metadata": {
"tools_used": ["mcp__filesystem__directory_tree", "read_text_file"],
"has_tools": true,
"message_count": 1
},
"messages_sent": [
{
"role": "assistant",
"content": "[Tool Usage]\n[Tool Call: directory_tree]\nArguments: {...}\nResult: ..."
}
],
"facts_extracted": [
{
"id": "abc123",
"memory": "For analyzing Python codebases, directory_tree → read_file sequence...",
"event": "ADD"
}
],
"extraction_count": 10
}
Use cases:
Verify tool capture: Check if MCP tools appear in
messages_sentTune prompts: Compare input vs. extracted facts to improve extraction quality
Debug 0 facts: See what content was sent when extraction fails
Monitor quality: Review if facts are actionable or generic
Testing Memory Setup#
Verify your memory configuration:
# Run test script
uv run python scripts/test_memory_setup.py
Expected output:
🧪 MEMORY SYSTEM TEST SUITE
============================================================
TEST 1: Environment Variables
============================================================
✅ OPENAI_API_KEY found (starts with: sk-proj...)
============================================================
TEST 2: OpenAI Embedding API
============================================================
✅ Embedding successful!
Vector dimensions: 1536
============================================================
TEST 3: mem0 LLM API (gpt-4.1-nano)
============================================================
✅ LLM call successful!
============================================================
TEST 4: Qdrant Connection
============================================================
✅ Qdrant server connected!
============================================================
TEST 5: Full Memory Integration
============================================================
✅ PersistentMemory created!
✅ Messages recorded!
Advanced Usage#
Per-Agent Memory Configuration#
Override memory settings for specific agents:
memory:
# Global defaults
retrieval:
limit: 5
agents:
- id: "researcher"
memory:
retrieval:
limit: 20 # This agent gets more context
- id: "writer"
memory:
retrieval:
limit: 3 # This agent gets less
Different Embedding Providers#
Using Together AI (cost-effective):
persistent_memory:
embedding:
provider: "together"
model: "togethercomputer/m2-bert-80M-8k-retrieval"
# Reads TOGETHER_API_KEY from environment
Using Azure OpenAI:
persistent_memory:
llm:
provider: "azure_openai"
model: "gpt-4o-mini"
api_key: "${AZURE_OPENAI_API_KEY}"
embedding:
provider: "azure_openai"
model: "text-embedding-ada-002"
Session Continuation#
Continue a previous session:
persistent_memory:
session_name: "codebase_analysis_oct2025"
All agents will access memories from this session across multiple CLI runs.
Cross-session knowledge:
persistent_memory:
session_name: null # Search across ALL sessions
Useful for: - Building knowledge base across projects - Learning from past conversations - Avoiding repeating analysis
Troubleshooting#
Common Issues#
Qdrant Connection Error
⚠️ Failed to create shared Qdrant client: Storage folder .massgen/qdrant
is already accessed by another instance
Solution:
Check if Qdrant server is running:
docker-compose -f docker-compose.qdrant.yml ps
Remove stale lock files:
./scripts/cleanup_qdrant_lock.sh # Or manually: rm .massgen/qdrant/.lock
Use server mode for multi-agent:
qdrant: mode: "server"
API Key Not Found
⚠️ OPENAI_API_KEY not found in environment - embedding will fail!
Solution:
Create .env file in project root:
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-... # If using Anthropic
No Memories Retrieved
🔄 Retrieving memories after reset...
ℹ️ No relevant memories found
This is normal if: - First turn (no memories yet) - Query doesn’t match stored memories semantically - mem0 hasn’t processed messages yet (async extraction)
Check:
1. Verify recording succeeded: Look for ✅ mem0 extracted X fact(s) in logs
2. Browse Qdrant collections: http://localhost:6333/dashboard
3. Check debug files: .massgen/.../memory_debug/*.json
0 Facts Extracted
✅ mem0 extracted 0 fact(s), 0 relation(s)
⚠️ mem0 extracted 0 facts (check fact extraction prompt or content quality)
Common causes: 1. Content too short: Less than 10 chars or empty messages 2. Weak extraction model: gpt-4o-mini may fail on complex content 3. Generic content: No extractable facts (e.g., voting messages) 4. JSON parsing error: Model hit token limit mid-response
Solutions:
1. Use stronger model: Change llm.model to "gpt-4o"
2. Enable debug mode: --debug to inspect messages_sent
3. Check content length in logs: Combined content length: X chars
4. Enable record_all_tool_calls: true to provide more context
PointStruct Validation Errors
Error: 6 validation errors for PointStruct
vector.list[float] Input should be a valid list [type=list_type, input_value=None]
Cause: Embedding API returned None instead of valid vector
Common reasons: 1. Empty content: Message with no text sent to embedding API 2. API failure: Rate limit, timeout, or invalid API key 3. Malformed input: Special characters or encoding issues
Solution: This is now automatically prevented by content validation (messages < 10 chars filtered out). If still occurring, check API key and embedding provider status.
JSON Parsing Errors from mem0
Invalid JSON response: Unterminated string starting at: line 108 column 7
Cause: mem0’s extraction LLM hit token limit mid-response, didn’t close JSON string
Solution: Use stronger extraction model (gpt-4o) or reduce content length
Cleaning Up#
Stop Qdrant:
docker-compose -f docker-compose.qdrant.yml down
Clear all memories:
# Remove Qdrant storage (WARNING: deletes all memories!)
rm -rf .massgen/qdrant_storage
Clear session data:
# Remove specific session
rm -rf .massgen/memory_test_sessions/session_20251028_143000
# Or all sessions
rm -rf .massgen/memory_test_sessions
Design Decisions#
Why These Architecture Choices? (Click to expand)
Why mem0’s Native LLMs/Embedders?#
Decision: Use mem0’s built-in providers (OpenAI, Anthropic, etc.) instead of wrapping MassGen backends
Rationale:
Simpler: No adapter layer, direct integration
No async issues: mem0’s adapters are sync, wrapping async MassGen backends caused event loop conflicts
Optimized: mem0’s default (gpt-4.1-nano) is optimized for memory operations
Flexible: Support for many providers without custom code
Trade-off: Requires separate API keys (can’t reuse agent’s backend). But memory operations are cheap (~1-2 cents/session).
Why MCP Tools Are Optional in Memory (v0.1.9+)#
Default: MCP tool calls (read_file, list_directory, etc.) are not recorded
Rationale:
Implementation details: HOW the work was done, not WHAT was learned
Redundant: The final answer usually captures insights from reading those files
Noise: 50+ file reads can overwhelm mem0’s extraction, making it harder to extract semantic facts
Focus on outcomes: Agent’s conclusions more valuable than execution trace
Token efficiency: Keeps memory concise and focused
Example (default mode):
Recorded to memory:
✅ Final answer: "The backend uses an adapter pattern in base.py that enables provider abstraction"
Not recorded:
❌ [Tool: read_file] path=/foo/base.py
❌ [Tool: read_file] path=/foo/openai.py
❌ [Tool: read_file] path=/foo/claude.py
When to Enable (record_all_tool_calls: true):
Learning tool patterns: Understand which tool sequences work best
Debugging: See exactly what agent explored
Pattern analysis: Extract insights like “directory_tree before read_file is more effective”
Development: Maximum observability during testing
Example (all tools mode):
Recorded to memory:
✅ [Tool Call: mcp__filesystem__directory_tree]
Arguments: {"path": "/massgen"}
Result: [50+ files and directories...]
✅ [Tool Call: mcp__filesystem__read_text_file]
Arguments: {"path": "/massgen/base.py"}
Result: [full file contents...]
✅ Final answer: "The backend uses an adapter pattern..."
mem0’s LLM can then extract: “For analyzing codebases, using directory_tree first followed by reading key files provides systematic understanding”
If you just need execution history (not learning patterns): Check orchestrator logs or agent workspace snapshots instead.
Why Record Reasoning?#
Decision: Include full reasoning chains and summaries in memory
Rationale:
Context for decisions: Final answer is meaningless without the reasoning
Better fact extraction: mem0’s LLM can extract richer facts from reasoning
Debugging: Understand WHY agent made certain choices
Learning: Future turns benefit from understanding past reasoning
Example memory facts extracted:
Without reasoning: “Agent said backend uses adapters”
With reasoning: “Agent analyzed base.py first, then compared 5 implementations, concluded adapters enable provider abstraction”
Why Filter System Messages?#
Decision: Exclude role: "system" messages from memory
Rationale:
Orchestrator noise: System messages contain coordination prompts like “You are evaluating answers from multiple agents…”
Not conversation content: System prompts are framework instructions, not user/agent dialogue
Bloat: Can be 5-10KB per message, mostly boilerplate
Focus on semantics: User questions and agent answers are what matter for memory
Why Smart Retrieval (exclude_recent)?#
Decision: Default exclude_recent: true - only retrieve after compression
Rationale:
Before compression: All context already in conversation_memory sent to LLM
Retrieval would duplicate: Waste tokens on information already present
After compression: Old messages removed, retrieval fills the gap
On restart: Always retrieve to restore context
Token efficiency:
Without exclude_recent: ~500 extra tokens per turn (duplicated context)
With exclude_recent: ~100 tokens only when needed (after compression)
Context Compression Thresholds#
Decision: Default 75% trigger, 40% target
Rationale:
75% trigger: Provides buffer before hitting limit (avoid truncation)
40% target: Balances context retention vs. token budget
Room for retrieval: Retrieved facts + recent context fit comfortably
Headroom for response: LLM has space to generate long responses
Alternative configurations:
Long analysis tasks: Lower threshold (50%) to compress more aggressively
Short conversations: Higher threshold (90%) to compress rarely
Why Qdrant Server for Multi-Agent?#
Decision: Require Qdrant server mode (Docker) for multi-agent setups
Rationale:
Concurrent access: File-based Qdrant locks on first access
Performance: Server mode handles parallel searches better
Robustness: No stale lock files from crashed processes
Scalability: Can scale to many agents
Trade-off: Requires Docker. But setup is one command: docker-compose up -d
Why Separate Memories Per Agent?#
Decision: Each agent has isolated memories, filtered by agent_id
Rationale:
Specialization: Different agents can build different knowledge bases
Controlled sharing: Only share via turn-aware winner mechanism
Scalability: Single Qdrant database, filtered by metadata
Privacy: Agent-specific knowledge stays private until winning
Alternative considered: Shared memory pool for all agents. Rejected because: - Information overload: Agent sees irrelevant memories from other agents - Loss of specialization: Can’t maintain agent-specific expertise - Temporal issues: Agent sees work-in-progress from concurrent agents
Why Turn-Aware Memory Filtering?#
Decision: Filter previous winners’ memories by {"turn": 1} metadata
Rationale:
Prevents temporal leakage:
Turn 2 (concurrent):
- agent_a working... (incomplete)
- agent_b working... (incomplete)
Without filtering:
- agent_a could see agent_b's Turn 2 work-in-progress ❌
- Leads to confusion, inconsistent state
With filtering:
- agent_a only sees agent_b's Turn 1 (complete, winner) ✅
- Clean separation of concurrent work
Implementation: Memories tagged with {"turn": N} on recording, filtered on retrieval.
API Reference#
For programmatic usage, see the memory module docstrings:
massgen.memory.PersistentMemory- Persistent memory APImassgen.memory.ConversationMemory- Conversation memory APImassgen.memory._context_monitor- Context monitoring utilitieslog_context_usage_from_buffer(buffer, turn_number)- Buffer-based tracking (recommended)log_context_usage_from_tokens(tokens, turn_number)- Official API token countslog_context_usage(messages, turn_number)- Legacy message-based tracking
massgen.conversation_buffer.AgentConversationBuffer- Conversation bufferestimate_tokens(calculator)- Get total token count including pending contentget_token_stats(calculator)- Get breakdown by entry type (user, assistant, tool_call, etc.)
Examples#
See complete examples in:
massgen/configs/memory/gpt5mini_gemini_context_window_management.yamlmassgen/configs/memory/gpt5mini_high_reasoning_gemini.yaml
Future Improvements#
Note
The memory system is production-ready but has several planned enhancements.
Planned Features#
1. Proactive Streaming Interruption (Partially Implemented)
Implemented: Buffer-based token tracking captures ALL content during streaming:
[Agent streaming response...]
→ [Buffer tracks: tool calls, results, reasoning, content]
→ [Pre-processing: 📊 Context Buffer: 45K / 128K tokens [buffer]]
→ [Post-processing: 📊 Context Window: 45K / 128K tokens [API actual]]
→ [Compress if needed]
Remaining: Proactive interruption when approaching budget
Planned: Inject warning to agent mid-stream when approaching limit
[Agent streaming...]
→ [Buffer counter: 95K / 128K budget]
→ [Agent sees: "⚠️ Approaching token limit, wrap up"]
→ [Agent concludes early]
2. Memory Analytics Dashboard
Planned: Visualize memory quality and tool usage patterns
Memory Analytics Dashboard
===========================
Facts Extracted: 245 (last 7 days)
Tool Patterns Learned: 12
Top Tool Sequences:
1. directory_tree → read_file → grep (85% success)
2. list_directory → read_file (92% success)
Fact Quality:
- Actionable: 78%
- Generic: 15%
- Redundant: 7%
3. Smart Tool Result Summarization
Planned: Automatically summarize large MCP tool results before recording
memory:
recording:
record_all_tool_calls: true
summarize_large_results: true # Auto-summarize results > 5KB
summary_model: "gpt-4o-mini" # Model for summarization
Benefit: Capture tool usage patterns without overwhelming mem0’s extraction LLM with 50KB directory trees
4. Memory Summarization on Compression (Implemented)
Compression now generates a comprehensive summary of all work done:
Compression Flow:
1. Context limit error detected
2. Generate summary of buffer content (tool calls, results, analysis)
3. Rebuild context: [system] → [user request] → [summary]
4. Summary placed LAST so model continues from it (not restart)
The user→summary ordering prevents models from re-reading files or redoing analysis that was already completed before compression.
Known Limitations#
Token Counting During Streaming (Improved in v0.1.25+)
Buffer-based tracking now provides context estimates during streaming:
✅ Before processing: Buffer estimation shows current context size
✅ After response: Official API counts used when available
✅ Accurate tracking: Includes tool calls, results, injections, reasoning
❌ Can’t stop mid-response if too large (proactive interruption planned)
❌ No real-time budget warnings to agent yet
❌ Reasoning not provided by APIs so buffer can be inaccurate
Workaround: Set conservative compression thresholds (50-60%) to leave headroom.
Extraction Quality Depends on Model
The quality of extracted facts varies significantly by model:
gpt-4.1-nano / gpt-4o-mini: Fast, cheap, but may produce generic facts or JSON parsing errors on complex content
gpt-4o / gpt-4-turbo: Slower, more expensive, but extracts specific, actionable insights
Recommendation: Use gpt-4o-mini for development, gpt-4o for production if fact quality matters.
MCP Tools Recording is Opt-In
By default, MCP tool calls (read_file, list_directory) are excluded to keep memory concise.
To enable: Set memory.recording.record_all_tool_calls: true
Trade-off: More data for pattern learning vs. potential information overload for mem0’s extraction LLM.
Session-Level Memory Isolation
Memories are isolated per session. To access knowledge from previous sessions, either:
- Set session_name: null (search all sessions)
- Explicitly continue a session with session_name: "my_session"
Local Qdrant Single-Agent Only
File-based Qdrant (mode: "local") does NOT support concurrent access.
For multi-agent: Always use mode: "server" with Docker.
Next Steps#
Interactive Multi-Turn Mode - Interactive multi-turn conversations
Orchestration Restart - Graceful restart handling
Logging & Debugging - Understanding MassGen’s logging system