MassGen v0.1.5: Persistent Memory with Semantic Retrieval#
MassGen is focused on case-driven development. This case study demonstrates the introduction of persistent memory with semantic retrieval, enabling agents to build cumulative knowledge across multi-turn sessions and achieve true self-evolution through research-to-implementation workflows.
Table of Contents#
đź“‹ PLANNING PHASE
📝 Evaluation Design
Prompt#
Two-turn research-to-implementation workflow:
Turn 1 (Research):
Use crawl4ai to research the latest multi-agent AI papers and techniques from 2025.
Focus on: coordination mechanisms, voting strategies, tool-use patterns, and architectural innovations.
Turn 2 (Implementation):
Based on the multi-agent research from earlier, which techniques should we implement in MassGen
to make it more state-of-the-art? Consider MassGen's current architecture and what would be most impactful.
This prompt tests whether agents can:
Research external sources and store findings
Retrieve relevant research in a follow-up turn
Apply research to self-improvement recommendations
Baseline Config#
Multi-turn conversations were already supported in MassGen, but without persistent memory. Turn 2 only had access to:
The full conversation history from Turn 1 (in context window)
Any workspace files created during Turn 1
Limitation: No semantic search, no fact extraction, no persistent knowledge base across sessions.
đź”§ Evaluation Analysis
Results & Failure Modes#
Before Persistent Memory (multi-turn only):
âś… What worked:
Agents could have multi-turn conversations
Turn 2 could reference Turn 1’s full output in context
Workspace files persisted across turns
❌ What was missing:
No semantic augmentation: Turn 2 had Turn 1’s answer but no additional extracted facts
No structured knowledge: Research stored only as raw conversation text
Generic recommendations: Without structured facts, recommendations lacked specificity
Baseline Turn 2 Result (session log_20251029_064846, 132 lines):
Generic architectural proposals: “add massgen/voting.py”
Theoretical recommendations: “pluggable voting”, “layered memory”
Less grounded in current codebase structure
More abstract: “CoordinationStrategy interface and implementations”
Success Criteria#
Persistent memory should enable:
Automatic Fact Extraction: System extracts structured facts from Turn 1 research
Semantic Augmentation: Turn 2 gets Turn 1’s answer PLUS relevant extracted facts automatically
Persistent Storage: Facts stored in vector database
More Specific Recommendations: Turn 2 provides concrete file paths, implementation steps, grounded in both research and current architecture
🎯 Desired Features
To enable self-evolution through research-to-implementation:
Fact Extraction: Automatically extract important facts from conversations
Vector Storage: Store facts with embeddings in persistent vector database
Semantic Retrieval: Automatically retrieve relevant facts based on context
Cross-Turn Continuity: Facts from Turn 1 available in Turn 2
Quality Extraction: Custom prompts to ensure useful, self-contained facts
Multi-Agent Support: Concurrent fact storage from multiple agents
🚀 TESTING PHASE
📦 Implementation Details
Version#
v0.1.5 - Introduction of persistent memory system
✨ New Features
PersistentMemory Integration (
massgen/memory/_persistent.py)Wraps mem0’s AsyncMemory with MassGen-specific logic
Automatic fact extraction on turn completion
Semantic retrieval via vector search
Metadata tracking (session_id, agent_id, turn number)
Custom Fact Extraction Prompts (
massgen/memory/_fact_extraction_prompts.py)MASSGEN_UNIVERSAL_FACT_EXTRACTION_PROMPT designed for quality facts
Intended to filter out: agent comparisons, voting details, file paths, system internals
Focuses on: domain knowledge, insights, capabilities, recommendations
Enforces self-contained facts (understandable without original context)
Qdrant Vector Store Integration
Server mode support for multi-agent concurrency
Vector similarity search with metadata filtering
Persistent storage across sessions
Memory Configuration YAML
memory.persistent_memorysection for mem0 configurationLLM and embedding model settings
Qdrant connection parameters
Retrieval and compression settings
Automatic Recording & Retrieval (in
ChatAgent)Records facts after each turn completion
Retrieves relevant facts when context window approaches limit
Injects as system message: “Relevant memories: …”
New Config#
massgen/configs/memory/gpt5mini_gemini_research_to_implementation.yaml
Key Memory Settings:
memory:
enabled: true
persistent_memory:
enabled: true # 🆕 NEW: Persistent memory
session_name: "research_to_implementation" # Cross-turn continuity
vector_store: "qdrant"
llm:
provider: "openai"
model: "gpt-4.1-nano-2025-04-14" # Fact extraction
embedding:
provider: "openai"
model: "text-embedding-3-small" # Vector embeddings
qdrant:
mode: "server"
host: "localhost"
port: 6333
retrieval:
limit: 10 # Number of facts to retrieve
Command#
Prerequisites:
# Start Qdrant server
docker run -d -p 6333:6333 -p 6334:6334 \
-v $(pwd)/.massgen/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant
# Start crawl4ai (for web scraping)
docker run -d -p 11235:11235 --name crawl4ai \
--shm-size=1g unclecode/crawl4ai:latest
Run Session:
uv run massgen --config @examples/memory/gpt5mini_gemini_research_to_implementation.yaml
Turn 1 Prompt:
Use crawl4ai to research the latest multi-agent AI papers and techniques from 2025.
Focus on: coordination mechanisms, voting strategies, tool-use patterns, and architectural innovations.
Turn 2 Prompt (in same session):
Based on the multi-agent research from earlier, which techniques should we implement in MassGen
to make it more state-of-the-art? Consider MassGen's current architecture and what would be most impactful.
🤖 Agents
Agent A: gpt-5-mini (with crawl4ai tools)
Agent B: gemini-2.5-flash (with crawl4ai tools)
Session: session_20251029_072105
Duration: 11 minutes across 2 turns
Memory Stats:
Facts stored (Turn 1): 54
Facts retrieved (Turn 2): 10
🎥 Demo
Watch the recorded demo:
📊 EVALUATION & ANALYSIS
Results#
Persistent memory dramatically improved Turn 2’s ability to provide specific, actionable recommendations by retrieving relevant research findings from Turn 1.
The Collaborative Process#
Turn 1 - Research Phase (5 minutes):
Agents used crawl4ai to scrape arXiv
Retrieved 20+ papers on multi-agent systems from late 2025
Analyzed coordination mechanisms, voting strategies, tool patterns, architectures
Generated comprehensive research summary (~133 lines)
🆕 Memory recorded 54 facts automatically
Example facts stored:
“Multi-layer memory folding that includes short-term windows, episodic timelines, and semantic summaries allows agents to manage large contexts efficiently, reducing token usage while maintaining factual recall, which is crucial for long-horizon tasks and fine-tuning.”
“In 2025, multi-agent and agentic-AI systems evolved from ad-hoc multi-LLM setups to using structured workflows including hierarchical planning, task graphs, and planner-executor separations, which improve coherence, scalability, and fault tolerance.”
Turn 2 - Implementation Phase (6 minutes):
Agents have Turn 1’s full answer in context (standard multi-turn)
🆕 System automatically retrieves 10 relevant facts from Turn 1 via semantic search
🆕 Facts injected as system message: “Relevant memories: …”
Read MassGen codebase (
massgen/anddocs/directories)Cross-referenced Turn 1 answer + retrieved facts + current architecture
Generated prioritized implementation plan (~110 lines)
Example automatic memory retrieval:
When Turn 2 starts, system automatically searches memories and injects relevant facts:
Retrieved fact:
“Using argumentation frameworks with evidence scoring, proficiency or reputation-weighted voting, multi-stage consensus, and human-in-the-loop arbitration are advanced voting strategies in 2025…”
This fact (from Turn 1 research) was automatically added to Turn 2’s context, enabling:
“1) Evidence‑aware, proficiency‑weighted voting + Judge (High impact, low→medium effort)
Replace naive majority with weighted aggregation using per‑agent proficiency scores plus evidence strength…”
The Voting Pattern#
No changes to voting in this release - standard MassGen voting applied. The improvement came from what agents could reference during answer generation, not how they voted.
The Final Answer#
Turn 2 Quality Comparison (Both sessions have Turn 1 answer in context):
Without Persistent Memory (log_20251029_064846, 132 lines):
Generic architectural proposals: “add massgen/voting.py (or massgen/voting/ package)”
Theoretical interfaces: “CoordinationStrategy”, “Aggregator base”
Broad phases: “Phase 0 (1-2 sprints)”, “Phase 1 (2-6 weeks)”
Less grounded: Treats implementation like greenfield architecture
Sample from baseline:
1) Pluggable Voting & Aggregation + Adaptive Early Stopping
- Where to change: add massgen/voting.py (or massgen/voting/ package)
- Suggested API / design sketch:
- Aggregator (base)
- add_vote(agent_id, result, confidence, trajectory, metadata)
With Persistent Memory (log_20251029_072105, 110 lines):
âś… Specific existing file paths:
workflow_toolkits/vote.py,coordination_tracker.py,orchestrator.pyâś… Concrete implementation steps: Numbered steps for each feature
âś… Test metrics: Specific KPIs for measuring success
âś… Sprint planning: Concrete deliverables per sprint
âś… Grounded in current architecture: References actual MassGen files
Sample from memory-enabled:
Top recommendations (what + where to change in repo)
1) Evidence‑aware, proficiency‑weighted voting + Judge (High impact, low→medium effort)
Where to implement (explicit paths):
- workflow_toolkits/vote.py — extend to accept evidence payloads and compute weighted scores
- message_templates.py — add evidence schema to agent message format
- coordination_tracker.py — track per‑agent proficiency/calibration
- orchestrator.py — surface evidence into coordinator logs and call Judge
Concrete implementation steps:
1. Extend message template with evidence: {claims:[...], tool_outputs:[...], confidence:float}
2. Implement per‑agent scoreboard (moving average success) in coordination_tracker.py
3. Update vote.py: compute final_score = α*proficiency + β*evidence_score + γ*vote_strength
4. Create Judge agent that can (a) fetch supporting sources, (b) re-run tool calls...
Key Difference:
The memory-enabled version provides:
Actual file paths that exist (
workflow_toolkits/vote.pyvs. “add massgen/voting.py”)Numbered implementation steps
Specific integration points
Grounded in both research AND current codebase
This specificity comes from:
Turn 1 research stored as structured facts (automatic)
Turn 2 has Turn 1 answer PLUS 10 relevant facts (automatic semantic retrieval)
Facts provide additional semantic context beyond raw conversation history
Agents combine: Turn 1 answer + extracted facts + codebase analysis = concrete actionable plan
Memory System Performance#
Memory example:
“Coordination mechanisms that improve long-term coherence include hierarchical recursive planning, task decomposition with DAG structures, and planner-executor systems that maintain shared memory and intermediate artifacts.”
Retrieval Performance:
Turn 2 context triggers automatic semantic search
System found 10 most relevant facts from 54 stored
Latency: < 100ms
Facts augment Turn 1 answer already in context
Cost Analysis:
Fact extraction: gpt-4.1-nano @ $0.15/M tokens
Embeddings: text-embedding-3-small @ $0.020/M tokens
54 facts extracted + embedded: < $0.001
Storage: ~108 KB (54 Ă— 2KB per fact)
🎯 Conclusion
Why Persistent Memory Improves Self-Evolution#
Before (multi-turn with conversation history):
Turn 2 had Turn 1’s full answer in context ✓
But: No additional semantic augmentation
Result: Generic architectural proposals (132 lines, mostly abstract interfaces)
After (persistent memory + conversation history):
Turn 2 has Turn 1’s answer PLUS 10 automatically retrieved facts
Facts provide additional semantic context extracted from Turn 1
System automatically searches and injects relevant memories
Result: Specific file paths, numbered steps, grounded in actual architecture (110 lines, more concrete)
The Compound Effect#
Within this session, memory enabled:
Turn 1: Research 20+ papers → Extract and store 54 structured facts
Turn 2: Automatically retrieve 10 relevant facts → More specific recommendations
The architecture supports future cross-session retrieval, though not demonstrated in this case study.
Broader Implications#
Persistent memory enables:
Self-Evolution: Agents can learn about themselves through research-to-implementation
Research-to-Implementation: Bridge external research to internal development
Semantic Augmentation: Additional structured facts supplement conversation history
Knowledge Storage: Facts persist in vector database for future retrieval
Improved Specificity: Extracted facts lead to more concrete, actionable recommendations
Future Improvements#
Memory Quality (Current: 72% good, 28% system internals):
The custom fact extraction prompts significantly improve memory quality, but ~28% of stored facts are still system internals (voting details, agent comparisons, meta-instructions). Planned improvements:
Stricter pattern matching for procedural language
Two-pass extraction (extract, then validate against exclusion rules)
Domain-specific prompts for research vs. implementation tasks
Active learning from user feedback on memory relevance
Cross-Session Loading:
Load facts from previous sessions by session_name
Session management UI
Memory pruning and maintenance
Retrieval Intelligence:
Multi-query retrieval (expand queries to multiple search vectors)
Temporal weighting (recent facts ranked higher)
Cross-session memory fusion (merge related facts)
📌 Status Tracker
Planning phase completed
Features implemented
Testing completed
Demo recorded
Results analyzed
Case study reviewed
