
Giving agents long-term memory is hard: context windows are limited, storing every raw message is expensive, and retrieval has to be both fast and relevant. We solved this with a time-stratified design: organize memory by temporal grain, summarize progressively, and pair it with a graph-based knowledge base for structured facts. Here’s how it works under the hood.
In short: six time grains (working → yearly), progressive summarization, plus a graph for who/what facts.
What is Helpmaton? Helpmaton is a workspace-based platform for creating, configuring, and deploying AI agents at scale (see Building Helpmaton: An Infrastructure Layer for AI Agents for the broader story). Teams and developers use it to build customer support bots, internal knowledge assistants, scheduled automations, and API-driven workflows—without standing up their own infra for webhooks, knowledge bases, or memory. I built it because I wanted a place where deploying an agent with real context (documents, past conversations, structured facts) was straightforward: you create a workspace, add agents and docs, and get webhook endpoints and tools like
search_memoryout of the box. The memory system described below is one of the things that makes those agents useful over time.
The problem
Agents need to remember facts, people, and events across many conversations. Raw logs don’t scale and don’t support semantic search (I’ve written before about building semantic search for AI agents). We needed a design that keeps recent detail queryable while compressing the past and staying cost-bounded.
Core idea: Organize memory by temporal grain (working → daily → weekly → monthly → quarterly → yearly), summarize progressively, and keep a separate graph knowledge base of (subject, predicate, object) facts. In this post you’ll see the architecture, data flow, consistency and scaling choices, how agents use memory (search + optional injection), and how the graph complements vector memory.
Why stratification?
Recent interactions need fine-grained, searchable facts; older ones can live as summaries. We use six grains:
| Grain | Time format | Role |
|---|---|---|
| working | (none) | Raw conversation facts |
| daily | YYYY-MM-DD |
Daily summaries of working |
| weekly | YYYY-W{week} |
Weekly summaries of daily |
| monthly | YYYY-MM |
Monthly summaries of weekly |
| quarterly | YYYY-Q{quarter} |
Quarterly summaries of monthly |
| yearly | YYYY |
Yearly summaries of quarterly |
Benefits: Fast semantic search on recent data, bounded storage via summarization, and information preserved across time scales.
Architecture at a glance
- Storage: One vector DB (LanceDB) per agent per grain. Working memory is one DB per agent; other grains are one DB per time period, e.g.
vectordb/{agentId}/daily/2024-01-15/in Amazon S3. - Data flow: Conversations → working memory (facts + embeddings) → daily summarization → weekly → … → yearly.
- Scheduled jobs: Daily / weekly / monthly / quarterly / yearly summarization AWS Lambda functions, plus a daily retention-cleanup job.
Two pillars of agent knowledge:
- Stratified vector memory (above): temporal grains, progressive summarization, semantic search.
- Graph knowledge base: Structured (subject, predicate, object) facts per agent, used for entity-based lookup at inject time (see below).
flowchart LR
subgraph write [Write path]
Conv[Conversations]
WM[Working memory]
Graph[Graph facts]
Conv --> WM
Conv --> Graph
end
subgraph summarize [Summarization]
WM --> D[Daily]
D --> W[Weekly]
W --> M[Monthly]
M --> Q[Quarterly]
Q --> Y[Yearly]
end
subgraph use [Use]
Search[search_memory tool]
Inject[Inject Knowledge]
D --> Search
WM --> Inject
Graph --> Inject
end
Knowledge base: graph database
The graph stores structured facts (subject–predicate–object triples) per agent and answers questions like “What do we know about Alice?” via entity-based lookup. It complements vector memory, which is best for “what was said when” and semantic similarity.
-
Stack: DuckDB in-process with the DuckPGQ extension for property-graph queries; data persisted as Apache Parquet in S3.
-
Storage: One
factstable per agent ats3://bucket/graphs/{workspaceId}/{agentId}/facts.parquet. Schema:id,source_id,target_id,label,properties(e.g. confidence, conversationId, updatedAt). -
How facts get in: When conversations are created or updated, an LLM memory extraction step extracts facts and emits memory operations (INSERT / UPDATE / DELETE).
applyMemoryOperationsToGraph()in memoryExtraction.ts applies these to the graph viagraphDb(insertFacts, updateFacts, deleteFacts, save). The same pipeline can also write working-memory vectors. -
How it’s used: At conversation start, Inject Knowledge (if enabled) extracts entities from the user prompt using an optional entity-extractor model.
searchGraphByEntities()loads the graph, queries facts wheresource_idortarget_idis in the entity list, and returns (subject, predicate, object) snippets. Those are merged with working-memory snippets and document snippets and can be re-ranked before injection. -
API:
createGraphDb(workspaceId, agentId)returnsinsertFacts,updateFacts,deleteFacts,queryGraph(sql),save(),close(). Graph search lives in graphSearch.ts. See the graph database doc and graphDb.ts for details.
Writing memory: consistency and scale
- When: On conversation create/update we extract facts, embed them via OpenRouter using the
thenlper/gte-baseembedding model, and write to working memory. - How: Writes go through Amazon SQS FIFO (FIFO = first-in, first-out) queues with message group
{agentId}:{grain}so only one writer per DB runs at a time—no races on the same LanceDB instance. - Flow: API → extract facts → embed → enqueue → SQS handler → LanceDB. For more on how we use message groups, see FIFO queue message grouping. That keeps Lambda and DB usage decoupled and serialized.
Summarization pipeline
We run LLM summarization from finer grain to coarser (e.g. working → daily, daily → weekly). Prompts are grain-specific: daily focuses on events, people, and facts; weekly on narrative; monthly and above on themes and milestones.
- Schedule: Cron-like Lambda invocations—daily, 7d, 30d, 90d, 365d for each grain. The scheduled summarization tasks and their rates are defined in app.arc (Arc / Architect).
- Process: Query the source grain by time window → concatenate content → call LLM → embed summary → write to the next grain. Same pattern at every level; only the time window and prompt change.
Search and how agents use memory
Agents get a search_memory tool with grain, minimumDaysAgo / maximumDaysAgo, maxResults, and optional queryText for semantic search. Behavior: vector search within the chosen grain with optional date filter; results can be date-prefixed for the model.
Inject Knowledge (optional) combines two memory sources at conversation start:
- Working-memory vector search — semantic similarity to the prompt.
- Graph facts — entity-based: extract entities from the prompt, then fetch facts where subject or object matches.
Results are merged with document snippets and optionally re-ranked (e.g. via a cross-encoder or similar) before the first user message.
Example:
const results = await searchMemory({
agentId: "agent-123",
workspaceId: "workspace-456",
grain: "daily",
minimumDaysAgo: 0,
maximumDaysAgo: 30,
maxResults: 10,
queryText: "React project discussion",
});
Retention and cost control
Unlimited growth would be expensive and noisy. We enforce per-plan retention per grain and run a daily cleanup job.
| Plan | Working | Daily | Weekly | Monthly | Quarterly | Yearly |
|---|---|---|---|---|---|---|
| Free | 48 hours | 30 days | 6 weeks | 6 months | 4 quarters | 2 years |
| Starter | 120 hours | 60 days | 12 weeks | 12 months | 8 quarters | 4 years |
| Pro | 240 hours | 120 days | 24 weeks | 24 months | 16 quarters | 8 years |
Older data is pruned per grain so storage stays predictable.
- Cleanup: One daily job; for each workspace/agent/grain we compute the cutoff from the plan, delete older records, and process from the most granular grain (working) upward.
Implementation details engineers care about
Performance and operational notes
- Serialization: SQS message groups prevent concurrent writes to the same DB.
- Batching: Summarization processes in batches where applicable.
- Cleanup: Daily retention keeps storage predictable.
- Vector search: LanceDB per grain (approximate nearest neighbor search over embeddings); we don’t dive into LanceDB internals here—just “vector index per grain.”
Future directions
- Configurable summarization prompts per agent.
- Custom retention per workspace.
- Richer search (by person, event type) and memory analytics.
- Cross-agent memory in team workspaces.
Wrapping up
Agents get long-term memory from two subsystems:
- Stratified vector memory — grains, progressive summarization, FIFO writes, and retention give queryable, cost-bounded semantic memory.
- Graph knowledge base — (subject, predicate, object) facts populated by memory extraction and retrieved by entity at inject time.
Together they support both “what was said when” and “what we know about X.”
Go deeper: See the agent memory system doc, graph database doc, vector database doc, and agent configuration doc (Inject Knowledge). You can also dig into the Helpmaton repo on GitHub and the file paths above.