How We Give AI Agents Long-Term Memory Without Blowing the Budget

Giving agents long-term memory is hard: context windows are limited, storing every raw message is expensive, and retrieval has to be both fast and relevant. We solved this with a time-stratified design: organize memory by temporal grain, summarize progressively, and pair it with a graph-based knowledge base for structured facts. Here’s how it works under the hood.

In short: six time grains (working → yearly), progressive summarization, plus a graph for who/what facts.

What is Helpmaton? Helpmaton is a workspace-based platform for creating, configuring, and deploying AI agents at scale (see Building Helpmaton: An Infrastructure Layer for AI Agents for the broader story). Teams and developers use it to build customer support bots, internal knowledge assistants, scheduled automations, and API-driven workflows—without standing up their own infra for webhooks, knowledge bases, or memory. I built it because I wanted a place where deploying an agent with real context (documents, past conversations, structured facts) was straightforward: you create a workspace, add agents and docs, and get webhook endpoints and tools like search_memory out of the box. The memory system described below is one of the things that makes those agents useful over time.

The problem

Agents need to remember facts, people, and events across many conversations. Raw logs don’t scale and don’t support semantic search (I’ve written before about building semantic search for AI agents). We needed a design that keeps recent detail queryable while compressing the past and staying cost-bounded.

Core idea: Organize memory by temporal grain (working → daily → weekly → monthly → quarterly → yearly), summarize progressively, and keep a separate graph knowledge base of (subject, predicate, object) facts. In this post you’ll see the architecture, data flow, consistency and scaling choices, how agents use memory (search + optional injection), and how the graph complements vector memory.

Why stratification?

Recent interactions need fine-grained, searchable facts; older ones can live as summaries. We use six grains:

Grain	Time format	Role
working	(none)	Raw conversation facts
daily	`YYYY-MM-DD`	Daily summaries of working
weekly	`YYYY-W{week}`	Weekly summaries of daily
monthly	`YYYY-MM`	Monthly summaries of weekly
quarterly	`YYYY-Q{quarter}`	Quarterly summaries of monthly
yearly	`YYYY`	Yearly summaries of quarterly

Benefits: Fast semantic search on recent data, bounded storage via summarization, and information preserved across time scales.

Architecture at a glance

Storage: One vector DB (LanceDB) per agent per grain. Working memory is one DB per agent; other grains are one DB per time period, e.g. vectordb/{agentId}/daily/2024-01-15/ in Amazon S3.
Data flow: Conversations → working memory (facts + embeddings) → daily summarization → weekly → … → yearly.
Scheduled jobs: Daily / weekly / monthly / quarterly / yearly summarization AWS Lambda functions, plus a daily retention-cleanup job.

Two pillars of agent knowledge:

Stratified vector memory (above): temporal grains, progressive summarization, semantic search.
Graph knowledge base: Structured (subject, predicate, object) facts per agent, used for entity-based lookup at inject time (see below).

flowchart LR
  subgraph write [Write path]
    Conv[Conversations]
    WM[Working memory]
    Graph[Graph facts]
    Conv --> WM
    Conv --> Graph
  end
  subgraph summarize [Summarization]
    WM --> D[Daily]
    D --> W[Weekly]
    W --> M[Monthly]
    M --> Q[Quarterly]
    Q --> Y[Yearly]
  end
  subgraph use [Use]
    Search[search_memory tool]
    Inject[Inject Knowledge]
    D --> Search
    WM --> Inject
    Graph --> Inject
  end

Knowledge base: graph database

The graph stores structured facts (subject–predicate–object triples) per agent and answers questions like “What do we know about Alice?” via entity-based lookup. It complements vector memory, which is best for “what was said when” and semantic similarity.

Stack: DuckDB in-process with the DuckPGQ extension for property-graph queries; data persisted as Apache Parquet in S3.
Storage: One facts table per agent at s3://bucket/graphs/{workspaceId}/{agentId}/facts.parquet. Schema: id, source_id, target_id, label, properties (e.g. confidence, conversationId, updatedAt).
How facts get in: When conversations are created or updated, an LLM memory extraction step extracts facts and emits memory operations (INSERT / UPDATE / DELETE). applyMemoryOperationsToGraph() in memoryExtraction.ts applies these to the graph via graphDb (insertFacts, updateFacts, deleteFacts, save). The same pipeline can also write working-memory vectors.
How it’s used: At conversation start, Inject Knowledge (if enabled) extracts entities from the user prompt using an optional entity-extractor model. searchGraphByEntities() loads the graph, queries facts where source_id or target_id is in the entity list, and returns (subject, predicate, object) snippets. Those are merged with working-memory snippets and document snippets and can be re-ranked before injection.
API: createGraphDb(workspaceId, agentId) returns insertFacts, updateFacts, deleteFacts, queryGraph(sql), save(), close(). Graph search lives in graphSearch.ts. See the graph database doc and graphDb.ts for details.

Writing memory: consistency and scale

When: On conversation create/update we extract facts, embed them via OpenRouter using the thenlper/gte-base embedding model, and write to working memory.
How: Writes go through Amazon SQS FIFO (FIFO = first-in, first-out) queues with message group {agentId}:{grain} so only one writer per DB runs at a time—no races on the same LanceDB instance.
Flow: API → extract facts → embed → enqueue → SQS handler → LanceDB. For more on how we use message groups, see FIFO queue message grouping. That keeps Lambda and DB usage decoupled and serialized.

Summarization pipeline

We run LLM summarization from finer grain to coarser (e.g. working → daily, daily → weekly). Prompts are grain-specific: daily focuses on events, people, and facts; weekly on narrative; monthly and above on themes and milestones.

Schedule: Cron-like Lambda invocations—daily, 7d, 30d, 90d, 365d for each grain. The scheduled summarization tasks and their rates are defined in app.arc (Arc / Architect).
Process: Query the source grain by time window → concatenate content → call LLM → embed summary → write to the next grain. Same pattern at every level; only the time window and prompt change.

Search and how agents use memory

Agents get a search_memory tool with grain, minimumDaysAgo / maximumDaysAgo, maxResults, and optional queryText for semantic search. Behavior: vector search within the chosen grain with optional date filter; results can be date-prefixed for the model.

Inject Knowledge (optional) combines two memory sources at conversation start:

Working-memory vector search — semantic similarity to the prompt.
Graph facts — entity-based: extract entities from the prompt, then fetch facts where subject or object matches.

Results are merged with document snippets and optionally re-ranked (e.g. via a cross-encoder or similar) before the first user message.

Example:

const results = await searchMemory({
  agentId: "agent-123",
  workspaceId: "workspace-456",
  grain: "daily",
  minimumDaysAgo: 0,
  maximumDaysAgo: 30,
  maxResults: 10,
  queryText: "React project discussion",
});

Retention and cost control

Unlimited growth would be expensive and noisy. We enforce per-plan retention per grain and run a daily cleanup job.

Plan	Working	Daily	Weekly	Monthly	Quarterly	Yearly
Free	48 hours	30 days	6 weeks	6 months	4 quarters	2 years
Starter	120 hours	60 days	12 weeks	12 months	8 quarters	4 years
Pro	240 hours	120 days	24 weeks	24 months	16 quarters	8 years

Older data is pruned per grain so storage stays predictable.

Cleanup: One daily job; for each workspace/agent/grain we compute the cutoff from the plan, delete older records, and process from the most granular grain (working) upward.

Implementation details engineers care about

Performance and operational notes

Serialization: SQS message groups prevent concurrent writes to the same DB.
Batching: Summarization processes in batches where applicable.
Cleanup: Daily retention keeps storage predictable.
Vector search: LanceDB per grain (approximate nearest neighbor search over embeddings); we don’t dive into LanceDB internals here—just “vector index per grain.”

Future directions

Configurable summarization prompts per agent.
Custom retention per workspace.
Richer search (by person, event type) and memory analytics.
Cross-agent memory in team workspaces.

Wrapping up

Agents get long-term memory from two subsystems:

Stratified vector memory — grains, progressive summarization, FIFO writes, and retention give queryable, cost-bounded semantic memory.
Graph knowledge base — (subject, predicate, object) facts populated by memory extraction and retrieved by entity at inject time.

Together they support both “what was said when” and “what we know about X.”

Go deeper: See the agent memory system doc, graph database doc, vector database doc, and agent configuration doc (Inject Knowledge). You can also dig into the Helpmaton repo on GitHub and the file paths above.

Source link