A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part C)

AI Agents Crash Course—Part 17 (with implementation).

👉

Recap

In the previous part (Part 16) of this AI agents crash course, we focused on memory as a first-class design concern for agentic systems.

We began by understanding sequential memory, where the entire conversation history is appended and sent to the LLM on every turn.

Next, we introduced sliding window memory, which places a hard bound on context size by retaining only the most recent messages. This significantly stabilizes token usage and latency, but at the cost of forgetting older information once it falls outside the window.

To address this limitation, we moved on to summarization-based memory, where older conversation segments are compressed into a running summary while recent turns remain in full detail.

We then extended this idea further with compression and consolidation, introducing importance-aware memory management.

Throughout the chapter, we grounded every concept in a customer support chatbot example, analyzing how each memory strategy affects behavior, latency, and token consumption in realistic workflows.

If you haven’t gone through the previous chapter yet, we strongly recommend reading it first, as it lays the foundation for the next step in our memory optimization journey.

A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part B)
AI Agents Crash Course—Part 16 (with implementation).

In this chapter, we will continue our discussion on memory optimization, learning about:

  • Retrieval-based memory
  • Hierarchical memory
  • OS-like memory

Let’s continue.


Retrieval-based memory

So far, all the techniques we discussed so far have been short-term:

  • Sequential memory: send the entire conversation so far.
  • Sliding window: keep only the last few turns.
  • Summarization: compress older parts into a running summary.

All of these live inside a single thread. Once the thread ends, that memory is gone.

For real, production-grade agentic systems, that’s not enough. You want your agent to:

  • remember a user’s preferences across conversations
  • recall past issues for the same user/account
  • reuse knowledge learned last week in a new session

Instead of keeping everything inside the thread, we need to:

  • store durable memory items in a long-term store
  • retrieve only the relevant ones for the current query
  • stitch them into the context alongside the current conversation

In Part 15, we briefly saw how long-term memory works in LangGraph using the store abstraction.

A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part A)
AI Agents Crash Course—Part 15 (with implementation).

Here we’ll walk through it step by step, and then wire it into a retrieval-based memory setup for our customer support agent.

Memory store in LangGraph

Checkpointers give us continuity inside a thread. As long as we use the same thread ID, LangGraph will restore the previous state and continue the conversation from where you left off.

But checkpointers alone cannot:

  • share information between different threads
  • carry knowledge across sessions or tickets
  • build a persistent profile for a user

Imagine a user who opens three support tickets over a month:

  • Ticket 1 – billing issue
  • Ticket 2 – access issue
  • Ticket 3 – workspace performance issue

With only checkpointers, each ticket is its own island. The agent has no way to reuse what it learned in Ticket 1 when answering Ticket 2 or 3.

This is the exact reason for the need to have stores as an external database that can store important conversations that can be retrieved on demand.

LangGraph stores long-term memories as JSON documents in a store.

Each memory is organized under:

  • A namespace is like a folder and is represented as a tuple like (user_id, "memories").
  • A key is like a file name within that folder.

Moreover, you can write and read from the store in any thread. In LangGraph we use the InMemoryStore to implement this.

The good thing about LangGraph is that it takes care of all the backend infrastructure required for the store implementation. We don't need to worry about managing it but it is still useful to understand the basics.

Here’s an implementation of the InMemoryStore:

Here:

  • InMemoryStore creates a store which is essentially just a dictionary in memory.
  • Memories are grouped by a namespace, which is a tuple of strings.

We use (user_id, "memories") here, but you could just as well use (project_id, "docs") or (team_id, "preferences") whatever suits your use case.

To save a memory, we use the put method:

In this snippet:

  • memory_id is a unique key inside this namespace.
  • memory is any JSON-serializable document such as a simple dictionary.
  • put stores this (key, value) pair under the given namespace.

To read memories back, we use the search method:

Each memory item returned by search is an Item class with:

  • value: the memory or user preference saved
  • key: the unique key used (memory_id)
  • namespace: the namespace for this memory
  • created_at / updated_at: timestamps

That’s the core concept:

  • put(namespace, key, value) to save a memory
  • search(namespace, ...) to retrieve a memory

So far, search returns all items for a namespace and follows a simple retrieval strategy based on traditional keyword search. This doesn’t scale well if you have hundreds or thousands of memories per user.

In real systems, we want semantic retrieval that first converts text into vector embeddings and store these embeddings in an index. At query time, we embed the query and find the closest memories using a similarity metric.

Let us first configure our embedding model before going to the implementation part. We will again be using OpenRouter for this.

InMemoryStore can be configured with an embedding index like this:

Here:

  • embed is an embedding function (OpenAI embeddings).
  • dims is the dimensionality of the embedded vectors.
  • fields tells the store which fields of our values to embed.
    • "food_preference" will embed the value of this key in the dict (i.e. I like pizza).
    • "$" is a catch-all for the entire object.

Now, when we put memories into the store, vectors are computed and stored behind the scenes. We can then issue semantic queries:

Here:

  • query is a natural language search string.
  • limit controls how many top matches you want.
  • The store embeds the query, computes similarity with stored embeddings, and returns the best matches.

We can also control which parts of the memories get embedded by configuring the fields parameter or by specifying the index parameter when storing memories:

This is the core of retrieval-based memory systems:

  • define what you save
  • define how it is embedded
  • ask semantic questions and get back relevant chunks of memory

Before we start to implement this strategy for our support agent it is important to note that the InMemoryStore is prefect for local experiments, unit tests and small prototypes but not enough for a production-grade app.

For production workloads, we almost always want a robust vector database backend that can handle millions of memory items, is scalable with efficient read/write operations and supports low-latency search and retrieval.

Implementation

Now that the store and semantic search pieces are clear, we can design a retrieval-based memory layer for our customer support agent.

Here's the workflow:

  • We add important long-term facts about a user into the store (plan, workspace names, previous issues, etc.).
  • When the user opens a new ticket and asks a question, the agent:
    • looks up these long-term memories via semantic search
    • injects the relevant ones into the prompt
    • answers using both the current conversation and the retrieved context

We’ll keep the design simple and focused on retrieval. Let's define our state:

Join the Daily Dose of Data Science Today!

A daily column with insights, observations, tutorials, and best practices on data science.

Get Started!
Join the Daily Dose of Data Science Today!

Great! You’ve successfully signed up. Please check your email.

Welcome back! You've successfully signed in.

You've successfully subscribed to Daily Dose of Data Science.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.