The memory layerfor AI agents.

Your app sends us conversation turns. We extract structured knowledge, store it across a typed graph and a hybrid vector index, and return ranked, grounded context on every turn — in under 300 ms, isolated per end-user, with 95% fewer context tokens.

The recommended memory layer for
OpenAI
Anthropic
ChatGPT apps
LangChain
Vercel AI SDK
live · user.session.018f3a
p50 142ms · tokens −94%
CONVERSATION TURN user · "I live in Berlin" session 018f3a · ts 04:27 UTC user · "prefer weekend runs" session 018f3a · ts 04:27 UTC agent · "noted — Berlin it is" session 018f3a · ts 04:27 UTC QUEUE · ASYNC WORKER RabbitMQ EXTRACTION MODEL LLM extract → 12-cat taxonomy [+] fact: lives in Berlin [+] pref: weekend runs ENTITY RESOLVER "Burlin" ~ Berlin "Alex" ~ Alexander VECTOR INDEX · HYBRID dense 384d · sparse BM25 · 17-field payload index TYPED KNOWLEDGE GRAPH u city act org pref LOCATED_AT CACHE · EMBED + ENTITY embeds entities api-keys GROUNDED CONTEXT · <300ms [user profile] Alexander lives in Berlin [preference] prefers weekend runs decompose 4ms · search 86ms · graph 21ms · rank 12ms
<300ms
p50 retrieval, fast-path
−95%
Context tokens vs. rolling-history
12categories
Typed knowledge taxonomy
2endpoints
Your entire integration surface
How it works

An async write path and a fast read path. That's all.

Ingestion returns 202 in milliseconds; a worker extracts typed knowledge, resolves entities, and writes to a hybrid vector index and a typed graph in parallel. Retrieval runs LLM-free on the fast path — fuzzy entity match, parallel hybrid search, bounded graph expansion — and assembles grounded context per request.

POST /v1/memory/ingest returns 202 · <20ms async queue idempotent · UUIDv5 extraction model → 12-cat taxonomy · tone · confidence no-training API · Claude-class entity resolver canonicalize · alias · typed edges hybrid vector index dense · sparse · 17 filters typed graph store entities · typed edges POST /v1/memory/get fuzzy · parallel search · graph expand → grounded context + per-stage meta p50 < 300ms · sync
1

Ingest the turn

One endpoint. Returns 202 in milliseconds. Idempotent — re-sending the same conversation is safe.

POST /v1/memory/ingest
2

Extract structured knowledge

A worker reads each turn and a short recent-history window, then emits typed facts across a 12-category taxonomy with tone and confidence.

observations → knowledge items
3

Resolve entities

People, places, orgs, and topics are canonicalized with aliases and stable IDs. Relationships are typed — not bag-of-words.

LOCATED_AT · PREFERS · WORKS_AT …
4

Dual-write in parallel

Knowledge lands in a hybrid vector index (dense + sparse, 17-field payload) and a typed graph store simultaneously.

vector · graph · cache
5

Retrieve in under 300 ms

LLM-free fast path: heuristic decompose, fuzzy entity match, parallel hybrid search, bounded graph expansion, rank, assemble. Every response ships per-stage timings.

POST /v1/memory/get · sync
Extraction

Conversations become structured knowledge — not chunks.

Every turn is parsed into categorized, typed facts with tone, confidence, and provenance. Chunks of raw text can't power grounded responses; typed knowledge can.

raw conversation · session 018f3ainput
extracted knowledge · 12-category taxonomyoutput
Retrieval

Five stages. All LLM-free. All under 300 ms.

Most memory systems re-run an LLM at retrieve time. That's why their latency budgets shred your agent's response time. We don't.

Heuristic decompose

Key terms, likely entities, and tags pulled from the query and recent turns without a model call. Falls back to LLM-mode only for multi-hop questions.

query: "when was I last in Berlin"
terms: ["last","Berlin"] entities: [Berlin] tags: [location,time]

Fuzzy entity match

Typos, abbreviations, morphological variants, and diacritic drift all normalize to the same canonical entity ID. Partial-ratio matching with per-token fallback.

"Burlin" Berlin (0.92)
"Alex" Alexander (0.88)
"в Берлине" Берлин (0.95)

Parallel hybrid search

Four to five indexes queried concurrently: dense semantic, sparse lexical, entity-sparse, tag-chain, recent-session. Every query pre-filtered by tenant.

dense: 42ms · sparse: 38ms
entity: 21ms · tag: 19ms
fan-out · fan-in · merge

Graph expansion

Matched entities seed a bounded graph traversal that pulls in directly related facts. User asks about Berlin → we surface "Alexander lives in Berlin" even when it wasn't top-ranked.

Berlin ←LOCATED_AT— u
u —PREFERS→ weekend runs
depth: 2 · max-edges: 12

Rank & assemble

Weighted blend of similarity, recency, confidence, and tone. Results are deduplicated and formatted into a structured prompt block your model can consume directly.

[user profile] Alexander lives in Berlin
[preference] weekend runs
[related] Berlin ← LOCATED_AT ← u

Every stage, observable

Every response carries a meta block with per-stage latency, matched entities, and token count. If memory is the bottleneck, you'll see it.

decompose_ms: 4
search_ms: 86 graph_ms: 21
rank_ms: 12 total_ms: 142
Built for

Agents that need deep, accurate personalization.

Anywhere the quality of a response depends on what the system already knows about the user — getmem is the memory layer.

Healthcare

Patient agents that remember.

Medications, allergies, prior symptoms, care preferences, provider instructions — surfaced on every turn without asking the patient to repeat themselves.

"I've got that same headache again"
Your agent already knows: patient has hypertension, is on lisinopril 10mg, reported similar migraines 6 weeks ago, and prefers non-pharma options first.
HIPAA-ready isolationper-patient scopingaudit log
Personal AI

Companions that actually know you.

Goals, preferences, relationships, recurring contexts, long-running projects — carried across sessions, across devices, across weeks and months.

"How's the marathon training going?"
Agent remembers: Berlin marathon in Sep, target 3:45, prefers weekend long runs, last week's mileage was 48km, and you mentioned a knee twinge on Tuesday.
cross-sessionlong-horizonper-user
Integrations

Drop in next to your existing stack. No router, no chat framework, no new model.

Two API calls, or a one-line adapter for the runtimes you already ship. getmem composes with OpenAI, Anthropic, and ChatGPT-style apps without changing your deployment topology.

Memory, alongside your model.

Call /get before you prompt, inject the grounded context block, call /ingest fire-and-forget after the turn completes. That's it.

Works with any model. OpenAI, Anthropic, open weights, on-prem — we touch the context block, not the model.
SDKs for Python, TypeScript, Go. Plus a REST API for everything else. Adapters for LangChain, LlamaIndex, Vercel AI SDK.
No vendor lock-in on storage. Export your memories as JSON at any time. Full-data export is a single API call.
Per-stage meta on every call. Latency budget blown? Your logs will say exactly where.
# pip install getmem-ai
import getmem_ai as getmem
mem = getmem.init("gm_live_...")

# Get context before LLM call
ctx = mem.get("uid", query=msg)["context"]

# Save both roles after each turn
mem.ingest("uid", messages=[
  {"role": "user", "content": msg},
  {"role": "assistant", "content": reply}
])
Compared

What rolling your own actually costs.

Four paths people take to give agents memory. You can do any of them — we'd just rather you ship this week.

 
 
Setuptime to first call
Knowledge extractionturn → typed facts
Entity resolutionfuzzy · multilingual
Typed graphrelations · traversal
Latency · p50retrieval budget
Context tokensper turn sent to model
Per-user isolationtenant safety
Auditabilityobservability per call
getmem
managed memory layer
2 API calls
12-category taxonomy, included
built in · fuzzy · multilingual
typed edges, bounded traversal
< 300ms
~5% of history
filter-level, provable
per-stage meta on every call
Vector DB + glue
pinecone · weaviate · qdrant
4–6 weeks of glue
you write & tune prompts
you build it
extra service
300–800ms
20–40%
app-layer only
build your own
Full history in context
do nothing · send it all
trivial & bad
none
none
none
token-bound
100%
inherent
none
Pricing

Pay for what you use.
$20 free on signup, then $10 / month.

Top up your balance, get locked-in lower per-call prices for as long as the balance holds. No seats, no minimums, no commitments. $20 free on signup, then $10 added every month automatically.

1. Pick a deposit

The bigger the deposit, the cheaper every call.
Active users / mo 2,000
Turns / user / day 6
LLM tokens per turn, as a conversation grows
rolling history with getmem
Rolling-history agents pay for every prior turn, forever. getmem keeps the prompt small — because typed knowledge beats raw transcripts.
Ingest calls / month
Get calls / month
Ingest cost
Get cost
Expected monthly spend
$250 deposit Deposit acts as prepaid balance. . When it drops to zero, top up again — same tier prices lock in at the new deposit level.

Total cost with getmem net saved

with
saved tokens
LLM tokens / turn · rolling history
LLM tokens / turn · with getmem context
LLM bill without memory
LLM bill with memory
+ getmem calls
Net saved / month
Assumes GPT-4.1-class pricing ($2.50 / 1M in · $10 / 1M out), a rolling history that grows to ~4,000 input tokens / turn by turn 10, and getmem's grounded context block averaging ~250 input tokens / turn. Output tokens held constant at 350.
Deposit range
Ingest · per call
Get · per call
Discount
$0 – $50 · Starter
$0.002000
$0.000500
base
$50 – $250 · Indie
$0.001600
$0.000400
20%
$250 – $1,000 · Growth
$0.001200
$0.000300
40%
$1,000+ · Scale
$0.000800
$0.000200
60%

All plans include: 12-category taxonomy · entity resolution · typed graph · per-user isolation · full export · $20 free credit on signup + $10 / month added automatically.

Trust

Tenant-isolated by default. No training on your data. Every write audited.

Filter-level isolation

Every vector search, every graph query, and every cache key is scoped by (developer, project, user). Cross-tenant reads are architecturally impossible, not just policed.

Embeddings & extraction — no-train

Customer conversations never train shared models. Upstream vendor calls run under no-training API terms; our own models never see cross-tenant data.

Every stage, timed

Per-stage latency, matched entities, and token count on every response. Structured logs carry a request ID through every hop. If memory is slow, you'll see exactly where.

Deterministic writes

Observation IDs are deterministic UUIDv5s derived from (dev, project, user, session, index). Re-ingesting the same conversation is safe — identical observations, no dupes.

Quotas, never silent drops

Ingest failures propagate as structured errors with stable codes. Exhausted quota returns 402 — never a 200 with an empty body.

Export, always

A single API call returns your entire memory corpus as JSON. Migrate away, archive, or replay into a new tenant. Your data is never held hostage.

FAQ

Questions developers actually ask.

How is this different from a vector DB?
A vector DB stores embeddings and indexes. getmem is one layer up: raw conversation in, typed structured knowledge out, across both a hybrid vector index and a typed graph store. If you want to manage chunks and similarity, use a vector DB. If you want your agent to remember things, use getmem.
How is this different from document RAG?
RAG is for documents. getmem is for conversations. They compose — we're the memory of what the user said and meant across sessions; RAG is the knowledge your product already has on a shelf.
Do you train on our data?
No. Embeddings and extraction run under no-training vendor terms. Our own models, when used, never see cross-tenant data. Isolation is enforced at the filter level on every query.
What happens to superseded memories?
Nothing silent. When a fact changes (switched from VS Code to Neovim) we detect the contradiction at ingest time, mark the old item status=superseded, and write the new one — with a pointer between them. Your audit log is default-on.
What's the real latency story?
Retrieval is sync and designed to stay under 300 ms on the fast path. Ingest is async and returns 202 in milliseconds. Every response carries meta.total_ms plus per-stage breakdowns. No hidden budget.
Can I self-host?
Single-tenant deploys are available on the Scale tier. The Managed plan is what most developers run in production — same binary, same contract, same latency envelope.
How does the deposit pricing actually work?
Top up a balance; your deposit size locks a per-call price for as long as the balance lasts. Larger deposits unlock bigger discounts (20% / 40% / 60%). When the balance runs down, top up again at whatever level fits your next month. No subscriptions, no minimums.
Start building

Give your agent a memory.

Claim your API key and get $20 free credit instantly — no card needed. Then $10 added every month, automatically.

No credit card required. $20 on signup, then $10 free every month.