Memory should outlive models.
It was around three in the morning when I saw the announcement. Anthropic had just launched Claude Managed Agents, a fully hosted platform for building and running AI agents, with sandboxed execution, checkpointing, credential management, and scoped permissions all handled for you. Weeks of infrastructure work, gone. Ten times faster to production, the blog post claimed.
My first thought was simple: my product just died.
I'm an indie founder. I've been building a memory infrastructure company called Nunchi for the better part of a year: a reference implementation called Nexus, an enterprise backend called MaaS, a personal variant called Norfolk, and most recently a model-neutral coding harness I've been calling 3122. The whole thing rests on the idea that AI agents need a memory layer that isn't bolted to any single model provider. And here was the largest, most thoughtful lab in the space announcing exactly the kind of infrastructure that could, in principle, absorb the entire problem I'd been working on.
Big company decisions rattling small company plans is the ordinary weather of indie building. You get used to it. But three in the morning is a bad time to be rational about anything.
I made coffee and went back to read the engineering blog more carefully. Three hours later, I'd arrived at the opposite conclusion.
Rewind: 83.2%
A few months ago, I started running Nexus against LongMemEval_S. The first run was not good. Neither was the second. By the sixth iteration I'd learned three things that, in retrospect, felt less like engineering discoveries and more like rules I should have known from the start:
- Never delete atoms.
- Never reorder atoms.
- Never inject non-similarity signals into retrieval.
Each rule came from a different failure mode, each one painful in its own way. The final number was 83.2%, the same as GPT-4o on the same benchmark, and well above GPT-4o-mini at 73.8% or a large open-weight model served through Groq at 63.8%.
I wasn't proud of the number so much as what the number told me. The ceiling of memory quality wasn't being set by the model. It was being set by how the memory layer was designed. A smaller model with a better memory layer was outperforming a larger model with a worse one. That was the signal.
Which raised a question I didn't have a good answer for: if the memory layer matters more than the model, why would anyone want their memory tied to a specific model company?
Rewind: 3122
That question is why I built 3122.
It's a local coding harness, but the design brief wasn't "another coding tool." The brief was this: build something where the model is a replaceable part, and the things that actually matter, the tool surface, the safety policy, the continuity across sessions, belong to the harness. You can run it against Claude, GPT, Gemini, DeepSeek, Qwen, or a local Ollama model, and the experience is the same. Same tools, same permission model, same trajectory memory.
Three decisions mattered more than the rest. First, a common text protocol for tool calls that kicks in when a model doesn't support native tool calling, so the small open-weight models aren't second-class citizens. Second, a compact prompt shape and a separate context budget for those smaller models, because "supporting" a model and making it actually work are different things. Third, a handoff snapshot that fires when you switch models mid-task, so the next model picks up where the last one left off.
None of this is exotic. It's the kind of infrastructure you build once you've decided the model is a part and the harness is the machine.
Back to 3 A.M.: What Managed Agents Actually Did
Here's the thing about reading Anthropic's engineering blog at three in the morning with a cold cup of coffee: you stop reacting and start recognizing.
The architecture they described, what they call decoupling the brain from the hands, is almost painfully clean. The brain is Claude and its harness, responsible for reasoning. The hands are sandboxed execution environments, completely replaceable, talking to the brain through a deliberately minimal interface: give it a name and an input, get back a string. The session is a durable append-only event log that lives outside the model's context window, so any stateless brain can pick up any session and continue from where the last one stopped.
It's a good design. I mean that as an engineer reading another engineer's work, not as a competitor. They arrived at architectural conclusions I'd been circling for months, described them more clearly than I had, and shipped them as running infrastructure. Reading it was the closest thing I've had in a while to the feeling of a colleague showing you a proof you'd been fumbling toward.
But then I noticed what wasn't there.
The session in Managed Agents is an event log. Raw. Chronological. It's what happened, not what any of it means. There's no semantic layer on top, no notion of which events contradict each other, which ones should decay, which ones belong to the same entity, which ones support or refute which claims. Anthropic built the storage and the retrieval primitives. They left the meaning layer empty.
I can think of three reasons for that. Maybe they haven't gotten to it yet and it's on next quarter's roadmap. Maybe they're betting the meaning layer belongs to partners and protocols rather than to them. Maybe it's a deliberate architectural choice about separation of concerns. I genuinely don't know which it is, and I'm suspicious of anyone, including myself, who claims to.
The Position That Got Clearer, Not Weaker
What I do know is what the architecture implies, regardless of which of those three stories turns out to be true.
If you accept that the brain is a replaceable part, and Managed Agents is built on exactly that premise, then the memory that sits above those parts has to be neutral with respect to which part is plugged in today. No single model company can credibly own the cross-vendor memory layer, because owning it means being neutral about which model wins, and no model company can be neutral about that. The neutrality has to come from somewhere else.
That "somewhere else" is a smaller, more boring place than it sounds. It's a protocol other people can implement. It's a reference backend that proves the protocol works. It's an enterprise variant for companies that don't want to host it themselves. It's a personal variant for people who want their own notes to outlive whichever model they happen to be using this year. It's a harness that treats memory as something you connect to rather than something you build into yourself. That's the shape of what I've been making, and the shape I've started to think this moment actually calls for.
I'm not going to claim the announcement proved me right. That would be too clean, and the honest version is that I'm still working it out. But I went into the night worried a larger company had absorbed my problem, and I came out of it with a clearer sense of the part of that problem they structurally can't absorb. That's more than I had at midnight.
3122 goes public with this post. The rest of the stack, AMCP, Nexus, MaaS, Norfolk, has been in motion for a while and will keep moving. If you're working on agent memory from a different angle, I'd like to hear from you.
It's morning now. The coffee is worse than it was at three.