Most LLM systems don't get better with each query.
The base model is frozen at training time, and everything around it — prompts, retrieval, memory — is hand-tuned. When something works in production, the signal usually has to be noticed by a human, interpreted, and pushed back manually. Most of it is lost.
Our view: most of the meaningful gains in agent and LLM-system reliability over the next few years will come from the system around the model, not from the model itself.
We're building a layer that closes the loop. The system observes interactions, evaluates outcomes against whatever signal is available in a given deployment, and updates a structured representation that influences future calls.
The base model stays frozen. The behavior around it adapts — continuously, from deployment.
three surfaces, one loop
01 / memory
Memory
Structured state that updates from outcomes, not from human edits.
policy: π(write | trace, outcome)
02 / retrieval
Retrieval
Bandit-style selection over what context the model sees, with non-stationary relevance and partial-feedback rewards.
contextual bandit · LinUCB / Thompson
03 / prompt
Prompt construction
Reward modeling and policy learning over templates, demonstrations, and call structure.
reward model · off-policy eval
Credit assignment breaks down when the relevant action happened thousands of tokens ago. Reward signals are noisy or missing in most production settings. Systems that influence their own training signal find ways to game it. And changing your own behavior changes the data you see — which makes everything less stable. We don't have clean answers yet, only working hypotheses about which are tractable.
We're talking to researchers and to engineers running LLM systems in production.