zkas

Most LLM systems don't get better with each query.

zkas / adaptive-layer · v0.3.1-rc · run_01HXK3ASYNC

[+0.184s] 1. context built

retrieval 4 docs · mean similarity 0.83 (vector index)

prompt template qa_v4 (tool-use, 1-shot)

[+2.412s] 2. outcome observed

reward 0.876 (reward model)

vs 100-step avg +0.041

replay buffer +1 (now 14,284) (off-policy · IS ratio 0.92)

[+9.012s] 3. policy updated step 8,421

loss 0.214 (PPO, clip 0.2)

gradients retrieval 0.014 · prompt 0.009 · memory 0.023

kl to ref 0.18 / 0.20 budget (within trust region)

▋

LIVE · 1.2s14,284 eventsloss ↓ 31%

the problem

The base model is frozen at training time, and everything around it — prompts, retrieval, memory — is hand-tuned. When something works in production, the signal usually has to be noticed by a human, interpreted, and pushed back manually. Most of it is lost.

Our view: most of the meaningful gains in agent and LLM-system reliability over the next few years will come from the system around the model, not from the model itself.

the layer

We're building a layer that closes the loop. The system observes interactions, evaluates outcomes against whatever signal is available in a given deployment, and updates a structured representation that influences future calls.

The base model stays frozen. The behavior around it adapts — continuously, from deployment.

three surfaces, one loop

01 / memory

Memory

Structured state that updates from outcomes, not from human edits.

policy: π(write | trace, outcome)

02 / retrieval

Retrieval

Bandit-style selection over what context the model sees, with non-stationary relevance and partial-feedback rewards.

contextual bandit · LinUCB / Thompson

03 / prompt

Prompt construction

Reward modeling and policy learning over templates, demonstrations, and call structure.

reward model · off-policy eval

open problems

Credit assignment breaks down when the relevant action happened thousands of tokens ago. Reward signals are noisy or missing in most production settings. Systems that influence their own training signal find ways to game it. And changing your own behavior changes the data you see — which makes everything less stable. We don't have clean answers yet, only working hypotheses about which are tractable.

correspondence

We're talking to researchers and to engineers running LLM systems in production.

hello@zkas.ai →