Agent Post-Training as Environment Design

Distribution reshaping over trajectories needs truthful state, localized feedback, and maintained objectives.

The model is not the whole system. The durable layer is environment, objective, verifier, state, trace, and feedback.

Post-training is useful when those layers turn behavior into correction signals instead of vague applause at the end of a rollout.

Post-training is best understood as distribution reshaping over trajectories.

Distributional view

A model is a distribution over trajectories. SFT, RL, on-policy distillation, rejection sampling, and verifier-feedback methods differ in how they define the target distribution and where credit assignment enters.

For agents, the question is not whether a reward exists. It is whether the environment makes the right counterfactual visible.

Which tool call failed? Which state boundary was violated? Which evidence was missing? Which verifier rejected the artifact? Which human review changed the objective?

Feedback topology

Sourcehuman preference, verifier, runtime error, static checker, simulation, benchmark, product telemetry
Timingterminal reward, step-level critique, tool-call failure, state diff, review gate, regression test
Trustgold label, synthetic task, noisy preference, adversarial trace, production correction, operator override

Good environments move feedback closer to the action that caused it.

That is the difference between learning a behavior and learning a superstition about the benchmark.

Objective maintenance

Agent objectives decay when tools, memory, traces, reward channels, and eval loops are treated as glue code.

They are part of the learning surface. The environment has to preserve provenance, keep state explicit, expose uncertainty, and resist silent objective drift.

Always look at the data, especially when the benchmark is the product being sold.

Proof anchors

  • AIxCC / DL-Patcher: code-model harnesses, checker feedback, patch verification, and patent-filed repair loops.
  • T-UEBA: constrained graph ML with active learning, synthetic-data triggers, calibration, and analyst-facing evidence.
  • DXF-to-CAD: structured artifact environment, HATCH-IoU, deterministic validators, conformal review bands, and learned residual repair.
  • Agent harnesses: thin control planes around state, traces, tool use, evals, and operator review.

Practical frame

Problem discovery and problem solving have to happen together when the representation is not yet trustworthy.

Build the environment that tells the truth, then ask the model to improve inside it.

Applied epistemics in cursed regions · Research map · RAM Labs dossier