Tool-Use Agents · Self-Evolution · Reinforcement Learning

SEAL: Synergistic Co-Evolution of Agents and Learning Environments

Yihao Hu*,1,2, Zhihao Wen*,1, Xiujin Liu3, Pan Wang1,4, Xin Zhang1, Wei Wu1 * Equal Contribution   Corresponding Author (z.wen@antgroup.com) 1 Ant Group   2 Westlake University 3 University of Michigan–Ann Arbor   4 University of Science and Technology of China

SEAL closes the loop between policy learning and training-time environment adaptation, using verifier-grounded failure diagnosis to make low-resource tool-use agent learning more targeted and robust.

Conceptual comparison of self-evolution paradigms for LLM agents

Paper Abstract

Co-Evolving Agents with Their Learning Interfaces

Large Language Model agents are increasingly improved through interaction rather than static supervision. Yet most self-evolution methods adapt either the policy or the learning environment in isolation. As the agent's capability frontier shifts during training, the environment that provides supervision often remains static or only weakly coupled to the agent's revealed failures. We call this mismatch Agent-Environment Misalignment.

SEAL is a closed-loop co-evolution framework for interactive tool-use agents. It collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level labels, and uses these diagnoses as a shared signal for both learning-interface evolution and model-side policy optimization. With only 400 training samples, SEAL yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer.

Core Ideas

Verifier-Grounded Co-Evolution

Failure Diagnosis

Executable traces are converted into turn-level labels such as invalid tool calls, argument mismatches, missing tool calls, recovery failures, and response mismatches.

Interface Evolution

The training-time interface exposes clearer schema cues, tool affordances, constraints, and recovery-oriented feedback while keeping benchmark tools and verifier fixed.

Advantage Reweighting

Diagnosis profiles estimate how actionable each trajectory is, then reweight GRPO advantages without changing the verifier reward or success criterion.

Closed Loop

The agent reveals capability gaps, the learning interface adapts around those gaps, and the model internalizes the resulting feedback through policy optimization.

Architecture

Training-Time Co-Evolution Loop

SEAL training-time co-evolution loop
The policy interacts with an executable tool environment, failed rollouts are diagnosed into verifier-grounded failure types, and the same diagnoses drive learning-interface evolution and diagnosis-guided policy optimization.

Reported Experiments

Low-Resource Tool-Use Learning

+26.25 average-point gain on Qwen2.5-7B-Instruct
400 training examples in the BFCL V3 low-resource setting
3 backbones improved under the same verifier reward
OOD positive transfer to BFCL V4 and tau2-bench domains

BFCL V3 Multi-Turn Results

Average score and category-level success rates.

Model Average Base Missing Functions Missing Parameters Long Context
Qwen2.5-3B-Instruct 5.75 11.00 6.00 3.00 3.00
+ Vanilla RL 9.25 16.00 9.00 6.00 6.00
+ SEAL 14.00 19.00 15.00 12.00 10.00
Qwen2.5-7B-Instruct 14.00 22.00 14.00 10.00 10.00
+ Vanilla RL 30.75 46.00 27.00 27.00 23.00
+ SEAL 40.25 58.00 36.00 34.00 33.00
ToolACE-2-Llama-3.1-8B 32.00 45.00 26.00 35.00 22.00
+ Vanilla RL 38.50 52.00 30.00 40.00 32.00
+ SEAL 46.75 58.00 46.00 44.00 39.00
SEAL diagnosis-guided reward ablation
Diagnosis-guided training allocates more optimization pressure to attributable and recoverable failures, improving the usefulness of sparse verifier feedback.

Citation

BibTeX

@article{hu2026seal,
  title={SEAL: Synergistic Co-Evolution of Agents and Learning Environments},
  author={Hu, Yihao and Wen, Zhihao and Liu, Xiujin and Wang, Pan and Zhang, Xin and Wu, Wei},
  journal={Preprint},
  year={2026},
  url={https://yihaohu0118.github.io/SEAL/}
}