1. Introduction
Creating autonomous agents that can act and learn without human intervention has long been a central goal of artificial intelligence. Thanks to large language models (LLMs), this vision is becoming a reality. Such agents are already being used to navigate websites, operate tools, and assist in scientific research.
A promising training method is reinforcement learning (RL), where an agent optimizes its actions to maximize a cumulative reward. This approach enabled systems like AlphaGo to achieve superhuman performance in domains with well-defined rules and rewards. However, in the real world, such as on websites, clear reward signals are often absent. An agent might fill out a form but receive no feedback on whether it was done correctly.
Therefore, the current standard is Supervised Fine-Tuning (SFT) based on expert-curated data. This is efficient but has major drawbacks. Collecting high-quality human data is expensive. More importantly, the agent never interacts with the environment during training, so it doesn't learn from its own mistakes and performs poorly in new situations.
To solve these problems, the early experience paradigm is proposed—a practical middle ground between imitation learning and reinforcement learning. The fundamental principle is that the agent proposes its own actions, observes the consequences (future states), and uses this information for training without needing external rewards.
Within this paradigm, two methods are studied:
Implicit World Modeling: The agent learns to predict what will happen in the environment after its action.
Self-Reflection: The agent compares its exploratory actions to an expert's and analyzes why the expert's choice was better.
Experiments have shown that this approach improves average success rates by +9.6% and generalization by +9.4% compared to standard imitation learning.
2. The Basics: How Agents Learn
An agent's decision-making problem can be described as follows: in every situation (state), the agent must choose an action. The goal of training is to teach it to choose the best actions.
The main challenge is the lack of reliable reward signals. This is why imitation learning is common, where the agent learns by copying "state-action" pairs from a dataset provided by experts.
However, this approach has two primary weaknesses:
Distribution Shift: During real-world use, the agent inevitably deviates from the expert's behavior and encounters unfamiliar situations where errors can accumulate.
Lack of Consequence Awareness: The agent only ever sees the expert's correct actions. It never learns what happens after incorrect or alternative actions, which limits its ability to recover from mistakes.
The early experience paradigm is designed to address these limitations directly.
3. The Early Experience Paradigm
The early experience paradigm enables language agents to learn through direct interaction with their environment, using the resulting future states as reward-free supervision.
Imagine an agent learning to book flights. In imitation learning, it only sees successful demonstrations. With early experience, the agent also explores what happens if it clicks the wrong button or enters an invalid date. The resulting error messages and page changes become direct learning signals.
How does it work?
For each state in the expert dataset, the agent is allowed to try several alternative actions. Executing each of these actions leads to a new state, which captures the immediate consequence. These interactions are collected into a rollout dataset, which serves as a rich source of reward-free supervision.
Implicit World Modeling (IWM)
The agent is trained to predict the next state that will result from a given state-action pair. Since states are represented as text, this becomes a standard next-token prediction task.
The training process is two-staged: first, the agent is trained on the world modeling task to internalize the environment's dynamics, and then it is fine-tuned on the expert data. This helps the agent better understand its context and makes it more robust to unexpected situations.
Self-Reflection (SR)
Self-Reflection is a mechanism for agents to learn from their own exploratory outcomes by comparing expert actions with alternatives. For each state, an LLM is prompted to generate a rationale (reflection) explaining why the expert action was better than an alternative, based on their observed outcomes.
The agent is then trained to jointly predict both the rationale and the expert action. This encourages the model to move beyond rote imitation and develop more generalizable decision-making principles. For example, a generated reflection might teach the model to prioritize budget constraints—a lesson that transfers across different tasks.
Self-Reflection Prompt Template:
You will be presented with a situation where you need to choose between multiple possible actions. Your task is to analyze the situation and provide reasoning about why we decide to take the expert action.
Situation Description:
{Situation Description}
Expert Action: {Expert Action}
Expected Outcome: {Future State of Expert Action}
Alternative Actions:
Action 1: {Alt Action 1}, resulting state: {State 1}
Action 2: {Alt Action 2}, resulting state: {State 2}
...
Provide a detailed self-reflection as an internal monologue that demonstrates your reasoning process. Your monologue should:
1. Analyze the situation and the goal.
2. Compare the possible actions, explaining why each may be less optimal.
3. Justify why the expert action is most suitable, grounded in the expected outcome.
4. Highlight any relevant clues, constraints, or consequences.
4. Experiments
Experiments were conducted across eight diverse environments to assess the effectiveness of the early experience paradigm, its out-of-domain generalization, and its compatibility with downstream reinforcement learning.
Effectiveness Results
The table below presents a comparison of Imitation Learning with the proposed methods, IWM and SR. The numbers represent the task success rate.
Benchmark | Model | Imitation Learning | Proposed Method IWM | Proposed Method SR |
ALFWorld | Llama-3.1-8B | 80.5 | 85.9 (+5.4) | 85.2 (+4.7) |
ScienceWorld | Llama-3.1-8B | 54.7 | 57.0 (+2.3) | 68.0 (+13.3) |
TravelPlanner | Llama-3.1-8B | 17.2 | 25.0 (+7.8) | 32.2 (+15.0) |
BFCLv3 | Qwen-2.5-7B | 26.7 | 29.3 (+2.6) | 32.0 (+5.3) |
Tau-Bench | Llama-3.1-8B | 35.9 | 40.8 (+4.9) | 41.7 (+5.8) |
WebShop | Llama-3.1-8B | 47.3 | 58.6 (+11.3) | 58.2 (+10.9) |
WebArena | Llama-3.1-8B | 4.9 | 8.5 (+3.6) | 8.5 (+3.6) |
Export to Sheets
IWM yielded particularly strong gains in structured environments like WebShop (+11.3%), while SR delivered its largest improvements in tasks requiring complex reasoning, such as TravelPlanner (+15.0%).
Conclusion: Early experience reliably converts an agent’s own actions into scalable supervision, strengthening policies across diverse environments.
Out-Of-Domain Generalization
Next, it was tested how well the trained policies perform on out-of-domain (OOD) data they had not seen before. The results show that early experience consistently recovers a substantial portion of the performance drop. For example, on ALFWorld, the Llama-3.1-8B model with IWM achieved a +14.8% gain over imitation learning.
Conclusion: Early experience improves robustness under diverse OOD regimes.
Reinforcement Learning Following Early Experience
It was evaluated how early experience serves as a preparatory stage for full reinforcement learning (RL). It was found that initializing an RL model with a checkpoint trained using IWM or SR leads to substantially higher final performance ceilings compared to initializing with imitation learning alone.
Conclusion: Early experience functions as a critical 'mid-training bridge,' resolving the cold-start problem for RL by producing superior initial policies.
5. Discussion
A comparison of the proposed methods against alternative baselines was conducted, and it was found that the alternatives are less effective. For example, approaches that generate rationales without actual environment interaction can even degrade performance. In contrast, early experience provides grounded supervision from observed outcomes.
It was also found that:
Early experience is data-efficient. On WebShop, using early experience with just 1/8th of the expert data surpassed the performance of imitation learning trained on the full dataset.
IWM performance improves as more alternative actions are explored, while SR works best with a modest number of alternatives (e.g., 2–4).
The benefits of early experience persist as model size increases (from 3B to 70B parameters), demonstrating its complementary role.
6. Conclusion
Early experience is a scalable, reward-free paradigm that bridges the gap between imitation learning and reinforcement learning. By enabling agents to convert their own actions and resulting states into direct supervision, this approach addresses the core limitations of existing training methods.
The proposed strategies, implicit world modeling and self-reflection, deliver consistent performance gains across eight diverse environments. Furthermore, early experience serves as a powerful warm-start for subsequent reinforcement learning. These findings establish early experience as a practical and generalizable foundation for building more capable and autonomous language agents.
Let’s connect
