Backpropagate
Posts
Sanity Check #1: SEAL

Sanity Check #1: SEAL

Self-Adapting Language Models

Dr Milos Ojdanic
July 01, 2025 • Estimated Reading Time: 5 minutes

Research paper: Self-Adapting Language Models

Sanity check implementation is available in the backpropagate repository.

Problem Definition

Large Language Models (LLMs) are powerful but fundamentally static. Meaning, once the initial training is complete, the weights stay the same as they were deduced to suit the training corpus. Therefore, they lack built-in mechanisms to adapt their own weights in response to new tasks, knowledge, or examples emerging after their initial training is complete.

State of the art

Well, it is not like nothing has been done before attempting to solve this problem. There is a branch of science that attempts to adapt models typically involving in-context learning or standard finetuning, where the model consumes data "as-is". In-context learning means a model learns to perform the tasks by being provided examples in the prompt, without permanent changes to its weights. While finetuning includes further updating model weights (training) on specialised datasets to make it knowledgeable on particular tasks. However, these approaches don't enable the model to reiterate on its own potential best strategies for how to best transform and learn from the training data it receives.

Research Hypothesis

The paper explores the central hypothesis: "Can an LLM self-adapt by transforming or generating its own training data and learning procedure?"

Methodology

The authors introduce Self-Adapting Language Models (SEAL), a framework that enables an LLM to self-adapt by generating its own finetuning data given the user input and update strategies, called "self-edits". This is achieved through a two-loop process: an inner loop where the model is updated via supervised fine-tuning (SFT) on a self-edit, and an outer reinforcement learning (RL) loop where the downstream performance of the updated model is used as a reward signal. The signal is simply whether the model has been able to provide the correct answer to the input. This reward is used to train the model to produce more effective self-edits over time. The RL training is implemented using a simple and effective "rejection sampling + SFT" method called ReSTEM.

More details…

Once the model is prompted to generate a “self-edit”, it creates a list of knowledge bites or key facts derived from the prompt. The model is then fine-tuned using LoRA on these self-generated knowledge bites. LoRA stands for Low-Rank Adaptation, a lightweight adaptation, instead of updating all parameters LoRA freezes the original model and its pre-trained weights, injects small “adapter” layers called low rank matrices next to original layers which serve as “adjustable knobs” which got updated during fining.

After fine-tuning, the model answers the questions about the prompt used for fine-tuning, and the GPT4.1 model plays the role of a grader whose grade (correct or incorrect) serves as a reward signal for the RL loop and learning self-edits.

Yes, the ReSTEM method was not explained. Let’s explain it. It works as an Expectation-Maximisation (EM) procedure for the action-value function we described in the previous lession. ReSTEM starts with a Sample (The "E-step") for a given task context, the model generates not just one, but multiple candidate self-edits. Next, Filter (The "Rejection Sampling") for each of the described self-edits is used to temporarily update and then evaluate the model. The self-edits that lead to a successful outcome (e.g., the model answers the test question correctly) are kept. All the unsuccessful ones are thrown away. This is also described as reinforcing only those samples that receive a positive reward. Train (The "M-step" / Supervised Fine Tuning): The model is then trained via standard supervised fine-tuning (SFT) only on the successful self-edits that were kept in the previous step. By repeatedly doing this loop, the model gradually learns the characteristics of a "good" self-edit and adjusts its own generation policy to produce them more often. For those for whom the term self-edit is still a bit blurry, it describes the right strategy for data augmentation (i.e, resizing, rotation etc), and optimisation of parameters (i.e., learning rate, epoch etc).

Results

Knowledge Incorporation: SEAL enabled a 7B parameter (llama3.1-instruct) model to achieve an accuracy of 47.0%, which was notably higher than the 46.3% achieved when using synthetic data generated by the much more powerful GPT-4.1 model.

Few-Shot Learning: On abstract reasoning tasks, SEAL achieved a success rate of 72.5%, a substantial improvement over standard in-context learning (0%) and using self-edits without RL training (20%)

Limitations

One of the most often occurring classic problems in neural networks is catastrophic forgetting. This refers to the process of destroying and interfering with the previously learned knowledge when incorporating new knowledge.

Computational overhead is a big concern with the suggested approach. The process of the approach includes for each self-edit finetuning the model, running an evaluation, grading the results and calculating the reward for each self-edit. A significant amount of time and computing power can add up.

The framework requires that every new information (the context), comes with a pre-packed evaluation task (meaning includes a set of questions and answers to evaluate on and use as a ground truth). This prevents the model from testing self-editing skills “in the wild” (on the whole internet or on some specific books used for exams for example).

Sincerely, MO