You can’t fine-tune GPT-5.5. You can’t fine-tune Claude. You can’t fine-tune most of the models you actually deploy in production. Yet somehow, we expect these frozen models to handle spreadsheet automation, mathematical olympiads, and multi-step search tasks - all from a hand-written system prompt. The paper “SkillOpt: Executive Strategy for Self-Evolving Agent Skills” (arXiv 2605.23904, May 2026) asks: what if the system prompt itself was the trainable parameter? What if we applied the full discipline of deep learning - learning rates, validation splits, negative feedback - to a natural-language document instead of model weights? The result: SkillOpt wins or ties on all 52 evaluated (model, benchmark, harness) cells, achieving gains of up to +39 absolute points on procedural benchmarks and producing compact skill files of just 300-2,000 tokens that transfer across models, harnesses, and benchmarks.


The Problem: Agent Skills Without a Training Loop

What is an agent skill?

An agent skill is a natural-language document - think of it as a procedure manual - that tells an LLM how to approach a specific domain. It might contain tool-use policies (“always validate spreadsheet formulas before submitting”), output format rules (“return answers as JSON with fields X, Y, Z”), or reasoning strategies (“decompose multi-step math into subproblems”). The skill gets prepended to the system prompt or dropped as a SKILL.md file into the agent’s workspace.

Skills are powerful because they’re portable: the same document works across GPT-5.5, GPT-5.4-nano, Codex CLI, and Claude Code. No model retraining. No adapter layers. Just text.

Why hand-crafted and one-shot skills fall short

The problem is how we create these skills today:

  • Hand-crafted skills require a human expert to anticipate every failure mode. They’re expensive to write, brittle to edge cases, and never improve from deployment feedback.
  • One-shot LLM-generated skills (ask a strong model to write a procedure manual) produce a reasonable starting point but are frozen at generation time. The skill never learns from the agent’s actual successes and failures.
  • Loosely controlled self-revision (TextGrad, GEPA, EvoSkill) lets the agent rewrite its own skill after observing failures. But without guardrails, these methods can erase useful rules, overfit to individual examples, or oscillate between conflicting edits.

The core insight of SkillOpt: a skill document is the agent’s external state - directly analogous to model weights in deep learning. If we apply the same engineering discipline to text-space editing (bounded step sizes, held-out validation, negative feedback, cross-epoch consolidation), we get reproducible, monotonically non-decreasing improvements without ever modifying the underlying model.


SkillOpt: Bringing Deep Learning Discipline to Text Space

The analogy that makes it click

Imagine you hired a consultant to write a procedure manual for a customer-service team. Instead of handing them a finished manual on day one, you let the team run tasks, observe what goes wrong, and iteratively revise a few paragraphs at a time - but only keeping a revision if an independent reviewer confirms it actually helps. The manual never bloats uncontrollably because each revision is limited in how much it can change, and every failed revision is remembered so the consultant doesn’t try the same fix twice.

That’s SkillOpt. The “manual” is the skill document. The “team” is a frozen LLM. The “consultant” is a separate optimizer model. And the “independent reviewer” is a held-out validation split.

The optimization loop at a glance

The full pipeline has four repeating phases per step, wrapped in epochs:

  1. Forward pass - run the frozen model on a batch of tasks using the current skill; collect trajectories and scores
  2. Backward pass - partition results into successes and failures; generate structured add/delete/replace edits via minibatch reflection
  3. Bounded text update - apply at most $L_t$ edits (the “textual learning rate”)
  4. Validation gate - accept the candidate only if it strictly improves on a held-out selection split; otherwise, reject and remember the failure

At the end of each epoch, a slow/meta update compares the current skill against the previous epoch’s skill on the same tasks, extracting durable cross-epoch lessons.


The Forward Pass: Collecting Rollout Evidence

Formally, the execution model is:

$$(\tau(s), r(s)) = h(M, x, s), \quad r(s) \in [0, 1] \tag{1}$$

where:

  • $h$ - the execution harness (direct chat, Codex CLI, or Claude Code CLI)
  • $M$ - the frozen target language model
  • $x$ - a task input from the training split
  • $s$ - the current skill document (natural-language string)
  • $\tau(s)$ - the full trajectory: messages, tool calls, observations, final answer
  • $r(s)$ - a scalar reward/score for the trajectory, normalised to $[0,1]$

Interpretation: This is the forward pass. The harness runs the frozen model on a task conditioned on the skill, producing an observable trajectory and a numeric score. The skill $s$ is the only variable we can change - everything else (model, harness, scorer) is fixed. This is why the analogy to weight-space optimization works: we’re optimizing a single “parameter” (the skill text) to maximise a scalar objective.

The default configuration runs $B = 40$ tasks per batch, collecting enough signal to distinguish systemic skill weaknesses from random per-task noise.


The Backward Pass: Minibatch Reflection

After the forward pass, trajectories are split into successes and failures. Each group is divided into reflection minibatches of size $B_m = 8$ (default). A separate optimizer model - typically a frontier LLM like GPT-5.5 - analyzes each minibatch and proposes structured edits.

Why separate the optimizer from the target model? For the same reason a coach doesn’t have to play the game themselves. A stronger optimizer can diagnose failure patterns more reliably than the target model, and it enables a powerful workflow: use GPT-5.5 as the optimizer to improve skills for GPT-5.4-nano, without changing the student’s weights. Table 5 in the paper shows that even a target-matched optimizer (same model for both roles) recovers 56-74% of the frontier-optimizer gain.

Failure-driven minibatches propose corrective and missing rules. Success-driven minibatches protect behaviors that already work. Both produce structured add/delete/replace edit proposals that feed into the merge step.


Bounded Text Updates: The Textual Learning Rate

Here’s where SkillOpt diverges most sharply from prior prompt-optimization work. Instead of letting the optimizer rewrite the skill freely, every update is bounded:

$$s_{t+1} = \text{Modifier}(s_t,; \text{TopK}_{L_t}(\mathcal{E}_t)), \quad L_t \in {\text{constant},; \text{cosine},; \text{linear},; \text{autonomous}} \tag{2}$$

where:

  • $s_t$ - skill at step $t$
  • $\mathcal{E}_t$ - pool of merged add/delete/replace edits proposed by the optimizer
  • $L_t$ - the edit budget (textual learning rate) at step $t$: maximum number of edits applied
  • $\text{TopK}_{L_t}$ - operator that keeps only the top-$L_t$ edits ranked by expected utility
  • $\text{Modifier}$ - applies the selected edits to produce the candidate skill

Interpretation: By limiting each update to at most $L_t$ atomic edits, SkillOpt ensures that consecutive skill versions stay close enough that the optimizer can learn from accepted and rejected history. This is identical in purpose to using a small learning rate to keep consecutive weight vectors close in gradient descent.

The default schedule is cosine decay from $L_t = 4$ (floor = 2). Early epochs make larger structural edits - establishing core procedures, output formats, and tool-use policies. Later epochs make fine-grained consolidations - tightening edge-case handling and removing redundancy. The ablations show that all moderate budgets (1-16) perform comparably; the key finding is that removing the budget entirely drops scores by up to 4 points.

Patch mode vs. rewrite mode

The default is patch mode: localized add/insert/replace/delete operations that preserve continuity. The alternative rewrite mode generates a full skill rewrite conditioned on the edit suggestions. Patch mode has lower variance and preserves stable content; rewrite mode offers higher expressiveness at the risk of erasing working rules. Patch mode wins on most benchmarks.


The Validation Gate: Propose-and-Test, Not Unconditional Editing

This is perhaps the single most important design choice. After every bounded update, the candidate skill is evaluated on a held-out selection (validation) split:

$$s^\star_{\text{sel}} = \arg\max_{s \in \mathcal{C}(D_{\text{tr}})} \frac{1}{|D_{\text{sel}}|} \sum_{x \in D_{\text{sel}}} r(s) \tag{3}$$

where:

  • $\mathcal{C}(D_{\text{tr}})$ - the set of candidate skills generated by the optimizer using the training split
  • $D_{\text{sel}}$ - the held-out selection split, used only for accept/reject decisions
  • $s^\star_{\text{sel}}$ - the best skill chosen by the gate across all candidates
  • $r(s)$ - per-task score as in Equation 1

Interpretation: A candidate skill is accepted into best_skill.md only when its average score on the selection split strictly exceeds the current best. Ties are rejected. This converts reflection into a propose-and-test optimization: plausible-sounding textual diagnoses that don’t actually help the target model get filtered out empirically, not by judgment calls.

Why strict gating? Because LLMs are excellent at generating convincing explanations for failures and plausible-sounding fixes - but those fixes often don’t help (or actively hurt) the target model’s performance. Without the validation gate, SkillOpt degenerates into the same unconditional self-editing that plagues TextGrad.


The Rejected-Edit Buffer: Learning From Failures

When a candidate fails the validation gate, it doesn’t just vanish. The rejected edit and the associated score drop are appended to an epoch-local rejected-edit buffer. This buffer is passed to future optimizer calls within the same epoch.

Think of it as a “don’t try this again” list. Without it, the optimizer might independently propose the same attractive-but-harmful edit three times in the same epoch. The rejected-edit buffer eliminates this waste. Removing it drops SearchQA by -1.6, SpreadsheetBench by -4.6, and LiveMath by -2.4 points.

Crucially, the buffer is epoch-local - it resets at epoch boundaries. This prevents stale negative feedback from blocking edits that might work in a different skill context.


Epoch-Wise Slow/Meta Update: Capturing Long-Horizon Lessons

At the end of each epoch, SkillOpt performs a slow update - the text-space analog of a momentum update or moving average in weight-space optimization.

The procedure:

  1. Re-run the previous epoch’s skill and the current epoch’s skill on the same sample of 20 tasks
  2. Group results into four categories: improvements, regressions, persistent failures, stable successes
  3. Write a concise longitudinal guidance block into a protected slow-update field - a section of the skill that step-level edits cannot overwrite
  4. Update an optimizer-side meta skill (teacher-only, never shipped with the final artifact) summarizing which edit patterns helped and which were rejected across epochs

Why protect the slow-update field? Fast local updates respond to batch-level evidence - “this spreadsheet formula failed, add a validation step.” But durable domain lessons - “always check for merged cells before parsing” - should not be overwritten by a single batch’s contradictory signal. The protected field ensures that long-horizon wisdom survives short-horizon noise.

This is the single most impactful component. Removing both the meta skill and slow update drops SpreadsheetBench by a devastating -22.5 points (77.5 to 55.0). No other ablation comes close.


The Final Artifact: Just a Text File

After 4 epochs (default), SkillOpt exports best_skill.md - a compact file of 300-2,000 tokens. That’s it. No model checkpoints, no adapter weights, no ONNX graphs. Just a Markdown file that gets prepended to the system prompt.

The final test evaluation follows standard protocol:

$$\text{Test}(s^\star_{\text{sel}}) = \frac{1}{|D_{\text{test}}|} \sum_{x \in D_{\text{test}}} r(s^\star_{\text{sel}}) \tag{4}$$

where:

  • $D_{\text{test}}$ - the held-out test split, locked until final reporting
  • $s^\star_{\text{sel}}$ - the skill selected by the validation gate, never retrained on $D_{\text{test}}$

Interpretation: Standard train/val/test protocol applied to text-space optimization. The test split is disjoint from both training and selection data, so reported numbers measure true generalization.

The data split is 4:1:5 (train:selection:test) by default. Yes - half the data is reserved for testing. This is unusually conservative and makes the reported gains more credible.


Experiments and Results: 52/52 Cells Won

Direct chat: up to +39 points on procedural benchmarks

BenchmarkNo SkillGEPA (best baseline)SkillOptGain vs No Skill
SearchQA77.784.887.3+9.6
SpreadsheetBench41.873.680.7+38.9
OfficeQA33.165.772.1+39.0
DocVQA78.890.691.2+12.4
LiveMathBench37.652.066.9+29.3
ALFWorld83.693.395.5+11.9
Average58.876.982.3+23.5

The gains on SpreadsheetBench (+38.9) and OfficeQA (+39.0) are staggering. These are procedural benchmarks - tasks that require consistent tool-use policies, output formatting, and multi-step reasoning strategies. Exactly the kind of knowledge that a well-optimized skill document encodes.

Agentic harnesses: Codex and Claude Code

SkillOpt isn’t limited to direct chat. The same skill files work inside agentic execution environments:

HarnessNo SkillEvoSkillSkillOptGain vs No Skill
Codex (SpreadsheetBench)27.567.585.0+57.5
Claude Code (SpreadsheetBench)22.175.080.4+58.3

The gains are even larger in agentic harnesses. Why? Because execution environments amplify both good and bad habits. A frozen model without skill guidance makes inconsistent tool choices that compound across multi-step workflows. A well-optimized skill enforces consistent procedures from the first step.

Small models benefit the most

The paper evaluates seven target models across the GPT-5 family. The pattern is clear: smaller models see larger absolute gains from optimized skills. A compact skill document effectively compensates for the smaller model’s weaker in-context reasoning, acting as externalized procedural memory.


Do Optimized Skills Transfer?

One of the most surprising findings is that best_skill.md transfers - to different models, different execution harnesses, and even different benchmarks.

Cross-harness transfer

A skill optimized for Codex CLI, when dropped into Claude Code with zero modification, scores 81.8 on SpreadsheetBench - actually higher than the in-domain Claude Code SkillOpt result (80.4). That’s a +59.7 point gain over the Claude Code no-skill baseline. The skill document captures domain knowledge that is harness-agnostic.

Cross-model transfer

Skills optimized on GPT-5.5 transfer positively to smaller GPTs without re-optimization. The gains decay with model scale distance but remain positive across all tested combinations.

Cross-benchmark transfer

OlympiadBench-trained math skills yield positive gains (+1.3 to +3.7 points) on Omni-MATH across three model scales. The transfer is modest - source and target share only the broad “math competition” family - but consistently positive, suggesting that optimized skills capture reusable reasoning patterns, not just benchmark-specific hacks.


Ablations: What Actually Drives the Gains?

The edit budget matters; the exact schedule does not

All moderate edit budgets ($L_t \in [1, 16]$) perform competitively. Constant, cosine, and linear schedules stay within 3 points of each other. But removing the budget entirely causes measurable drops: -4.0 on LiveMath, -1.8 on SpreadsheetBench. The textual learning rate acts as a regularizer - it prevents catastrophic skill rewrites even when individual edits are harmful.

Validation gating and the rejected-edit buffer

The validation gate is binary: either it accepts (strictly better) or rejects. Removing the rejected-edit buffer costs -1.6 to -4.6 points across benchmarks. The buffer’s value is practical, not theoretical - it simply prevents the optimizer from wasting steps on edits that already failed.

The slow/meta update is the single most important component

AblationSearchQASpreadsheetBenchLiveMath
Full SkillOpt87.380.766.9
Without meta skill85.378.963.7
Without meta + slow update85.155.062.5

The -22.5 point collapse on SpreadsheetBench when removing both components confirms that procedural benchmarks require long-horizon consolidation. Fast edits alone cannot capture the durable domain rules that drive performance on complex tool-use tasks.

Training data scaling

SpreadsheetBench climbs from 47.5 (1 example) to 78.0 (full training set). SearchQA saturates near 84-86 after just 20% of the data. Procedural benchmarks are data-hungry because they require diverse failure modes to build comprehensive skill rules.


Cost and Interpretability

The optimization loop is not cheap. Training costs range from 0.6M to 46.4M tokens per benchmark (Table 6), depending on task complexity and trajectory length. But this is a one-time offline cost - the exported best_skill.md adds zero inference overhead beyond its 300-2,000 token context.

The interpretability story is compelling. Every edit to the skill is a natural-language patch: “add rule: validate cell references before formula evaluation.” Every rejected edit is logged with its score impact. Every slow-update consolidation is a readable paragraph explaining what the optimizer learned across epochs. You can audit the entire optimization history in plain text - try doing that with a fine-tuned LoRA adapter.


Limitations

  • Optimization cost. Training requires many rollout batches (default $B = 40$ per step, 4 epochs), consuming substantial API budget. This is practical for high-value production deployments but expensive for casual experimentation.
  • Single skill per domain. SkillOpt studies one compact skill per (domain, model, harness) triple. Multi-skill libraries with selective routing remain unexplored.
  • Cross-benchmark transfer is limited. Gains of +1.3 to +3.7 points suggest that skill transfer works best within narrow task families, not across broad domain shifts.
  • Optimizer dependence. The strongest results use GPT-5.5 as the optimizer. Users without frontier API access can use target-matched optimization (56-74% of the gain), but the gap is real.
  • Slow/meta update overhead. Rerunning 20 tasks per epoch for the slow update adds cost proportional to task execution time.

Takeaways

1. Skill documents are trainable parameters. SkillOpt demonstrates that natural-language skill documents can be optimized with the same rigor as neural network weights - bounded updates, validation gating, negative feedback, and momentum-like consolidation - producing compact, reusable artifacts that improve frozen models by up to +39 points.

2. The validation gate is non-negotiable. Plausible-sounding LLM self-edits frequently hurt performance. Strict held-out validation converts reflection from wishful thinking into empirical optimization. Every accepted edit is verified; every rejected edit is remembered.

3. Long-horizon consolidation drives the biggest gains. The slow/meta update - comparing skills across epochs and writing protected domain lessons - is worth up to 22.5 points on procedural benchmarks. Fast local edits alone are insufficient.

4. The artifact is just text. The final output is a 300-2,000 token Markdown file that transfers across model scales, execution harnesses, and related benchmarks. No model surgery, no adapter weights, no retraining. This makes SkillOpt uniquely practical for deployment with closed frontier models.

5. 52/52 is hard to argue with. SkillOpt wins or ties on every single evaluated (model, benchmark, harness) combination. The +23.5 average gain on GPT-5.5 direct chat, +24.8 on Codex, and +19.1 on Claude Code suggest that principled text-space optimization is a genuinely general capability.


Sources and materials:

📄 SkillOpt: Executive Strategy for Self-Evolving Agent Skills - arXiv 2605.23904