Imagine 8 people in a company using the same AI assistant. Each of them hits the same problems — wrong API port, missing file, malformed argument — and each time independently discovers a workaround. The next day, someone else falls into the exact same hole. The system doesn’t learn from its users’ experience. What if a nightly “editorial shift” automatically analyzed all the day’s interactions, drew conclusions, and served improved procedures to everyone the next morning?
That’s exactly what SkillClaw does — a framework that turns isolated multi-user interactions into continuous, collective evolution of LLM agent skills.
Motivation: Why Static Skills Are a Problem
Modern agent systems like OpenClaw rely on skills — structured procedures that encode how an agent should use tools and solve tasks. Users install skills from a central hub, and these skills become the fundamental building blocks of agent behavior.
The problem? These skills are static. They don’t evolve after deployment.
In practice, this means:
- The same workflows, tool usage patterns, and failure modes are repeatedly rediscovered independently by different users
- Solutions found during interaction don’t survive beyond the session — they vanish when the window closes
- The system doesn’t accumulate knowledge at the system level — every user starts from scratch
Existing approaches don’t fully solve this:
| Approach | What it does | What’s missing |
|---|---|---|
| Memory-based (Reflexion, ExpEL) | Stores trajectories for retrieval | Hard to generalize into behavioral improvements |
| Skill-based (Voyager, SkillWeaver) | Compresses experience into instructions | Treats the library as a static resource |
| Local adaptation | Improves a single agent | Improvements don’t propagate to other users |
What’s missing is a mechanism that turns everyday interactions into continuous skill evolution — and does so collectively, across users.
The Key Idea: Collective Evolution via an Agentic Evolver
Before we get to formalisms, let’s build intuition.
Think of a company’s standard operating procedures (SOP) manual. Eight people use it, each in a slightly different context. Someone discovers that procedure X fails when the file is in PDF format instead of TXT. Someone else discovers that the API port changed from 9100 to 9110. Yet another person invents a better ordering of steps.
In a typical system, these discoveries stay in private notebooks. In SkillClaw, it works differently:
- Evidence collection: Every interaction is recorded with its full causal chain — what the agent did, what response it got, what went wrong
- Grouping: Sessions are grouped by the skills they used — all interactions with the same skill land in one “evidence pool”
- Autonomous editor: An LLM agent (the evolver) analyzes successes and failures and decides: fix the existing skill? Create a new one? Leave it unchanged?
- Nightly validation: Candidate changes are tested under real conditions. Only confirmed improvements make it to deployment
- Synchronization: In the morning, all users get the improved version
The key insight behind this approach: different users invoking the same skill in different contexts create a natural ablation. When 5 users invoke a “Slack summarization” skill and 3 succeed while 2 fail — the comparison reveals exactly where the skill breaks. A single user generates too little signal to distinguish a generalizable improvement from a one-off workaround. Aggregation across users provides a stable evidence base.
Formalization: How SkillClaw Works
Formally, let $S = {s_1, \ldots, s_M}$ denote a shared skill set, where each skill is a reusable procedural artifact. Each user interaction produces a session trajectory $\tau$ that records the complete loop: prompt, agent actions, environment feedback, and final response.
The system’s goal can be expressed in a single equation:
$$ S’ = \Phi(S, T) \tag{1} $$
where:
- $S$ — current skill set
- $T = {\tau_i}$ — set of trajectories collected from multiple users
- $\Phi$ — evolution operator
- $S’$ — updated skill set
Interpretation: Take the current skill set and collected user interactions, produce an improved version. The crucial point: $T$ contains trajectories from multiple users — a perspective unavailable to any single agent.
From Isolated Sessions to Shared Evidence
Evolution happens in two stages: structuring and aggregation.
Structuring. Each raw session is converted into a representation that preserves the causal chain:
$$ \text{prompt} \rightarrow \text{action} \rightarrow \text{feedback} \rightarrow \cdots \rightarrow \text{agent response} \tag{2} $$
Why does this matter? Most skill-level failures are procedural — wrong argument format, missing validation step, incorrect tool call ordering. These problems are invisible in the agent’s final response. They can only be diagnosed from the full action-feedback chain.
Lightweight metadata is also extracted from each session: (i) which skills were referenced, (ii) whether tool errors occurred, (iii) a coarse quality estimate.
Aggregation and grouping. Sessions are grouped by the skills they invoked:
$$ G(s) = {\tau_i \mid s \in K_i} \tag{3} $$
where:
- $G(s)$ — group of sessions invoking skill $s$
- $K_i$ — set of skills used in session $\tau_i$
- $G(\emptyset)$ — sessions that didn’t use any skill
Interpretation: This grouping goes beyond data organization. When multiple sessions invoke the same skill but produce different outcomes across different users, tasks, and environments — the skill becomes a controlled factor, and the comparison directly reveals where the skill works and where it fails. This natural ablation enables two operations that would be unreliable from single-user data:
- Evaluating how an existing skill actually performs under diverse real-world usage
- Identifying recurring procedures that no existing skill covers (from patterns in $G(\emptyset)$)
Agentic Skill Evolution — Refine, Create, Skip
The heart of the system is an agentic evolver — an LLM agent equipped with a structured harness that supplies grouped session evidence, skill definitions, and permitted actions. The harness provides structure but does not constrain the evolver’s reasoning.
For a given skill $s$ and its session group $G(s)$, the evolver examines both successes and failures, then selects one of three actions:
- Refine — update the skill to correct identified errors or improve robustness
- Create — introduce a new skill when $G(s)$ reveals recurring sub-procedures not captured by existing skills
- Skip — leave the skill unchanged when evidence is insufficient
Why an agent instead of rules? Skill failures are heterogeneous — different formats, context lengths, error types. Predefined rules (e.g., pattern matching on error messages) would be too brittle for real-world scenarios. An LLM agent can reason end-to-end over sessions of arbitrary length and format.
The critical principle: joint analysis of successes and failures. Successes define invariants — parts of the skill that work and must not be altered. Failures define targets — specific behaviors requiring correction. This joint perspective prevents naive fixing where patching one bug breaks something that previously worked. Each update corrects deficiencies while preserving what successes have validated — evolution is cumulative.
The Synchronization Loop and Nightly Validation
After evolution, candidate updates undergo validation before entering the shared repository. Validation runs at night in available idle user environments:
- For skill $s$ and its candidate update $s’$, the system selects relevant tasks from the day’s interactions
- Both versions are executed under identical conditions — full toolchain, multi-step interactions, intermediate feedback
- The model compares outcomes: if $s’$ performs better → Accept; otherwise → Reject
- Accepted updates are merged into the shared repository and synchronized to all agents the next day
This strategy introduces monotonic deployment: since only improvements are accepted, the deployed skill pool never degrades over time. Users always interact with the best validated pool from the previous night.
The complete loop:
$$ \text{Interaction} \rightarrow \text{Evidence} \rightarrow \text{Evolution} \rightarrow \text{Validation} \rightarrow \text{Deployment} \tag{4} $$
From the user’s perspective — none of this is visible. Users interact with their agents as usual while skill evolution happens in the background.
The Algorithm Step by Step
The complete collective skill evolution algorithm:
Algorithm: Agentic Collective Skill Evolution
Input: Skill repository S, user sessions T
Output: Updated repository S'
1. Convert T into structured evidence E
2. Group E by referenced skills → {G(s)} and G(∅)
3. S' ← S
4. For each group G(s):
a. Agentic evolver analyzes recurring success/failure patterns
b. Select action from {refine, create, skip}
c. Generate candidate skill update if evidence supports modification
d. Apply conservative editing and validation
e. Merge approved updates into S'
5. Analyze G(∅) for missing but reusable procedures
6. Add validated new skills into S'
7. Synchronize S' back to all agents
8. Return S'
Three key properties follow from this design:
- Collective evolution — sessions aggregated across users; knowledge from one interaction propagates to the shared ecosystem
- Full automation — the entire pipeline (recording → evolution → synchronization) runs without manual curation or user intervention
- Agentic adaptability — updates produced through open-ended reasoning, not predefined rules
Experiments on WildClawBench
Experimental Setup
The authors evaluate SkillClaw on WildClawBench — a benchmark of 60 complex tasks across 6 domains (productivity, code, social, retrieval, creative, safety). Key benchmark properties:
- Full execution environment (Linux container with tools)
- Multimodal inputs (text, code, image, video)
- 15-50 steps per task
- Critical errors → zero score
Setup: 6 days (6 rounds), 8 concurrent users, day/night cycle, Qwen3-Max as backbone model.
Main Results
| Category | Day 1 (baseline) | Day 6 (post-evolution) | Abs. gain | Rel. gain |
|---|---|---|---|---|
| Social Interaction | 54.01% | 60.34% | +6.33 | +11.72% |
| Search & Retrieval | 22.73% | 34.55% | +11.82 | +52.00% |
| Creative Synthesis | 11.57% | 21.80% | +10.23 | +88.41% |
| Safety & Alignment | 24.00% | 32.00% | +8.00 | +33.33% |
Creative Synthesis shows the largest relative jump (+88.41%) — but note: the main bottleneck wasn’t content generation but environment setup (file validation, directory configuration, multimedia pipelines).
Controlled validation on 3 isolated tasks confirms the mechanism:
| Task | Baseline | Post-evolution | Gain |
|---|---|---|---|
| basic extraction | 21.7% | 69.6% | +47.8% |
| deadline parsing | 41.1% | 48.0% | +6.9% |
| save report | 28.3% | 100.0% | +71.7% |
Evolution is most effective for procedural gaps (save report: missing save procedure → full correction). Weaker for tasks requiring deeper reasoning (deadline parsing: +6.9%).
Per-Category Evolution Analysis
Each category exhibits a distinct evolution pattern:
Social Interaction — sharp jump on Day 2, then stabilization. Main improvement: the “cross-dept Slack summarization” skill rewritten from a descriptive instruction into an explicit procedural workflow. One skill update was enough.
Search & Retrieval — gradual improvement (Day 2: +7.27, Day 4: +4.55). Layered evolution: first file and path validation (foundation layer), then constraint-aware retrieval planning (upper layer). Higher-level reasoning becomes effective only after lower-level reliability is ensured.
Creative Synthesis — large jump on Day 2, then plateau. The accepted skill: validate-tmp-workspace-inputs — validating the /tmp_workspace directory before creative tasks. Later, more complex multimedia pipelines failed validation within the 6-day window.
Safety & Alignment — improvement only from Day 5 onward. Skills focused on reliability under real-world conditions: Git push fallback without credentials, correct directory cloning procedures, safe execution in non-interactive environments.
Case Studies: How Evolution Improves Execution
Slack message analysis. Original agent: retrieves all messages, processes them uniformly, trial-and-error on tool errors (e.g., wrong API port 9100 instead of 9110). After evolution: skill encodes the correct port, introduces preview scanning → selective full content retrieval → task extraction. Three improvements: task decomposition, proactive error correction, selective retrieval.
ICCV 2025 paper affiliation analysis. Original agent: heuristic university name matching — counts appearances without distinguishing first affiliation from subsequent ones. After evolution: strict “first affiliation” definition from the official PDF first page, alignment with OpenAccess records, targeted re-checks on ambiguous cases.
SAM3 inference in incomplete environments. Original agent assumes files and conditions (e.g., CUDA) are available — fails when a path is missing. After evolution: environment precheck, treating missing output directory as non-blocking, searching for nearby assets, adapting to system constraints (monkey-patching CUDA → CPU).
Why It Works — A Deeper Analysis
SkillClaw works for three fundamental reasons:
1. Natural experiment as evidence base. Multiple users invoking the same skill in different contexts creates a quasi-controlled experiment. The skill is the controlled factor; contexts, tasks, and users are variables. This makes it possible to separate generalizable improvements from one-off workarounds — something statistically impossible from single-user data.
2. Monotonic deployment policy. Nightly validation with an “accept only if better” rule ensures the deployed skill pool never degrades. This is crucial for user trust — the system can improve but never get worse. Most candidate skills are rejected (out of 6 Social Interaction candidates, only 1 was accepted), demonstrating the system’s conservatism.
3. Layered evolution mirrors real dependencies. Search & Retrieval first fixes file validation (foundation layer), then moves to constraint-aware planning (upper layer). This isn’t artificial — higher-level reasoning genuinely doesn’t work when lower-level input handling is unreliable. The system discovers this hierarchy automatically.
But why does deadline parsing improve by only 6.9%? Because skill evolution is inherently procedural — it optimizes step sequences, tool arguments, validations. Tasks requiring deeper semantic reasoning (date parsing, context interpretation) can’t easily be codified in a procedural skill. This is a fundamental limitation of skill-based approaches.
Limitations
- Test scale — 8 users, 6 days, 1 benchmark, 1 backbone model (Qwen3-Max). Behavior at production scale (hundreds of users, months) remains unverified
- Nightly validation cost — double execution for each candidate increases token consumption
- No convergence analysis — it’s unclear whether the evolution loop converges to an optimal skill set or enters cycles
- Privacy — user trajectories aggregated centrally, raising concerns about sensitive data in enterprise settings
- Procedural limitation — evolution is weakest for tasks requiring reasoning rather than procedures
- Status: work in progress — results missing for 2 out of 6 benchmark categories (Productivity Flow, Code Intelligence)
Takeaways
What to remember:
- SkillClaw turns static skill libraries into dynamically evolving ecosystems driven by multi-user interactions
- Key mechanism: grouping sessions by skills creates natural ablation, enabling the separation of generalizable improvements from one-off workarounds
- Agentic evolver (LLM agent) instead of rigid rules — flexible, context-aware skill updates
- Nightly validation ensures monotonicity — the system never deploys a degraded skill
- Results on WildClawBench: +11.72% to +88.41% relative improvement across 4 categories, with 8 users over 6 days
Looking ahead: SkillClaw is a conceptually elegant proposal that shifts the paradigm from “an agent learns by itself” to “agents learn collectively.” The biggest open question is scaling — will the mechanism maintain its effectiveness with hundreds of users and thousands of skills? And will nightly validation become a bottleneck? We’ll have to wait for a more complete version of the work for those answers.
Sources and materials:
- 📄 Paper: SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
- 💻 Project/Code: GitHub — AMAP-ML/SkillClaw