Imagine 8 people in a company using the same AI assistant. Each of them hits the same problems — wrong API port, missing file, malformed argument — and each time independently discovers a workaround. The next day, someone else falls into the exact same hole. The system doesn’t learn from its users’ experience. What if a nightly “editorial shift” automatically analyzed all the day’s interactions, drew conclusions, and served improved procedures to everyone the next morning?

That’s exactly what SkillClaw does — a framework that turns isolated multi-user interactions into continuous, collective evolution of LLM agent skills.

Motivation: Why Static Skills Are a Problem

Modern agent systems like OpenClaw rely on skills — structured procedures that encode how an agent should use tools and solve tasks. Users install skills from a central hub, and these skills become the fundamental building blocks of agent behavior.

The problem? These skills are static. They don’t evolve after deployment.

In practice, this means:

  • The same workflows, tool usage patterns, and failure modes are repeatedly rediscovered independently by different users
  • Solutions found during interaction don’t survive beyond the session — they vanish when the window closes
  • The system doesn’t accumulate knowledge at the system level — every user starts from scratch

Existing approaches don’t fully solve this:

ApproachWhat it doesWhat’s missing
Memory-based (Reflexion, ExpEL)Stores trajectories for retrievalHard to generalize into behavioral improvements
Skill-based (Voyager, SkillWeaver)Compresses experience into instructionsTreats the library as a static resource
Local adaptationImproves a single agentImprovements don’t propagate to other users

What’s missing is a mechanism that turns everyday interactions into continuous skill evolution — and does so collectively, across users.

The Key Idea: Collective Evolution via an Agentic Evolver

Before we get to formalisms, let’s build intuition.

Think of a company’s standard operating procedures (SOP) manual. Eight people use it, each in a slightly different context. Someone discovers that procedure X fails when the file is in PDF format instead of TXT. Someone else discovers that the API port changed from 9100 to 9110. Yet another person invents a better ordering of steps.

In a typical system, these discoveries stay in private notebooks. In SkillClaw, it works differently:

  1. Evidence collection: Every interaction is recorded with its full causal chain — what the agent did, what response it got, what went wrong
  2. Grouping: Sessions are grouped by the skills they used — all interactions with the same skill land in one “evidence pool”
  3. Autonomous editor: An LLM agent (the evolver) analyzes successes and failures and decides: fix the existing skill? Create a new one? Leave it unchanged?
  4. Nightly validation: Candidate changes are tested under real conditions. Only confirmed improvements make it to deployment
  5. Synchronization: In the morning, all users get the improved version

The key insight behind this approach: different users invoking the same skill in different contexts create a natural ablation. When 5 users invoke a “Slack summarization” skill and 3 succeed while 2 fail — the comparison reveals exactly where the skill breaks. A single user generates too little signal to distinguish a generalizable improvement from a one-off workaround. Aggregation across users provides a stable evidence base.

Formalization: How SkillClaw Works

Formally, let $S = {s_1, \ldots, s_M}$ denote a shared skill set, where each skill is a reusable procedural artifact. Each user interaction produces a session trajectory $\tau$ that records the complete loop: prompt, agent actions, environment feedback, and final response.

The system’s goal can be expressed in a single equation:

$$ S’ = \Phi(S, T) \tag{1} $$

where:

  • $S$ — current skill set
  • $T = {\tau_i}$ — set of trajectories collected from multiple users
  • $\Phi$ — evolution operator
  • $S’$ — updated skill set

Interpretation: Take the current skill set and collected user interactions, produce an improved version. The crucial point: $T$ contains trajectories from multiple users — a perspective unavailable to any single agent.

From Isolated Sessions to Shared Evidence

Evolution happens in two stages: structuring and aggregation.

Structuring. Each raw session is converted into a representation that preserves the causal chain:

$$ \text{prompt} \rightarrow \text{action} \rightarrow \text{feedback} \rightarrow \cdots \rightarrow \text{agent response} \tag{2} $$

Why does this matter? Most skill-level failures are procedural — wrong argument format, missing validation step, incorrect tool call ordering. These problems are invisible in the agent’s final response. They can only be diagnosed from the full action-feedback chain.

Lightweight metadata is also extracted from each session: (i) which skills were referenced, (ii) whether tool errors occurred, (iii) a coarse quality estimate.

Aggregation and grouping. Sessions are grouped by the skills they invoked:

$$ G(s) = {\tau_i \mid s \in K_i} \tag{3} $$

where:

  • $G(s)$ — group of sessions invoking skill $s$
  • $K_i$ — set of skills used in session $\tau_i$
  • $G(\emptyset)$ — sessions that didn’t use any skill

Interpretation: This grouping goes beyond data organization. When multiple sessions invoke the same skill but produce different outcomes across different users, tasks, and environments — the skill becomes a controlled factor, and the comparison directly reveals where the skill works and where it fails. This natural ablation enables two operations that would be unreliable from single-user data:

  1. Evaluating how an existing skill actually performs under diverse real-world usage
  2. Identifying recurring procedures that no existing skill covers (from patterns in $G(\emptyset)$)

Agentic Skill Evolution — Refine, Create, Skip

The heart of the system is an agentic evolver — an LLM agent equipped with a structured harness that supplies grouped session evidence, skill definitions, and permitted actions. The harness provides structure but does not constrain the evolver’s reasoning.

For a given skill $s$ and its session group $G(s)$, the evolver examines both successes and failures, then selects one of three actions:

  • Refine — update the skill to correct identified errors or improve robustness
  • Create — introduce a new skill when $G(s)$ reveals recurring sub-procedures not captured by existing skills
  • Skip — leave the skill unchanged when evidence is insufficient

Why an agent instead of rules? Skill failures are heterogeneous — different formats, context lengths, error types. Predefined rules (e.g., pattern matching on error messages) would be too brittle for real-world scenarios. An LLM agent can reason end-to-end over sessions of arbitrary length and format.

The critical principle: joint analysis of successes and failures. Successes define invariants — parts of the skill that work and must not be altered. Failures define targets — specific behaviors requiring correction. This joint perspective prevents naive fixing where patching one bug breaks something that previously worked. Each update corrects deficiencies while preserving what successes have validated — evolution is cumulative.

The Synchronization Loop and Nightly Validation

After evolution, candidate updates undergo validation before entering the shared repository. Validation runs at night in available idle user environments:

  1. For skill $s$ and its candidate update $s’$, the system selects relevant tasks from the day’s interactions
  2. Both versions are executed under identical conditions — full toolchain, multi-step interactions, intermediate feedback
  3. The model compares outcomes: if $s’$ performs better → Accept; otherwise → Reject
  4. Accepted updates are merged into the shared repository and synchronized to all agents the next day

This strategy introduces monotonic deployment: since only improvements are accepted, the deployed skill pool never degrades over time. Users always interact with the best validated pool from the previous night.

The complete loop:

$$ \text{Interaction} \rightarrow \text{Evidence} \rightarrow \text{Evolution} \rightarrow \text{Validation} \rightarrow \text{Deployment} \tag{4} $$

From the user’s perspective — none of this is visible. Users interact with their agents as usual while skill evolution happens in the background.

The Algorithm Step by Step

The complete collective skill evolution algorithm:

Algorithm: Agentic Collective Skill Evolution
Input: Skill repository S, user sessions T
Output: Updated repository S'

1. Convert T into structured evidence E
2. Group E by referenced skills → {G(s)} and G(∅)
3. S' ← S
4. For each group G(s):
   a. Agentic evolver analyzes recurring success/failure patterns
   b. Select action from {refine, create, skip}
   c. Generate candidate skill update if evidence supports modification
   d. Apply conservative editing and validation
   e. Merge approved updates into S'
5. Analyze G(∅) for missing but reusable procedures
6. Add validated new skills into S'
7. Synchronize S' back to all agents
8. Return S'

Three key properties follow from this design:

  1. Collective evolution — sessions aggregated across users; knowledge from one interaction propagates to the shared ecosystem
  2. Full automation — the entire pipeline (recording → evolution → synchronization) runs without manual curation or user intervention
  3. Agentic adaptability — updates produced through open-ended reasoning, not predefined rules

Experiments on WildClawBench

Experimental Setup

The authors evaluate SkillClaw on WildClawBench — a benchmark of 60 complex tasks across 6 domains (productivity, code, social, retrieval, creative, safety). Key benchmark properties:

  • Full execution environment (Linux container with tools)
  • Multimodal inputs (text, code, image, video)
  • 15-50 steps per task
  • Critical errors → zero score

Setup: 6 days (6 rounds), 8 concurrent users, day/night cycle, Qwen3-Max as backbone model.

Main Results

CategoryDay 1 (baseline)Day 6 (post-evolution)Abs. gainRel. gain
Social Interaction54.01%60.34%+6.33+11.72%
Search & Retrieval22.73%34.55%+11.82+52.00%
Creative Synthesis11.57%21.80%+10.23+88.41%
Safety & Alignment24.00%32.00%+8.00+33.33%

Creative Synthesis shows the largest relative jump (+88.41%) — but note: the main bottleneck wasn’t content generation but environment setup (file validation, directory configuration, multimedia pipelines).

Controlled validation on 3 isolated tasks confirms the mechanism:

TaskBaselinePost-evolutionGain
basic extraction21.7%69.6%+47.8%
deadline parsing41.1%48.0%+6.9%
save report28.3%100.0%+71.7%

Evolution is most effective for procedural gaps (save report: missing save procedure → full correction). Weaker for tasks requiring deeper reasoning (deadline parsing: +6.9%).

Per-Category Evolution Analysis

Each category exhibits a distinct evolution pattern:

Social Interaction — sharp jump on Day 2, then stabilization. Main improvement: the “cross-dept Slack summarization” skill rewritten from a descriptive instruction into an explicit procedural workflow. One skill update was enough.

Search & Retrieval — gradual improvement (Day 2: +7.27, Day 4: +4.55). Layered evolution: first file and path validation (foundation layer), then constraint-aware retrieval planning (upper layer). Higher-level reasoning becomes effective only after lower-level reliability is ensured.

Creative Synthesis — large jump on Day 2, then plateau. The accepted skill: validate-tmp-workspace-inputs — validating the /tmp_workspace directory before creative tasks. Later, more complex multimedia pipelines failed validation within the 6-day window.

Safety & Alignment — improvement only from Day 5 onward. Skills focused on reliability under real-world conditions: Git push fallback without credentials, correct directory cloning procedures, safe execution in non-interactive environments.

Case Studies: How Evolution Improves Execution

Slack message analysis. Original agent: retrieves all messages, processes them uniformly, trial-and-error on tool errors (e.g., wrong API port 9100 instead of 9110). After evolution: skill encodes the correct port, introduces preview scanning → selective full content retrieval → task extraction. Three improvements: task decomposition, proactive error correction, selective retrieval.

ICCV 2025 paper affiliation analysis. Original agent: heuristic university name matching — counts appearances without distinguishing first affiliation from subsequent ones. After evolution: strict “first affiliation” definition from the official PDF first page, alignment with OpenAccess records, targeted re-checks on ambiguous cases.

SAM3 inference in incomplete environments. Original agent assumes files and conditions (e.g., CUDA) are available — fails when a path is missing. After evolution: environment precheck, treating missing output directory as non-blocking, searching for nearby assets, adapting to system constraints (monkey-patching CUDA → CPU).

Why It Works — A Deeper Analysis

SkillClaw works for three fundamental reasons:

1. Natural experiment as evidence base. Multiple users invoking the same skill in different contexts creates a quasi-controlled experiment. The skill is the controlled factor; contexts, tasks, and users are variables. This makes it possible to separate generalizable improvements from one-off workarounds — something statistically impossible from single-user data.

2. Monotonic deployment policy. Nightly validation with an “accept only if better” rule ensures the deployed skill pool never degrades. This is crucial for user trust — the system can improve but never get worse. Most candidate skills are rejected (out of 6 Social Interaction candidates, only 1 was accepted), demonstrating the system’s conservatism.

3. Layered evolution mirrors real dependencies. Search & Retrieval first fixes file validation (foundation layer), then moves to constraint-aware planning (upper layer). This isn’t artificial — higher-level reasoning genuinely doesn’t work when lower-level input handling is unreliable. The system discovers this hierarchy automatically.

But why does deadline parsing improve by only 6.9%? Because skill evolution is inherently procedural — it optimizes step sequences, tool arguments, validations. Tasks requiring deeper semantic reasoning (date parsing, context interpretation) can’t easily be codified in a procedural skill. This is a fundamental limitation of skill-based approaches.

Limitations

  • Test scale — 8 users, 6 days, 1 benchmark, 1 backbone model (Qwen3-Max). Behavior at production scale (hundreds of users, months) remains unverified
  • Nightly validation cost — double execution for each candidate increases token consumption
  • No convergence analysis — it’s unclear whether the evolution loop converges to an optimal skill set or enters cycles
  • Privacy — user trajectories aggregated centrally, raising concerns about sensitive data in enterprise settings
  • Procedural limitation — evolution is weakest for tasks requiring reasoning rather than procedures
  • Status: work in progress — results missing for 2 out of 6 benchmark categories (Productivity Flow, Code Intelligence)

Takeaways

What to remember:

  • SkillClaw turns static skill libraries into dynamically evolving ecosystems driven by multi-user interactions
  • Key mechanism: grouping sessions by skills creates natural ablation, enabling the separation of generalizable improvements from one-off workarounds
  • Agentic evolver (LLM agent) instead of rigid rules — flexible, context-aware skill updates
  • Nightly validation ensures monotonicity — the system never deploys a degraded skill
  • Results on WildClawBench: +11.72% to +88.41% relative improvement across 4 categories, with 8 users over 6 days

Looking ahead: SkillClaw is a conceptually elegant proposal that shifts the paradigm from “an agent learns by itself” to “agents learn collectively.” The biggest open question is scaling — will the mechanism maintain its effectiveness with hundreds of users and thousands of skills? And will nightly validation become a bottleneck? We’ll have to wait for a more complete version of the work for those answers.


Sources and materials: