Think of a doctor diagnosing a patient. You could evaluate the doctor solely by whether the patient recovered. That matters – but it tells you nothing about whether the right tests were ordered, the right lab results were read, or the doctor simply got lucky with a broad-spectrum antibiotic. If you want to improve the diagnostic process, you need to instrument the intermediate steps.

Coding-agent benchmarks have the same problem. SWE-bench, SWE-bench Verified, SWE-bench Multilingual – they all give you a single bit per issue: pass or fail. Did the patch make the tests green? Useful, but it hides an enormous amount of signal. When an agent fails, why did it fail? Wrong reasoning? Wrong edit? Or did it simply never look at the right code?

The paper “SWE-Explore: Benchmarking How Coding Agents Explore Repositories” (arXiv 2606.07297, June 2026) by Shaoqiu Zhang, Yuhang Wang, Jialiang Liang et al. from Shanghai Jiao Tong University makes a compelling case that the bottleneck is often upstream of patch generation. SWE-Explore isolates the exploration phase, scores it with fine-grained metrics, and shows that a single number – context efficiency – predicts downstream resolve rate with a Pearson correlation of 0.950. That is a remarkably clean signal from a task that most benchmarks treat as invisible.


The Problem: Why Pass/Fail Hides the Real Bottleneck

A typical SWE-bench run involves three phases: (1) explore the repository to understand the codebase and the bug, (2) localize the relevant code regions, and (3) generate a patch. Phases 1 and 2 are completely entangled in current benchmarks – the agent reads files, edits files, runs tests, reads more files – and the only feedback is the final patch verdict.

This makes it impossible to answer basic diagnostic questions:

  • Coverage: Did the agent find all the files and lines it needed?
  • Precision: How much of what it read was actually relevant?
  • Ranking: Did it surface the most important evidence first, or burn its context window on boilerplate?
  • Efficiency: What fraction of the agent’s reading was load-bearing?

SWE-Explore answers all of these by cleanly separating exploration from patching and introducing a metric suite designed for ranked code-region retrieval.


What Is SWE-Explore?

SWE-Explore is a benchmark of 848 issues across 10 programming languages and 203 open-source repositories, drawn from SWE-bench Verified, SWE-bench-Pro, and SWE-bench Multilingual. The dataset is dominated by Python (64.5%, reflecting upstream SWE-bench distributions) but includes Go (9.9%), JavaScript (6.0%), Rust (3.7%), Java (3.5%), PHP (3.3%), TypeScript (3.2%), Ruby (2.6%), C (2.5%), and C++ (0.8%).

The key design choice: every instance comes with trajectory-derived ground truth – not just the patch files, but the code regions that multiple successful agents independently read while solving the same issue.

Task Formulation

The exploration task is defined as a function:

$$f : (q, \mathcal{R}) \rightarrow P = (r_1, r_2, \ldots, r_K) \tag{1}$$

where:

  • $q$ – the issue description (natural language)
  • $\mathcal{R}$ – the full repository at the relevant commit
  • $P$ – a ranked list of $K$ predicted code regions
  • $r_i = (p_i, s_i, e_i)$ – a triple of file path, start line, and end line (1-indexed, closed interval)

In plain terms: the explorer receives exactly what a human developer would see – a bug report and a codebase – and must return the $K$ most relevant code regions, ranked by importance. The default is $K = 5$, which aligns with the average of 4.7 ground-truth regions per instance.

Notice that this is a retrieval task with a twist: the search space is not a document collection but a structured codebase, and the retrieval units are not documents but line-level spans within files. This makes classical information retrieval baselines (BM25, TF-IDF) applicable but, as we will see, woefully insufficient.

Ground Truth: Consensus From Successful Trajectories

Here is where SWE-Explore gets creative. Traditional code-localization benchmarks use either patch-file annotations (which files were modified?) or manual expert labels. Both have problems: patch annotations tell you what was changed but not what needed to be read to understand the change, and manual labels are expensive and subjective.

SWE-Explore constructs ground truth from the intersection of successful agent trajectories. The process works in four stages.

Stage 1 – Trajectory Collection. Five frontier LLMs (GPT-5.4, Gemini-3-Pro, Sonnet-4.6, GLM-5.1, Kimi-K2.6) independently attempt each issue. Only trajectories that pass the original benchmark’s executable test harness are kept. Each instance requires at least 2 successful trajectories.

Stage 2 – Read Extraction. From each successful trajectory $\tau$, the system extracts every observable file-reading action – editor views, command-line reads (cat, head, tail, sed -n), grep outputs with line numbers – and normalizes them into $(p, s, e)$ tuples.

Stage 3 – Core Intersection. The core ground truth is the file-wise, line-level intersection across all successful trajectories:

$$R_{\text{core}}^{\text{raw}} = \bigcap_{\tau \in T} R(\tau) \tag{2}$$

where:

  • $T$ – the set of successful trajectories for a given issue
  • $R(\tau)$ – the merged set of line regions read by trajectory $\tau$
  • $\bigcap$ – file-wise, line-level intersection (a line is in the core only if every successful trajectory read it)

In plain terms: if GPT-5.4, Gemini-3-Pro, and Sonnet-4.6 all read the same 20 lines of a utils.py function while solving the same bug, those 20 lines are almost certainly load-bearing evidence. This consensus-based approach is more robust than any single trajectory and avoids the cost of manual annotation.

Stage 4 – Refinement. Regions read by some but not all trajectories form the optional context $R_{\text{opt}}^{\text{raw}} = \left(\bigcup_{\tau \in T} R(\tau)\right) \setminus R_{\text{core}}^{\text{raw}}$. An LLM refinement step promotes optional regions that are repeatedly visited, adjacent to core evidence, or near the patch-modified lines. Every promoted region undergoes manual audit to verify relevance.

The result: ground truth that reflects what agents actually need to read, not just what they need to change. On average, each instance has 4.3 ground-truth files, 4.7 regions, and 1,578 ground-truth lines – substantially more than the 1.4 patch-modified files per issue. Reading requirements far exceed editing requirements.


The Metrics: How to Score an Explorer

SWE-Explore introduces a suite of metrics organized along three axes: coverage and accuracy, ranking quality under budget, and context efficiency.

Coverage and Accuracy

The most familiar metrics operate at the line level. Let $L(P)$ denote the set of all lines covered by the predicted regions and $Y$ the set of ground-truth lines:

$$\operatorname{Prec} = \frac{|L(P) \cap Y|}{|L(P)|} \tag{3}$$$$\operatorname{Rec}_{\ell} = \frac{|L(P) \cap Y|}{|Y|} \tag{4}$$$$F_1 = \frac{2 \cdot \operatorname{Prec} \cdot \operatorname{Rec}_{\ell}}{\operatorname{Prec} + \operatorname{Rec}_{\ell}} \tag{5}$$

where:

  • $L(P)$ – the union of all lines in the predicted regions
  • $Y$ – the union of all lines in the ground-truth regions
  • $\operatorname{Prec}$ – what fraction of predicted lines are actually relevant
  • $\operatorname{Rec}_{\ell}$ – what fraction of relevant lines were found

In plain terms: these are standard precision and recall, but applied at line granularity rather than file or document level. This distinction is crucial – an agent that opens the right file but reads the wrong function scores well on file-level metrics but poorly on line-level ones. As the experiments show, that distinction is exactly where modern agents struggle.

Two coarser metrics complement the line-level view:

  • HitFile – the fraction of ground-truth files that have at least one predicted region overlapping them
  • HitRegion – the fraction of core ground-truth regions that have at least one predicted region intersecting them

Ranking Under Budget: nDCG

Because context windows are finite and expensive, we care not just about what the explorer found but when it found it. SWE-Explore adapts normalized discounted cumulative gain (nDCG) to a line-budget setting:

$$\operatorname{DCG}@B = \sum_{i \in P_{\leq B}} \frac{g_i}{\log_2(i + 2)} \tag{6}$$

where:

  • $P_{\leq B}$ – the longest prefix of the prediction list whose cumulative visible lines do not exceed budget $B$
  • $g_i$ – the gain of the $i$-th predicted region, equal to the number of newly covered core lines (lines not already covered by higher-ranked predictions)
  • $\log_2(i + 2)$ – the standard DCG position discount

In plain terms: a prediction that surfaces 50 new ground-truth lines in its first region gets more credit than one that surfaces the same 50 lines in its fourth region. The line budget $B$ acts as a proxy for context-window pressure: at $B = 100$, we ask “given only 100 lines of context, how much ground truth did you capture?”

The normalized version divides by the ideal DCG, computed greedily:

$$\operatorname{nDCG}@B = \frac{\operatorname{DCG}@B}{\text{IDCG}@B} \tag{7}$$

Context Efficiency

Perhaps the most important single metric in the paper:

$$\operatorname{CtxEff} = \frac{|L(P) \cap (L(R_{\text{core}}) \cup L(R_{\text{opt}}))|}{|L(P)|} \tag{8}$$

where:

  • $L(R_{\text{core}}) \cup L(R_{\text{opt}})$ – all lines in both core and optional ground truth
  • Numerator – predicted lines that overlap with any ground truth (core or optional)
  • Denominator – total predicted lines

In plain terms: context efficiency measures what fraction of the explorer’s output is grounded evidence rather than noise. An explorer that returns 500 lines, 450 of which are ground truth, has a context efficiency of 0.90. One that returns 500 lines with only 50 relevant has 0.10.

This metric has a Pearson correlation of 0.950 with downstream resolve rate – the strongest single predictor in the entire metric suite.


Numbers That Convince

The authors evaluate a wide spectrum of explorers: classical retrieval baselines (BM25, TF-IDF, RAG with Potion embeddings), general-purpose coding agents (OpenHands, Mini-SWE-Agent, AweAgent, Claude Code, Codex), and specialized localization methods (AutoCodeRover, LocAgent, OrcaLoca, CoSIL).

Agentic vs. Classical Retrieval

The first result is dramatic but unsurprising: agentic explorers obliterate classical retrieval. BM25 achieves a HitFile of 0.079 and a line-level recall of 0.021. TF-IDF does slightly better at 0.140 and 0.049. Meanwhile, even mainstream agents like OpenHands, Claude Code, and Codex cluster around HitFile 0.645-0.667 and recall 0.154-0.194.

The downstream resolve rates tell the same story even more starkly:

ExplorerResolve Rate (%)
Oracle59.7
CoSIL59.3
Codex50.3
Mini-SWE-Agent50.0
Claude Code48.0
OpenHands47.7
AutoCodeRover44.7
TF-IDF26.0
BM2512.7
Random4.7

BM25 is barely better than random. TF-IDF is half as good as the weakest agentic method. The lesson: repository exploration is fundamentally an agentic task. Static retrieval methods, no matter how well-tuned, cannot navigate the semantic structure of a codebase the way an agent with tool use can.

The Line-Level Recall Bottleneck

Here is the most striking finding in the paper. Look at the full exploration quality table:

ExplorerHitRegPrecRec_lF1HitFilenDCG@500FUHCtxEffNoiseReg
Oracle0.9151.0000.9530.9640.9230.8581.0001.0000.000
Random0.0030.0020.0040.0020.0040.0040.0060.0020.997
BM250.0650.0550.0210.0240.0790.1320.1410.0870.910
TF-IDF0.1210.1170.0490.0540.1400.2230.2400.1900.821
OpenHands0.5140.4890.1790.2090.6450.8670.8950.7370.245
Mini-SWE-Agent0.5050.5300.1510.1900.6400.8850.9070.7540.253
AweAgent0.5340.5770.1400.1820.6820.9540.9750.8290.191
AutoCodeRover0.2720.6800.2330.2910.2800.7200.7300.7380.034
LocAgent0.4720.6420.1910.2410.5400.9500.9770.7990.195
CoSIL0.5440.5810.7880.6020.5440.8240.9200.8980.471
Claude Code0.5310.5980.1540.2020.6670.9380.9630.8290.186
Codex0.5160.5230.1940.2230.6490.9010.9360.7620.249

Focus on the contrast between HitFile and Rec_l for the general-purpose agents. Claude Code hits 66.7% of ground-truth files but recalls only 15.4% of ground-truth lines. OpenHands: 64.5% files, 17.9% lines. Codex: 64.9% files, 19.4% lines.

The pattern is universal: agents are reasonably good at finding the right files but terrible at finding the right lines within those files.

This is a profound insight. The bottleneck is not navigation (finding which file matters) but granular localization (identifying which function, which class, which block of code within that file is relevant). File-level metrics, which most prior work focuses on, dramatically overstate agent capability.

The one exception is CoSIL, which achieves a line-level recall of 0.788 – nearly matching the Oracle’s 0.953. CoSIL uses iterative code-graph search, walking dependency edges and call chains to expand its context. There’s a catch: CoSIL’s NoiseReg is 0.471 (meaning nearly half its predicted regions overlap no ground truth), compared to Claude Code’s 0.186 or AutoCodeRover’s remarkably low 0.034. CoSIL compensates by emitting broad, often whole-file regions, which lifts recall at the cost of precision.

And yet: CoSIL’s downstream resolve rate is 59.3%, virtually identical to the Oracle’s 59.7%. For the patching agent downstream, having all the evidence with some noise is far better than having clean but incomplete evidence.

Why Does the LLM Choice Not Fix This?

The authors run Mini-SWE-Agent with six different LLM backends to disentangle the effect of the model from the effect of the exploration framework:

LLM BackendHitRegPrecRec_lF1HitFilenDCG@500CtxEff
GPT-5.40.5160.5420.1540.1940.6550.9050.771
GPT-5.4-mini0.5310.5090.1850.2150.6490.9240.754
Kimi-K2.60.4130.4750.1170.1490.5090.7390.676
Sonnet-4.50.4280.5190.1180.1540.5350.7790.715
GLM-4.70.2890.4140.1220.1480.3430.5570.536
Gemini-3-Pro0.2680.4200.0520.0790.3690.6050.540

The top-tier models (GPT-5.4, GPT-5.4-mini) cluster together. The mid-tier models (Kimi-K2.6, Sonnet-4.5) form another cluster. The weaker models (GLM-4.7, Gemini-3-Pro) form a third. But across all clusters, the same pattern holds: HitFile is 2-5x higher than Rec_l. The LLM choice shifts the operating point up or down, but it does not change the fundamental bottleneck.

This tells us the line-level recall problem is not a model capability issue – it is a task design issue. Current exploration frameworks ask agents to read files and then move on. They do not incentivize the agent to carefully identify which specific spans within a file are relevant. Improving this likely requires changes to the exploration loop itself: better tool design, structured code-graph traversal (as CoSIL demonstrates), or explicit span-selection mechanisms.


Missing Context Hurts More Than Noise

The paper’s context degradation analysis delivers one of the cleanest experimental findings in recent agent benchmarks. The setup: synthetically remove $\alpha\%$ of core ground-truth regions from the explorer’s output and optionally replace them with random non-core regions (to keep total context volume constant). Then run a fixed patching agent and measure resolve rate.

Two conditions are tested:

  • Missing-context: Expose only $\alpha\%$ of core regions (the rest is hidden)
  • Redundant-context: Expose $\alpha\%$ of core regions and fill the remaining budget with random noise

The results reveal a threshold effect. Performance stays near the random baseline through $\alpha = 25\%$ and $\alpha = 50\%$, then jumps sharply between $\alpha = 50\%$ and $\alpha = 75\%$. Multiple pieces of core evidence must be present simultaneously – partial evidence is almost as useless as no evidence.

Crucially, once above the threshold, the missing-context and redundant-context curves converge. Modern patching agents (especially stronger ones like GPT-5.4) tolerate extra irrelevant code quite well when the essential evidence is present. This explains CoSIL’s success: its high recall ensures the critical evidence is almost always present, and its high noise is tolerable because the downstream patcher can filter it.

The implication for explorer design is clear: optimize for recall first, precision second. A noisy explorer that covers all the evidence will outperform a precise explorer that misses critical regions. Missing a load-bearing region causes the patch to fail; including an irrelevant region merely wastes a few context tokens. The cost structure is deeply asymmetric.


Metric Correlations: Context Efficiency Rules

The correlation analysis between exploration metrics and downstream resolve rate is worth examining in detail:

MetricPearson $r$Spearman $\rho$
Context Efficiency+0.950
First Useful Hit+0.928
Rec@100+0.926+0.845
HitFile+0.925
nDCG@500+0.921
F1+0.810
NoiseReg-0.812-0.562

Context efficiency – the fraction of predicted lines that overlap ground truth – is the single best predictor. The intuition is surprisingly clean: it simultaneously captures whether the explorer found relevant evidence (numerator) and whether it avoided wasting context on noise (denominator). It is essentially a precision metric over all lines, weighted by the ground truth’s definition of relevance.

The negative correlation with NoiseReg confirms the converse: explorers that output many irrelevant regions hurt downstream performance. But notice that the Spearman $\rho$ for NoiseReg is only -0.562 compared to Pearson $r$ of -0.812. The relationship is approximately linear (Pearson captures it well) but not perfectly monotonic (rank-based Spearman is weaker). A small amount of noise is tolerable; the damage scales roughly linearly with noise fraction.


Who Is This For?

For practitioners

If you build or maintain coding-agent pipelines, SWE-Explore gives you a diagnostic you never had before: where does exploration fail? You can now benchmark your agent’s file-finding and line-finding independently, identify whether the bottleneck is navigation or localization, and choose an exploration strategy accordingly. The key practical insight: if your agent finds the right files but misses critical lines, you need better in-file search tools or code-graph traversal – not a bigger model.

For researchers

SWE-Explore opens a clean research surface. The line-level recall gap (50-70% file hit rate vs. 14-19% line recall) is an invitation to rethink exploration architectures. Code-graph methods like CoSIL show that structured traversal can close this gap, but at the cost of noise. The open question: can we build explorers that achieve CoSIL’s recall with AutoCodeRover’s precision? Context efficiency as a single optimization target (Pearson $r = 0.950$) provides a clear north star.


Technical Details

For the advanced reader – two points worth highlighting.

The ground-truth construction is cleverer than it looks. The core intersection (Equation 2) only includes lines read by every successful trajectory. This is conservative by design – it avoids inflating ground truth with lines that happened to appear in one agent’s output by chance. The optional-context refinement step then selectively expands coverage using proximity heuristics and manual audit. The result is a two-tier ground truth (core + optional) that separates “definitely needed” from “probably helpful.”

nDCG adaptation to line budgets is non-trivial. Standard nDCG operates on ranked lists of documents. SWE-Explore’s version operates on ranked lists of code regions with variable length (each region spans a different number of lines). The budget $B$ constrains the total visible lines, not the number of regions. This means a single long region can consume most of the budget, which correctly penalizes explorers that return overly broad spans. The ideal DCG is computed greedily – at each step, the remaining ground-truth region with the largest marginal uncovered-line gain is selected, subject to the same budget.


Limitations

  • Selection bias. The benchmark only includes issues solved by at least one agent in the evaluation pool. Truly hard issues – ones that defeat all five frontier models – are excluded. The benchmark may underestimate exploration difficulty on the hardest real-world problems.
  • Trajectory-derived ground truth is approximate. The intersection of successful trajectories captures what agents did read, not the minimal set of what they needed to read. Some valid solution paths may rely on different evidence than any observed trajectory.
  • Python dominance. At 64.5% Python, the benchmark inherits SWE-bench’s language skew. Results on underrepresented languages (C++ at 0.8%, C at 2.5%) should be interpreted cautiously.
  • Fixed $K = 5$. The default of returning 5 regions may disadvantage explorers that work better at higher or lower $K$ values.
  • Restricted-context protocol is synthetic. The downstream validation hides all code outside the explorer’s predictions. Real agents would have the option to read more code if the initial exploration was insufficient. This controlled setup is necessary for fair comparison but overstates the penalty of poor exploration in practice.

Summary

  1. Exploration as the true bottleneck. Context efficiency predicts downstream resolve rate with $r = 0.950$. If an agent reads the right code, patching almost takes care of itself. If it does not, no amount of clever editing can compensate.
  2. File-level metrics as a mirage. Modern agents find the right files 50-70% of the time but recall only 14-19% of the relevant lines. The challenge is not navigating the repository – it is identifying exactly which code spans matter.
  3. Recall over precision for downstream success. CoSIL’s strategy of broad, high-recall exploration ($\operatorname{Rec}_{\ell} = 0.788$, NoiseReg = 0.471) achieves 59.3% resolve rate – essentially matching the Oracle at 59.7%. Missing critical evidence is far more costly than including irrelevant context.
  4. Classical retrieval as a dead end. BM25 and TF-IDF are barely above random on this task. Repository exploration requires agentic behavior: tool use, code-graph navigation, iterative hypothesis testing.
  5. The exploration loop, not the LLM, as the design lever. Swapping models shifts performance up or down but does not change the fundamental pattern. Improving exploration requires better tooling and framework design, not just bigger models.

For the first time, we can see exactly where a coding agent’s understanding breaks down – not “the patch was wrong” but “the agent never read lines 142-167 of parser.py.” That granularity is what makes progress possible.

📎 Linki