Code-Localization

Think of a doctor diagnosing a patient. You could evaluate the doctor solely by whether the patient recovered. That matters – but it tells you nothing about whether the right tests were ordered, the right lab results were read, or the doctor simply got lucky with a broad-spectrum antibiotic. If you want to improve the diagnostic process, you need to instrument the intermediate steps. Coding-agent benchmarks have the same problem. SWE-bench, SWE-bench Verified, SWE-bench Multilingual – they all give you a single bit per issue: pass or fail. Did the patch make the tests green? Useful, but it hides an enormous amount of signal. When an agent fails, why did it fail? Wrong reasoning? Wrong edit? Or did it simply never look at the right code? ...