🧠 Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Agentic Search is one of the most promising applications of modern AI. Imagine a virtual assistant that doesn’t just look up information for you but can autonomously search the web, navigate pages, find facts, and return well-structured answers with citations. That’s the idea behind tools like OpenAI’s Deep Research.

However, how do we evaluate if such an AI is doing a good job?

Enter Mind2Web 2 – a benchmark and evaluation framework introduced in the paper arXiv:2506.21506, which tackles the evaluation of complex web search agents using a new concept: Agent-as-a-Judge.


🔍 What is Mind2Web 2?

Mind2Web 2 is a benchmark suite of 130 real-world search tasks across topics like:

  • product comparisons,
  • booking services,
  • financial insights,
  • and summarizing information from multiple sources.

Each task simulates how a human would actually perform a deep dive into the web to answer complex questions.

Examples include:

  • “Compare three cloud hosting providers by price and support”,
  • “Find the latest electric cars under $35,000 and summarize key specs”,
  • “Summarize expert reviews for the top 3 hiking backpacks in 2024.”

To support these tasks, Mind2Web 2 uses a large dataset of realistic queries and web interactions labeled by humans (over 1000 hours of effort).


🧑‍⚖️ Agent-as-a-Judge: A New Evaluation Paradigm

Evaluating such tasks manually is expensive. The paper introduces Agent-as-a-Judge, an AI-powered rubric-based evaluator:

  • It uses tree-structured rubrics for task-specific correctness.
  • It evaluates both factual accuracy and the quality of cited sources.
  • It mimics how a human would check if an AI did its homework.

This approach allows scalable, automated evaluation across hundreds of tasks.


📊 Experimental Findings

Nine agentic search systems were tested, including OpenAI’s Deep Research.

  • Current best systems achieve about 50–70% of human performance.
  • They solve tasks 2x faster than humans.

Despite impressive results, there’s still a gap in judgment quality, especially around verifying claims and tracing them to proper sources.


🧩 Why it matters

Mind2Web 2 raises the bar for evaluating next-gen web agents. With an increasing reliance on autonomous systems to do research, summarize news, or make purchase recommendations, trust and verifiability are key.

This benchmark helps:

  • Researchers improve agent strategies (e.g., memory, planning).
  • Developers tune multi-step browsing workflows.
  • Users get safer, more reliable answers from AI.