🧠 Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Agentic Search is one of the most promising applications of modern AI. Imagine a virtual assistant that doesn’t just look up information for you but can autonomously search the web, navigate pages, find facts, and return well-structured answers with citations. That’s the idea behind tools like OpenAI’s Deep Research.
However, how do we evaluate if such an AI is doing a good job?
Enter Mind2Web 2 – a benchmark and evaluation framework introduced in the paper arXiv:2506.21506, which tackles the evaluation of complex web search agents using a new concept: Agent-as-a-Judge.
🔍 What is Mind2Web 2?
Mind2Web 2 is a benchmark suite of 130 real-world search tasks across topics like:
- product comparisons,
- booking services,
- financial insights,
- and summarizing information from multiple sources.
Each task simulates how a human would actually perform a deep dive into the web to answer complex questions.
Examples include:
- “Compare three cloud hosting providers by price and support”,
- “Find the latest electric cars under $35,000 and summarize key specs”,
- “Summarize expert reviews for the top 3 hiking backpacks in 2024.”
To support these tasks, Mind2Web 2 uses a large dataset of realistic queries and web interactions labeled by humans (over 1000 hours of effort).
🧑⚖️ Agent-as-a-Judge: A New Evaluation Paradigm
Evaluating such tasks manually is expensive. The paper introduces Agent-as-a-Judge, an AI-powered rubric-based evaluator:
- It uses tree-structured rubrics for task-specific correctness.
- It evaluates both factual accuracy and the quality of cited sources.
- It mimics how a human would check if an AI did its homework.
This approach allows scalable, automated evaluation across hundreds of tasks.
📊 Experimental Findings
Nine agentic search systems were tested, including OpenAI’s Deep Research.
- Current best systems achieve about 50–70% of human performance.
- They solve tasks 2x faster than humans.
Despite impressive results, there’s still a gap in judgment quality, especially around verifying claims and tracing them to proper sources.
🧩 Why it matters
Mind2Web 2 raises the bar for evaluating next-gen web agents. With an increasing reliance on autonomous systems to do research, summarize news, or make purchase recommendations, trust and verifiability are key.
This benchmark helps:
- Researchers improve agent strategies (e.g., memory, planning).
- Developers tune multi-step browsing workflows.
- Users get safer, more reliable answers from AI.
📎 Links
- Based on the publication 📄 arXiv:2506.21506