Mind2Web 2: A new era of “agent-based” web search

🧠 Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Agentic Search is one of the most promising applications of modern AI. Imagine a virtual assistant that doesn’t just look up information for you but can autonomously search the web, navigate pages, find facts, and return well-structured answers with citations. That’s the idea behind tools like OpenAI’s Deep Research.

However, how do we evaluate if such an AI is doing a good job?

Enter Mind2Web 2 – a benchmark and evaluation framework introduced in the paper arXiv:2506.21506, which tackles the evaluation of complex web search agents using a new concept: Agent-as-a-Judge.

🔍 What is Mind2Web 2?

Mind2Web 2 is a benchmark suite of 130 real-world search tasks across topics like:

product comparisons,
booking services,
financial insights,
and summarizing information from multiple sources.

Each task simulates how a human would actually perform a deep dive into the web to answer complex questions.

Examples include:

“Compare three cloud hosting providers by price and support”,
“Find the latest electric cars under $35,000 and summarize key specs”,
“Summarize expert reviews for the top 3 hiking backpacks in 2024.”

To support these tasks, Mind2Web 2 uses a large dataset of realistic queries and web interactions labeled by humans (over 1000 hours of effort).

🧑‍⚖️ Agent-as-a-Judge: A New Evaluation Paradigm

Evaluating such tasks manually is expensive. The paper introduces Agent-as-a-Judge, an AI-powered rubric-based evaluator:

It uses tree-structured rubrics for task-specific correctness.
It evaluates both factual accuracy and the quality of cited sources.
It mimics how a human would check if an AI did its homework.

This approach allows scalable, automated evaluation across hundreds of tasks.

📊 Experimental Findings

Nine agentic search systems were tested, including OpenAI’s Deep Research.

Current best systems achieve about 50–70% of human performance.
They solve tasks 2x faster than humans.

Despite impressive results, there’s still a gap in judgment quality, especially around verifying claims and tracing them to proper sources.

🧩 Why it matters

Mind2Web 2 raises the bar for evaluating next-gen web agents. With an increasing reliance on autonomous systems to do research, summarize news, or make purchase recommendations, trust and verifiability are key.

This benchmark helps:

Researchers improve agent strategies (e.g., memory, planning).
Developers tune multi-step browsing workflows.
Users get safer, more reliable answers from AI.

📎 Links

Based on the publication 📄 arXiv:2506.21506

🧠 Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge#

🔍 What is Mind2Web 2?#

🧑‍⚖️ Agent-as-a-Judge: A New Evaluation Paradigm#

📊 Experimental Findings#

🧩 Why it matters#

📎 Links#