What if AI could not just answer questions, but actively plan scientific research? Not generating text — creating coherent, novel experiment plans that experts rate as better than human-written ones. Sounds like science fiction? Researchers from Meta AI and partners just achieved this.

The Problem: How Do You Grade Scientific Creativity?

Training models for “closed” tasks (math, coding) is relatively straightforward — the answer is correct or not. But how do you evaluate a research plan?

  • There’s no single “correct” answer
  • Novelty is subjective
  • Feasibility depends on context
  • Experts disagree

Traditional approaches would require thousands of hours of expert time evaluating each generated plan. Impossible to scale.

The Solution: Let Science Grade Science

The authors had an elegant idea: extract evaluation criteria from existing scientific publications.

Step 1: Rubric Extraction

From each scientific paper, you can extract:

  1. Research goal — what did the authors want to achieve?
  2. Grading rubric — what criteria must a good plan for this goal satisfy?

For example, for a fraud detection paper:

  • Goal: “Detect credit card fraud in real-time”
  • Rubric:
    • Does the plan address class imbalance?
    • Does it consider latency requirements?
    • Does it propose metrics beyond accuracy?

Step 2: Self-Assessment with Frozen Grader

Here’s the key innovation. Instead of asking external experts to evaluate:

  1. Take the initial model version and freeze it (frozen grader)
  2. Train a new model version to generate plans
  3. Have the frozen model grade those plans against rubrics
  4. Use grades as reward signal in RL

Why does this work? The frozen model creates a stable reference point. Even if its grades aren’t perfect, they’re consistent — and that’s enough for learning.

$$ L = - E_{x} \left[ \sum_i r_i(G(x), \text{rubric}_i) \right] $$

where $G$ is the plan generator, and $r_i$ is the grade from the frozen grader for the $i$-th criterion.

Results: Numbers That Surprise

Expert Preferences

In blind tests with domain experts:

ComparisonPreference for AI
vs. baseline (Qwen3-30B)70%
vs. plans from papers52%

The model trained with this method was preferred 70% of the time over the baseline. Moreover — in half the cases experts preferred the AI plan over the original plan from the publication!

Rubric Quality

84% of automatically extracted rubrics were rated by humans as “good or very good”. The system learned on its own to recognize what makes a research plan valuable.

Generalization

The model trained primarily on ML generalizes to:

  • Medical research — without access to clinical data
  • Physics — with different notation conventions
  • Biology — with completely different methodology

12-22% improvement over baseline holds across domains.

Why This Matters

For Scientists

This isn’t a tool to “replace” scientists. It’s a planning assistant:

  • Generates alternative approaches to problems
  • Points out gaps in existing plans
  • Suggests methodologies from other fields
  • Accelerates brainstorming

For the AI Industry

The self-assessment with frozen grader technique is general:

  • Doesn’t require experts for labeling
  • Scales to any number of examples
  • Works in domains without ground truth

For Philosophy of Science

What does it mean that AI can write “better” research plans? Is scientific creativity more systematic than we thought?

Limitations

The authors are honest about limitations:

  1. No plan execution — the model generates plans but doesn’t execute them. A good plan isn’t the same as good results.

  2. Data bias — rubrics come from existing publications, which may favor conventional approaches.

  3. AI evaluation — even experts might miss subtle problems the model propagates.

Technical Details

For the advanced reader:

Base model: Qwen3-30B-A3B (Mixture of Experts)

Training data:

  • ~10k papers from arXiv (primarily ML)
  • Automatic extraction of goals and rubrics
  • Filtering by rubric quality

RL Training:

  • PPO with KL penalty relative to base model
  • Multi-reward: separate reward for each rubric criterion
  • Frozen grader = checkpoint from before training

Evaluation:

  • 225 hours of evaluation by domain experts
  • Frontier models (GPT-4, Claude) as additional evaluators
  • Out-of-domain tests on medicine and physics

Conclusion

This paper shows we can train AI for tasks requiring “creativity” — if we cleverly define what that creativity is. Instead of asking “is this creative?”, we ask “does it meet criteria that scientists themselves consider important?”.

This is a fundamental perspective shift: from subjective evaluation to operationalizing expertise.

Will AI replace scientists? No. But AI as a collaborator — that’s happening now.