What if AI could not just answer questions, but actively plan scientific research? Not generating text — creating coherent, novel experiment plans that experts rate as better than human-written ones. Sounds like science fiction? Researchers from Meta AI and partners just achieved this.
The Problem: How Do You Grade Scientific Creativity?
Training models for “closed” tasks (math, coding) is relatively straightforward — the answer is correct or not. But how do you evaluate a research plan?
- There’s no single “correct” answer
- Novelty is subjective
- Feasibility depends on context
- Experts disagree
Traditional approaches would require thousands of hours of expert time evaluating each generated plan. Impossible to scale.
The Solution: Let Science Grade Science
The authors had an elegant idea: extract evaluation criteria from existing scientific publications.
Step 1: Rubric Extraction
From each scientific paper, you can extract:
- Research goal — what did the authors want to achieve?
- Grading rubric — what criteria must a good plan for this goal satisfy?
For example, for a fraud detection paper:
- Goal: “Detect credit card fraud in real-time”
- Rubric:
- Does the plan address class imbalance?
- Does it consider latency requirements?
- Does it propose metrics beyond accuracy?
Step 2: Self-Assessment with Frozen Grader
Here’s the key innovation. Instead of asking external experts to evaluate:
- Take the initial model version and freeze it (frozen grader)
- Train a new model version to generate plans
- Have the frozen model grade those plans against rubrics
- Use grades as reward signal in RL
Why does this work? The frozen model creates a stable reference point. Even if its grades aren’t perfect, they’re consistent — and that’s enough for learning.
$$ L = - E_{x} \left[ \sum_i r_i(G(x), \text{rubric}_i) \right] $$
where $G$ is the plan generator, and $r_i$ is the grade from the frozen grader for the $i$-th criterion.
Results: Numbers That Surprise
Expert Preferences
In blind tests with domain experts:
| Comparison | Preference for AI |
|---|---|
| vs. baseline (Qwen3-30B) | 70% |
| vs. plans from papers | 52% |
The model trained with this method was preferred 70% of the time over the baseline. Moreover — in half the cases experts preferred the AI plan over the original plan from the publication!
Rubric Quality
84% of automatically extracted rubrics were rated by humans as “good or very good”. The system learned on its own to recognize what makes a research plan valuable.
Generalization
The model trained primarily on ML generalizes to:
- Medical research — without access to clinical data
- Physics — with different notation conventions
- Biology — with completely different methodology
12-22% improvement over baseline holds across domains.
Why This Matters
For Scientists
This isn’t a tool to “replace” scientists. It’s a planning assistant:
- Generates alternative approaches to problems
- Points out gaps in existing plans
- Suggests methodologies from other fields
- Accelerates brainstorming
For the AI Industry
The self-assessment with frozen grader technique is general:
- Doesn’t require experts for labeling
- Scales to any number of examples
- Works in domains without ground truth
For Philosophy of Science
What does it mean that AI can write “better” research plans? Is scientific creativity more systematic than we thought?
Limitations
The authors are honest about limitations:
No plan execution — the model generates plans but doesn’t execute them. A good plan isn’t the same as good results.
Data bias — rubrics come from existing publications, which may favor conventional approaches.
AI evaluation — even experts might miss subtle problems the model propagates.
Technical Details
For the advanced reader:
Base model: Qwen3-30B-A3B (Mixture of Experts)
Training data:
- ~10k papers from arXiv (primarily ML)
- Automatic extraction of goals and rubrics
- Filtering by rubric quality
RL Training:
- PPO with KL penalty relative to base model
- Multi-reward: separate reward for each rubric criterion
- Frozen grader = checkpoint from before training
Evaluation:
- 225 hours of evaluation by domain experts
- Frontier models (GPT-4, Claude) as additional evaluators
- Out-of-domain tests on medicine and physics
Conclusion
This paper shows we can train AI for tasks requiring “creativity” — if we cleverly define what that creativity is. Instead of asking “is this creative?”, we ask “does it meet criteria that scientists themselves consider important?”.
This is a fundamental perspective shift: from subjective evaluation to operationalizing expertise.
Will AI replace scientists? No. But AI as a collaborator — that’s happening now.
📎 Links
- Based on the publication 📄 2512.23707