AI Co-Scientist: Teaching Models to Write Research Plans Better Than Humans

What if AI could not just answer questions, but actively plan scientific research? Not generating text — creating coherent, novel experiment plans that experts rate as better than human-written ones. Sounds like science fiction? Researchers from Meta AI and partners just achieved this.

The Problem: How Do You Grade Scientific Creativity?

Training models for “closed” tasks (math, coding) is relatively straightforward — the answer is correct or not. But how do you evaluate a research plan?

There’s no single “correct” answer
Novelty is subjective
Feasibility depends on context
Experts disagree

Traditional approaches would require thousands of hours of expert time evaluating each generated plan. Impossible to scale.

The Solution: Let Science Grade Science

The authors had an elegant idea: extract evaluation criteria from existing scientific publications.

Step 1: Rubric Extraction

From each scientific paper, you can extract:

Research goal — what did the authors want to achieve?
Grading rubric — what criteria must a good plan for this goal satisfy?

For example, for a fraud detection paper:

Goal: “Detect credit card fraud in real-time”
Rubric:
- Does the plan address class imbalance?
- Does it consider latency requirements?
- Does it propose metrics beyond accuracy?

Step 2: Self-Assessment with Frozen Grader

Here’s the key innovation. Instead of asking external experts to evaluate:

Take the initial model version and freeze it (frozen grader)
Train a new model version to generate plans
Have the frozen model grade those plans against rubrics
Use grades as reward signal in RL

Why does this work? The frozen model creates a stable reference point. Even if its grades aren’t perfect, they’re consistent — and that’s enough for learning.

$$ L = - E_{x} \left[ \sum_i r_i(G(x), \text{rubric}_i) \right] $$

where $G$ is the plan generator, and $r_i$ is the grade from the frozen grader for the $i$-th criterion.

Results: Numbers That Surprise

Expert Preferences

In blind tests with domain experts:

Comparison	Preference for AI
vs. baseline (Qwen3-30B)	70%
vs. plans from papers	52%

The model trained with this method was preferred 70% of the time over the baseline. Moreover — in half the cases experts preferred the AI plan over the original plan from the publication!

Rubric Quality

84% of automatically extracted rubrics were rated by humans as “good or very good”. The system learned on its own to recognize what makes a research plan valuable.

Generalization

The model trained primarily on ML generalizes to:

Medical research — without access to clinical data
Physics — with different notation conventions
Biology — with completely different methodology

12-22% improvement over baseline holds across domains.

Why This Matters

For Scientists

This isn’t a tool to “replace” scientists. It’s a planning assistant:

Generates alternative approaches to problems
Points out gaps in existing plans
Suggests methodologies from other fields
Accelerates brainstorming

For the AI Industry

The self-assessment with frozen grader technique is general:

Doesn’t require experts for labeling
Scales to any number of examples
Works in domains without ground truth

For Philosophy of Science

What does it mean that AI can write “better” research plans? Is scientific creativity more systematic than we thought?

Limitations

The authors are honest about limitations:

No plan execution — the model generates plans but doesn’t execute them. A good plan isn’t the same as good results.
Data bias — rubrics come from existing publications, which may favor conventional approaches.
AI evaluation — even experts might miss subtle problems the model propagates.

Technical Details

For the advanced reader:

Base model: Qwen3-30B-A3B (Mixture of Experts)

Training data:

~10k papers from arXiv (primarily ML)
Automatic extraction of goals and rubrics
Filtering by rubric quality

RL Training:

PPO with KL penalty relative to base model
Multi-reward: separate reward for each rubric criterion
Frozen grader = checkpoint from before training

Evaluation:

225 hours of evaluation by domain experts
Frontier models (GPT-4, Claude) as additional evaluators
Out-of-domain tests on medicine and physics

Conclusion

This paper shows we can train AI for tasks requiring “creativity” — if we cleverly define what that creativity is. Instead of asking “is this creative?”, we ask “does it meet criteria that scientists themselves consider important?”.

This is a fundamental perspective shift: from subjective evaluation to operationalizing expertise.

Will AI replace scientists? No. But AI as a collaborator — that’s happening now.

📎 Links

Based on the publication 📄 2512.23707

The Problem: How Do You Grade Scientific Creativity?#

The Solution: Let Science Grade Science#

Step 1: Rubric Extraction#

Step 2: Self-Assessment with Frozen Grader#

Results: Numbers That Surprise#

Expert Preferences#

Rubric Quality#

Generalization#

Why This Matters#

For Scientists#

For the AI Industry#

For Philosophy of Science#

Limitations#

Technical Details#

Conclusion#

📎 Links#