Ask any language model to write a 10,000-word story. On page one, the hero has blue eyes. By page five — brown. In chapter three it’s Thursday; in chapter six, the same day is suddenly Saturday. A character who died on page seven is chatting away on page ten.
Sound familiar? The paper “Lost in Stories: Consistency Bugs in Long Story Generation by LLMs” systematically investigates this problem for the first time — and the results are sobering. Even the best models produce an average of one consistency error per 10,000 words, and human experts catch only 17% of them.
The Problem: The Longer the Text, the More Lies
Language models can generate impressively fluent text. But narrative consistency narrative consistency Maintaining agreement between facts, characters, world rules, and chronology within a single text. When a character has blue eyes on p.1 and brown on p.5 — that’s a consistency failure. in long texts is an entirely different challenge from single-sentence quality.
Existing benchmarks benchmarks Standard test suites for comparing models. Text benchmarks cover grammar, logic, knowledge — but narrative consistency has been neglected until now. evaluate models on grammar, logic, general knowledge — but none systematically measured whether a model can maintain consistency within a single long text.
ConStory-Bench fills that gap.
Error Taxonomy: 5 Categories, 19 Subtypes
The authors identified five major categories of consistency errors:
1. Chronology and Plot Logic
Six subtypes — the most frequent category:
- Absolute time contradictions — “It was Wednesday” → a few paragraphs later the same day is Friday
- Temporal contradictions — a journey simultaneously takes 2 hours and 3 days
- Simultaneity — a character is in two places at once
- Effects without causes — a character reacts to something that hasn’t happened yet
- Broken causal logic — events follow from each other in contradictory ways
- Abandoned threads — a foreshadowed plot line is never resolved
2. Character Consistency
- Memory contradictions — a character forgets what they said
- Knowledge contradictions — a character knows something they shouldn’t
- Skill fluctuations — an expert suddenly can’t handle basics
- Forgotten abilities — magical powers appear and disappear without explanation
3. World and Setting
- Broken world rules — magic works differently than established
- Geographic contradictions — cities change location
- Social norm violations — characters behave contrary to established rules
4. Facts and Details
- Appearance changes — eye color, hair, height
- Name confusion — characters swap names
- Numerical contradictions — “five knights” becomes “three”
5. Narration and Style
- Perspective shifts — sudden jumps between 1st and 3rd person
- Tone inconsistency — a thriller suddenly becomes a comedy
- Stylistic jumps — formal prose turns into slang
ConStory-Bench: 2,000 Prompts, 4 Scenarios
Test Scenarios
| Scenario | Prompts | Description |
|---|---|---|
| Generation | 751 (37.5%) | Creating a narrative from scratch with minimal plot |
| Continuation | 432 (21.6%) | Extending an existing fragment |
| Expansion | 422 (21.1%) | Building a story from an outline |
| Infilling | 395 (19.8%) | Filling a gap between a beginning and an ending |
Target length: 8,000–10,000 words. Prompts collected from seven corpora, deduplicated using MinHash MinHash An algorithm for fast estimation of set similarity. Used to detect and remove duplicates in large text collections. .
ConStory-Checker: The Automated Detective
Manual analysis of 10,000-word texts is impractical — human experts detect only 17.1% of errors (recall). The authors built a four-stage automated detection pipeline:
Pipeline
Stage 1: Extraction — Pull out fragments prone to contradictions, separately for each category
Stage 2: Pairwise Classification — Compare extracted fragments: “Consistent” or “Contradictory”
Stage 3: Evidence Chain — Build justifications with exact quotes and character positions in the text
Stage 4: Structured Output — JSON with quotes, locations, error types, and explanations
Evaluation model: o4-mini.
Effectiveness
| Metric | ConStory-Checker | Experts |
|---|---|---|
| Precision | 88.4% | — |
| Recall | 55.0% | 17.1% |
| F1-score | 0.678 | 0.229 |
ConStory-Checker is 3.2x more effective than manual expert analysis.
Results: Ranking 20+ Models
Metrics
CED CED Consistency Error Density — the number of consistency errors per 10,000 words. Lower is better. (Consistency Error Density):
$$\text{CED} = \frac{e_{m,i}}{w_{m,i} / 10000}$$
GRR GRR Group Relative Rank — quality score accounting for prompt difficulty. Models are ranked within a group of responses to the same prompt, yielding a fairer comparison. (Group Relative Rank) — ranking that accounts for prompt difficulty.
Top Models
| Model | CED (↓ better) | GRR (↓ better) |
|---|---|---|
| GPT-5-Reasoning | 0.113 | 3.05 |
| Gemini-2.5-Pro | 0.305 | 7.79 |
| Claude-Sonnet-4.5 | 0.520 | 4.90 |
| GLM-4.6 | 0.528 | — |
| Qwen3-32B | 0.537 | — |
GPT-5-Reasoning dominates — nearly 3x fewer errors than Gemini and almost 5x fewer than Claude.
Worst Scenarios
Generation tasks (from scratch) consistently produce the most errors — the model has no “anchors” to base consistency on.
When Do Models Get It Wrong?
Errors Cluster in the Middle
Positional analysis reveals a clear pattern:
- Facts (establishments) concentrate in 15–30% of the text
- Contradictions accumulate in 40–60% of the text
In other words: the model sets up rules at the beginning and loses them in the middle — exactly when the context window context window The amount of text a model ‘sees’ at once. Even models with long contexts can lose attention to earlier fragments as generation progresses. is already full, but the story is still developing.
The Model Knows It’s Wrong
The most fascinating finding: text fragments containing errors have significantly higher entropy entropy A measure of model uncertainty. High entropy = the model is uncertain about the next token. Low = it’s confident in its choice. :
- Qwen3-30B: +12.03% higher entropy in erroneous fragments
- Qwen3-4B: +19.24% higher entropy
“The model does not err unconsciously; rather, it makes incorrect decisions when facing greater uncertainty.”
Entropy can serve as an early warning signal — a trigger for consistency verification during generation.
Errors Come in Pairs
Co-occurrence analysis shows that facts and details errors are the central node — strongly correlated with errors in:
- Characterization (r=0.304)
- World building (r=0.255)
- Chronology (r=0.176)
When a model gets eye color wrong, it’s likely to get other details wrong too. Style errors, however, are independent (r≈0).
Distance Between Fact and Contradiction
How far apart in the text are a fact and its contradiction?
| Error Type | Average Distance |
|---|---|
| Geographic contradictions | 31.0% of text length |
| Temporal contradictions | 29.7% |
| Perspective shifts | 4.7% |
Geographic and temporal contradictions are “long-range” errors — the model forgets facts established many pages earlier. Perspective errors are local failures at the paragraph level.
Summary
ConStory-Bench is the first systematic narrative consistency benchmark for LLMs. Key takeaways:
- No model is error-free — even GPT-5-Reasoning produces ~1 error per 10,000 words
- Human experts are worse than automation — ConStory-Checker detects 3.2x more errors
- Errors accumulate in the middle of the text, when the model loses touch with initial establishments
- Entropy is a signal — the model “senses” uncertainty before making an error
- Errors cluster — getting one detail wrong increases the risk of more
Practical application: the ConStory-Checker pipeline can run in real time as a verification layer in long-text generation systems — from AI novels to documentation, reports, and screenplays.
Links
- Based on the publication arXiv:2603.05890 PDF