Ask any language model to write a 10,000-word story. On page one, the hero has blue eyes. By page five — brown. In chapter three it’s Thursday; in chapter six, the same day is suddenly Saturday. A character who died on page seven is chatting away on page ten.

Sound familiar? The paper “Lost in Stories: Consistency Bugs in Long Story Generation by LLMs” systematically investigates this problem for the first time — and the results are sobering. Even the best models produce an average of one consistency error per 10,000 words, and human experts catch only 17% of them.


The Problem: The Longer the Text, the More Lies

Language models can generate impressively fluent text. But narrative consistency narrative consistency Maintaining agreement between facts, characters, world rules, and chronology within a single text. When a character has blue eyes on p.1 and brown on p.5 — that’s a consistency failure. in long texts is an entirely different challenge from single-sentence quality.

Existing benchmarks benchmarks Standard test suites for comparing models. Text benchmarks cover grammar, logic, knowledge — but narrative consistency has been neglected until now. evaluate models on grammar, logic, general knowledge — but none systematically measured whether a model can maintain consistency within a single long text.

ConStory-Bench fills that gap.


Error Taxonomy: 5 Categories, 19 Subtypes

The authors identified five major categories of consistency errors:

1. Chronology and Plot Logic

Six subtypes — the most frequent category:

  • Absolute time contradictions — “It was Wednesday” → a few paragraphs later the same day is Friday
  • Temporal contradictions — a journey simultaneously takes 2 hours and 3 days
  • Simultaneity — a character is in two places at once
  • Effects without causes — a character reacts to something that hasn’t happened yet
  • Broken causal logic — events follow from each other in contradictory ways
  • Abandoned threads — a foreshadowed plot line is never resolved

2. Character Consistency

  • Memory contradictions — a character forgets what they said
  • Knowledge contradictions — a character knows something they shouldn’t
  • Skill fluctuations — an expert suddenly can’t handle basics
  • Forgotten abilities — magical powers appear and disappear without explanation

3. World and Setting

  • Broken world rules — magic works differently than established
  • Geographic contradictions — cities change location
  • Social norm violations — characters behave contrary to established rules

4. Facts and Details

  • Appearance changes — eye color, hair, height
  • Name confusion — characters swap names
  • Numerical contradictions — “five knights” becomes “three”

5. Narration and Style

  • Perspective shifts — sudden jumps between 1st and 3rd person
  • Tone inconsistency — a thriller suddenly becomes a comedy
  • Stylistic jumps — formal prose turns into slang

ConStory-Bench: 2,000 Prompts, 4 Scenarios

Test Scenarios

ScenarioPromptsDescription
Generation751 (37.5%)Creating a narrative from scratch with minimal plot
Continuation432 (21.6%)Extending an existing fragment
Expansion422 (21.1%)Building a story from an outline
Infilling395 (19.8%)Filling a gap between a beginning and an ending

Target length: 8,000–10,000 words. Prompts collected from seven corpora, deduplicated using MinHash MinHash An algorithm for fast estimation of set similarity. Used to detect and remove duplicates in large text collections. .


ConStory-Checker: The Automated Detective

Manual analysis of 10,000-word texts is impractical — human experts detect only 17.1% of errors (recall). The authors built a four-stage automated detection pipeline:

Pipeline

Stage 1: Extraction — Pull out fragments prone to contradictions, separately for each category

Stage 2: Pairwise Classification — Compare extracted fragments: “Consistent” or “Contradictory”

Stage 3: Evidence Chain — Build justifications with exact quotes and character positions in the text

Stage 4: Structured Output — JSON with quotes, locations, error types, and explanations

Evaluation model: o4-mini.

Effectiveness

MetricConStory-CheckerExperts
Precision88.4%
Recall55.0%17.1%
F1-score0.6780.229

ConStory-Checker is 3.2x more effective than manual expert analysis.


Results: Ranking 20+ Models

Metrics

CED CED Consistency Error Density — the number of consistency errors per 10,000 words. Lower is better. (Consistency Error Density):

$$\text{CED} = \frac{e_{m,i}}{w_{m,i} / 10000}$$

GRR GRR Group Relative Rank — quality score accounting for prompt difficulty. Models are ranked within a group of responses to the same prompt, yielding a fairer comparison. (Group Relative Rank) — ranking that accounts for prompt difficulty.

Top Models

ModelCED (↓ better)GRR (↓ better)
GPT-5-Reasoning0.1133.05
Gemini-2.5-Pro0.3057.79
Claude-Sonnet-4.50.5204.90
GLM-4.60.528
Qwen3-32B0.537

GPT-5-Reasoning dominates — nearly 3x fewer errors than Gemini and almost 5x fewer than Claude.

Worst Scenarios

Generation tasks (from scratch) consistently produce the most errors — the model has no “anchors” to base consistency on.


When Do Models Get It Wrong?

Errors Cluster in the Middle

Positional analysis reveals a clear pattern:

  • Facts (establishments) concentrate in 15–30% of the text
  • Contradictions accumulate in 40–60% of the text

In other words: the model sets up rules at the beginning and loses them in the middle — exactly when the context window context window The amount of text a model ‘sees’ at once. Even models with long contexts can lose attention to earlier fragments as generation progresses. is already full, but the story is still developing.

The Model Knows It’s Wrong

The most fascinating finding: text fragments containing errors have significantly higher entropy entropy A measure of model uncertainty. High entropy = the model is uncertain about the next token. Low = it’s confident in its choice. :

  • Qwen3-30B: +12.03% higher entropy in erroneous fragments
  • Qwen3-4B: +19.24% higher entropy

“The model does not err unconsciously; rather, it makes incorrect decisions when facing greater uncertainty.”

Entropy can serve as an early warning signal — a trigger for consistency verification during generation.

Errors Come in Pairs

Co-occurrence analysis shows that facts and details errors are the central node — strongly correlated with errors in:

  • Characterization (r=0.304)
  • World building (r=0.255)
  • Chronology (r=0.176)

When a model gets eye color wrong, it’s likely to get other details wrong too. Style errors, however, are independent (r≈0).


Distance Between Fact and Contradiction

How far apart in the text are a fact and its contradiction?

Error TypeAverage Distance
Geographic contradictions31.0% of text length
Temporal contradictions29.7%
Perspective shifts4.7%

Geographic and temporal contradictions are “long-range” errors — the model forgets facts established many pages earlier. Perspective errors are local failures at the paragraph level.


Summary

ConStory-Bench is the first systematic narrative consistency benchmark for LLMs. Key takeaways:

  1. No model is error-free — even GPT-5-Reasoning produces ~1 error per 10,000 words
  2. Human experts are worse than automation — ConStory-Checker detects 3.2x more errors
  3. Errors accumulate in the middle of the text, when the model loses touch with initial establishments
  4. Entropy is a signal — the model “senses” uncertainty before making an error
  5. Errors cluster — getting one detail wrong increases the risk of more

Practical application: the ConStory-Checker pipeline can run in real time as a verification layer in long-text generation systems — from AI novels to documentation, reports, and screenplays.