Lost in Stories: How LLMs Lose the Thread in Long Narratives

Ask any language model to write a 10,000-word story. On page one, the hero has blue eyes. By page five — brown. In chapter three it’s Thursday; in chapter six, the same day is suddenly Saturday. A character who died on page seven is chatting away on page ten.

Sound familiar? The paper “Lost in Stories: Consistency Bugs in Long Story Generation by LLMs” systematically investigates this problem for the first time — and the results are sobering. Even the best models produce an average of one consistency error per 10,000 words, and human experts catch only 17% of them.

The Problem: The Longer the Text, the More Lies

Language models can generate impressively fluent text. But narrative consistency narrative consistency Maintaining agreement between facts, characters, world rules, and chronology within a single text. When a character has blue eyes on p.1 and brown on p.5 — that’s a consistency failure. in long texts is an entirely different challenge from single-sentence quality.

Existing benchmarks benchmarks Standard test suites for comparing models. Text benchmarks cover grammar, logic, knowledge — but narrative consistency has been neglected until now. evaluate models on grammar, logic, general knowledge — but none systematically measured whether a model can maintain consistency within a single long text.

ConStory-Bench fills that gap.

Error Taxonomy: 5 Categories, 19 Subtypes

The authors identified five major categories of consistency errors:

1. Chronology and Plot Logic

Six subtypes — the most frequent category:

Absolute time contradictions — “It was Wednesday” → a few paragraphs later the same day is Friday
Temporal contradictions — a journey simultaneously takes 2 hours and 3 days
Simultaneity — a character is in two places at once
Effects without causes — a character reacts to something that hasn’t happened yet
Broken causal logic — events follow from each other in contradictory ways
Abandoned threads — a foreshadowed plot line is never resolved

2. Character Consistency

Memory contradictions — a character forgets what they said
Knowledge contradictions — a character knows something they shouldn’t
Skill fluctuations — an expert suddenly can’t handle basics
Forgotten abilities — magical powers appear and disappear without explanation

3. World and Setting

Broken world rules — magic works differently than established
Geographic contradictions — cities change location
Social norm violations — characters behave contrary to established rules

4. Facts and Details

Appearance changes — eye color, hair, height
Name confusion — characters swap names
Numerical contradictions — “five knights” becomes “three”

5. Narration and Style

Perspective shifts — sudden jumps between 1st and 3rd person
Tone inconsistency — a thriller suddenly becomes a comedy
Stylistic jumps — formal prose turns into slang

ConStory-Bench: 2,000 Prompts, 4 Scenarios

Test Scenarios

Scenario	Prompts	Description
Generation	751 (37.5%)	Creating a narrative from scratch with minimal plot
Continuation	432 (21.6%)	Extending an existing fragment
Expansion	422 (21.1%)	Building a story from an outline
Infilling	395 (19.8%)	Filling a gap between a beginning and an ending

Target length: 8,000–10,000 words. Prompts collected from seven corpora, deduplicated using MinHash MinHash An algorithm for fast estimation of set similarity. Used to detect and remove duplicates in large text collections. .

ConStory-Checker: The Automated Detective

Manual analysis of 10,000-word texts is impractical — human experts detect only 17.1% of errors (recall). The authors built a four-stage automated detection pipeline:

Pipeline

Stage 1: Extraction — Pull out fragments prone to contradictions, separately for each category

Stage 2: Pairwise Classification — Compare extracted fragments: “Consistent” or “Contradictory”

Stage 3: Evidence Chain — Build justifications with exact quotes and character positions in the text

Stage 4: Structured Output — JSON with quotes, locations, error types, and explanations

Evaluation model: o4-mini.

Effectiveness

Metric	ConStory-Checker	Experts
Precision	88.4%	—
Recall	55.0%	17.1%
F1-score	0.678	0.229

ConStory-Checker is 3.2x more effective than manual expert analysis.

Results: Ranking 20+ Models

Metrics

CED CED Consistency Error Density — the number of consistency errors per 10,000 words. Lower is better. (Consistency Error Density):

$$\text{CED} = \frac{e_{m,i}}{w_{m,i} / 10000}$$

GRR GRR Group Relative Rank — quality score accounting for prompt difficulty. Models are ranked within a group of responses to the same prompt, yielding a fairer comparison. (Group Relative Rank) — ranking that accounts for prompt difficulty.

Top Models

Model	CED (↓ better)	GRR (↓ better)
GPT-5-Reasoning	0.113	3.05
Gemini-2.5-Pro	0.305	7.79
Claude-Sonnet-4.5	0.520	4.90
GLM-4.6	0.528	—
Qwen3-32B	0.537	—

GPT-5-Reasoning dominates — nearly 3x fewer errors than Gemini and almost 5x fewer than Claude.

Worst Scenarios

Generation tasks (from scratch) consistently produce the most errors — the model has no “anchors” to base consistency on.

When Do Models Get It Wrong?

Errors Cluster in the Middle

Positional analysis reveals a clear pattern:

Facts (establishments) concentrate in 15–30% of the text
Contradictions accumulate in 40–60% of the text

In other words: the model sets up rules at the beginning and loses them in the middle — exactly when the context window context window The amount of text a model ‘sees’ at once. Even models with long contexts can lose attention to earlier fragments as generation progresses. is already full, but the story is still developing.

The Model Knows It’s Wrong

The most fascinating finding: text fragments containing errors have significantly higher entropy entropy A measure of model uncertainty. High entropy = the model is uncertain about the next token. Low = it’s confident in its choice. :

Qwen3-30B: +12.03% higher entropy in erroneous fragments
Qwen3-4B: +19.24% higher entropy

“The model does not err unconsciously; rather, it makes incorrect decisions when facing greater uncertainty.”

Entropy can serve as an early warning signal — a trigger for consistency verification during generation.

Errors Come in Pairs

Co-occurrence analysis shows that facts and details errors are the central node — strongly correlated with errors in:

Characterization (r=0.304)
World building (r=0.255)
Chronology (r=0.176)

When a model gets eye color wrong, it’s likely to get other details wrong too. Style errors, however, are independent (r≈0).

Distance Between Fact and Contradiction

How far apart in the text are a fact and its contradiction?

Error Type	Average Distance
Geographic contradictions	31.0% of text length
Temporal contradictions	29.7%
Perspective shifts	4.7%

Geographic and temporal contradictions are “long-range” errors — the model forgets facts established many pages earlier. Perspective errors are local failures at the paragraph level.

Summary

ConStory-Bench is the first systematic narrative consistency benchmark for LLMs. Key takeaways:

No model is error-free — even GPT-5-Reasoning produces ~1 error per 10,000 words
Human experts are worse than automation — ConStory-Checker detects 3.2x more errors
Errors accumulate in the middle of the text, when the model loses touch with initial establishments
Entropy is a signal — the model “senses” uncertainty before making an error
Errors cluster — getting one detail wrong increases the risk of more

Practical application: the ConStory-Checker pipeline can run in real time as a verification layer in long-text generation systems — from AI novels to documentation, reports, and screenplays.

The Problem: The Longer the Text, the More Lies#

Error Taxonomy: 5 Categories, 19 Subtypes#

1. Chronology and Plot Logic#

2. Character Consistency#

3. World and Setting#

4. Facts and Details#

5. Narration and Style#

ConStory-Bench: 2,000 Prompts, 4 Scenarios#

Test Scenarios#

ConStory-Checker: The Automated Detective#

Pipeline#

Effectiveness#

Results: Ranking 20+ Models#

Metrics#

Top Models#

Worst Scenarios#

When Do Models Get It Wrong?#

Errors Cluster in the Middle#

The Model Knows It’s Wrong#

Errors Come in Pairs#

Distance Between Fact and Contradiction#

Summary#

Links#