Attention as a Compass – Teaching Reasoning Models to Explore Smarter
Large Language Models (LLMs) are no longer just text generators — they are becoming reasoners, capable of solving mathematical problems, logical puzzles, or planning tasks step by step. One of the key challenges is how to improve the quality of this reasoning. Traditional Reinforcement Learning (RL) rewards only the final outcome, but in complex reasoning it makes more sense to evaluate each intermediate step. This is called process-supervised RL (PSRL). ...