RLVMR: Reinforcement Learning with Verifiable Meta‑Reasoning Rewards for Robust Long‑Horizon Agents

The paper introduces RLVMR, a novel framework for reinforcement learning (RL) that integrates verifiable meta‑reasoning rewards to strengthen long‑horizon performance. It enables agents to generate internal explanatory signals and be explicitly evaluated using meta‑reasoning criteria, enhancing robustness and planning over extended trajectories :contentReference[oaicite:1]{index=1}.

Contributions

A formal definition of meta‑reasoning rewards: agents receive additional reward signals based on the verifiability of reasoning chains.
A verifiable protocol: using checkable reasoning traces to assess agent justification.
Empirical validation on long‑horizon RL tasks showing improved performance vs. standard RL baselines :contentReference[oaicite:2]{index=2}.

Method

Let the agent generate reasoning chain $r = (r_1,\dots,r_T)$ alongside actions $a_t$. The total reward is: $$ R_{\text{total}} = \sum_t R_{\text{env}}(a_t) + \lambda,R_{\text{meta}}(r), $$ where $R_{\text{meta}}(r)$ is high only if reasoning can be verified according to protocol; $\lambda$ tunes the meta‑reasoning influence.

Experiments

Tested on several long‑horizon environments; results indicate that RLVMR agents maintain consistent performance and avoid semantic shortcuts better than plain RL :contentReference[oaicite:3]{index=3}.

Conclusion

RLVMR offers a promising direction: combining environment rewards and verifiable reasoning feedback yields agents that act more robustly in settings requiring deep reasoning.

📚 Link

👉 Based on the publication 📄 arXiv:2507.22844

Contributions#

Method#

Experiments#

Conclusion#

📚 Link#

Contributions

Method

Experiments

Conclusion

📚 Link