Reinforcement Learning

Attention as a Compass – Teaching Reasoning Models to Explore Smarter

Large Language Models (LLMs) are no longer just text generators — they are becoming reasoners, capable of solving mathematical problems, logical puzzles, or planning tasks step by step. One of the key challenges is how to improve the quality of this reasoning. Traditional Reinforcement Learning (RL) rewards only the final outcome, but in complex reasoning it makes more sense to evaluate each intermediate step. This is called process-supervised RL (PSRL). ...

Quantum Trading – AI and Quantum Computing in Investing

Imagine your computer not only analyzing financial charts but also learning to make investment decisions on its own – faster and smarter than a human. Now add a touch of quantum physics. Sounds like science fiction? Yet, recent research shows that combining reinforcement learning, quantum-inspired neural networks, and classical financial data can provide a real edge in trading. This is exactly the focus of a publication from National Taiwan Normal University and Wells Fargo. The researchers built a trading agent that uses quantum-enhanced neural networks to trade the USD/TWD (US Dollar/Taiwan Dollar) currency pair. ...

Reinforcement Learning in Pinterest Ads – DRL-PUT in action!

Can the effectiveness of an advertising system be improved by almost 10% simply by tuning the weights in the ranking function more intelligently? It turns out the answer is yes – and that’s exactly what the paper Deep Reinforcement Learning for Ranking Utility Tuning in the Ad Recommender System at Pinterest (arXiv:2509.05292) is about. Traditionally, ad ranking relies on a utility function – a linear combination of multiple model predictions, such as CTR (click-through rate), conversion probability, or other business metrics. The problem? The weights of these predictors were historically tuned manually by engineers. This approach: ...

Look Inside Seamless Flow's Hyper-Efficient Training

We are in the midst of an AI gold rush, where companies are investing billions to build increasingly intelligent models. The final, crucial step in this process is often Reinforcement Learning (RL), the “finishing school” where an AI agent learns to master complex tasks through trial and error. However, this training process at an industrial scale is plagued by two crippling problems: crippling inefficiency and maddening complexity. It’s like trying to run a state-of-the-art factory where half the machines are always idle and every product requires a complete retooling of the assembly line. ...

Dynamic Fine-Tuning (DFT): How a Single Line of Code is Revolutionizing AI Training

In an era where Large Language Models (LLMs) like GPT-4 or Llama seem to understand the world, a fundamental challenge remains: how to teach them effectively and efficiently? The standard method is Supervised Fine-Tuning (SFT), which involves “feeding” the model thousands of examples of correct responses. However, as the groundbreaking paper “On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification” (arXiv:2508.05629) points out, SFT has a hidden flaw that limits its true potential. ...

Optimizing Call Center Operations with Reinforcement Learning: PPO vs. Value Iteration

Can AI improve how call centers operate? The paper “Optimising Call Centre Operations using Reinforcement Learning: Value Iteration versus Proximal Policy Optimisation” by Kwong Ho Li and Wathsala Karunarathne shows that it can — and with strong results. The authors compare two reinforcement learning (RL) approaches to optimize call routing: the classical Value Iteration (VI) and the modern Proximal Policy Optimisation (PPO). What is Reinforcement Learning? Reinforcement Learning is an AI method where an agent takes actions in an environment and receives rewards based on how good those actions are. The goal is to maximize the cumulative reward — essentially, to learn the best decisions. ...

SOPHIA: Enhancing Slow‑Thinking in Large Vision‑Language Models

In recent years, Large Vision‑Language Models (LVLMs) have shown impressive abilities to understand and generate text about images—but they often struggle with long, multi‑step reasoning. The paper “SOPHIA: Semi‑Off‑Policy Reinforcement Learning for Slow‑Thinking in LVLMs” presents a new approach that significantly improves their capacity for slow‑thinking reasoning. What Is Slow‑Thinking? Slow‑thinking is a deliberate, step‑by‑step reasoning process where the model: Breaks down complex problems into smaller steps, Verifies intermediate conclusions, Provides transparency into each decision. This contrasts with fast, intuitive “snap” judgments and helps avoid hallucinations—invented details not supported by the image. ...

The Role of AI in Managing Satellite Constellations

Modern satellite mega-constellations—groups of hundreds or thousands of small satellites working together—are transforming how we connect the world. Yet, managing these networks presents unique challenges: constantly moving nodes, limited onboard computing power, and a need to minimize communication delays. The ConstellAI project, supported by the European Space Agency, explores how artificial intelligence (AI) can optimize two critical tasks: Data Routing: Choosing the best path through the network to send data quickly and reliably. Resource Allocation: Distributing limited resources (bandwidth, power, time slots) among satellites and ground stations. Data Routing with Reinforcement Learning Traditional routing algorithms, like finding the shortest path on a map, don’t account for traffic jams (long queues) at network nodes. ConstellAI uses a technique called reinforcement learning (RL). In RL, a software agent learns from experience: it tries different routes, observes delays, and gradually discovers which paths minimize overall transit time. ...