Can the effectiveness of an advertising system be improved by almost 10% simply by tuning the weights in the ranking function more intelligently?
It turns out the answer is yes – and that’s exactly what the paper Deep Reinforcement Learning for Ranking Utility Tuning in the Ad Recommender System at Pinterest (arXiv:2509.05292) is about.

Traditionally, ad ranking relies on a utility function – a linear combination of multiple model predictions, such as CTR (click-through rate), conversion probability, or other business metrics.
The problem? The weights of these predictors were historically tuned manually by engineers. This approach:

offers simplicity and transparency,
but is inefficient, inflexible, and lacks personalization.

Pinterest introduces a new approach: DRL-PUT (Deep Reinforcement Learning for Personalized Utility Tuning).

The Idea in Brief

The system frames the task of choosing utility weights as a reinforcement learning (RL) problem:

the state represents the ad request context (user, session, ad features),
the action is the selection of appropriate weights or hyperparameters for ranking,
the reward reflects business goals such as CTR, LC-CTR, or composite objectives.

This way, instead of manually tuning weights, the model learns how to adaptively select the best ranking strategy for each user and situation.

DRL-PUT Architecture

The approach is based on policy learning, not value estimation of $Q(s, a)$.

1. State Representation

The state $s$ includes:

user features,
session context (time, device type),
ad request characteristics.

Effectively, a high-dimensional feature vector fed into a neural network.

2. Policy Network

The model is a deep neural network policy $ \pi_\theta(a|s) $, which directly predicts the utility weights.
Instead of learning the value of each action, it immediately chooses an action.

Input: state representation $s$,
Output: weight vector $w$ that defines the ranking function:
$$ U(x) = \sum_i w_i \cdot f_i(x), $$ where $f_i(x)$ are predictors (CTR, conversion, long-click probability, etc.).

3. Reward

Pinterest experimented with different reward definitions:

CTR,
LC-CTR (long click-through rate),
combined business metrics.

The reward $R$ is derived offline from online logs, avoiding the need for costly live training.

4. Training

The method applies direct policy learning:

instead of estimating $Q(s,a)$ (which is unstable and data-intensive),
the model optimizes the policy directly, minimizing a loss linked to expected rewards.

Results

A/B testing in Pinterest’s production ad system showed:

CTR increased by 9.7%,
LC-CTR increased by 7.7%,
compared to manually tuned utility functions.

In an industry where even a fraction of a percent matters, this is a major improvement.

Why It Works

Personalization – different users and contexts get different utility weights.
Adaptability – the system adjusts to seasonal changes and new ad campaigns.
Automation – engineers no longer need to manually tune parameters.

Math Behind the Scenes

The task is modeled as a Markov Decision Process (MDP):

$s_t$ – state (user/context features),
$a_t$ – action (utility weights),
$r_t$ – reward (CTR/LC-CTR),
$\pi_\theta$ – policy.

The objective is to maximize expected reward:
$$ J(\theta) = E_{\pi_\theta} [\sum_{t=0}^T \gamma^t r_t], $$

Policy gradient update:
$$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(a_i|s_i) R_i. $$

Conclusion

This work shows that:

reinforcement learning can be applied successfully to large-scale ad systems,
DRL-PUT outperforms manual tuning,
real-time personalization and adaptability are possible at production scale.

It’s a strong example of how theory meets practice: RL moving from research papers into billion-dollar business applications.

📎 Links

Based on the publication 📄 arXiv:2509.05292 PDF

The Idea in Brief#

DRL-PUT Architecture#

1. State Representation#

2. Policy Network#

3. Reward#

4. Training#

Results#

Why It Works#

Math Behind the Scenes#

Conclusion#

📎 Links#