The paper “PinFM: Foundation Model for User Activity Sequences at a Billion-Scale Visual Discovery Platform” introduces a $>$20B-parameter transformer pretrained on Pinterest user interaction sequences. Its goal is to build a universal sequence model applicable to various recommendation tasks, including content ranking, related Pins, and personalized feeds.
Background and Motivation
Traditional recommendation systems rely on specialized models for each task. The explosion of data volume and signal diversity calls for a generalized pretraining–finetuning paradigm. PinFM was developed to:
- Leverage extensive interaction histories (views, clicks, saves) to better model user preferences.
- Reduce maintenance overhead of many task-specific models.
- Accelerate deployment of new recommendation features.
Model Architecture
PinFM uses a transformer with a novel Deduplicated Cross-Attention Transformer (DCAT) mechanism. Key elements:
- Input: a sequence of $ (item_id, action_type) $ pairs of length $T$, embedded into vectors.
- Cross-Attention: instead of full quadratic attention, DCAT aggregates only unique token embeddings, reducing compute from $O(T^2)$ to $O(TU)$ where $U$ is the number of unique tokens.
The self-attention mechanism is defined as:
$$ \text{Attention}(Q,K,V) = \mathrm{softmax}\Bigl(\frac{QK^\top}{\sqrt{d}}\Bigr)V, $$
where $Q,K,V\in\mathbb{R}^{T\times d}$.
Performance Optimizations
To meet strict latency and throughput demands, the team introduced:
- 4-bit (int4) embedding quantization – a 75% memory reduction compared to float32 with minimal performance loss.
- Sequence mapping – grouping adjacent tokens to further reduce cross-attention calls.
- Asynchronous pipelining – parallel execution of Transformer layers.
Experimental Results and Insights
A/B tests on 100M users showed:
- 600% throughput increase on the existing infrastructure.
- 20% engagement uplift for recommendations involving previously unseen items.
- 30% operational cost savings through consolidation of multiple models.
Conclusion and Future Work
PinFM demonstrates that large-scale sequence models can be production-ready. Novel optimizations enable deployment on platforms handling billions of daily events. Future extensions include multimodal signals (images, text) and online adaptive learning.
📎 Links
- Based on the publication 📄 2507.12704