Imagine you’re learning to play chess. You master all the rules, strategies, and openings. You become a pretty good player. Now, someone introduces a new piece with completely new rules of movement. As you learn to play with this new piece, do you forget how to move a pawn or a knight? Of course not. Your brain can integrate new knowledge without losing what it has already acquired. Unfortunately, for many artificial intelligence systems, this is a huge challenge, known as “catastrophic forgetting”.
Artificial intelligence, and machine learning models in particular, are often trained on a huge dataset all at once. When new data appears, the model has to be retrained from scratch on the combined old and new dataset, which is time-consuming and expensive. Trying to “retrain” a model only on new data often leads to a situation where the model “forgets” what it learned before.
Scientists Lecheng Kong, Theodore Vasiloudis, Seongjun Yun, Han Xie, and Xiang Song in their publication “Dynamic Mixture-of-Experts for Incremental Graph Learning” (arXiv:2508.09974) propose an innovative solution to this problem, especially in the context of graph data.
What is graph data?
Think about your network of friends on Facebook. You are a node, your friends are other nodes, and the connections between you are edges. This is a graph. Graphs are everywhere: in social networks, recommendation systems (e.g., Netflix recommending movies to you based on what people with similar tastes have watched), biology (protein interaction networks), or logistics (road connection maps). Analyzing these graphs allows for the discovery of valuable information and patterns.
Experts Who Learn Together
The key to the solution proposed by the authors is the concept of a “Mixture-of-Experts” (MoE). Instead of one, monolithic model that has to know everything, they create a group of “experts” - smaller, specialized neural networks.
When new data appears (e.g., new users and their interactions on a social network), the system does not try to forcefully change the existing experts. Instead… it adds a new expert! This new expert specializes in handling precisely this new data.
It’s a bit like a company, instead of retraining the entire team to handle a new type of client, hiring a new specialist who is an expert in that field.
Mathematics in the Service of Memory
How does the system know which “expert” to use at any given moment? It uses a special “gating network” that decides which experts are most competent to process the input data. The result is a weighted sum of the predictions of individual experts.
This can be written in a simplified form as:
$$ \text{Output} = \sum_{i=1}^{n} G(x)_i \cdot E_i(x) $$
Where:
- $n$ is the number of experts.
- $x$ is the input data.
- $E_i(x)$ is the output of the $i$-th expert.
- $G(x)_i$ is the “weight” or “confidence” that the gating network assigns to the $i$-th expert for the data $x$.
Moreover, the authors introduce a special regularization loss function, which ensures that older experts not only do not forget their knowledge but also help the new expert in learning. It’s a kind of teamwork between models.
A Real-Life Example: Music Recommendation System
Imagine a music streaming service that learns your taste. At first, you listen mainly to rock, so the system creates a “rock expert” that perfectly recommends new bands to you. After some time, you discover jazz. A traditional system, while learning about jazz, might “forget” about your rock preferences.
Thanks to the DyMoE (Dynamic Mixture-of-Experts) method, the system, instead of modifying the “rock expert,” would create a new “jazz expert.” When you ask for a song, the gating system will ask: “Is this more of a rock or a jazz query?”. Based on this, it will activate the appropriate expert (or both, if you’re listening to fusion!). And most importantly, as you add more genres, the system will add more experts, becoming smarter and smarter without “resetting” its knowledge.
Conclusion
The approach proposed in the publication is an important step towards creating more flexible and adaptive AI systems. It allows for continuous learning in a dynamically changing world, without the fear of losing valuable, previously acquired knowledge. This is a technology that could revolutionize fields such as recommendation systems, social network analysis, and drug discovery.
📎 Links
- Based on the publication 📄 arXiv:2508.09974 PDF