Time series forecasting is one of the most important applications of machine learning — from demand prediction, through infrastructure monitoring, to flood forecasting. The problem? Standard models optimize for typical cases. Yet it’s precisely the atypical ones — extreme events — that are often most important to predict. M²FMoE is a model that learns to predict both.
The Problem: Extreme Events Break Standard Models
Time series forecasting has made remarkable progress. Transformers, frequency-domain methods, and hybrid architectures achieve impressive results on benchmarks. But there’s a catch.
Most models optimize for average error across all timesteps. This means:
- They learn patterns that work most of the time
- Extreme events are rare, so they contribute little to the loss
- The model “ignores” outliers to minimize overall error
Result? When a flood approaches, the model predicts “slightly above normal” — because that’s what minimizes average error. But you don’t care about average error when water is rising.
Why is this so hard?
Extreme events have fundamentally different characteristics:
- Rare — maybe 1-5% of all observations
- Heavy-tailed distributions — not Gaussian, not predictable from averages
- Different dynamics — normal fluctuations follow seasonal patterns; extreme events follow storm systems
- High stakes — errors during extreme events cost orders of magnitude more
Traditional solution? Label extreme events manually and train specialized models. But labels are expensive, subjective, and often unavailable in real-time.
The Solution: Frequency Experts Without Labels
M²FMoE (Multi-Resolution Multi-View Frequency Mixture-of-Experts) takes a different approach. Instead of labeling extreme events, it learns to recognize them through frequency signatures.
The key insight: extreme events look different in the frequency domain.
- Normal patterns → strong periodic components (daily, weekly, seasonal)
- Extreme events → sudden energy in unusual frequency bands
Architecture Overview
M²FMoE combines three modules:
Input → [Multi-View Frequency MoE] → [Multi-Resolution Fusion] → [Temporal Gating] → Prediction
↓ ↓ ↓
Fourier + Wavelet Coarse → Fine Long vs Short term
Expert Routing Hierarchical Balance
Module 1: Multi-View Frequency Mixture-of-Experts
This is the core innovation. Instead of one monolithic model, M²FMoE uses specialized experts for different frequency bands — and does this from two complementary perspectives.
Fourier View
Fourier transform decomposes a signal into pure frequencies. Each expert specializes in a frequency band:
- Expert 1: Low frequencies (long-term trends)
- Expert 2: Medium frequencies (weekly/daily patterns)
- Expert 3: High frequencies (rapid fluctuations)
The routing mechanism decides which expert handles each input:
$$\alpha = \text{Softmax}(\tilde{G}(\tilde{M}))$$
where $\tilde{M}$ is the magnitude spectrum and $\tilde{G}$ is a learned gating network.
Wavelet View
Fourier has a limitation: it loses temporal information. A sudden spike looks the same whether it happened yesterday or last month.
Wavelets preserve both frequency and time localization. M²FMoE adds wavelet experts that can detect when unusual frequencies appear — crucial for extreme events.
Cross-View Alignment
Here’s the clever part. Fourier and Wavelet domains use different scales. How do you ensure experts are looking at the same phenomena?
Theorem 1 establishes a mapping between Fourier frequency $f$ and wavelet scale $a$:
$$a = \frac{\gamma}{f}$$
where $\gamma$ is the wavelet center frequency. This ensures both views split the spectrum consistently.
Module 2: Multi-Resolution Adaptive Fusion
Not all patterns operate at the same scale. Seasonal trends need months of context; sudden spikes need hours.
M²FMoE processes the signal at multiple resolutions:
- Coarse resolution — captures long-term trends
- Medium resolution — captures weekly/daily cycles
- Fine resolution — captures rapid changes
These are then hierarchically fused from coarse to fine, allowing the model to build a complete picture.
Module 3: Temporal Gating Integration
The final module balances two types of information:
- Long-term trends ($H_r$) — slow-moving baselines
- Frequency-aware features ($H_h$) — detected patterns from experts
A learned gating mechanism combines them:
$$\text{Output} = G \odot H_r + (1-G) \odot H_h$$
where $G$ is a sigmoid gate. During normal periods, the model relies more on trends. During anomalies, it shifts weight to frequency features.
Training: The Loss Function
M²FMoE uses a composite loss with three terms:
- Forecasting loss (MSE) — standard prediction error
- Diversity loss — encourages experts to specialize in different bands
- Consistency loss — aligns Fourier and Wavelet expert outputs
$$\mathcal{L} = \mathcal{L}{\text{forecast}} + \lambda_1 \mathcal{L}{\text{diversity}} + \lambda_2 \mathcal{L}_{\text{consistency}}$$
The diversity loss is key: without it, all experts would learn the same thing.
Experiments: California Reservoir Data
The authors tested M²FMoE on real hydrological data from five reservoirs in Santa Clara County, California:
- Almaden, Coyote, Lexington, Stevens Creek, Vasona
- 28 years of hourly water level measurements (1991-2019)
- Heavy-tailed distributions with clear extreme events
Baselines (13 models)
| Category | Models |
|---|---|
| Attention-based | CATS, TQNet, iTransformer |
| Frequency-domain | FreqMoE, Umixer |
| Linear/Hybrid | KAN, CycleNet, PatchTST, TimesNet, TimeMixer |
| Extreme-event (with labels) | DAN, MCANN |
Results: 8-Hour Forecast
| Dataset | M²FMoE | Best Baseline | Improvement |
|---|---|---|---|
| Almaden | 7.99 | 14.73 (FreqMoE) | 45.7% |
| Coyote | 48.80 | 80.94 (iTransformer) | 39.7% |
| Lexington | 251.96 | 386.99 (iTransformer) | 34.9% |
Results: 72-Hour Forecast
For longer horizons, M²FMoE achieves:
- 22.30% average improvement over best baselines (without extreme labels)
- 9.19% improvement over methods that use extreme-event labels
This is remarkable: M²FMoE beats models that have access to information it doesn’t have.
Statistical Significance
All improvements are statistically significant (p < 0.05) via Wilcoxon signed-rank test.
Ablation Studies: What Matters?
The authors systematically removed components to measure their importance:
| Removed Component | Performance Drop |
|---|---|
| Wavelet view | Significant |
| Multi-resolution fusion | Significant |
| Temporal gating | Moderate |
| Diversity loss | Moderate |
Key finding: Wavelet experts activate more strongly during extreme events, while Fourier experts handle regular patterns. The dual-view design is essential.
Expert Count
How many experts are optimal?
- Too few (1-2): Can’t specialize enough
- Too many (5+): Overhead without benefit
- Sweet spot: 3-4 experts
Why This Matters
For practitioners
If you’re forecasting time series with occasional extreme events (floods, demand spikes, equipment failures), M²FMoE offers:
- No labeling required — learns extreme patterns automatically
- Interpretable — you can see which experts activate when
- Practical horizons — tested on 8h and 72h forecasts
For researchers
M²FMoE demonstrates:
- Frequency-domain expertise can replace explicit event labels
- Multi-view (Fourier + Wavelet) beats single-view approaches
- Mixture-of-Experts scales well for time series
Limitations
- Domain-specific tuning: Lookback window and expert count may need adjustment
- Computational cost: More expensive than simple linear models
- Evaluation: Tested primarily on hydrological data
Summary
M²FMoE shows that you don’t need to label extreme events to predict them. By combining:
- Dual frequency views (Fourier for spectrum, Wavelet for localization)
- Specialized experts for different frequency bands
- Multi-resolution fusion for different time scales
- Temporal gating for adaptive combination
…the model learns to recognize extreme events through their frequency signatures. On California reservoir data, it beats 13 baselines including methods that use extreme-event labels.
The broader lesson: instead of treating rare events as noise to be ignored, we can design architectures that naturally learn to handle them.
Links
- Paper: arXiv:2601.08631
- Accepted at: AAAI 2026