Time series forecasting is one of the most important applications of machine learning — from demand prediction, through infrastructure monitoring, to flood forecasting. The problem? Standard models optimize for typical cases. Yet it’s precisely the atypical ones — extreme events — that are often most important to predict. M²FMoE is a model that learns to predict both.

The Problem: Extreme Events Break Standard Models

Time series forecasting has made remarkable progress. Transformers, frequency-domain methods, and hybrid architectures achieve impressive results on benchmarks. But there’s a catch.

Most models optimize for average error across all timesteps. This means:

  • They learn patterns that work most of the time
  • Extreme events are rare, so they contribute little to the loss
  • The model “ignores” outliers to minimize overall error

Result? When a flood approaches, the model predicts “slightly above normal” — because that’s what minimizes average error. But you don’t care about average error when water is rising.

Why is this so hard?

Extreme events have fundamentally different characteristics:

  • Rare — maybe 1-5% of all observations
  • Heavy-tailed distributions — not Gaussian, not predictable from averages
  • Different dynamics — normal fluctuations follow seasonal patterns; extreme events follow storm systems
  • High stakes — errors during extreme events cost orders of magnitude more

Traditional solution? Label extreme events manually and train specialized models. But labels are expensive, subjective, and often unavailable in real-time.

The Solution: Frequency Experts Without Labels

M²FMoE (Multi-Resolution Multi-View Frequency Mixture-of-Experts) takes a different approach. Instead of labeling extreme events, it learns to recognize them through frequency signatures.

The key insight: extreme events look different in the frequency domain.

  • Normal patterns → strong periodic components (daily, weekly, seasonal)
  • Extreme events → sudden energy in unusual frequency bands

Architecture Overview

M²FMoE combines three modules:

Input → [Multi-View Frequency MoE] → [Multi-Resolution Fusion] → [Temporal Gating] → Prediction
              ↓                              ↓                         ↓
        Fourier + Wavelet              Coarse → Fine            Long vs Short term
        Expert Routing                 Hierarchical                Balance

Module 1: Multi-View Frequency Mixture-of-Experts

This is the core innovation. Instead of one monolithic model, M²FMoE uses specialized experts for different frequency bands — and does this from two complementary perspectives.

Fourier View

Fourier transform decomposes a signal into pure frequencies. Each expert specializes in a frequency band:

  • Expert 1: Low frequencies (long-term trends)
  • Expert 2: Medium frequencies (weekly/daily patterns)
  • Expert 3: High frequencies (rapid fluctuations)

The routing mechanism decides which expert handles each input:

$$\alpha = \text{Softmax}(\tilde{G}(\tilde{M}))$$

where $\tilde{M}$ is the magnitude spectrum and $\tilde{G}$ is a learned gating network.

Wavelet View

Fourier has a limitation: it loses temporal information. A sudden spike looks the same whether it happened yesterday or last month.

Wavelets preserve both frequency and time localization. M²FMoE adds wavelet experts that can detect when unusual frequencies appear — crucial for extreme events.

Cross-View Alignment

Here’s the clever part. Fourier and Wavelet domains use different scales. How do you ensure experts are looking at the same phenomena?

Theorem 1 establishes a mapping between Fourier frequency $f$ and wavelet scale $a$:

$$a = \frac{\gamma}{f}$$

where $\gamma$ is the wavelet center frequency. This ensures both views split the spectrum consistently.

Module 2: Multi-Resolution Adaptive Fusion

Not all patterns operate at the same scale. Seasonal trends need months of context; sudden spikes need hours.

M²FMoE processes the signal at multiple resolutions:

  1. Coarse resolution — captures long-term trends
  2. Medium resolution — captures weekly/daily cycles
  3. Fine resolution — captures rapid changes

These are then hierarchically fused from coarse to fine, allowing the model to build a complete picture.

Module 3: Temporal Gating Integration

The final module balances two types of information:

  • Long-term trends ($H_r$) — slow-moving baselines
  • Frequency-aware features ($H_h$) — detected patterns from experts

A learned gating mechanism combines them:

$$\text{Output} = G \odot H_r + (1-G) \odot H_h$$

where $G$ is a sigmoid gate. During normal periods, the model relies more on trends. During anomalies, it shifts weight to frequency features.

Training: The Loss Function

M²FMoE uses a composite loss with three terms:

  1. Forecasting loss (MSE) — standard prediction error
  2. Diversity loss — encourages experts to specialize in different bands
  3. Consistency loss — aligns Fourier and Wavelet expert outputs

$$\mathcal{L} = \mathcal{L}{\text{forecast}} + \lambda_1 \mathcal{L}{\text{diversity}} + \lambda_2 \mathcal{L}_{\text{consistency}}$$

The diversity loss is key: without it, all experts would learn the same thing.

Experiments: California Reservoir Data

The authors tested M²FMoE on real hydrological data from five reservoirs in Santa Clara County, California:

  • Almaden, Coyote, Lexington, Stevens Creek, Vasona
  • 28 years of hourly water level measurements (1991-2019)
  • Heavy-tailed distributions with clear extreme events

Baselines (13 models)

CategoryModels
Attention-basedCATS, TQNet, iTransformer
Frequency-domainFreqMoE, Umixer
Linear/HybridKAN, CycleNet, PatchTST, TimesNet, TimeMixer
Extreme-event (with labels)DAN, MCANN

Results: 8-Hour Forecast

DatasetM²FMoEBest BaselineImprovement
Almaden7.9914.73 (FreqMoE)45.7%
Coyote48.8080.94 (iTransformer)39.7%
Lexington251.96386.99 (iTransformer)34.9%

Results: 72-Hour Forecast

For longer horizons, M²FMoE achieves:

  • 22.30% average improvement over best baselines (without extreme labels)
  • 9.19% improvement over methods that use extreme-event labels

This is remarkable: M²FMoE beats models that have access to information it doesn’t have.

Statistical Significance

All improvements are statistically significant (p < 0.05) via Wilcoxon signed-rank test.

Ablation Studies: What Matters?

The authors systematically removed components to measure their importance:

Removed ComponentPerformance Drop
Wavelet viewSignificant
Multi-resolution fusionSignificant
Temporal gatingModerate
Diversity lossModerate

Key finding: Wavelet experts activate more strongly during extreme events, while Fourier experts handle regular patterns. The dual-view design is essential.

Expert Count

How many experts are optimal?

  • Too few (1-2): Can’t specialize enough
  • Too many (5+): Overhead without benefit
  • Sweet spot: 3-4 experts

Why This Matters

For practitioners

If you’re forecasting time series with occasional extreme events (floods, demand spikes, equipment failures), M²FMoE offers:

  • No labeling required — learns extreme patterns automatically
  • Interpretable — you can see which experts activate when
  • Practical horizons — tested on 8h and 72h forecasts

For researchers

M²FMoE demonstrates:

  • Frequency-domain expertise can replace explicit event labels
  • Multi-view (Fourier + Wavelet) beats single-view approaches
  • Mixture-of-Experts scales well for time series

Limitations

  • Domain-specific tuning: Lookback window and expert count may need adjustment
  • Computational cost: More expensive than simple linear models
  • Evaluation: Tested primarily on hydrological data

Summary

M²FMoE shows that you don’t need to label extreme events to predict them. By combining:

  1. Dual frequency views (Fourier for spectrum, Wavelet for localization)
  2. Specialized experts for different frequency bands
  3. Multi-resolution fusion for different time scales
  4. Temporal gating for adaptive combination

…the model learns to recognize extreme events through their frequency signatures. On California reservoir data, it beats 13 baselines including methods that use extreme-event labels.

The broader lesson: instead of treating rare events as noise to be ignored, we can design architectures that naturally learn to handle them.