M²FMoE: When Experts Learn to Predict Floods

Time series forecasting is one of the most important applications of machine learning — from demand prediction, through infrastructure monitoring, to flood forecasting. The problem? Standard models optimize for typical cases. Yet it’s precisely the atypical ones — extreme events — that are often most important to predict. M²FMoE is a model that learns to predict both.

The Problem: Extreme Events Break Standard Models

Time series forecasting has made remarkable progress. Transformers, frequency-domain methods, and hybrid architectures achieve impressive results on benchmarks. But there’s a catch.

Most models optimize for average error across all timesteps. This means:

They learn patterns that work most of the time
Extreme events are rare, so they contribute little to the loss
The model “ignores” outliers to minimize overall error

Result? When a flood approaches, the model predicts “slightly above normal” — because that’s what minimizes average error. But you don’t care about average error when water is rising.

Why is this so hard?

Extreme events have fundamentally different characteristics:

Rare — maybe 1-5% of all observations
Heavy-tailed distributions — not Gaussian, not predictable from averages
Different dynamics — normal fluctuations follow seasonal patterns; extreme events follow storm systems
High stakes — errors during extreme events cost orders of magnitude more

Traditional solution? Label extreme events manually and train specialized models. But labels are expensive, subjective, and often unavailable in real-time.

The Solution: Frequency Experts Without Labels

M²FMoE (Multi-Resolution Multi-View Frequency Mixture-of-Experts) takes a different approach. Instead of labeling extreme events, it learns to recognize them through frequency signatures.

The key insight: extreme events look different in the frequency domain.

Normal patterns → strong periodic components (daily, weekly, seasonal)
Extreme events → sudden energy in unusual frequency bands

Architecture Overview

M²FMoE combines three modules:

Input → [Multi-View Frequency MoE] → [Multi-Resolution Fusion] → [Temporal Gating] → Prediction
              ↓                              ↓                         ↓
        Fourier + Wavelet              Coarse → Fine            Long vs Short term
        Expert Routing                 Hierarchical                Balance

Module 1: Multi-View Frequency Mixture-of-Experts

This is the core innovation. Instead of one monolithic model, M²FMoE uses specialized experts for different frequency bands — and does this from two complementary perspectives.

Fourier View

Fourier transform decomposes a signal into pure frequencies. Each expert specializes in a frequency band:

Expert 1: Low frequencies (long-term trends)
Expert 2: Medium frequencies (weekly/daily patterns)
Expert 3: High frequencies (rapid fluctuations)

The routing mechanism decides which expert handles each input:

$$\alpha = \text{Softmax}(\tilde{G}(\tilde{M}))$$

where $\tilde{M}$ is the magnitude spectrum and $\tilde{G}$ is a learned gating network.

Wavelet View

Fourier has a limitation: it loses temporal information. A sudden spike looks the same whether it happened yesterday or last month.

Wavelets preserve both frequency and time localization. M²FMoE adds wavelet experts that can detect when unusual frequencies appear — crucial for extreme events.

Cross-View Alignment

Here’s the clever part. Fourier and Wavelet domains use different scales. How do you ensure experts are looking at the same phenomena?

Theorem 1 establishes a mapping between Fourier frequency $f$ and wavelet scale $a$:

$$a = \frac{\gamma}{f}$$

where $\gamma$ is the wavelet center frequency. This ensures both views split the spectrum consistently.

Module 2: Multi-Resolution Adaptive Fusion

Not all patterns operate at the same scale. Seasonal trends need months of context; sudden spikes need hours.

M²FMoE processes the signal at multiple resolutions:

Coarse resolution — captures long-term trends
Medium resolution — captures weekly/daily cycles
Fine resolution — captures rapid changes

These are then hierarchically fused from coarse to fine, allowing the model to build a complete picture.

Module 3: Temporal Gating Integration

The final module balances two types of information:

Long-term trends ($H_r$) — slow-moving baselines
Frequency-aware features ($H_h$) — detected patterns from experts

A learned gating mechanism combines them:

$$\text{Output} = G \odot H_r + (1-G) \odot H_h$$

where $G$ is a sigmoid gate. During normal periods, the model relies more on trends. During anomalies, it shifts weight to frequency features.

Training: The Loss Function

M²FMoE uses a composite loss with three terms:

Forecasting loss (MSE) — standard prediction error
Diversity loss — encourages experts to specialize in different bands
Consistency loss — aligns Fourier and Wavelet expert outputs

$$\mathcal{L} = \mathcal{L}{\text{forecast}} + \lambda_1 \mathcal{L}{\text{diversity}} + \lambda_2 \mathcal{L}_{\text{consistency}}$$

The diversity loss is key: without it, all experts would learn the same thing.

Experiments: California Reservoir Data

The authors tested M²FMoE on real hydrological data from five reservoirs in Santa Clara County, California:

Almaden, Coyote, Lexington, Stevens Creek, Vasona
28 years of hourly water level measurements (1991-2019)
Heavy-tailed distributions with clear extreme events

Baselines (13 models)

Category	Models
Attention-based	CATS, TQNet, iTransformer
Frequency-domain	FreqMoE, Umixer
Linear/Hybrid	KAN, CycleNet, PatchTST, TimesNet, TimeMixer
Extreme-event (with labels)	DAN, MCANN

Results: 8-Hour Forecast

Dataset	M²FMoE	Best Baseline	Improvement
Almaden	7.99	14.73 (FreqMoE)	45.7%
Coyote	48.80	80.94 (iTransformer)	39.7%
Lexington	251.96	386.99 (iTransformer)	34.9%

Results: 72-Hour Forecast

For longer horizons, M²FMoE achieves:

22.30% average improvement over best baselines (without extreme labels)
9.19% improvement over methods that use extreme-event labels

This is remarkable: M²FMoE beats models that have access to information it doesn’t have.

Statistical Significance

All improvements are statistically significant (p < 0.05) via Wilcoxon signed-rank test.

Ablation Studies: What Matters?

The authors systematically removed components to measure their importance:

Removed Component	Performance Drop
Wavelet view	Significant
Multi-resolution fusion	Significant
Temporal gating	Moderate
Diversity loss	Moderate

Key finding: Wavelet experts activate more strongly during extreme events, while Fourier experts handle regular patterns. The dual-view design is essential.

Expert Count

How many experts are optimal?

Too few (1-2): Can’t specialize enough
Too many (5+): Overhead without benefit
Sweet spot: 3-4 experts

Why This Matters

For practitioners

If you’re forecasting time series with occasional extreme events (floods, demand spikes, equipment failures), M²FMoE offers:

No labeling required — learns extreme patterns automatically
Interpretable — you can see which experts activate when
Practical horizons — tested on 8h and 72h forecasts

For researchers

M²FMoE demonstrates:

Frequency-domain expertise can replace explicit event labels
Multi-view (Fourier + Wavelet) beats single-view approaches
Mixture-of-Experts scales well for time series

Limitations

Domain-specific tuning: Lookback window and expert count may need adjustment
Computational cost: More expensive than simple linear models
Evaluation: Tested primarily on hydrological data

Summary

M²FMoE shows that you don’t need to label extreme events to predict them. By combining:

Dual frequency views (Fourier for spectrum, Wavelet for localization)
Specialized experts for different frequency bands
Multi-resolution fusion for different time scales
Temporal gating for adaptive combination

…the model learns to recognize extreme events through their frequency signatures. On California reservoir data, it beats 13 baselines including methods that use extreme-event labels.

The broader lesson: instead of treating rare events as noise to be ignored, we can design architectures that naturally learn to handle them.

The Problem: Extreme Events Break Standard Models#

Why is this so hard?#

The Solution: Frequency Experts Without Labels#

Architecture Overview#

Module 1: Multi-View Frequency Mixture-of-Experts#

Fourier View#

Wavelet View#

Cross-View Alignment#

Module 2: Multi-Resolution Adaptive Fusion#

Module 3: Temporal Gating Integration#

Training: The Loss Function#

Experiments: California Reservoir Data#

Baselines (13 models)#

Results: 8-Hour Forecast#

Results: 72-Hour Forecast#

Statistical Significance#

Ablation Studies: What Matters?#

Expert Count#

Why This Matters#

For practitioners#

For researchers#

Limitations#

Summary#

Links#