Modern emotion-recognition systems increasingly leverage data from multiple sources—ranging from physiological signals (e.g., heart rate, skin conductance) to facial video. The goal is to capture the richness of human feelings, where multiple emotions often co-occur. Traditional approaches, however, focused on single-label classification (e.g., “happy” or “sad”).

The paper “HeLo: Heterogeneous Multi-Modal Fusion with Label Correlation for Emotion Distribution Learning” introduces an entirely new paradigm: emotion distribution learning, where the model predicts the probability of each basic emotion being present.


1. A Simple Intuition

Imagine we want to predict how much we feel:

  1. Surprise,
  2. Joy,
  3. Worry,
  4. Calm.

Instead of labeling one emotion, we want to say:

“This sample is 60% joy, 30% surprise, and 10% sadness.”

We use two information sources:

  • Facial video—processed by a CNN or lightweight transformer to extract visual features,
  • Physiological signals (e.g., ECG)—processed by a small MLP.

The model attends to both representations and learns which features indicate joy versus surprise. It also understands that some emotions co-occur more often (e.g., joy & surprise) than others (e.g., sadness & calm).


2. Core HeLo Components

2.1. Cross‐Attention Fusion

  • Query: vectors from physiological signals,
  • Key/Value: vectors from facial-video features,
  • The attention mechanism pairs elements from both modalities into a joint multimodal vector.

2.2. Heterogeneity Module (Optimal Transport)

  • Treat each modality’s representation as a distribution of points.
  • Define a cost matrix ( C_{ij} = |p_i - b_j|^2 ) and solve an entropic optimal transport problem via Sinkhorn iterations.
  • The resulting transport plan weights how strongly to align each physiological feature with each video feature.

2.3. Label‐Correlation Embeddings

  • Assign each emotion a label vector in the same space as the multimodal features.
  • Compute an empirical correlation matrix $ R $ from training labels.
  • Add a loss term
    $$ \mathcal{L}_{corr} = |softmax(L L^T) - R|_F^2, $$ so that learned embeddings reflect true emotion correlations.

2.4. Label‐Guided Attention

  • Query: the label embeddings $ L $,
  • Key/Value: the fused multimodal vector $ z $,
  • Each emotion “attends” to relevant aspects of $ z $, producing a $ K\times d $ matrix where each row is the predicted distribution for one emotion.

3. Mathematical foundations and model losses

  1. Entropic Optimal Transport
    $$ \min_{\pi\in U(\mu,\nu)} \langle \pi, C\rangle - \varepsilon H(\pi), $$ solved efficiently by alternating normalization (Sinkhorn).

  2. Correlation Loss
    $$ \mathcal{L}_{corr} = \Bigl|softmax(L L^T) - R\Bigr|_F^2. $$

  3. Overall Objective
    $ \mathcal{L}=\mathrm{EMD}(Y,Y_{gt})+ \lambda_{OT}\mathcal{L}_{OT}+ $

    $ \lambda_{corr} \mathcal{L}_{corr} $


4. Future Directions

  1. Incorporate more modalities (audio, text)
  2. Cross‐cultural emotion‐correlation studies
  3. Social robotics—emotionally aware avatars and assistants
  4. Computational efficiency—lighter OT modules for mobile devices