Google researchers have proposed a new transformer architecture (MBT) for audiovisual fusion and explored different fusion strategies using cross-attention between latent tokens in a new paper called, Attention Bottlenecks for Multimodal Fusion.
Machine perception models are usually modality-specific and optimised for unimodal benchmarks, and hence the late-stage fusion of final representations or predictions from each modality (‘late-fusion’) is still a dominant paradigm for multimodal video classification. Multimodal Bottleneck Transformer uses ‘fusion bottlenecks’ for modality fusion at multiple layers. Compared to traditional pairwise self-attention, MBT forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the important information in each modality and only share what is necessary.
The researchers showed restricting cross-modal attention via a small set of fusion bottlenecks achieved state-of-the-art results on a number of video classification benchmarks while also reducing computational costs compared to vanilla cross-attention models.