Audio-Visual Understanding: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

We newly define a fine-grained AudioVisual Learning task, termed Audio-Visual Understanding (AVU), which aims at achieving region-aware, frame-level, and high-quality sound source understanding. To support this goal, we newly construct two corresponding datasets: fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene). Moreover, we propose AVUFormer, an Audio-Visual Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-model input and multi-model output Transformer architecture.

Datasets

Dataset Analysis and Statistics

Basic informations

  • we newly construct two corresponding datasets: fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene).
  • containing annotated sound source masks and frame-by-frame textual descriptions
  • f-Music focuses on music scenes with complex instrument mixing and background noise,f-Lifescene includes diverse sounding objects in life scenarios.

Basic statistics

  • f-Music dataset includes 3,976 samples across 22 scene types
  • f-Lifescene dataset contains 6,156 samples across 61 types

Statistics of sounding scene category from two datasets

Data annotation process and labeling system

In this system, the video data is first uploaded into the system. Then the SAM model is used to obtain the initial masks. Based on the initial masks, the TAM is used to gain the frame-level video masks. Finally, the masked region of the frame-level images are fed into the Chat-Univi to get the region-aware descriptions.

Model

Our framework AVUFormer and key modules

AVUFormer Framework

AVUFormer: fine-grainied Audio-Visual Understanding Benchmark. In the left of this architecture, the audio and video are fed into the encoders and mapped to the tokens. Then the multi-modal features are fused with the attention mechanism. Next, the previous features are integrated into task decoders for mask and description generation.


Multi-Modality Integration (MMI)

Multi-Modality integration with attention mechanisms. Self-attention uses the same input for Q, K, V. Cross-attention uses the cross-modality input for Q and K, V.


Mask Collaboration Module for task interaction

Mask Collaboration Module for task interaction. (a). The plain multi-task output without interaction. (b). Mask collaboration module introduces the interaction between these two tasks. As a fine-grained AVU task, region-aware visual information will give more details for text description. Thus, combining the multi-modal representations with more regional visual features will give more accurate fine-grained captions.

Results

Demo Video
Visual Results on f-music Dataset
Visual Results on f-life Dataset
More Visual Results with Discriptions

Experiments

Extensive experiments are conducted on our two datasets to verify the feasibility of the task, evaluate the availability of the datasets, and demonstrate the superiority of the AVUFormer, which achieves SOTA performance on the Audio-Visual Understanding benchmark.
Quantitative comparisons with single-task models on dataset f-Music

SSS: Sound Source Segmentation,AVC: Audio-Visual Caption, SSU-D: Sound-Source Understanding with Description Only, SSU-S: Sound-Source Understanding with Segmentation Only.

Quantitative comparisons with single-task models on dataset f-Lifescene

SSS: Sound Source Segmentation,AVC: Audio-Visual Caption, SSU-D: Sound-Source Understanding with Description Only, SSU-S: Sound-Source Understanding with Segmentation Only.

Comparison with Multi-Model Large Models on sound object description task (average of two datasets)
Quantitative comparisons on ablation analysis
Computational efficiency of the proposed method with comparisons

Params:Model Parameters,FPS:Frame Rate Inferred Per Second,FLOPS:Floating Point Operations Per Second,Inference Time:Inference Time For a Single Sample.