We newly define a fine-grained AudioVisual Learning task, termed Audio-Visual Understanding (AVU), which aims at achieving region-aware, frame-level, and high-quality sound source understanding. To support this goal, we newly construct two corresponding datasets: fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene). Moreover, we propose AVUFormer, an Audio-Visual Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-model input and multi-model output Transformer architecture.
In this system, the video data is first uploaded into the system. Then the SAM model is used to obtain the initial masks. Based on the initial masks, the TAM is used to gain the frame-level video masks. Finally, the masked region of the frame-level images are fed into the Chat-Univi to get the region-aware descriptions.
AVUFormer: fine-grainied Audio-Visual Understanding Benchmark. In the left of this architecture, the audio and video are fed into the encoders and mapped to the tokens. Then the multi-modal features are fused with the attention mechanism. Next, the previous features are integrated into task decoders for mask and description generation.
Multi-Modality integration with attention mechanisms. Self-attention uses the same input for Q, K, V. Cross-attention uses the cross-modality input for Q and K, V.
Mask Collaboration Module for task interaction. (a). The plain multi-task output without interaction. (b). Mask collaboration module introduces the interaction between these two tasks. As a fine-grained AVU task, region-aware visual information will give more details for text description. Thus, combining the multi-modal representations with more regional visual features will give more accurate fine-grained captions.
SSS: Sound Source Segmentation,AVC: Audio-Visual Caption, SSU-D: Sound-Source Understanding with Description Only, SSU-S: Sound-Source Understanding with Segmentation Only.
SSS: Sound Source Segmentation,AVC: Audio-Visual Caption, SSU-D: Sound-Source Understanding with Description Only, SSU-S: Sound-Source Understanding with Segmentation Only.
Params:Model Parameters,FPS:Frame Rate Inferred Per Second,FLOPS:Floating Point Operations Per Second,Inference Time:Inference Time For a Single Sample.