Spatio-temporal Relation Modeling for Few-shot Action Recognition CVPR 2022
- Anirudh Thatipelli MBZUAI
- Sanath Narayan IIAI
- Salman Khan MBZUAI, ANU
- Rao Mohammad Anwer MBZUAI, Aalto University
- Fahad Shahbaz Khan MBZUAI, Linkoping University
- Bernard Ghanem KAUST
Our proposed model, STRM attains current state-of-the-art for Few-shot Action Recognition.
We propose a novel few-shot action recognition framework, STRM, which enhances class-specific feature discriminability while simultaneously learning higher-order temporal representations. The focus of our approach is a novel spatio-temporal enrichment module that aggregates spatial and temporal contexts with dedicated local patch-level and global frame-level feature enrichment sub-modules. Local patch-level enrichment captures the appearance-based characteristics of actions. On the other hand, global frame-level enrichment explicitly encodes the broad temporal context, thereby capturing the relevant object features over time. The resulting spatio-temporally enriched representations are then utilized to learn the relational matching between query and support action sub-sequences. We further introduce a query-class similarity classifier on the patch-level enriched features to enhance class-specific feature discriminability by reinforcing the feature learning at different stages in the proposed framework.Experiments are performed on four few-shot action recognition benchmarks: Kinetics, SSv2, HMDB51 and UCF101. Our extensive ablation study reveals the benefits of the proposed contributions. Furthermore, our approach sets a new state-of-the-art on all four benchmarks. On the challenging SSv2 benchmark, our approach achieves an absolute gain of 3.5% in classification accuracy, as compared to the best existing method in the literature.
State-of-the-art comparison on four FS action recognition datasets.
Our STRM outperforms existing FS action recognition methods on all four datasets.
Below you will find qualitative results with attention maps for few-shot action recognition. TRX struggles in case of spatial and temporal context variations that are commonly encountered in actions performed with different objects and backgrounds, e.g., fifth and sixth frame from the left in (b), where the regions corresponding to actions are not emphasized. Similarly, the action in the second and third frame from the left in (d) is not accurately captured due to the distractor motion from the moving hand of another person. Our STRM approach explicitly enhances class-specific feature discriminability through spatio-temporal context aggregation and intermediate latent feature classification. This leads to better matching between query and limited support action instances.
Below you will find quantitative comparisons between Baseline TRM and STRM w.r.t tuple matches. For the query tuple in green, the best match obtained by our STRM (2nd support video) is a better representative, in comparison to the best match of Baseline TRM (1st support video).
The Baseline TRM fails to obtain support tuples that are representative enough for the query tuples in red and green. Our STRM alleviates this issue and obtains good representative matches (and support videos) since it enhances the feature disriminability through patch-level as well as frame-level enrichment and learns higher-order temporal representations.