Spatio-Temporal-based Multi-level Aggregation Network for Physical Action Recognition

Yuhang Wang1

  1. School of Physical Education, Harbin University
    Harbin, Heilongjiang 150086, China
    wangyuhang1978@hrbu.edu.cn

Abstract

This paper introduces spatio-temporal-based multi-level aggregation network (ST-MANet) for action recognition. It utilizes the correlations between different spatial positions and the correlations between different temporal positions on the feature map to explore long-range spatial and temporal dependencies, respectively, generating the spatial and temporal attention map that assigns different weights to features at different spatial and temporal locations. Additionally, a multi-scale approach is introduced, proposing a multi-scale behavior recognition framework that models various visual rhythms while capturing multi-scale spatiotemporal information. A spatial diversity constraint is then proposed, encouraging spatial attention maps at different scales to focus on distinct areas. This ensures a greater emphasis on spatial information unique to each scale, thereby incorporating more diverse spatial information into multi-scale features. Finally, ST-MANet is compared with existing approaches, demonstrating high accuracy on the three datasets.

Key words

Action recognition, spatial and temporal attention, multi-level aggregation network

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS240418060W

Publication information

Volume 21, Issue 4 (September 2024)
Year of Publication: 2024
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

DownloadAvailable in PDF
Portable Document Format

How to cite

Wang, Y.: Spatio-Temporal-based Multi-level Aggregation Network for Physical Action Recognition. Computer Science and Information Systems, Vol. 21, No. 4, 1823–1843. (2024), https://doi.org/10.2298/CSIS240418060W