Welcome, my friend.

\$cat /posts/2019-11-07-cvpr2019-collaborative-spatiotemporal-feature-learning-for-video-action-recognition.md Collaborative Spatiotemporal Feature Learning for Video Action Recognition Chao Li, Qiaoyong Zhong, Di Xie, Shiliang Pu, Hikvision Research Institute Chen Xiaoyuan, 2019/11/7 1. Background To extract the spatiotemporal features, exiting deep neural networks either learn spatial and temporal features independently (C2D), or jointly (C3D), but with unconstrained parameters. The latter way, C3D, and its variants, work well in many models, but they need too much computing resources. 2. Motivation This is a visualization of three views of a video. The top left one is a normal one being understandable to us. However, if we treat the height and length dimensions as “spatial” ones (while width as temporal), we will get the top right image. It looks strange, but it still has some “spatial” pattern, such as edges and color blobs. Our trained models for static images may be useful here. ----So, that’s the core idea of CoST (Collaborative Spatio-Temporal). 3. Methods (including framework) Let$x$be the input data of size$T\times H\times W\times C$, then we have the three views of this video: $$x_{hw}=x\otimes w_{1\times3\times3} \\ x_{tw}=x\otimes w_{3\times1\times3} \\ x_{th}=x\otimes w_{3\times3\times1}$$ where$\otimes$denotes 3D convolution, and$w_{3\times3}\$ is shared by these three views (after a simple reshaping). Then they would be aggregated with weighted summation:

$$y=\left[a_{hw},a_{tw},a_{th}\right]\left[\begin{matrix}x_{hw}\\x_{tw}\\x_{th}\\\end{matrix}\right]$$

Then, there are two different way of generating these coefficients, accordingly they’re called CoST(a) and CoST(b) respectively:

4. Experiments (data corpus, evaluation metrics, compared methods)

Datasets: Moments in Time; Kinetics.

Evaluation metrics: Top-1 Acc., Top-5 Acc.

Compared methods: C2D, C3D, I3D, R(2+1)D, S3D-G and etc.

Framework: ResNet-50, ResNet-101.

These results are all on Kinetics, and all of which are single model results (RGB modality only).

5. Pros. & Cons.

Pros:

1. More accurate while having less parameters (compared with C3D).
2. We can quantitatively interpret the compacts of spatial (H-W) and temporal (T-W, T-H) information through every entire videos.

Cons:

1. The T-W and T-H views are not intuitive at all.

(And, since the authors haven’t published the source codes yet, we should implement it by ourselves /sad.)