Brais Martinez, Davide Modolo, Yuanjun Xiong, Joseph Tighe
Most existing methods treat action recognition as a generic classification problem, and the only difference from ImageNet classification is that the input is now video frames. However, to recognize the actions correctly, some finer details should be paid attention to.
This paper took inspiration from Yaming W. et al., as the human activities are complex concepts, which needs a good comprehension of finer details.
And, by changing only the last layers of the network, there will be low impact in terms of computational cost.
3. Methods (including framework)
This paper aims at modifying the last layers of existed models. In addition to preserve the original global pooling branch, they’ve proposed two more branches which could seize the very localized structures, as shown in this figure:
Here, C is the number of classes, N is a hyper-parameter indicating how many classfiers are associated to each classes. So, z_xchannel could be demonstrated as a very localized prediction, while z_avg is a global one, and z_max is in the middle of them. The final prediction is just a simple summation of these three outputs.
And, as the whole ResNet would down-sample the videos to such a high degress that the filters cannot learn the finer details (via this highly-extracted, or highly-abstract feature map), they proposed a local feature branch, in which the data would be up-sampled before feeding to the proceeding blocks.
Datasets: Kinetics-400 and Something-Something-V1.
Implementation framework: 2D/3D TSN, with ResNet backbone.
Evaluation metrics: Top-1 and Top-5.
5. Pros. & Cons.
- Our model could seize more localized details now.
- The top-1 and top-5 accuracy is higher than the last week’s paper ;-)
- No source-code.
- Slightly increase the computational cost.
6. Comments (e. g., improvements)
- I don’t think that the potential capacity of newly proposed branches are fully explored. It seems like this model is still dominated by the “global pooling” branch. Yes the way of mining more localized features may be right, but there may be some better ways to extract it.
- The up-sampling blocks are too expensive, why not feed the “discriminative filter banks” before the down-sample scale is acceptable, i.e., earlier.
7. Closely related papers (less than 3)
 Yaming W., et al., Learning a discriminative filter bank within a CNN for fine-grained recognition, in ICCVPR, 2018.