Welcome, my friend. \$ cat /posts/2019-11-13-cvpr2019-action-recognition-from-single-timestamp-supervision-in-untrimmed-videos.md---title:"CVPR2019-Action Recognition from Single Timestamp Supervision in Untrimmed Videos"date:2019-11-13tags:[]draft:false---## Background

1. Marking every start and end times of every actions instance is very expensive and hard to acquire, but action recognition needs more and more large datasets to improve the performance. 2. Weak video-level supervision has been successfully exploited for recognition in untrimmed videos, however it's challenged when the number of different action instances in videos increases.## Motivation

The idea, single timestamp supervision, is inspired by similar approaches for single point annotations in image based semantic segmentation. It is a compromise between cost and performance. Furthermore, these single timestamps can even be collected from audio narrations and video subtitles.## Methods & Framework

## 1. Sampling Near the Timestamps: The Plateau Function

This paper has proposed the followingto model the probability density of the sampling distributiolns: $$ g(x|c, w, s) = \frac{1}{(e^{s(x-c-w)}+1)(e^{s(-x+c-w)}+1)} $$ Each timestamps will have its' initialplateau functioninstance respectively, with $c^v_i=a^v_i$, $w=45$ and $s=0.75$. Then, let's define the notations first:plateau functionwhich s.t. $i\in\{1..N_v\}$ and $v\in\{1..M\}$ And let: $$ \mathcal{F}^k=( x \gets G(\beta^v_i) \, : \, y^v_i=k,\forall i\in\{1..N_v\}, \forall v\in\{1..M\} ) \\ s.t.\,\,P(k|\mathcal{F}^k_{t-1})\ge P(k|\mathcal{F}^k_t) $$ be the list of sampled frames with corresponding class $k$. Then the top-$T$ frames within it would be fed into training process, where $T=h\,|\mathcal{F}^k|, h\in[0, 1]$. The $h$ would increase slowly when learning the model.

- $a^v_i$: the $i$-
single timestamp in an untrimmed video $v$th- $y^v_i$: the corresponding class label
- $\beta^v_i=(c^v_i, w^v_i, s^v_i)$, where $c^v_i=a^v_i$: the parameters of a
instanceplateau function- $G(\beta^v_i)$: the corresponding
instanceplateau function- $t$: the index of frames list $\mathcal{F}^k$
## 2. Updating the Distribution Function

The initial sampling functions are not precise enough in most of times, so we need a policy to update it. It contains three steps (this note would be too long if I cover it in detail...)## 2.1 Finding Update proposals

We denote each update proposal with $\gamma^v_j=(c^v_j, w^v_j, s^v_j)$. The set of update proposals for $\beta^v_i$ is thus: $$ \mathcal{Q}^v_i=\{\gamma^v_j\,:\,c^v_{i-1}<c^v_i<c^v_{i+1}\} $$## 2.2 Selecting the Update Proposals

First, select the frames such that $g(x|\beta^v_i)>0.5$, and denote it with $\mathcal{X}$. Then we can select the proposal $\widehat{\gamma^v_i}$ with highest confidence for each $\beta^v_i$: $$ \widehat{\gamma^v_i}=\text{arg min}_{\gamma^v_j}(\rho(\gamma^v_j)-\rho(\beta^v_j)) $$ where $\rho(\beta^v_i)=\frac1{|\mathcal{X}|}\sum_{x\in\mathcal{X}}P(y^v_i|x)$.## 2.3 Updating Proposals

Nothing special. Every parameters could have its own update rate.## Experiments

Datasets: THUMOS 14, BEOID, EPIC Kitchens. Architecture: BN-Inception, pretrained on Kinetics, embeded in framework TSN. Evaluation metrics: Top-1 accuracy. Where:

`APV`

means the average of uniquectionsAer traningPideo.V`Video-level`

means video-level supervision (contains or not contains)`TS`

means $a_i \sim U[\sigma_i-1sec, \epsilon_i+1sec]$`TS in GT`

means $a_i \sim N(\frac{\sigma_i+\epsilon_i}2, 1sec)$`Full`

means giving the exact extent of every action instance to the model.## Pros. & Cons.

: Much easier to do the annotation, which makes the larger datasets possible.Pros: In the real-world application, no timestamps can be supplied.Cons## Comments

- What if alleviate the supervision more radically: only with the number of action instances provided, and let the sampling function learn where to go by itself?
- It seems that the refinement process of action boundaries is applicable in many traditional models / frameworks.
- There are too much
instances! What if every action class shares a common instance?plateau function## Closely related Papers

- [Alayrac J, Bojanowski P, et al. Unsupervised learning from narrated instruction videos. In *CVPR*, 2019.](https://arxiv.org/abs/1506.09215)
- [Bengio Y, Louradour J, et al. Curriculum learning. In *ICML*, 2009.](https://dl.acm.org/citation.cfm?id=1553380)