# [Paper-Reading] CVPR2019-Action Recognition from Single Timestamp Supervision in Untrimmed Videos

## Background

- Marking every start and end times of every actions instance is very expensive and hard to acquire, but action recognition needs more and more large datasets to improve the performance.
- Weak video-level supervision has been successfully exploited for recognition in untrimmed videos, however it’s challenged when the number of different action instances in videos increases.

## Motivation

The idea, single timestamp supervision, is inspired by similar approaches for single point annotations in image based semantic segmentation. It is a compromise between cost and performance. Furthermore, these single timestamps can even be collected from audio narrations and video subtitles.

## Methods & Framework

### 1. Sampling Near the Timestamps: The Plateau Function

This paper has proposed the following * plateau function* to model the probability density of the sampling distributiolns:

$$ g(x|c, w, s) = \frac{1}{(e^{s(x-c-w)}+1)(e^{s(-x+c-w)}+1)} $$

Each timestamps will have its’ initial *plateau function* instance respectively, with \(c^v_i=a^v_i\), \(w=45\) and \(s=0.75\). Then, let’s define the notations first:

- \(a^v_i\): the \(i\)-
*th*single timestamp in an untrimmed video \(v\) - \(y^v_i\): the corresponding class label
- \(\beta^v_i=(c^v_i, w^v_i, s^v_i)\), where \(c^v_i=a^v_i\): the parameters of a
*plateau function*instance - \(G(\beta^v_i)\): the corresponding
*plateau function*instance - \(t\): the index of frames list \(\mathcal{F}^k\)

which s.t. \(i\in{1..N_v}\) and \(v\in{1..M}\)

And let:
$$
\mathcal{F}^k=(
x \gets G(\beta^v_i)
, : ,
y^v_i=k,\forall i\in{1..N_v},
\forall v\in{1..M}
) \

s.t.,,P(k|\mathcal{F}^k_{t-1})\ge P(k|\mathcal{F}^k_t)
$$
be the list of sampled frames with corresponding class \(k\). Then the top-\(T\) frames within it would be fed into training process, where \(T=h,|\mathcal{F}^k|, h\in[0, 1]\). The \(h\) would increase slowly when learning the model.

### 2. Updating the Distribution Function

The initial sampling functions are not precise enough in most of times, so we need a policy to update it. It contains three steps (this note would be too long if I cover it in detail…)

#### 2.1 Finding Update proposals

We denote each update proposal with \(\gamma^v_j=(c^v_j, w^v_j, s^v_j)\). The set of update proposals for \(\beta^v_i\) is thus:

$$ \mathcal{Q}^v_i={\gamma^v_j,:,c^v_{i-1}<c^v_i<c^v_{i+1}} $$

#### 2.2 Selecting the Update Proposals

First, select the frames such that \(g(x|\beta^v_i)>0.5\), and denote it with \(\mathcal{X}\). Then we can select the proposal \(\widehat{\gamma^v_i}\) with highest confidence for each \(\beta^v_i\):

$$ \widehat{\gamma^v_i}=\text{arg min}_{\gamma^v_j}(\rho(\gamma^v_j)-\rho(\beta^v_j)) $$

where \(\rho(\beta^v_i)=\frac1{|\mathcal{X}|}\sum_{x\in\mathcal{X}}P(y^v_i|x)\).

#### 2.3 Updating Proposals

Nothing special. Every parameters could have its own update rate.

## Experiments

Datasets: THUMOS 14, BEOID, EPIC Kitchens.

Architecture: BN-Inception, pretrained on Kinetics, embeded in framework TSN.

Evaluation metrics: Top-1 accuracy.

Where:

`APV`

means the average of uniquections**A**er traning**P**ideo.**V**`Video-level`

means video-level supervision (contains or not contains)`TS`

means \(a_i \sim U[\sigma_i-1sec, \epsilon_i+1sec]\)`TS in GT`

means \(a_i \sim N(\frac{\sigma_i+\epsilon_i}2, 1sec)\)`Full`

means giving the exact extent of every action instance to the model.

## Pros. & Cons.

**Pros**: Much easier to do the annotation, which makes the larger datasets possible.

**Cons**: In the real-world application, no timestamps can be supplied.

## Comments

- What if alleviate the supervision more radically: only with the number of action instances provided, and let the sampling function learn where to go by itself?
- It seems that the refinement process of action boundaries is applicable in many traditional models / frameworks.
- There are too much
*plateau function*instances! What if every action class shares a common instance?