A wide range of human activities can be defined as a spatiotemporal repetition of certain motion patterns. Therefore, activities can be robustly recognized by explicitly identifying these patterns, and their layout in the video. To this end, an activity can be represented by a graphical model, whose nodes correspond to latent motion primitives, and edges encode their probabilistic affinities for grouping into observable repetitive patterns. A video of the activity can then be viewed as probabilistically sampled from the graphical model. The probabilistic sampling: (i) Picks a subset of primitives; (ii) Varies their intrinsic characteristics (e.g., motion); and (iii) Groups them into larger patterns, forming the video. This talk will show that (i)--(iii) can be formulated as an efficient Kronecker multiplication of the model’s affinity matrix, leading to the computationally efficient inference and learning algorithms. In comparison with the state of the art, our experiments on individual, structured, and collective human activities demonstrate good scalability and superior performance on benchmark datasets, including the UCF YouTube, Olympic, and Collective Activities datasets.
Back to Graduate Summer School: Computer Vision