VIDEO UNDERSTANDING WITH DEEP NETWORKS

Diploma

ABSTRACT

Video understanding is one of the fundamental problems in computer vision. Videos provide more information to the image recognition task by adding a temporal component through which motion and other information can be additionally used. Encouraged by the success of deep convolutional neural networks (CNNs) on image classification, we extend the deep convolutional networks to video understanding by modeling both spatial and temporal information.

To effectively utilize deep networks, we need a comprehensive understanding of convolutional neural networks. We first study the network on the domain of image retrieval. We show that for instance-level image retrieval, lower layers often perform better than the last layers in convolutional neural networks. We present an approach for extracting convolutional features from different layers of the networks and adopt VLAD encoding to encode features into a single vector for each image. Our work provides guidance for transferring deep convolutional networks to other

tasks.

We then propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN.

Next, we propose a multitask learning model ActionFlowNet to train a single stream network directly from raw pixels to jointly estimate optical flow while recognizing actions with convolutional neural networks, capturing both appearance and motion in a single model. Experiments show that our model effectively learns video representation from motion information on unlabeled videos.

While recent deep models for videos show improvement by incorporating optical flow or aggregating high-level appearance across frames, they focus on modeling either the long-term temporal relations or short-term motion. We propose Temporal Difference Networks (TDN) that model both long-term relations and short-term motion from videos. We leverage a simple but effective motion representation: difference of CNN features in our network and jointly modeling the motion at multiple scales in a single CNN.


Introduction

Motivation

Video understanding is one of the fundamental problems in computer vision. Videos provide more information to the image recognition task by adding a temporal component through which motion and other information can be additionally used. Encouraged by the success of deep convolutional neural networks (CNNs) on image classification, in this dissertation we extend the deep convolutional networks to video understanding by modeling both spatial and temporal information.

Traditionally, video action recognition research has been very successful at extracting local features from videos which encode local spatio-temporal patterns. Hand-crafted features such as Histogram of Oriented Gradients (HOG), Histogram of Optical Flow (HOF), Motion Boundary Histogram (MBH) and trajectories are extracted from the videos. These local descriptors are then encoded to produce a global video-level feature representation with Bag-of-Word (BoW), VLAD, or Fisher vector encodings. By aggregating spatio-temporal local features to obtain global video representations, these approaches are able to obtain state-of-the-art results in a wide range of video recognition benchmarks.

Deep convolutional networks have shown great success in large scale image classification. By learning a hierarchy of feature representations through end-toend optimization, CNNs give superior performance compared to traditional handcrafted features. It shows great success when transferred to other related tasks such as object detection, semantic segmentation and image retrieval. Different improving network architectures have been proposed like AlexNet, VGG, Inception and ResNet.

There have been several challenges on applying deep networks for video understanding. First, appropriate models are needed to learn spatial appearance and temporal information. Both short term motion and long term context are required to obtain full understanding of videos. Second, having one extra dimension compared to images, processing videos are computationally expensive. Efficient algorithms are needed to process large amounts of data.

In this dissertation, we study the problem of video classification with deep networks. We present multiple approaches for video action recognition which focuses on both aggregating long-term temporal information as well as capturing short-term motion and local apppearance.

Approaches

To effectively utilize deep networks, we need comprehensive understanding of convolutional neural networks. We first study the network on the domain of image retrieval. We then propose different network architectures for video classification.

First, we study network architectures for full-length video classification. Second, we introduce methods for learning short-term motion representation for video action recognition. Finally, we propose a framework to jointly model low-level motion and high-level temporal relations.