Deep learning based visual recognition and localization is one of the pillars of computer vision and is the driving force behind applications like self-driving cars, visual search, video surveillance, augmented reality, to name a few. This thesis identifies key bottlenecks in state-of-the-art visual recognition pipelines which use convolutional neural networks and proposes effective solutions to push their limits. A few shortcomings of convolutional neural networks are, lack of scale invariance which poses a challenge for tasks like object detection, fixed structure of the network which restricts their usage when presented with new class labels, and difficulty in modeling long range spatial/temporal dependencies. We provide evidence of these problems and then design effective solutions to overcome them.

In the first part, an analysis of different techniques for recognizing and detecting objects under extreme scale variation is presented. Since small and large objects are difficult to recognize at smaller and larger scales of an image pyramid respectively, we present a novel training scheme called Scale Normalization for Image Pyramids (SNIP) which selectively back-propagates the gradients of object instances of different sizes as a function of the image scale. As SNIP ignores gradients of objects at extreme resolutions, following up on this idea, we developed SNIPER (Scale Normalization for Image Pyramids with Efficient Re-sampling), an algorithm for performing efficient multi-scale training for instance level visual recognition tasks. Instead of processing every pixel in an image pyramid, SNIPER processes context regions (512x512 pixels) around ground-truth instances at the appropriate scale. For background sampling, these context-regions are generated using proposals extracted from a region proposal network trained with a short learning schedule. Hence, the number of chips generated per image during training adaptively changes based on the scene complexity. SNIPER brings training of instance level recognition tasks like object detection closer to the protocol for image classification and suggests that the commonly accepted guideline that it is important to train on high resolution images for instance level visual recognition tasks might not be correct.

Next, we present a real-time large-scale object detector (R-FCN-3000) for detecting thousands of classes where objectness detection and classification are decoupled. To obtain the detection score for an RoI, we multiply the objectness score with the fine-grained classification score. We show that the objectness learned by R-FCN-3000 generalizes to novel classes and the performance increases with the number of training object classes supporting the hypothesis that it is possible to learn a universal objectness detector. Because of generalized objectness, we can train object detectors for new classes, just with classification data, without even requiring bounding boxes.

Finally, we present a multi-stream bi-directional recurrent neural network for action detection. This was the first deep learning based system which could perform action localization in long videos and it could do it just with RGB data, without requiring any skeletal models or performing intermediate tasks like pose-estimation. Our system uses a tracking algorithm to locate a bounding box around the person, which provides a frame of reference for appearance and motion while suppressing background noise that is not within the bounding box. We train two additional streams on motion and appearance cropped to the tracked bounding box, along with full-frame streams. To model long-term temporal dynamics within and between actions, the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. We show that our bi-directional LSTM network utilizes about 8 seconds of the video sequence to predict an action label and outperforms state-of-the-art methods on multiple benchmarks.


Visual recognition is a fundamental problem in computer vision and has wide ranging applications like autonomous driving, visual search, surveillance, augmented reality, to name a few. Modern recognition pipelines primarily rely on data driven model based approaches, which are instantiated in the form of convolutional neural networks [54]. These networks distill information from high dimensional input spaces by applying multiple nonlinear operators in an iterative fashion to compress the input to a low dimensional semantic space. For different problems like object detection, semantic segmentation, action detection, problem specific architectural changes are made to enable appropriate information flow through the network cascade for generating meaningful semantic representations towards the end.

A major bottleneck in convolutional neural networks is that they are not invariant to scale. For example, if we change the input size of the image or video, the semantic features which will be generated at the end of the network would be different. Another bottleneck is that the models are not flexible. By flexible, it means that after training the network on 1,000 classes if we want to add another class, we need to re-train the network again. This makes it impractical to use them in many applications where online learning is needed. Another failure mode arises while modeling long range spatial/temporal dependencies. This is because these networks aggregate contextual information by applying successive convolution and pooling operations to increase their receptive field. Although in theory their receptive field grows, in practice it does not happen [65]. This is similar to the visual cortex and if we want to focus on two small objects which are far away simultaneously, it gets very difficult. In our mind, we retain information about other parts of the image or video in our memory cells which helps us in processing long term contextual information. In this thesis, we provide evidence of these problems by designing controlled experiments and then propose effective solutions to alleviate them.