TOWARDS GENERALIZED FRAMEWORKS FOR OBJECT RECOGNITION

Postgraduate

ABSTRACT

Over the past few years, deep convolutional neural network (DCNN) based approaches have been immensely successful in tackling a diverse range of object recognition problems. Popular DCNN architectures like deep residual networks (ResNets) are highly generic, not just for classification, but also for high level tasks like detection/tracking which rely on classification DCNNs as their backbone. The generality of DCNNs however doesn’t extend to image-to-image(Im2Im) regression tasks (eg: super-resolution, denoising, rgb-to-depth, relighting, etc). For such tasks, DCNNs are often highly task-specific and require specific ancillary post-processing methods. The major issue plaguing the design of generic architectures for such tasks is the tradeoff between context/locality given a fixed computation/memory budget.

We first present a generic DCNN architecture for Im2Im regression that can be trained end-to-end without any further machinery. Our proposed architecture, the Recursively Branched Deconvolutional Network (RBDN), which features a cheap early multi-context image representation, an efficient recursive branching scheme with extensive parameter sharing and learnable upsampling. We provide qualitative/quantitative results on 3 diverse tasks: relighting, denoising and colorization and show that our proposed RBDN architecture obtains comparable results to the state-of-the-art on each of these tasks when used off-the-shelf without any post processing or task-specific architectural modifications.

Second, we focus on gradient flow and optimization in ResNets. In particular, we theoretically analyze why pre-activation(v2) ResNets outperform the original ResNets(v1) on CIFAR datasets but not on ImageNet. Our analysis reveals that although v1-ResNets lack ensembling properties, they can have a higher effective depth in comparison to v2-ResNes. Subsequently, we show that downsampling projections (while only few in number) have a significantly detrimental effect on performance. We show that by simply replacing downsampling-projections with identity-like dense-reshape shortcuts, the classification results of standard residual architectures like ResNets, ResNeXts and SE-Nets improve by up to 1.2% on ImageNet, without any increase in computational complexity (FLOPs).

Finally, we present a robust non-parametric probabilistic ensemble method for multi-classification, which outperforms the state-of-the-art ensemble methods on several machine learning and computer vision datasets for object recognition with statistically significant improvements. The approach is particularly geared towards multi-classification problems with very low training data and/or a fairly high proportion of outliers, for which training end-to-end DCNNs is not very beneficial.


Introduction

Given any image/video, it is often trivial for even small children to develop a basic understanding of the underlying scene. Humans have a potentially limitless capacity for remembering and identifying objects, which often makes recognition tasks seem fairly easy. Furthermore, humans also use context in a very natural way to develop complex relationships between various objects in a scene. For a machine however, such tasks are far from a trivial proposition. While we humans can perform these tasks at ease, we have yet to provide a thorough comprehensive description of precisely how we do them, despite decades of research in neurosciences. Any machine needed to automate even the most basic of tasks unfortunately doesn’t have intuition or common sense at its disposal, and requires a precise set of instructions to either (a) directly perform the task or (b) learn from training data to develop a potentially complex set of instructions for performing the task.

Fully automated scene understanding has always been and is still considered the holy-grail of computer vision. While the problem remains unsolved at large, there has nevertheless been tremendous progress over the past few decades. A bulk of the progress has infact come over the last few years via deep learning approaches which have been made possible with major breakthroughs in hardware and compute capabilities. Broadly speaking, deep learning relies on massive cascades of modular building blocks, each comprising of several basic compute units refered to as neurons (which are biologically inspired). A major reason for its widespread success is not just its extremely high modeling capacity, but the ability to learn end-to-end models for solving tasks. Deep Convolutional Neural Networks (DCNNs) in particular directly operate on the input image/video and learn to produce the desired output without relying on hand-crafted features or multi-step pipelines which are often ad-hoc/suboptimal.


The Importance of Image Classification

Ever since the advent of computer vision, the approach to solving complex high-level tasks has always been to break them heirarchically into progressively simpler low-level vision tasks. This approach works well since high level tasks are almost always strongly dependent on the output of one or more low-level tasks and can often be described as a repeated application of a low-level task. Figure 1.1 illustrates this for an example of traffic-jam analysis from video-footage. For analyzing such a video, the ability to track specific vehicles across video frames is often a pre-requisite. Tracking vehicles on the other hand requires the ability to reliably detect vehicles in individual frames. Finally, detecting vehicles involves applying a vehicle-classifier on various image-patches at various scales followed by non-maximal suppression. So, the vehicle classification task lies at the heart of this problem and is a critical component.