FINDING OBJECTS IN COMPLEX SCENES

Postgraduate

ABSTRACT

Object detection is one of the fundamental problems in computer vision that has great practical impact. Current object detectors work well under certain conditions. However, challenges arise when scenes become more complex. Scenes are often cluttered and object detectors trained on Internet collected data fail when there are large variations in objects’ appearance.

We believe the key to tackle those challenges is to understand the rich context of objects in scenes, which includes: the appearance variations of an object due to viewpoint and lighting condition changes; the relationships between objects and their typical environment; and the composition of multiple objects in the same scene. This dissertation aims to study the complexity of scenes from those aspects.

To facilitate collecting training data with large variations, we design a novel user interface, ARLabeler, utilizing the power of Augmented Reality (AR) devices. Instead of labeling images from the Internet passively, we put an observer in the real world with full control over the scene complexities. Users walk around freely and observe objects from multiple angles. Lighting can be adjusted. Objects can be added and/or removed to the scene to create rich compositions. Our tool opens new possibilities to prepare data for complex scenes.

We also study challenges in deploying object detectors in real world scenes: detecting curb ramps in street view images. A system, Tohme, is proposed to combine detection results from detectors and human crowdsourcing verifications. One core component is a meta-classifier that estimates the complexity of a scene and assigns it to human (accurate but costly) or computer (low cost but error-prone) accordingly.

One of the insights from Tohme is that context is crucial in detecting objects. To understand the complex relationship between objects and their environment, we propose a standalone context model that predicts where an object can occur in an image. By combining this model with object detection, it can find regions where an object is missing. It can also be used to find out-of-context objects.

To take a step beyond single object based detections, we explicitly model the geometrical relationships between groups of objects and use the layout information to represent scenes as a whole. We show that such a strategy is useful in retrieving indoor furniture scenes with natural language inputs


Background

Computer vision is changing the world. With the ever growing enormous amount of visual data available (e.g., Google Street View, Flickr, and Instagram), new powerful dedicated hardware (e.g., Nvidia GPUs and Intel Deep Learning Chips), and evolving machine learning algorithms with highly nonlinear characteristics (Deep Neural Networks), we are standing at the dawn of a new era in which intelligent systems will substantially improve the quality of life for society. For example, self driving cars, which could reduce traffic and save lives once they hit the mass market, use object detection, semantic segmentation and 3D reconstruction algorithms from computer vision; grocery stores use facial recognition and tracking algorithms to enable a cashier-free shopping experience; segmentation and classification algorithms are used in medical imaging to diagnose diseases.

As one of the fundamental problems in computer vision, Object Detection attracts a lot of research attention and has great practical impact. The task is to find the positions of all objects of interest in input images. Usually, object locations are represented by rectangular boxes that tightly fit the objects: for example SSD  and YOLO. Another choice is to produce a pixel level probability map of where an object is; for example human part detection often uses heat maps to account for uncertainties in predicting different parts.

Object detection is an essential building block in many computer vision applications. Once the location of an object is known, more sophisticated analysis can be performed on the identified object regions such as person identification or attributes analysis. Because such analysis takes more computational resources, it is crucial to reduce the computational load by focusing on true object locations. Being able to efficiently perform object detection is also important, especially for time-sensitive applications such as obstacle detection on self driving cars.

The main goal of this dissertation is to investigate and develop algorithms to improve object detection in complex scenes.

Challenges in Real-World Object Detection

Although extensively studied, object detection is still a difficult task in complex real-world scenes for the following reasons.

Standard object detection datasets such as PASCAL VOC and COCO collect images from the Internet. General purpose object detectors that are trained on those data are assumed to be useful in real world scenes. However, those detectors hardly work right out of the box for objects with arbitrary viewing angles and lighting conditions. There is a reality gap in terms of the objects appearance variations between images collected from the Internet and test images in real world applications. For example, there barely exists any back-view images of computer monitors in the PASCAL VOC data, while in a real use case, a detector should work for monitors in any orientation.

While state-of-the-art object detection systems certainly perform reasonably well, there are applications that require even higher accuracy. It is an open research problem on how to combine human knowledge in detection systems to help improve the performance of a fully automatic system.

Objects in an unconstrained environment often appear together with other objects, resulting in a cluttered scene. For example, a laptop might be on top of a desk and surrounded by monitors, books, and keyboards. Without knowledge of which objects tend to appear with each other, detecting each object individually can be difficult due to occlusion and distraction from similar looking objects.

The relationships between objects provide useful information for localizing them in a complex scene. Modeling objects’ interactions, however, is not an easy task. One key challenge is that the parametric space is prohibitively large: O(K2) for pairwise relations, and O(Kn) for n-tuple relations of K objects.

Most state-of-the-art object detectors focus on treating objects in a scene as individuals. For example, region proposal based approaches such as Faster RCNN discard information outside of the proposal boxes. For an end-to-end bounding box regression approach such as SSD, global information is used but the relationship between objects are not modeled explicitly.

Understanding interactions between multiple objects is a challenging task, yet it is naturally the next research step over the current single detection based methods.

Towards Better Performance in Complex Scenes

While object detection is often done by looking at isolated local regions of an image, we believe it is crucial to understand the elements in the whole scene that can affect our ways of finding objects.

When an object appears in a scene, there are viewpoint and lighting changes that alter its appearance. The existence of other objects provides information on the possibility that a related object coexists. Some objects have strict constraints on where other objects can occur: for example, a train should be on rail tracks.

Understanding the characteristics of a scene helps a detection system to: 1) improve performance on objects with large variations; 2) reduce false detections by considering a typical environment of objects; 3) detect object layouts as an intermediate representation of scenes. This dissertation aims to study these effects of scene complexity on object detection tasks.

To facilitate collecting training data with large variations in poses and lighting conditions, we design a novel user interface, ARLabeler, utilizing the power of Augmented Reality (AR) devices. Instead of labeling images from the Internet passively, we put an observer in the real world and collect training labels for object detection with full control over the scene complexities. Users walk around freely and observe objects from multiple angles. Lighting can be adjusted. Objects can be added and/or removed to the scene to create rich compositions. Our labeling tool opens new possibilities to prepare data for complex scenes.

We also study challenges in deploying state-of-the-art object detectors in real

world scenes. In particular, our task is to detect curb ramps in street view images. This is challenging because the object of interest is not visually salient and street view images contain a lot of distractors. We propose a system, Tohme, that combines detection results from computer vision algorithms and human crowdsourcing verifications for this task. One core component is a meta-classifier that estimates the complexity of a scene and assigns detection tasks to human (accurate but costly) or computer (low cost but error-prone) accordingly.

One of the insights from Tohme is that context is very important in detecting some objects. For example, driveway ramps are visually almost identical to curb ramps but they are located in very different context: driveway ramps are attached to houses while curb ramps are in intersections. To understand the complex relationship between objects and their environment, we propose a standalone context model that predicts where an object can occur in an image. By combining this model with object detection, false positives can be reduced. It can also be used for novel tasks such as finding regions where an object is missing or finding out-of-context objects.

To take a step beyond single object based detections, we explicitly model the geometrical relationships between groups of objects and use the layout information to represent scenes as a whole. We show that such a strategy is useful in retrieving indoor furniture scenes with natural language inputs.

In the following subsections, we give detailed introductions to each work and their connections to the main goal of this dissertation: finding objects in complex

scenes.