VISUAL ANALYTICS FOR OPEN-ENDED TASKS IN TEXT MINING

Postgraduate

ABSTRACT

Overview of documents using topic modeling and multidimensional scaling is helpful in understanding topic distribution. While we can spot clusters visually, it is challenging to characterize them. My research investigates an interactive method to identify clusters by assigning attributes and examining the resulting distributions.

ParallelSpaces examines the understanding of topic modeling applied to Yelp business reviews, where businesses and their reviews each constitute a separate visual space. Exploring these spaces enables the characterization of each space using the other. However, the scatterplot-based approach in ParallelSpaces does not generalize to categorical variables due to overplotting. My research proposes an improved layout algorithm for those cases in our follow-up work, Gatherplots, which eliminate overplotting in scatterplots while maintaining individual objects. Another limitation in clustering methods is the fixed number of clusters as a hyperparameter. TopicLens is a Magic Lens-type interaction technique, where the documents under the lens are clustered according to topics in real time. While ParallelSpaces help characterize the clusters, the attributes are sometimes limited. To extend the analysis by creating a custom mixture of attributes, CommentIQ is a comment moderation tool where moderators can adjust model parameters according to the context or goals. To help users analyze documents semantically, we develop a technique for user-driven text mining by building a dictionary for topics or concepts in a follow-up study, ConceptVector, which uses word embedding to generate dictionaries interactively and uses those dictionaries to analyze the documents.

My dissertation contributes interactive methods to overview documents to integrate the user in text mining loops that currently are non-interactive. The case studies we present in this dissertation provide concrete and operational techniques for directly improving several state-of-the-art text mining algorithms. We summarize those generalizable lessons and discuss the limitations of the visual analytics approach.

Introduction

An open-ended task can be defined as a task where it is inappropriate to have one single true answer, but instead it is appropriate to have several answers depending on the context and situation. Outputs of these tasks are not bounded to fixed sets of possible answers. Closed-ended tasks are the opposite, in which there is a set of possible answers and it is trivial to check whether it is correct or not, with a given ground-truth. One main challenge with open-ended tasks is that it is hard to judge the correctness of output automatically. For example, explaining why a comment or photo is funny may be an intractable computer science problem, at least with current technology. Humans, on the other hand, have no problem in doing these types of open-ended tasks because of their ability to perceive common sense, sympathy, semantics, and context. However, it is also challenging and costly to scale human judgment.


Open-ended Tasks

If we compare humans and machines, there are things that machines can do much better than humans, such as arithmetic operations. The list of such tasks is growing rapidly as tasks that were originally thought of as impossible for machines are solved using novel algorithms, such as the development of the neural network based model that beat a human champion in Go. Some tasks, however, remain as difficult for machines as they are for humans. Examples are NP hard problems, such as the “traveling salesman” problem. The theoretical limits that keep us from (easily) solving these problems have been, to some extent, circumvented by methods that can approximate the solution with a reasonable amount of resources.