SOFTWARE INFRASTRUCTURE FOR VISUAL AND INTEGRATIVE ANALYSIS OF MICROBIOME DATA

Diploma

ABSTRACT

Microbiome sequencing allows researchers to reconstruct bacterial community census profiles at resolutions greater than previous methodologies. As a result, increasingly large numbers of these taxonomic community profiles are now generated, analyzed, and published by researchers in the field. In this work, I present new methods and software infrastructure for visualization and sharing of microbiome data. The overall goal is to enable a researcher to complete cycles of exploratory and confirmatory analysis over metagenomic data. I describe Metaviz, an interactive statistical and visual analysis tool specifically designed for effective taxonomic hierarchy navigation and data analysis feature selection. I next detail the incorporation of Metaviz into the Human Microbiome Project Data Portal. I then show a novel method to visualize longitudinal data across multiple features built as an extension over Metaviz. Finally, previous work has shown that specific subjects in an experimental cohort can be identified using their microbiome data. I developed software using a secure multiparty computation library to complete comparative analyses of metagenomic data across cohorts without directly revealing feature count values for individuals. 


Introduction

Microbiome sequencing

A microbiome is the collection of microbial organisms in an environment. High throughput DNA sequencing provides a mechanism to generate a microbial community census. Current research focuses on identification of the microbiome in human body sites and different ecological domains. For human health, studies are designed as large observational epidemiological studies or smaller controlled experiments. Initial large observational studies focused on identifying the microbiome of healthy individuals, examining known and detecting novel pathogens in diarrheal diseases and observing the relationship between the obesity and an individual’s microbiome. One large epidemiological study of note is the Global Enteric Multi-Center Study, which gathered stool samples from children with diarrheal disease and matched controls in four countries to identify associations between microbiome structure and disease status. Another prominent study examined the microbiome of individuals with Inflammatory Bowel Disease with a focus on Crohn’s disease.  Recent and ongoing work in the field investigates the feasibility and effectiveness of modifying the microbiome of an organism to potentially alter host health. 

Researchers create microbiome community profiles for a community by first taking a sample and extracting DNA. Next, one of two high-throughput sequencing methods is employed. The first method amplifies specific variable regions of the 16S ribosomal RNA gene. After the products are sequenced, the reads are clustered and annotated against a taxonomic annotation reference database. The number of times a given taxonomic unit is observed for each sample is computed into a count table that serves as the main object of subsequent downstream analysis. The other sequencing method is whole metagenome shotgun sequencing. The reads from this sequencing approach are either aligned to reference genomes, assembled, used in k-mer based taxonomic classification, or compared against clade-specific gene catalogues to produce taxonomic profiles. Marker gene sequencing surveys are more accessible to perform than whole metagenome shotgun sequencing and are more often used. Metagenome sequencing allows for gene-level resolution and functional profiling while marker gene surveys must rely on a functional inference estimation.