Data and algorithms are ubiquitous in all scientific, industrial and personal domains. Data now come in multiple forms (text, image, video, web, sensors, etc.), are massive, and require more and more complex processing beyond their mere indexation or the computation of simple statistics, such as recognizing objects in images or translating texts.
Cumulative Distribution Networks: Inference and Estimation of Graphical Models for Cumulative Distribution Functions
I present a class of graphical models for directly representing the joint cumulative distribution function (CDF) of many random variables, called \"cumulative distribution networks\" (CDNs). I will present properties of such graphical models, such as efficient marginalization, marginal and conditional independence and connections to extreme value theory.
In the context of variable selection for model-based clustering the problem of comparing two nested subsets of variables is recast as a model comparison problem, and addressed using approximate Bayes factors. A greedy search algorithm is proposed for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously for either continuous or discrete data. We present the results of applying the method to several datasets. In general removing irrelevant variables often improves performance.
In this paper we give an asymptotic theory for the eigenvalues of the sample covariance matrix of a multivariate time series when the number of components p goes to infinity with the sample size. The time series constitutes a linear process across time and between components. The input noise of the linear process has regularly varying tails with index between 0 and 4; in particular, the time series has infinite fourth moment.
Prediction problems typically assume the training data are independent samples, but in many modern applications samples come from individuals connected by a network. For example, in adolescent health studies of risk-taking behaviors, information on the subjects’ social networks is often available and plays an important role through network cohesion, the empirically observed phenomenon of friends behaving similarly. Taking cohesion into account should allow us to improve prediction.
Modern datasets are often in the form of matrices or arrays, potentially having correlations along each set of data indices. For example, researchers often gather relational data measured on pairs of units, where the population of units may consist of people, genes, websites or some other set of objects. Multivariate relational data include multiple relational measurements on the same set of units, possibly gathered under different conditions or at different time points. Such data can be represented as a multiway array, or tensor.
Suppose you have constructed a data-based estimation rule, perhaps a logistic regression model, and would like to know its performance as a predictor of future cases. There are two main theories concerning pre- diction error: (1) methods like Cp, AIC, and SURE (Stein\'s Unbiased Risk Estimator) that relate to the covariance between data points and the corresponding predictions, and (2) Cross-validation, and related techniques such as the nonparametric bootstrap. This talk concerns the relationship between the two theories.
We discuss trend filtering, a recently proposed tool of Kim et al. (2009) for nonparametric regression. The trend filtering estimate is defined as the minimizer of a penalized least squares criterion, in which the penalty term sums the absolute kth order discrete derivatives over the input points.
Inference problems with incomplete observations often aim at estimating population properties of unobserved quantities. One simple way to accomplish this estimation is to impute the unobserved quantities of interest at the individual level and then take an empirical average of the imputed values. We show that this simple imputation estimator can provide partial protection against model misspecification. The main advantages of imputation estimation are its universality, ease of implementation, and computational efficiency.
Part 1: Overview of Adaptive Algorithms Lab at the University of Toronto --- Part 2: ALGONQUIN: Fast Variational Methods for Robust Speech Recognition or Variational Methods Explained using Cartoons
Part 1: Overview of Adaptive Algorithms Lab at the University of Toronto.
I\'ll review the projects underway in my new group at the University of Toronto. These projects include Bayesian networks for video processing, codes on graphs and iterative algorithms, probabilistic phase unwrapping for MRI and SAR imaging, variational techniques for speech recognition in noisy environments, automated medical diagnosis and the open QMR network.
Part 2: ALGONQUIN: Fast Variational Methods for Robust Speech Recognition or Variational Methods Explained using Cartoons.
We consider the problem of estimating a normal mean constrained to be in a convex polyhedral cone in Euclidean space. We say that the true mean is sparse if it belongs to a low dimensional face of the cone. We show that, in a certain natural subclass of these problems, the maximum likelihood estimator automatically adapts to sparsity in the underlying true mean. We discuss the problems of convex regression and univariate and bivariate isotonic regression as examples.
The capability of flow cytometry to offer rapid quantification of multidimensional characteristics for millions of cells has made this technology indispensable for health research, medical diagnosis, and treatment. However, the lack of statistical and bioinformatics tools to parallel recent high-throughput technological advancements has hindered this technology from reaching its full potential. Traditional methods for flow cytometry (FCM) data processing have relied on manual gating of cell events to define cell populations for statistical analysis.
Advisor: Thomas Richardson
Guaranteed Learning of Latent Variable Models: Overlapping Community Models and Overcomplete Representations
Incorporating latent or hidden variables is a crucial aspect of statistical modeling. I will present a statistical and a computational framework for guaranteed learning of a wide range of latent variable models. I will focus on two instances, viz., community detection and overcomplete representations.
There have been several instances recently of papers in the scientific literature insisting that significance tests are misleading. I will first go back to some of the original writing (by Fisher, Jeffreys, Neyman and Pearson) on testing. Then I will discuss the issue of how to set null hypothesis, the logic of tests of significance as part of the scientific method, and the distinction between real correlation and physical relationship.
Advisor: Peter Guttorp
Machine learning from modern datasets presents novel opportunities and challenges. Larger and more diverse datasets enable us to answer more complex statistical questions, but present computational challenges in designing algorithms that can scale. In this talk I will present two results, the first one about computational challenges and the second about an opportunity enabled by modern datasets in the context of representation learning.
Smooth James-Stein Model Selection in Wavelet Smoothing, Parametric Linear Model and Inverse Problem
Motivated by a gravitational wave burst detection problem from several detectors, we derive smooth James-Stein (SJS) thresholding-based estimators in three settings: nonparametric and parametric regression, and inverse problem. SJS estimators enjoy smoothness like ridge regression and perform variable selection like lasso. They have added flexibility thanks to more than one regularization parameters, and the ability to select these parameters well thanks to an unbiased and smooth estimation of the risk.
Advisors: Judy Zeh & David Madigan
Modern technologies generate vast amounts of data at unprecedented speed. This ubiquitous technological trend is driving the need for increasingly sophisticated algorithms to find subtle statistical patterns in massive amounts of data, and extract actionable information.
Examples of such problems arise in healthcare, social networks, and recommendation systems.
The human microbiome is the collection of microorganisms which live inside and on top of us; recent studies have established the centrality of the microbiome to human health. These studies raise a number of questions: Given a collection of samples from a single body location, which samples indicate a healthy versus unhealthy phenotype? Do these samples fall into natural \"types\" from which one can generalize? What are the \"units\" of microbial communities, and what are the significant synergistic and antagonistic interactions between microbes?
Advisor: Finbarr O\'Sullivan
How can we maximally leverage available resources--such as computation, communication, multiprocessors, or even privacy--when performing machine learning? In this talk, I will suggest statistical risk (a rigorous notion of the accuracy of learning procedures) as a way to incorporate such criteria in a framework for development of algorithms.
Advisor: Jon Wellner We discuss applications of convex analysis to shape constrained density estimation. The dissertation consists of three parts. In the first part we introduce convex transformed densities as a multivariate generalization of known classes of densities defined by shape constraints based on convexity. We study the properties of the nonparametric maximum likelihood estimator of a convex-transformed density in several dimensions and prove basic properties: existence and consistency.
Advisor: Finbarr O\'Sullivan
In this talk I will consider learning with lower information costs, focusing on linear regression. Linear regression is one of the most widely used methods for prediction and forecasting, with widespread uses in many fields such as natural sciences, economy and medicine. I will show how to improve the information costs of linear regression in two settings. First, I will present a new estimation algorithm for the standard supervised regression setting.
Advisor: Mark Handcock
Advisor: Julian Besag
Adapting group sequential methods to observational drug and vaccine safety surveillance studies using large electronic healthcare data
Gaps in medical product safety evidence have spurred the development of new national post-licensure systems that prospectively monitor large observational cohorts of health plan enrollees. These multi-site systems, which include CDCâ€™s Vaccine Safety Datalink (VSD) and FDAâ€™s Mini-Sentinel (MS) Pilot Program for the Sentinel Initiative, attempt to leverage the vast amount of administrative and clinical information that is captured during the course of routine medical care and contained within computerized health plan databases.
We develop a framework for the modeling of high-dimensional data that is robust to a variety of data types and modeling paradigms. In particular, we focus on several classes of models that each employ conditional independence assumptions to derive estimators. In this presentation we pay particular attention to the problem of model averaging in instrumental variable models, which rely on conditional independence assumptions between subsets of variables to form estimators possessing desirable properties for causal inference.
Advisor: David Ford
Probabilistic topic models provide a suite of tools for analyzing large document collections. Topic modeling algorithms discover the latent themes that underlie the documents and identify how each document exhibits those themes. Topic modeling can be used to help explore, summarize, and form predictions about documents.
Tandem mass spectrometry experiments generate from thousands to millions of spectra that can be used to identify the presence of proteins in complex samples. In this work, we propose a new method to identify peptides based on clustered tandem mass spectrometry data. In contrast to previously proposed approaches, which identify one representative spectrum for each cluster using traditional database searching algorithms, our method scores all the spectra in a cluster against candidate peptides using Bayesian model selection.
Drawing upon disparate fields as economics, psychology, operations research and statistics, the subfield of statistical machine learning has provided practically successful tools ranging from search engines to medical diagnosis, image processing, speech recognition, and a wide array of problems in science and engineering. However, over the past decade, faced with modern data settings, off-the-shelf statistical machine learning methods are frequently proving insufficient.
Clustering involves placing entities into mutually exclusive categories. We wish to relax the requirement of mutual exclusivity, allowing objects to belong simultaneously to multiple classes, a formulation that we refer to as "feature allocation." The first step is a theoretical one. In the case of clustering the class of probability distributions over exchangeable partitions of a dataset has been characterized (via exchangeable partition probability functions and the Kingman paintbox).
Advisor: Mark Handcock The current A (H1N1) influenza pandemic has posed questions to policymakers about the most effective interventions and resource mobilization strategies. Furthermore, mutation of the A (H5N1) "avian" influenza virus could also cause a pandemic with an estimated 60% case mortality rate in humans, requiring fast analysis of intervention and containment strategies. A stochastic simulation model can help determine the best strategies in case of a pandemic.
Deep learning and unsupervised feature learning offer the potential to transform many domains such as vision, speech, and natural language processing. However, these methods have been fundamentally limited by our computational abilities, and typically applied to small-sized problems. In this talk, I describe the key ideas that enabled scaling deep learning algorithms to train a large model on a cluster of 16,000 CPU cores (2000 machines). This network has 1.15 billion parameters, which is more than 100x larger than the next largest network reported in the literature.
In genomic sciences, the amount of data has grown faster than statistical methodologies necessary to analyze those data. Furthermore, the complex underlying structure of these data means that simple, unstructured statistical models do not perform well. We consider the problem of identifying multiple, functionally independent, co-localized genetic regulators of gene transcription. Sparse regression techniques have been critical to multi-SNP association mapping because of their computational tractability in large data settings.
Mathematical models are instrumental in understanding the progression of contact-based infectious diseases. These models implicitly rely on an underlying network of partnerships, and it is important for these partnership networks to reflect important structures in the population of interest. In particular, concurrent partnerships and assortative mixing can greatly affect the prevalence of a disease in a population and explain heterogeneity we observe. We show how mixing totals and concurrency, as given by momentary degree distributions, can simultaneously be estimated from egocentric data.
Game theory is a rich mathematical framework to model and analyze the interactions of multiple decision makers with possibly conflicting objectives. Finite games in strategic form (i.e., those with a finite number of players, each with finitely many possible actions, that simultaneously and independently choose their action) are particularly important and well-studied.
In classical quantitative genetics, the correlation between the phenotypes of individuals with unknown genotypes and a known pedigree relationship is expressed in terms of probabilities of IBD states. In existing models of the inverse problem where genotypes are observed but pedigree relationships are not, probabilities and correlations have either a Bayesian or a hybrid interpretation. We introduce a generative evolutionary model of the inverse problem based on the classic infinite allele mutation process, IBF (Identity by Function).
I will show an application of statistical learning to identification of different cell populations in lymph node images taken with a microscope using hyperspectral imaging. The whole system called GemIdent is available as a Java package and enables the identification and localization of millions of cells of up to 5 different types.
This is joint work with Adam Kapelner and Adam Guetz.
We consider the problem of regression in three scenarios: (a) random design under the assumption that the model F is well-specified, (b) distribution-free statistical learning with respect to a reference class F; and (c) online regression with no assumption on the generative process. The first problem is often studied in the literature on nonparametric estimation, the second falls within the purview of statistical learning theory, and the third is studied within the online learning community.
Faculty Host: Carlos Guestrin Stat Liason: Emily Fox Abstract: Society is witnessing remarkable technological and scientific advances as numerous disciplines are adopting more advanced statistical and computational methodologies. Along with this progress comes an increasing need for scalable algorithms with solid theoretical foundations; the hope is that algorithms which address efficiency (with regards to both statistical and computational perspectives) can further facilitate breakthroughs.
Exact probabilistic inference for graphical models is known to be NP-hard. For dense graph with many cycles, one has to resort to tractable approximate methods such as loopy belief propagation. It has been shown that loopy BP is equivalent to the minimization of the so-called Bethe free energy in variational methods. First a review of some recently developed approximate algorithms in this context will be given. Then we will present some theory about the Bethe region graph and a convex relaxation method for energy minimization.
In this talk, I will explore the state of the art in the analysis and modeling of player tracking data in the NBA. In the past, player tracking data has been used primarily for visualization, such as understanding the spatial distribution of a playerâ€™s shooting characteristics, or to extract summary statistics, such as the distance traveled by a player in a given game. In this talk, I will present how we're using advanced statistics and machine learning tools to answer previously unanswerable questions about the NBA.
Host: Daniela Witten, Tyler McCormick
Respondent-Driven Sampling employs a variant of a link-tracing network sampling strategy to collect data from hard-to-reach populations. By tracing the links in the underlying social network, the process exploits the social structure to expand the sample and reduce its dependence on the initial (convenience) sample. Current estimation focuses on estimating population averages in the hard-to-reach population. These estimates are based on strong assumptions allowing the sample to be treated as a probability sample.
One of the challenges of building statistical models for large data sets is balancing the correctness of inference procedures against computational realities. In the context of Bayesian procedures, the pain of such computations has been particularly acute as it has appeared that algorithms such as Markov chain Monte Carlo necessarily need to touch all of the data at each iteration in order to arrive at a correct answer. Several recent proposals have been made to use subsets (or "minibatches") of data to perform MCMC in ways analogous to stochastic gradient descent.
Random matrices now play a role in many areas of theoretical, applied, and computational mathematics. Therefore, it is desirable to have tools for studying random matrices that are flexible, easy to use, and powerful. Over the last fifteen years, researchers have developed a remarkable family of results, called matrix concentration inequalities, that balance these criteria.