EEB

Building Map

Can machine learning survive the artificial intelligence revolution?

Start Time
Speaker
Francis Bach

Data and algorithms are ubiquitous in all scientific, industrial and personal domains. Data now come in multiple forms (text, image, video, web, sensors, etc.), are massive, and require more and more complex processing beyond their mere indexation or the computation of simple statistics, such as recognizing objects in images or translating texts.

Building
Room
105

Cumulative Distribution Networks: Inference and Estimation of Graphical Models for Cumulative Distribution Functions

Start Time
Speaker
Jim Huang

I present a class of graphical models for directly representing the joint cumulative distribution function (CDF) of many random variables, called \"cumulative distribution networks\" (CDNs). I will present properties of such graphical models, such as efficient marginalization, marginal and conditional independence and connections to extreme value theory.

Building
Room
045

Variable Selection for Model-Based Clustering

Start Time
Speaker
Nema Dean

In the context of variable selection for model-based clustering the problem of comparing two nested subsets of variables is recast as a model comparison problem, and addressed using approximate Bayes factors. A greedy search algorithm is proposed for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously for either continuous or discrete data. We present the results of applying the method to several datasets. In general removing irrelevant variables often improves performance.

Building
Room
031

Big n, Big p: Eigenvalues for Cov Matrices of Heavy-Tailed Multivariate Time Series

Start Time
Speaker
Richard Davis

In this paper we give an asymptotic theory for the eigenvalues of the sample covariance matrix of a multivariate time series when the number of components p goes to infinity with the sample size. The time series constitutes a linear process across time and between components. The input noise of the linear process has regularly varying tails with index between 0 and 4; in particular, the time series has infinite fourth moment.

Building
Room
037

Interpretable Prediction Models for Network-Linked Data

Start Time
Speaker
Liza Levina

Prediction problems typically assume the training data are independent samples, but in many modern applications samples come from individuals connected by a network. For example, in adolescent health studies of risk-taking behaviors, information on the subjects’ social networks is often available and plays an important role through network cohesion, the empirically observed phenomenon of friends behaving similarly. Taking cohesion into account should allow us to improve prediction.

Building
Room
125

Mean and Covariance Models for Tensor-Valued Data

Start Time
Speaker
Peter D Hoff

Modern datasets are often in the form of matrices or arrays, potentially having correlations along each set of data indices. For example, researchers often gather relational data measured on pairs of units, where the population of units may consist of people, genes, websites or some other set of objects. Multivariate relational data include multiple relational measurements on the same set of units, possibly gathered under different conditions or at different time points. Such data can be represented as a multiway array, or tensor.

Building
Room
045

Prediction Error: Covariance Penalties and Cross-Validation

Start Time
Speaker
Bradley Efron

Suppose you have constructed a data-based estimation rule, perhaps a logistic regression model, and would like to know its performance as a predictor of future cases. There are two main theories concerning pre- diction error: (1) methods like Cp, AIC, and SURE (Stein\'s Unbiased Risk Estimator) that relate to the covariance between data points and the corresponding predictions, and (2) Cross-validation, and related techniques such as the nonparametric bootstrap. This talk concerns the relationship between the two theories.

Building
Room
105

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Start Time
Speaker
Ryan Tibshirani

We discuss trend filtering, a recently proposed tool of Kim et al. (2009) for nonparametric regression. The trend filtering estimate is defined as the minimizer of a penalized least squares criterion, in which the penalty term sums the absolute kth order discrete derivatives over the input points.

Building
Room
037

Experiments with Imputations Estimators Under Model Misspecification

Start Time
Speaker
Vladimir Minin

Inference problems with incomplete observations often aim at estimating population properties of unobserved quantities. One simple way to accomplish this estimation is to impute the unobserved quantities of interest at the individual level and then take an empirical average of the imputed values. We show that this simple imputation estimator can provide partial protection against model misspecification. The main advantages of imputation estimation are its universality, ease of implementation, and computational efficiency.

Building
Room
045

Part 1: Overview of Adaptive Algorithms Lab at the University of Toronto --- Part 2: ALGONQUIN: Fast Variational Methods for Robust Speech Recognition or Variational Methods Explained using Cartoons

Start Time
Speaker
Brendan J. Frey

Part 1: Overview of Adaptive Algorithms Lab at the University of Toronto.

I\'ll review the projects underway in my new group at the University of Toronto. These projects include Bayesian networks for video processing, codes on graphs and iterative algorithms, probabilistic phase unwrapping for MRI and SAR imaging, variational techniques for speech recognition in noisy environments, automated medical diagnosis and the open QMR network.

Part 2: ALGONQUIN: Fast Variational Methods for Robust Speech Recognition or Variational Methods Explained using Cartoons.

Building
Room
105

Adaptation in Some Shape-Constrained Regression Problems

Start Time
Speaker
Adityanand Guntuboyina

We consider the problem of estimating a normal mean constrained to be in a convex polyhedral cone in Euclidean space. We say that the true mean is sparse if it belongs to a low dimensional face of the cone. We show that, in a certain natural subclass of these problems, the maximum likelihood estimator automatically adapts to sparsity in the underlying true mean. We discuss the problems of convex regression and univariate and bivariate isotonic regression as examples.

Building
Room
037

Statistical Issues in the Analysis of Flow Cytometry Data and the FlowCAP Competition

Start Time
Speaker
Raphael Gottardo

The capability of flow cytometry to offer rapid quantification of multidimensional characteristics for millions of cells has made this technology indispensable for health research, medical diagnosis, and treatment. However, the lack of statistical and bioinformatics tools to parallel recent high-throughput technological advancements has hindered this technology from reaching its full potential. Traditional methods for flow cytometry (FCM) data processing have relied on manual gating of cell events to define cell populations for statistical analysis.

Building
Room
045

Guaranteed Learning of Latent Variable Models: Overlapping Community Models and Overcomplete Representations

Start Time
Speaker
Anima Anandkumar

Incorporating latent or hidden variables is a crucial aspect of statistical modeling. I will present a statistical and a computational framework for guaranteed learning of a wide range of latent variable models. I will focus on two instances, viz., community detection and overcomplete representations.

Building
Room
105

On the Use and Misuse of Tests of Significance

Start Time
Speaker
Peter Guttorp

There have been several instances recently of papers in the scientific literature insisting that significance tests are misleading. I will first go back to some of the original writing (by Fisher, Jeffreys, Neyman and Pearson) on testing. Then I will discuss the issue of how to set null hypothesis, the logic of tests of significance as part of the scientific method, and the distinction between real correlation and physical relationship.

Building
Room
045

Representation, modeling and computation: Opportunities and challenges of modern datasets

Start Time
Speaker
Alekh Agarwal

Machine learning from modern datasets presents novel opportunities and challenges. Larger and more diverse datasets enable us to answer more complex statistical questions, but present computational challenges in designing algorithms that can scale. In this talk I will present two results, the first one about computational challenges and the second about an opportunity enabled by modern datasets in the context of representation learning.

Building
Room
105

Smooth James-Stein Model Selection in Wavelet Smoothing, Parametric Linear Model and Inverse Problem

Start Time
Speaker
Sylvain Sardy

Motivated by a gravitational wave burst detection problem from several detectors, we derive smooth James-Stein (SJS) thresholding-based estimators in three settings: nonparametric and parametric regression, and inverse problem. SJS estimators enjoy smoothness like ridge regression and perform variable selection like lasso. They have added flexibility thanks to more than one regularization parameters, and the ability to select these parameters well thanks to an unbiased and smooth estimation of the risk.

Building
Room
045

Reasoning About Uncertainty in High-Dimensional Data Analysis

Start Time
Speaker
Adel Javanmard

Modern technologies generate vast amounts of data at unprecedented speed. This ubiquitous technological trend is driving the need for increasingly sophisticated algorithms to find subtle statistical patterns in massive amounts of data, and extract actionable information.
Examples of such problems arise in healthcare, social networks, and recommendation systems.

Building
Room
105

New Theory and Software for the Analysis of Microbial Communities

Start Time
Speaker
Frederick A. Matsen

The human microbiome is the collection of microorganisms which live inside and on top of us; recent studies have established the centrality of the microbiome to human health. These studies raise a number of questions: Given a collection of samples from a single body location, which samples indicate a healthy versus unhealthy phenotype? Do these samples fall into natural \"types\" from which one can generalize? What are the \"units\" of microbial communities, and what are the significant synergistic and antagonistic interactions between microbes?

Building
Room
045

Computation, Communication, and Privacy Constraints on Statistical Learning

Start Time
Speaker
John Duchi

How can we maximally leverage available resources--such as computation, communication, multiprocessors, or even privacy--when performing machine learning? In this talk, I will suggest statistical risk (a rigorous notion of the accuracy of learning procedures) as a way to incorporate such criteria in a framework for development of algorithms.

Building
Room
125

Convex Analysis Methods in Shape Constrained Estimation

Start Time
Speaker
Arseni V. Seregin

Advisor: Jon Wellner We discuss applications of convex analysis to shape constrained density estimation. The dissertation consists of three parts. In the first part we introduce convex transformed densities as a multivariate generalization of known classes of densities defined by shape constraints based on convexity. We study the properties of the nonparametric maximum likelihood estimator of a convex-transformed density in several dimensions and prove basic properties: existence and consistency.

Building
Room
403

Learning with Lower Information Costs

Start Time
Speaker
Sivan Sabato

In this talk I will consider learning with lower information costs, focusing on linear regression. Linear regression is one of the most widely used methods for prediction and forecasting, with widespread uses in many fields such as natural sciences, economy and medicine. I will show how to improve the information costs of linear regression in two settings. First, I will present a new estimation algorithm for the standard supervised regression setting.

Building
Room
105

Adapting group sequential methods to observational drug and vaccine safety surveillance studies using large electronic healthcare data

Start Time
Speaker
Jennifer Nelson

Gaps in medical product safety evidence have spurred the development of new national post-licensure systems that prospectively monitor large observational cohorts of health plan enrollees. These multi-site systems, which include CDC’s Vaccine Safety Datalink (VSD) and FDA’s Mini-Sentinel (MS) Pilot Program for the Sentinel Initiative, attempt to leverage the vast amount of administrative and clinical information that is captured during the course of routine medical care and contained within computerized health plan databases.

Building
Room
037

Bayesian Model Averaging and Multivariate Conditional Independence Structures

Start Time
Speaker
Alex F. Lenkoski

We develop a framework for the modeling of high-dimensional data that is robust to a variety of data types and modeling paradigms. In particular, we focus on several classes of models that each employ conditional independence assumptions to derive estimators. In this presentation we pay particular attention to the problem of model averaging in instrumental variable models, which rely on conditional independence assumptions between subsets of variables to form estimators possessing desirable properties for causal inference.

Building
Room
403

Probabilistic Topic Models of Text and Users

Start Time
Speaker
Dave Blei

Probabilistic topic models provide a suite of tools for analyzing large document collections. Topic modeling algorithms discover the latent themes that underlie the documents and identify how each document exhibits those themes. Topic modeling can be used to help explore, summarize, and form predictions about documents.

Building
Room
125

A Statistical Approach to Peptide Identication From Clustered Tandem Mass Spectrometry Data

Start Time
Speaker
Soyoung Ryu

Tandem mass spectrometry experiments generate from thousands to millions of spectra that can be used to identify the presence of proteins in complex samples. In this work, we propose a new method to identify peptides based on clustered tandem mass spectrometry data. In contrast to previously proposed approaches, which identify one representative spectrum for each cluster using traditional database searching algorithms, our method scores all the spectra in a cluster against candidate peptides using Bayesian model selection.

Building
Room
303

Statistical Machine Learning and Big-p, Big-n, Complex Data

Start Time
Speaker
Pradeep Ravikumar

Drawing upon disparate fields as economics, psychology, operations research and statistics, the subfield of statistical machine learning has provided practically successful tools ranging from search engines to medical diagnosis, image processing, speech recognition, and a wide array of problems in science and engineering. However, over the past decade, faced with modern data settings, off-the-shelf statistical machine learning methods are frequently proving insufficient.

Building
Room
105

Feature allocations, probability functions, and paintboxes

Start Time
Speaker
Tamara Broderick

Clustering involves placing entities into mutually exclusive categories. We wish to relax the requirement of mutual exclusivity, allowing objects to belong simultaneously to multiple classes, a formulation that we refer to as "feature allocation." The first step is a theoretical one. In the case of clustering the class of probability distributions over exchangeable partitions of a dataset has been characterized (via exchangeable partition probability functions and the Kingman paintbox).

Building
Room
105

Estimation of Social Contact Networks to Improve Influenza Simulation Models

Start Time
Speaker
Gail E. Potter

Advisor: Mark Handcock The current A (H1N1) influenza pandemic has posed questions to policymakers about the most effective interventions and resource mobilization strategies. Furthermore, mutation of the A (H5N1) "avian" influenza virus could also cause a pandemic with an estimated 60% case mortality rate in humans, requiring fast analysis of intervention and containment strategies. A stochastic simulation model can help determine the best strategies in case of a pandemic.

Building
Room
303

Scaling deep learning to 10,000 cores and beyond

Start Time
Speaker
Quoc V Le

Deep learning and unsupervised feature learning offer the potential to transform many domains such as vision, speech, and natural language processing. However, these methods have been fundamentally limited by our computational abilities, and typically applied to small-sized problems. In this talk, I describe the key ideas that enabled scaling deep learning algorithms to train a large model on a cluster of 16,000 CPU cores (2000 machines). This network has 1.15 billion parameters, which is more than 100x larger than the next largest network reported in the literature.

Building
Room
105

Bayesian Structured Sparsity for Genetic Association Mapping

Start Time
Speaker
Barbara Engelhardt

In genomic sciences, the amount of data has grown faster than statistical methodologies necessary to analyze those data. Furthermore, the complex underlying structure of these data means that simple, unstructured statistical models do not perform well. We consider the problem of identifying multiple, functionally independent, co-localized genetic regulators of gene transcription. Sparse regression techniques have been critical to multi-SNP association mapping because of their computational tractability in large data settings.

Building
Room
037

Models for Heterogeneity in Heterosexual Partnership Networks

Start Time
Speaker
Ryan M. Admiraal

Mathematical models are instrumental in understanding the progression of contact-based infectious diseases. These models implicitly rely on an underlying network of partnerships, and it is important for these partnership networks to reflect important structures in the population of interest. In particular, concurrent partnerships and assortative mixing can greatly affect the prevalence of a disease in a population and explain heterogeneity we observe. We show how mixing totals and concurrency, as given by momentary degree distributions, can simultaneously be estimated from egocentric data.

Building
Room
403

Flows and Decompositions of Games; Harmonic and Potential Games

Start Time
Speaker
Pablo Parrilo

Game theory is a rich mathematical framework to model and analyze the interactions of multiple decision makers with possibly conflicting objectives. Finite games in strategic form (i.e., those with a finite number of players, each with finitely many possible actions, that simultaneously and independently choose their action) are particularly important and well-studied.

Building
Room
125

Functional Quantitative Genetics and the Missing Heritability Problem

Start Time
Speaker
Serge Sverdlov

In classical quantitative genetics, the correlation between the phenotypes of individuals with unknown genotypes and a known pedigree relationship is expressed in terms of probabilities of IBD states. In existing models of the inverse problem where genotypes are observed but pedigree relationships are not, probabilities and correlations have either a Bayesian or a hybrid interpretation. We introduce a generative evolutionary model of the inverse problem based on the classic infinite allele mutation process, IBF (Identity by Function).

Building
Room
042

Image Segmentation Applied and Semi-Supervised Learning

Start Time
Speaker
Susan Holmes

I will show an application of statistical learning to identification of different cell populations in lymph node images taken with a microscope using hyperspectral imaging. The whole system called GemIdent is available as a Java package and enables the identification and localization of millions of cells of up to 5 different types.

This is joint work with Adam Kapelner and Adam Guetz.

Building
Room
303

Learning and Estimation: Separated at Birth, Reunited at Last

Start Time
Speaker
Alexander (Sasha) Rakhlin

We consider the problem of regression in three scenarios: (a) random design under the assumption that the model F is well-specified, (b) distribution-free statistical learning with respect to a reference class F; and (c) online regression with no assumption on the generative process. The first problem is often studied in the literature on nonparametric estimation, the second falls within the purview of statistical learning theory, and the third is studied within the online learning community.

Building
Room
037

Some Algorithmic Challenges in Statistics: Convexity, Non-Convexity, and Depth

Start Time
Speaker
Sham Kakade

Faculty Host: Carlos Guestrin Stat Liason: Emily Fox Abstract: Society is witnessing remarkable technological and scientific advances as numerous disciplines are adopting more advanced statistical and computational methodologies. Along with this progress comes an increasing need for scalable algorithms with solid theoretical foundations; the hope is that algorithms which address efficiency (with regards to both statistical and computational perspectives) can further facilitate breakthroughs.

Building
Room
105

Approximate Inference Algorithms for Graphical Models

Start Time
Speaker
Ming Su

Exact probabilistic inference for graphical models is known to be NP-hard. For dense graph with many cycles, one has to resort to tractable approximate methods such as loopy belief propagation. It has been shown that loopy BP is equivalent to the minimization of the so-called Bethe free energy in variational methods. First a review of some recently developed approximate algorithms in this context will be given. Then we will present some theory about the Bethe region graph and a convex relaxation method for energy minimization.

Building
Room
M-306

XY - Basketball meets Big Data

Start Time
Speaker
Luke Bornn

In this talk, I will explore the state of the art in the analysis and modeling of player tracking data in the NBA. In the past, player tracking data has been used primarily for visualization, such as understanding the spatial distribution of a player’s shooting characteristics, or to extract summary statistics, such as the distance traveled by a player in a given game. In this talk, I will present how we're using advanced statistics and machine learning tools to answer previously unanswerable questions about the NBA.

Building
Room
037

Inference in Network-Based Respondent-Driven Sampling

Start Time
Speaker
Krista J. Gile

Respondent-Driven Sampling employs a variant of a link-tracing network sampling strategy to collect data from hard-to-reach populations. By tracing the links in the underlying social network, the process exploits the social structure to expand the sample and reduce its dependence on the initial (convenience) sample. Current estimation focuses on estimating population averages in the hard-to-reach population. These estimates are based on strong assumptions allowing the sample to be treated as a probability sample.

Building
Room
403

Accelerating Exact MCMC with Subsets of Data

Start Time
Speaker
Ryan Adams

One of the challenges of building statistical models for large data sets is balancing the correctness of inference procedures against computational realities. In the context of Bayesian procedures, the pain of such computations has been particularly acute as it has appeared that algorithms such as Markov chain Monte Carlo necessarily need to touch all of the data at each iteration in order to arrive at a correct answer. Several recent proposals have been made to use subsets (or "minibatches") of data to perform MCMC in ways analogous to stochastic gradient descent.

Building
Room
037

Applied Random Matrix Theory

Start Time
Speaker
Joel Tropp

Random matrices now play a role in many areas of theoretical, applied, and computational mathematics. Therefore, it is desirable to have tools for studying random matrices that are flexible, easy to use, and powerful. Over the last fifteen years, researchers have developed a remarkable family of results, called matrix concentration inequalities, that balance these criteria.

Building
Room
125