Padelford

### Padelford

# Statistical methods for adaptive immune receptor repertoire analysis and comparison

B and T cell receptors, also known as adaptive immune receptors, perform key roles in adaptive immunity.

These proteins identify and deal with foreign invaders like viruses or bacteria, allowing for robust and long-lasting immunological protection.

The DNA sequences coding for these receptors arise by a complex recombination process followed by a series of productivity-based filters, as well as affinity maturation for B cells, giving considerable diversity to the circulating pool of these sequences.

# Quantifying Uncertainty in Causal Discovery with Bayesian Causal Model Selection

Causal Discovery algorithms attack the challenging problem of learning the causal relationships among a set of variables from observational data, but are often partly ad-hoc and give the researcher no measure of confidence in the correctness of the learned causal structure. I introduce Bayesian Causal Model Selection (BCMS), a Bayesian framework for causal discovery that unifies existing methods by expressing identifiability assumptions through the model prior.

# Laplace approximations and ordinal models for continuous spatial and spatio-temporal health mapping applications

With the increasing ability to collect myriad types of spatial data, we find ourselves regularly presented with new modeling problems that require novel solutions, but many of the available options for fitting spatial statistical models have limited applicability. Here we describe, evaluate and critique Template Model Builder (TMB), an existing but relatively unknown and unvetted (within the statistics community) modeling tool that leverages Laplace approximations to fit a large class of mixed effects models, including many spatial and spatial-temporal models.

# Faculty Meeting - Monday, March 2, 2020

# Faculty Meeting - Monday, February 24, 2020

# Faculty Meeting - Monday, February 10, 2020

- Call to Order
- Chair's Remarks
- 2020-2021 Faculty Teaching Preferences
- February 21 - Research Exchange presentation by Provost Mark Richards
- March 31 - VRI deadline at 11:59pm (no exceptions)
- April 4 - Admitted Student Preview Day for College of Arts and Sciences
- May 7 - Amanda Cox, Graduate School Public Lecture.

# Faculty Meeting - Monday, January 27, 2020

# Functional Estimation in Nonparametric Regression

Consider the heteroscedastic nonparametric regression model with random design $Y_i = f(X_i) + V^{1/2}(X_i)\varepsilon_i, \quad i=1,2,\ldots,n$, with $f(\cdot)$ and $V(\cdot)$ $\alpha$- and $\beta$-H\" older smooth, respectively. We show that the minimax rate of estimating $V(\cdot)$ under both local and global squared risks is of the order $n^{-\frac{8\alpha\beta}{4\alpha\beta + 2\alpha + \beta}} \vee n^{-\frac{2\beta}{2\beta+1}}$, where $a\vee b\define \max\{a,b\}$ for any two real numbers $a,b$.

# Faculty Meeting - Monday, November 18, 2019

A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, November 18th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

**Chair’s Remarks**

Daniel Pollack informed that faculty that there will be faculty lunches with the chair candidates on Tuesday and Thursday (Nov. 19 & 21). If any faculty would like to sign up, please reach out to Kristine Chan.

# Flexible spatial models for household survey data in low and middle income countries

The need for rigorous and timely health and demographic summaries has led to an explosion in geographic studies, particularly in low and middle income countries. While household surveys are a major source of data in this context, they present challenges for statistical modeling. These challenges include biases due to oversampling certain population segments, nonlinear interactions between covariates, and multiple scales of prediction. However, many common statistical methods have never been tested rigorously in these settings.

# Faculty Meeting - Monday, December 9, 2019

# Faculty Meeting - Monday, December 2, 2019

A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, December 2nd, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

**Chair’s Remarks**

Daniel Pollack reminded the faculty about Holiday Party happening on Dec. 11 – food and drinks will be provided. Vickie Graybeal has sent an email out for faculty, postdocs, and staff to RSVP by Dec.4 at 12:00pm.

# Faculty Meeting - Monday, November 4, 2019

The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, November 4th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

The meeting began with approval of previous meeting’s minutes from October 21, 2019.

# Faculty Meeting - Friday, November 8, 2019

A special meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, November 8th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

The meeting began with approval of previous meeting’s minutes from November 4, 2019.

**Adjournment**

There being no chair remarks, announcements, committee reports, and new business, the meeting passed into the executive session at 12:37pm and was adjourned at 2:00pm.

# Faculty Meeting - Monday, December 9, 2019

A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, December 9th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

**Announcements**

Daniel Pollack has announced that the Senior Lecturer position has been posted on the department website.

**Committee Reports**

The GSRs reported students’ interactions and feedback with each chair candidate.

# Faculty Meeting - Monday, October 21, 2019

The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, October 21st, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

The meeting began with approval of previous meeting’s minutes from October 7, 2019.

**Chair’s Remarks**

Daniel Pollack announced Vickie Graybeal’s twenty years of service award.

# Faculty Meeting - Monday, October 7, 2019

The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, October 7th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

The meeting began with approval of previous meeting’s minutes from September 23, 2019.**Chair’s Remarks**

Daniel Pollack went over existing department policies that were up for renewal. Of the policies, the delegation of authority, merit review process, and retention consultation were reviewed, voted, and approved.

# Faculty Meeting - Monday, September 23, 2019

The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, September 23rd, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

**Chair’s Remarks**

Daniel Pollack reported there will be no faculty retreat this Autumn quarter. Planning for the retreat will be revisited in the Spring.

He also provided updates and timelines on the two ongoing searches: Full Professor & Chair and Assistant Professor.

# Latent Variable Models for Indirectly or Imprecisely Measured Networks

In the social sciences, social networks are important structures which represent the relationships and interactions between actors in a population of study. The most common methods for measuring networks are to survey study participants about who their connections are and to collect interaction activity between pairs of actors. However, directly measuring the exact network of interest can be challenging.

# Estimation and testing under shape constraints

Over the last few decades, shape constrained methods have increasingly gathered importance in statistical inference as attractive alternatives to traditional nonparametric methods which often require tuning parameters and restrictive smoothness assumptions. This talk focuses on application of shape-constraints like unimodality and log-concavity in comparing the outcome of two HIV vaccine trials. To this end, we develop shape-constrained tests of stochastic dominance, and shape-constrained plug-in estimator of the Hellinger distance between two densities.

# Realized genome sharing in random effects models for quantitative genetic traits

DNA copies inherited from the same ancestral copy by related individuals are said to be identical by descent (IBD). IBD gives rise to genetic similarities between related individuals. In quantitative genetics, two fundamental problems are heritability estimation and gene mapping for genetic traits. IBD plays a critical role in the study of both problems. When working with population-based samples where pedigree information is unavailable, it is essential to estimate IBD accurately from genetic marker data using pedigree-free methods.

# Inferring Network Structure From Partially Observed Graphs

Collecting social network data is notoriously difficult, meaning that indirectly observed or missing observations are very common. In this talk, we address two of such scenarios: inference on network measures without network observations and inference of regression coefficients when actors in the network have latent block memberships.

# High-dimensional independence testing with maxima of rank correlations

Testing mutual independence for high-dimensional observations is a fundamental statistical challenge. Popular tests based on linear and simple rank correlations are known to be incapable of detecting non-linear, non-monotone relationships, calling for methods that can account for such dependences. To address this challenge, we propose a family of tests that are constructed using maxima of pairwise rank correlations that permit consistent assessment of pairwise independence.

# Recursive Inversion Models for Partially Ranked Data

Can we do exact and tractable inferences in Mallows-like models for incomplete data? I will show that the answer is yes for the most general form Mallows-type model and a large class of partial orders known as partial rankings (including special cases like top-t rankings). I will also demonstrate that despite partial rankings lacking a sufficient statistic, exact inference is possible with overhead that is at most polynomial in O(nN) and that, in practice, the overhead per data point is negligible.

# Fitting Stochastic Epidemic Models to Multiple Data Types

Traditional infectious disease epidemiology focuses on fitting deterministic and stochastic epidemics models to surveillance case count data. Recently, researchers began to make use of infectious disease agent genetic data to complement statistical analyses of case count data. Such genetic analyses rely on the field of phylodynamics --- a set of population genetics tools that aim at reconstructing demographic history of a population based on molecular sequences of individuals sampled from the population of interest.

# Large-Scale B Cell Receptor Sequence Analysis Using Phylogenetics and Machine Learning

The adaptive immune system synthesizes antibodies, the soluble form of B cell receptors (BCRs), to bind to and neutralize pathogens that enter our body. B cells are able to generate a diverse set of high affinity antibodies through the affinity maturation process. During maturation, ``naive'' BCR sequences first accumulate mutations according to a neutral evolutionary process called somatic hypermutation (SHM), which may modify the associated binding affinities, and then are subject to natural selection by clonal expansion, which promotes the higher affinity antibodies.

# Gradient Group Lasso Identifies Sparse Functional Basis for Molecular Manifolds

We present a method for analyzing low-energy paths between molecular conformations by combining techniques in both manifold learning, which identifies such paths, and functional regression, which can parameterize them by explanatory non-linear functions. Unsupervised manifold learning approaches are useful for understanding molecular dynamics simulations since they disregard small-scale information such as peripheral hydrogen vibrations that can nevertheless drastically affect the observed energy.

# Fast nonconvex changepoint detection

In recent years, new technologies in neuroscience have made it possible to measure the activities of large numbers of neurons in behaving animals. For each neuron, a fluorescence trace is measured; this can be seen as a first-order approximation of the neuron's activity over time. Determining the exact time at which a neuron spikes on the basis of its fluorescence trace is an important open problem in the field of computational neuroscience. Recently, a convex optimization problem involving an L1 penalty was proposed for this task.

# Estimating Mortality at the Subnational Level in a Low and Medium Income Context

Child mortality, and, in particular under-five mortality (U5MR), is an important indicator of the overall health of a population. Subnational estimation of U5MR is relatively new endeavor

# Statistical Methods for Manifold Recovery and C^{1, 1} Regression on Manifolds

High-dimensional data sets often have lower-dimensional structure taking the form of a submanifold of a Euclidean space. It is challenging but necessary to develop statistical methods for these data sets that respect the manifold structure. We present research from two different areas: manifold learning (i.e., support estimation) and smooth regression on manifolds.

# Space-Time Contour Models for Sea Ice Forecasting

The amount of sea ice (frozen ocean water) found in the Arctic is declining rapidly as a result of climate change. This has increased the need for accurate forecasts of where sea ice will be located. Of particular interest is predicting the sea ice edge contour, or the boundary of the region where at least 15% of the area is ice-covered. Current sea ice forecasts are issued from deterministic numerical prediction systems.

# Nonparametric inference on monotone functions, with applications to observational studies

In this dissertation, we study general strategies for constructing nonparametric monotone function estimators in two broad statistical settings. In the first setting, a sensible initial estimator of the monotone function of interest is available, but may fail to be monotone. We study the correction of such an estimator obtained via projection onto the space of functions monotone over a finite grid in the domain.

# Bayesian Methods for Graphical Models with Limited Data

Scientific studies in many fields involve understanding and characterizing dependence relationships among large numbers of variables. This can be challenging in settings where data is limited and noisy. Take survey data as an example, understanding the associations between questions may help researchers better explain themes amongst related questions and impute missing values. Yet, such data typically contains a combination of binary, continuous, and categorical variables; a high proportion of missing values; and complex data structures.

# Preferential sampling and model checking in phylodynamic inference

Estimating population size fluctuations is one of the key tasks in Ecology. However, traditional sampling based approaches to perform this task have limitations when populations of interest are extinct or are hard to reach, as is the case for individuals infected for a short time period by a pathogen.

# Analysis of Incomplete Network Data

Collecting social network data is notoriously difficult, meaning that indirectly observed or missing observations are very common. In this talk, we address two of such scenarios: inference on network measures without any direct network observations and inference of regression coefficients when important features are missing.

# Parameter Identification and Assessment of Independence in Multivariate Statistical Modeling

In this talk we define a new class of multivariate nonparametric measures of dependence that we refer to as symmetric rank covariances. This new class generalizes many existing classical rank measures of dependence, such as Kendall's tau and Hoeffding's D, as well as the more recently discovered Bergsma--Dassios sign covariance. Symmetric rank covariances make explicit the implicit symmetries hidden in the standard definitions of the above measures and, in doing so, lead naturally to multivariate extensions of the Bergsma--Dassios sign covariance.

# Latent Variable Models for Imprecisely or Indirectly Measured Networks

In the social sciences, social networks are important structures which represent the relationships and interactions between actors in a population of study. In these fields, the most common method for measuring networks is to directly survey study participants about who their connections are. However, directly measuring the network of interest can be challenging. Participants do not always provide accurate accounts of their connections, which can result in mismeasurement of the network.

# Causal Discovery with non-Gaussian Data

In this talk, we consider causal discovery when the underlying structure corresponds to a linear structural equation model with error terms which are non-Gaussian. Previous work by Shimizu et al. (2006) has shown that under this framework, a unique directed acyclic graph--not simply an equivalence class--can be identified from infinite data. We extend that result in two directions. First, we show that a unique graph can still be consistently recovered in the high dimensional setting where p, the number of variables, exceeds n, the number of observed samples.

# Faculty Meeting - Monday, January 29, 2018

Agenda:

- Faculty Search discussion

# Composite Likelihood Estimation for Binary Network Models

We develop a scalable method to estimate the parameters in models of very large binary network datasets. Maximum likelihood estimates are generally impossible to obtain because the full likelihood involves an intractable high dimensional integral. Also, full-likelihood Bayesian estimation is impractical for very large datasets as the MCMC algorithm is very slow.

# Quarterly Pedagogy Meeting - March 7, 2016

Time: 12.30-1.30pm March 7, 2016

Place: Padelford Hall, C-301

Agenda:

- 12:30 - Pedagogy Meeting

# Faculty Meeting - January 9, 2017

Time: 12.30-1.30pm January 9, 2017

Place: Padelford Hall, C-301

Agenda:

- Welcome back & Updates (Thomas R.)
- Mentoring & Diversity (Jessica G.)
- Consulting / Paul Sampson Replacement (Thomas R.)

# Faculty Meeting - February 13, 2017

Time: 12.30-1.30pm February 13, 2017

Place: Padelford Hall, C-301

Agenda:

- Updates (Thomas R.)
- 3-year Affiliate/Adjunct Renewals (Thomas R.)
- Affiliate/Adjunct Re-Appointments (Not up for periodic 3-year review associated with renewal) (Thomas R.)
- Case for Promotion to Affiliate Associate Professor (Thomas R.)
- Paul Sampson (Thomas R.)

# Faculty Meeting - February 27, 2017

Time: 12.30-1.30pm February 27, 2017

Place: Padelford Hall, C-301

Agenda:

- Renew Policies (Thomas R.)
- Biostatistics Search (Thomas R.)
- Affiliate Appointment for Jon Azose (Thomas R., Adrian R.)
- Loyce Adams (Emeritus Professor for AMATH) for Senator (Thomas R.)
- Annual Student Review (Michael P.)

# Faculty Meeting - March 6, 2017

Time: 12.30-1.30pm March 6, 2017

Place: Padelford Hall, C-301

Agenda:

- Annual Student Review (Michael P.)

# Faculty Meeting - April 3, 2017

Time: 12.30-1.30pm April 3, 2017

Place: Padelford Hall, C-301

Agenda:

- Upcoming talk by Nature Editor, 4/5, Physics/Astronomy Auditorium A118
- Computing Staff Updates (Thomas/Kris)
- Web-site overhaul (Thomas/Kris)
- Discuss and vote on Affiliate appointment for Sam Clark
- Update on Faculty Search for Full-Time Lecturer in Consulting (Elena)
- Search request for next year (Thomas)
- New learning spaces / scheduling policy (commencing Spring 2018): https://registrar.washington.edu/learning-spaces-faq/

# Faculty Meeting - April 10, 2017

Time: 12.30-1.30pm April 10, 2017

Place: Padelford Hall, C-301

Agenda:

- Meeting for Full + Assoc. Professors Only

# Faculty Meeting - April 24, 2017

Time: 12.30-1.30pm April 24, 2017

Place: Padelford Hall, C-301

Agenda:

- MS Student Review

# Faculty Meeting - May 1, 2017

Time: 12.30-1.30pm May 1, 2017

Place: Padelford Hall, C-301

Agenda:

- FTL Consulting Search
- Personnel Matter (Full Professors only)

# Faculty Meeting - May 8, 2017

Time: 12.30-1.30pm May 8, 2017

Place: Padelford Hall, C-301

Agenda:

- TCC Meeting

# Faculty Meeting - May 22, 2017

Time: 12.30-1.30pm May 22, 2017

Place: Padelford Hall, C-301

Agenda:

- FTL Consulting Search

# Faculty Meeting - June 5, 2017

Time: 12.30-1.30pm June 5, 2017

Place: Padelford Hall, C-301

Agenda:

- Update and discussion on searches.
- PhD Admission Policy; TOEFL Scores; TA Requirement for PhDs.
- College Absence Policy; also Effective Personnel Vote rule.
- 10 Year Department Review.

# Faculty Meeting - October 2, 2017

Time: 12:30-1:30pm October 2, 2017

Place: Padelford Hall, C-301

Agenda:

- Updates
- Discussion of upcoming Search
- Adjunct Appointment (Amy Willis)
- Research Prelim and 572

# Recovery of Item Rankings Under Nonnormal Fitting Distributions in MML Parameter Estimation

In a simulation study, data are generated under a variety of conditions with respect to underlying ability distribution, test length, and sample size. Item parameter estimates are obtained under two conditions: in one, the assumed ability distribution matches the underlying ability distribution; in the other, it does not. The item parameter estimates from the matching condition are compared to those from the nonmatching condition to determine the effect on the recovery of parameter estimates and item rankings.

# Finite Sampling Exponential Bounds with Applications to Two-Sample Kolmogorov-Smirnov Statistics

Advisor: Jon Wellner In this talk, we discuss exponential tail inequalities for the sum in the context of sampling without replacement. Using an exponential inequality due to Serfling as the basis for investigation, we consider the special case of sampling from a finite population containing only 0s and 1s. This leads to considering exponential bounds for the Hypergeometric distribution.

# Geostatistical Model Averaging for Probabilistic Quantitative Precipitation Forecasting

Advisor: Tilmann Gneiting Accurate weather forecasts benefit society in crucial functions, including agriculture, transportation, recreation, and basic human and infrastructural safety. Over the past two decades, ensembles of numerical weather prediction models have been developed, in which multiple estimates of the current state of the atmosphere are used to generate probabilistic forecasts for future weather events. However, ensemble systems are uncalibrated and biased, and thus need to be statistically postprocessed. Bayesian model averaging (BMA) is a preferred way of doing this.

# Introduction to Model-Based Clustering

I will talk briefly about how I got involved in research in Model-Based Clustering in my final year of undergrad (and subsequently here) and give a brief outline of research I did then. The main part of the talk will be about different extensions to the model-based clustering methodology that I\'m working on. I\'ll mainly be focusing on research on variable selection with model-based clustering but I\'ll also talk, if I have time, about ideas I\'ll be working on for the next year.

# Likelihood-Based Inference for Partially Observed Multi-Type Markov Branching Processes

Advisor - Vladimir Minin Abstract - Markov branching processes are a class of continuous-time Markov chains (CTMCs) with ubiquitous modeling applications. Multi-type processes are necessary to model phenomena such as competition, predation, or infection, but often feature large or uncountable state spaces, rendering general CTMC techniques impractical. We present new methodology motivated by processes arising in molecular epidemiology, cellular differentiation, and infectious disease dynamics.

# A Comparison of Alternative Methodologies for Estimation of HIV Incidence

# Trend Estimation Using Wavelets

Advisors: Peter Guttorp & Don Percival

# Model-Based Penalized Inference

It is well known that many penalized regression problems can be interpreted as estimating unknown regression coefficients having assumed a specific statistical model. This includes the lasso when tuning parameters are estimated from the marginal likelihood of the data, the Bayesian lasso, Gaussian random effects models, ridge regression, etc. In the first part, we consider estimating a mean matrix from a single noisy realization. We assume possibly sparse elementwise effects and use a lasso penalty.

# Statistical Methods in Medical Imaging: Application to Mammography

Medical professionals and researchers used a variety of imaging techniques in their clinical practice and scientific investigations. In this talk I will focus on Mammography which is used for breast examinations and routine breast cancer screening. While the mammographic images proved to be a useful non-invasive tool for clinical monitoring, the images often luck detail and clarity. For example, in addition to having limited spatial resolution, skin-air boundary of the imaged breast is often obscured. This boundary is, however, an important initial step in the breast density estimation.

# MS Thesis Presentation: A resampling approach to clustering with confidence

We propose a method for estimating the number of groups in a data set. Our method is an extension of Generalized Single Linkage clustering (GSL) (Stuetzle and Nugent 2010), a nonparametric clustering method based on the premise that groups in the data correspond to modes of the underlying data density. GSL starts with a nonparametric density estimate. It recursively splits the data into high density regions separated by valleys. The leaves of the resulting cluster tree correspond to modes of the density estimate.

# Analysis of Haplotype Structure: Application to the DARC Gene Region

# Modeling Preferential Sampling Reduces Bias and Improves Precision When Estimating Effective Population Size Trajectories

Advisor: Vladimir Minin The field of phylodynamics seeks to estimate effective population size fluctuations from molecular sequences of individuals sampled from the population of interest. One way to accomplish this task is to formulate an observed sequence data likelihood by using a coalescent model for the sampled individualsâ€™ genealogy and then integrating over all possible genealogies via Monte Carlo or, less efficiently, by conditioning on one genealogy estimated from sequence data. These strategies also work when molecular sequences are sampled serially through time.

# Factor Models with Non-Normality: Robust, Skewed Distribution MLE and Bayes Estimation

Advisor: R. Douglas Martin The literature on use of robust estimates, skewed distribution MLEâ€™s and non-normal distribution hierarchical Bayes models for multi-factor models in finance is surprisingly thin, and limited for the most part to single factor models (SFMâ€™s). The ultimate goal of our research is the study of the relative merits of robust versus non-normal MLE estimation of multi-factor models and the use of hierarchical Bayes modeling of multi-factor models using skewed fat-tailed distributions.

# Learning the "Epitome" of an Image

I will describe a new model of image data that we call the "epitome". The epitome of an image is its miniature, condensed version containing the essence of the textural and shape properties of the image. As opposed to previously used simple image models, such as templates or basis functions, the size of the epitome is considerably smaller than the size of the image or object it represents, but the epitome still contains most constitutive elements needed to reconstruct the image.

# Bayesian Methods for Inferring Gene Regulatory Networks

Advisors - Adrian Raftery and Ka Yee Yeung (UW Tacoma)

# Combining Probability Forecasts

We propose a method for combining probability forecasts from different sources. The commonly used method of linearly combining probability forecasts has limitations, in that a weighted combination of distinct calibrated forecasts is necessarily uncalibrated. In view of this, we propose a recalibration method. We illustrate our findings with simulation examples and a case study on operational probability of precipitation forecasts.

# Algorithms and Software for the Automated Identification of Minerals Using Field Spectra or Hyperspectral Imagery

Over the last few years, the speaker (and collaborators Leanne Bischof and Jon Huntington) have been developing fast and sophisticated algorithms and software for identifying pure minerals and mixtures of minerals from shortwave infrared spectra. The software, called The Spectral Assistant (TSA), has been designed to be used with a particular FIELD-PORTABLE spectrometer, the PIMA-II, which is about the size of a shoe box and can be used by geologists collecting samples in the field.

# Topics in Graph Clustering

Advisor: Marina Meila

# Statistical Methodology for Longitudinal Social Network Data

Social interaction data are data that are generated from the interaction or relationship between two or more actors, thus the observational units are pairs, trios, etc. of actors. This type of data are common in all fields of social science (e.g. political science, sociology, anthropology, and economics) for the interaction of actors is a key element in social science theory.

# Modeling Heterogeneity Within and Between Arrays

Data that can be represented in the form of an array is present in many of the social and biological sciences. In this talk we address two statistical problems concerning these data. The first problem is modeling the heterogeneity along the dimensions of an array. Previously developed models are either non-stochastic and difficult to interpret, or require a large number of parameters prohibiting likelihood based inference for some arrays.

# Robust Bayesian Analysis of Gene Expression Microarray Data

Microarrays are part of a new class of biotechnologies that can be used to measure expression levels (DNA or RNA abundance) for thousands of genes at a time. This new technology is being applied increasingly in biological and medical research to address a wide range of problems, such as the classification of tumors or the study of host responses to bacterial infections. DNA microarray experiments raise numerous statistical questions in fields as diverse as image analysis, experimental design, hypothesis testing, cluster analysis, etc.

# Bayesian Modeling of International Migration

Advisor: Adrian Raftery The future of international migration is a topic of great social and political importance, and yet international migration is hard to even estimate, let alone predict. The unreliability of point projections of migration indicates a need for better quantification of uncertainty in migration projections. We accomplish this quantification of uncertainty with a Bayesian hierarchical autoregressive model on net migration rates. In an initial model, we assume error terms are independent across countries.

# Nonparametric Estimation of the Bivariate Survivor Function

Correlated failure time data arise often in many application areas. For example, in genetic epidemiology study, the disease occurrence times of pairs of family members are often correlated and the degree of correlation may provide important leads in respect to disease etiology. Univariate failure time data methods are well established, including Kaplan-Meier method, censored data rank test and Cox regression method. However, the standard tools for multivariate failure data analysis data are not available yet.

# Logistic Regression with Covariate Measurement Error: Estimation and a New Measurement Model

Advisors: Ross Prentice and Ching-Yun Wang

# Discovering Interactions In Multivariate Time Series

In large collections of multivariate time series it is of interest to determine interactions between each pair of time series. We study methods for inferring time series interactions in three domains: 1) conditional independencies between time series, 2) Granger and instantaneous causality estimation in subsampled and mixed frequency time series, and 3) Granger causality estimation in multivariate categorical data. First, we explore a Bayesian framework for inferring graphical models of time series.

# Statistical Methods for Analyzing Incomplete Financial Data with Heavy Tails

A common problem with financial historical data is that they often have unequal lengths of histories. Examples include country market indices, currency rates and hedge fund returns histories. Practitioners often deal with such issues by truncating all the series so that the remaining data have the same length, which is apparently not an ideal solution. We discuss existing statistical methods that utilize the full data set, such as maximum likelihood estimation and multiple imputation.

# Parameter Priors for Directed Acyclic Graphical Models and the Characterization of Several Probability Distributions

**Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop** We develop simple methods for constructing parameter priors for model choice among Directed Acyclic Graphical (DAG) models. In particular, we introduce several assumptions that permit the construction of parameter priors for a large number of DAG models from a small set of assessments. We then present a method for directly computing the marginal likelihood of every DAG model given a random sample with no missing observations.

# Large-Scale B-Cell Receptor Sequence Analysis Using Phylogenetics and Machine Learning

Co-chairs: Vladimir Minin & Erick Matsen

# Bayesian Hierarchical Self-Modeling Warping Regression with Application to Network Inferences

Functional data often exhibit a common shape but also variations in amplitude and phase across curves. The analysis often proceed by synchronization of the data through curve registration. We propose a Bayesian Hierarchical model for curve registration. Our model provides a formal account of amplitude and phase variability while borrowing strength from the data across curves in the estimation of the model parameters.

# Bayesian Modeling of Health Data in Space and Time

In recent years spatial-temporal modeling has become increasingly popular in the ï¬eld of public health and epidemiology. Motivated by two datasets, we address three issues in the Bayesian modeling of health data in space and time.

# Learning in Spectral Clustering

Spectral segmentation is a technique used to group data based on pairwise similarities. A similarity matrix is used as input into a spectral clustering algorithm and a clustering over the data is output. The clustering criterion is such that similar points are put in the same cluster and dissimilar points are put in different clusters. Generally, this similarity matrix is assumed known, while in reality this matrix is usually constructed by hand, a very time consuming process.

# Discrete-Time Threshold Regression for Survival Data with Time-Dependent Covariates

Advisor: Professor Gary Chan

# Nonparametric Estimation of the Bivariate Survivor Function

Correlated failure time data arise often in many application areas. For example, in genetic epidemiology study, the disease occurrence times of pairs of family members are often correlated and the degree of correlation may provide important leads in respect to disease etiology. Univariate failure time data methods are well established, including Kaplan-Meier method, censored data rank test and Cox regression method. However, the standard tools for multivariate failure data analysis data are not available yet.

# Directed Markov Point Processes

Spatial Point process are often modeled as Markov fields, and inference for such models are sometimes either inefficient or computationally intensive due to difficulties in evaluating the normalizing constant. Simulation study for such process is hard. We exploit the partial order in the plane and introduce a class of Markov point processes known as \"Directed Markov Point Processes\" and investigate their properties. This Markov structure enables to study some of the well known spatial processes in detail.

# Hamiltonian Monte Carlo in Bayesian Empirical Likelihood Computation

We consider Bayesian empirical likelihood estimation and develop an efficient Hamiltonian Monte Carlo method for sampling from the posterior distribution of the parameters of interest. The proposed method uses hitherto unknown properties of the gradient of the underlying log-empirical likelihood function. It is seen that these properties hold under minimal assumptions on the parameter space, prior density and the functions used in the estimating equations determining the empirical likelihood.

# Nonparametric Estimation of Multivariate Monotone Densities

I will discuss the most important of results obtained along the direction of nonparametric estimation of two multivariate families of densities that exhibit monotonicity constraints, and which can otherwise be characterized as certain mixtures models. Discussion will emphasize on chracterizations of the estimators, their strong consistency and we will embark on discussing rates of convergence of these estimators, both in the global and the local sense.

# On the Geometry of Graphical Models

**Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop** We provide a classification of graphical models according to their representation as subfamilies of exponential families.

# Methods for Estimation and Inference for High-Dimensional Models

Advisor: Mathias Drton & Ali Shojaie

Modern statistical problems are increasingly high-dimensional, with the number of covariables p potentially vastly exceeding sample size N. Fortunately, significant progress has been made in developing rigorous statistical tools for tackling such problems, but these methods have primarily targeted prediction, point estimation, and or variable selection.

# Algorithms for Estimating the Cluster Tree of a Density

The goal of clustering is to identify distinct groups in a data set and assign a group label to each observation. To cast clustering as a statistical problem, we regard the data as an iid sample from some unknown probability density p. We adopt the premise that groups correspond to modes of the density. Our goal then is to find the modes and assign each observation to the \"domain of attraction\" of a mode. We do this by estimating the cluster tree of the density, a representation of the hierarchical structure of its level sets.

# Analyzing Time Series Data for Endemic Cholera in Bangladesh with Mechanistic Models of Infectious Disease Dynamics

Despite seasonal cholera outbreaks in Bangladesh, little is known

about the relationship between environmental conditions and cholera

cases. We seek to develop a predictive model for cholera outbreaks

in Bangladesh based on environmental predictors. To do this, we must

estimate the environmental parameters in the context of a disease

transmission model. We develop a method to simultaneously estimate

the transmission parameters and the environmental parameters in a

Susceptible-Infectious-Recovered-Susceptible (SIRS) model. The

# The Career Leap from Academia to Data Science

The amount of data we generate as a global civilization is growing exponentially. What's more important however, is the fact that storing, accessing and analyzing data is getting cheaper and faster. Organizations all over the world have realized that data is a prized commodity, and many in the industry are scrambling to extract value from their complex data sets. For this endeavor, they need individuals with the right skills and experience, and the quantitative disciplines in Academia are a great source for such individuals. In this talk, I will briefly describe my journey from a Ph.D.

# Population Genetic Variation: A Computationally Tractable Model for Large Samples Typed at Many Loci

Haplotypes are specific combinations of alleles on the same chromosome, and various methods exist for the analysis of haplotype data from unrelated individuals. However, humans are diploid and studies of genetic variation might consist of unphased genotype data, where an unordered pair of alleles is observed at each locus. There is a coming need for less-computationally intensive models that may be directly applied to unphased genotype data from thousands of individuals at thousands of loci. In this talk, we present such a model for genetic variation.

# Likelihood-Based Inference for Partially Observed Multi-type Branching Processes

Advisor: Vladimir Minin Branching processes are a class of continuous-time Markov chains (CTMCs) frequently used in stochastic modeling with ubiquitous applications. One-dimensional cases such as birth-death processes are well studied, but it is often necessary to model systems with more than one species --- bivariate or other multi-type processes are commonly used to model phenomena such as competition, predation, or infection.

# Factor Model Monte Carlo Methods for General Fund-of-Funds Portfolio Management

The general Fund-of-Funds (GFoF) class of investment organizations includes fund-of-hedge funds (FoHF), family offices, endowments, pension plans and asset management companies. GFoF portfolios are characterized by two important types of returns problems among others. The first is that the returns histories of the portfolio assets are unequal, sometimes quite short and often contain multiple frequencies, resulting in structured missing data problems. The second is that the returns have fat-tailed and skewed distributions to varying degrees.

# Estimation in Generalized Linear Mixed Models: Comparison of Maximum Likelihood with Iterative Bias Correction

Advisor: Brian Leroux

# John's Walk

We present an affine-invariant random walk for drawing uniform random samples from a convex body for which the maximum volume inscribed ellipsoid, known as John's ellipsoid, may be computed. We consider a polytope where as a special case. Our algorithm makes steps using uniform sampling from the John's ellipsoid of the symmetrization of at the current point. We show that from a warm start, the random walk mixes in steps. This sampling algorithm thus offers improvement over the affine-invariant walk known as the Dikin Walk (which mixes in steps from a warm start) for applications in which .

# Conditional Tests for Localizing Trait Genes

# Graphical Models from Phylogenies, Coalescents, and Migration

**Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop**

# Realized Genome Sharing in Random Effects Models for Quantitative Genetic Traits

Advisor: Elizabeth Thompson

# TBD

# Separable covariance testing and estimating for sociomatrices

We consider the problem of testing and estimating separable covariances for relational data sets. We propose to model these data as matrix normal distributions with separate row and column covariance matrices. The existing literature on testing and estimation in the context of a matrix normal distribution requires multiple observations of the matrix, which rarely occurs for relational data sets.

# Survival Analysis by Threshold Regression with Time-Dependent Covariates

A natural approach to survival analysis in many settings is to model the subjectâ€™s â€œhealthâ€ status as a latent stochastic process, where the terminal event is represented by the first time that the process crosses a threshold. â€œThreshold regressionâ€ models the covariate effects on the latent process. Much of the literature on threshold regression assumes that the process is one-dimensional Wiener, where crossing times have a tractable inverse Gaussian distribution but where the process characteristics are fixed at baseline.

# Bayesian Modeling of Survey Data in Space and Time

Advisor: Jon Wakefield Public health data are frequently obtained from surveys, which often have complex design sampling frames. It is crucial that analyses account for the latter to give appropriate inference. We describe two scenarios, with both having important spatial components. The first example is motivated by Behavioral Risk Factor Surveillance System (BRFSS) data. Empirical Bayes and Bayes hierarchical models for small area estimation have been used extensively for surveys like BRFSS.

# Clustering with Confidence

One of the fundamental goals of nonparametric cluster analysis is to estimate the cluster tree of a density. I will define and illustrate the cluster tree and describe a graph-based procedure for its estimation. The cluster tree will usually have spurious leaves due to variability in the density estimate. I will introduce a bootstrap-based method for eliminating spurious leaves and â€œclustering with confidenceâ€.

# Adaptive Higher-order Spectral Estimators

Advisor: Peter Hoff Many applications involve estimation of a signal matrix from a noisy data matrix. In such cases, it has been observed that estimators that shrink or truncate the singular values of the data matrix perform well when the signal matrix has approximately low rank. In this talk, we generalize this approach to the estimation of a tensor of parameters from noisy tensor data. We develop new classes of estimators that shrink or threshold the mode-specific singular values from the higher-order singular value decomposition.

# Geostatistical Model Averaging

Probabilistic weather forecasting is becoming an increasingly important and active area of research. Most current statistical post-processing techniques account for forecast bias and predictive variance without regard to forecast location. We will discuss a technique that adjusts bias and predictive variance locally, called geostatistical model averaging (GMA). In particular, GMA allows the parameters of the predictive distribution to vary over the model grid.

# Identification of an Infinite AR Model

# Manifold Learning Using Kernel Density Estimation and Local PCA

High-dimensional datasets often have lower-dimensional structure, which frequently takes the form of a manifold. There are many algorithms (e.g., Isomap) that are used in practice to fit manifolds and thus reduce the dimensionality of a given dataset. In our work, we consider the problem of recovering a d-dimensional submanifold M of R^n when provided with noiseless samples from M. Ideally, the estimate M_hat of M should be an actual manifold. Generally speaking, existing manifold learning algorithms do not meet these criteria.

# Probabilistic Wind Forecasting Using Bayesian Model Averaging

Bayesian model averaging has been shown to be a useful method for developing probabilistic weather forecasts for quantities (such as temperature) that can be represented by univariate normal distributions. This talk will discuss how these methods can be extended to other distributions, using wind forecasting as an example.

# Graphical Markov Models for Partially Observed Data Generating Mechanisms

**Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop** Graphical Markov models represent statistical dependencies by combining two simple yet powerful mathematical concepts: graphs and conditional independence. A graphical Markov model is constructed by specifying local dependencies for each node of the graph in terms of its immediate neighbors, yet can represent a highly varied and complex system of multivariate dependencies by means of the global structure of the graph.

# Bayesian Graphical Models with Limited Data and External Information

Advisor: Tyler McCormick

# Likelihood Inference for Population Structure, Using the Coalescent

# Learning and Manifolds: Leveraging the Intrinsic Geometry

We explore and exploit the use of differential operators on manifolds - the Laplace-Beltrami operator in particular - in learning tasks. In particular, we are interested in uncovering the geometric structure of data(unsupervised learning) and in exploiting information contained in unlabeled data for regression and classification tasks (semi-supervised learning). First, building on the Laplacian Eigenmap and Diffusion Maps framework, we propose a new paradigm that offers a guarantee, under reasonable assumptions, that any manifold learning algorithm will preserve the geometry of a data set.

# Degeneracy, Duration, and Co-evolution: Extensions of Exponential Random Graph Model (ERGM) for Social Network

We will address three aspects of statistical methodology for Exponential family Random Graph Models (ERGMs) in the context of applications to social network analysis. We start by addressing the topic of degeneracy in ERGMs. This is a problem often misunderstood to characterize the entire family of ERGMs, but is properly understood as a more limited issue of model misspecification.

# A Bayesian Surveillance System for Detecting Clusters of Non-Infectious Diseases

Advisor: Jon Wakefield We consider the problem of detecting clusters of non-infectious and rare diseases. Cluster detection is the routine surveillance over a large expanse of small administrative regions to identify individual \'hot-spots\' of elevated residual spatial risk without any preconceptions about their locations. A class of cluster detection procedures known as moving-window methods superimpose a large number of circular regions onto the study area.

# Probability and Inference for Random Fields

In recent decades, there has been much progress and interest in spatial statistics, with applications in agriculture, epidemiology, geology and other areas of environmental science and in image analysis. Two contrasting approaches have emerged, one based on Markov random fields, the other on geostatistics. The development of Markov Chain Monte Carlo as a computational tool has been phenomenal and has made Bayesian inference for spatial models relatively easy to perform, whereas frequentist inference still presents difficult problems.

# A Finite Population Likelihood Ratio Test of the Sharp Null Hypothesis for Compliers

Advisor: Thomas Richardson

# Portfolio Optimization and Asset Pricing with Skewed Fat-Tailed Distributions

# Estimation with Bivariate Interval Censored Data

# Improved estimation of bilateral migration flows

I propose a method for estimating migration flows between all pairs of countries, including breakdowns by place of birth. My estimator is a pseudo-Bayes estimator which smooths a set of state-of-the-art estimates of migration flows towards a simpler estimate which contains fewer structural zeroes. The smoothing process provides a natural way to bypass the state-of-the-art estimator's unrealistic assumption that the number of global migrants is as small as possible.

# Probabilistic Projections of Fertility Using a Bayesian Hierarchical

The United Nations Population Division produces estimates and projections of the total fertility rate for all countries in the world every two years. For countries with fertility above replacement level, future levels are projected by choosing one out of three scenarios describing the pace of future fertility decline.

I will discuss a Bayesian hierarchical model for producing country-specific projections of the total fertility rate, and assessing the uncertainty in these predictions. Results for various countries will be presented.

# A Survey of the Markov Properties of Directed, Undirected, and Mixed Graphs

**Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop**

# Exploring Rates and Patterns of Variability in Gene Conversion and Crossover in the Human Genome

Meiotic recombination is a biological process that shuffles our genetic material before we pass it along to our offspring. There are two known outcomes of recombination: crossover and gene conversion. Recently, fine-scale human crossover rates have been inferred with some success using statistical methodology applied to population data (i.e. genetic data on random samples of individuals from a population). However, reliable estimation of gene conversion rates has proven more difficult to come by.

# Estimating coancestry among multiple individuals in populations

Segments of genome inherited from a common ancestor by multiple individuals are said to be identical by descent (IBD). Dense genotyping platforms permit the detection of IBD segments less than 5 centiMorgans long, which arise due to coancestry on the order of dozens of generations ago. Generalizations of classical pedigree-based linkage methods use this inferred IBD and can be applied in situations where pedigree data is incomplete. We present a method for inferring IBD in groups of individuals without pedigrees.

# Running Markov Chain without Markov Basis

The methodology of Markov basis initiated by Diaconis and Sturmfels (1998) stimulated active research on Markov bases for more than a decade. It also motivated improvements of algorithms for Gr\"obner basis computation for toric ideals, such as those implemented in 4ti2.

# Seeing the Trees Through the Forest: A Competition Model for Growth and Mortality

Advisor: Peter Guttorp Local competition between trees affects growth and mortality, from which emerges spatial patterns of surviving trees. Often, the patterns resulting from this unspecified process are treated as instances of spatial patterns and analyzed with point process methods. Alternatively, forest simulation models assume mechanistic processes and parameters to examine the effects of these assumptions on tree patterns over time, and assess sensitivity to changing conditions, such as climate.

# Goodness of Fit Through Empirical Likelihood: Berk-Jones, Reversed Berk-Jones, and Generalizations

# The Likelihood Pivot: Performing Inference with Confidence

Advisor: Peter Hoff Maximum likelihood estimation is a popular method of statistical inference in part due to its efficiency. Unfortunately, much of the efficiency is lost when the model has been misspecified. To account for possible model misspecification, the sandwich estimate of variance can be used with MLE inference to generate asymptotically correct confidence intervals, but these intervals typically perform poorly at small sample sizes. In this talk, we present a pivot-based method that performs better than the sandwich and its adjustments at small sample sizes.

# Modeling Competition in Forest Development

Analysis of the patterns of entities and their attributes in space is a common and useful endeavor in ecology. Often, the end of a statistical analysis is a general characterization of the observed pattern or series of patterns. However, a good description of the outcome may be somewhat dissatisfying to the practicing scientist or resources manager in that the mechanisms and processes that led to the outcomes remain unknown.

# Is the Classical t-Test of the Slope Really Invalid in Linear Regression Models?

# Likelihood-based haplotype frequency modeling using variable-order Markov chains

The localized haplotype-cluster model uses variable-order Markov chains to create an empirical model for haplotype probabilities that adapts to the changing structure of linkage disequilibrium (LD) across the genome. By clustering haplotypes based on the Markov property, the model is able to take advantage of conditional independencies to improve estimates of haplotype frequencies while still respecting the dependencies induced by LD.

# Log-Linear Models for Heterogeneity in Bipartite Networks

# Counterfactuals and Bayesian Graphical Models

**Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop**

# Influence Functions in Finance: Statistical Analysis of Portfolio Risk and Performance Measures

# Testing for Differences between Least Squares and Robust Regression Estimates

At the present time there is no well accepted test for comparing least squares and robust linear regression coefficient estimates. To fill this gap we propose and demonstrate the efficacy of two Wald-like statistical tests for the above purposes, using for robust regression the class of MM-estimators.

# Hierarchical modelling of spatial structure of epidermal nerve fibers

Epidermal nerve fiber (ENF) density and morphology are used to diagnose small fiber involvement in diabetic and other small fiber neuropathies. ENF density and summed length of ENFs per epidermal surface area are reduced in diabetic subjects. Furthermore, based on mainly visual inspection, it has been reported that ENFs of subjects with diabetic neuropathy seem to appear more clustered than ENFs of healthy subjects. Therefore, it is important to understand the spatial structure of ENFs in healthy and diseased subjects.

# Bayesian Spatial and Temporal Methods for Public Health Data

Advisors: Adrian Dobra and Jon Wakefield Understanding the relationships between disease incidence and risk factors such as demographic characteristics, life style factors, and environmental contaminants is a central goal in public health and epidemiology. Often outcomes and risk factors are measured at specific locations or at particular times. We present flexible Bayesian models for spatial and temporal data to address important public health questions in two examples. In the first example, we consider low birthweight and preterm birth along with three risk factors in North Carolina.

# MS Thesis Presentation - Modeling the Game of Soccer Using Potential Functions

Advisor: Peter Guttorp

# Nonparametric Estimation of a k-Monotone Density: New Asymptotic Distribution Theory

# Scalable Methods for Inference of Multiple IBD

Advisor: Elizabeth Thompson A major topic in statistical genetics is discovering the locations of genes contributing to complex traits through linkage analysis. The likelihood of a genetic marker controlling the expression of the trait is calculated using estimated identity-by-descent (IBD) graphs, which indicate whether copies of the marker shared among individuals are inherited from a common ancestor. Methods for estimating IBD graphs either use pedigree or population relationships between the individuals, and do not scale to a large number of individuals.

# Statistical Analysis of Portfolio Risk and Performance Measures - The Influence Function Approach

Advisor: R. Douglas Martin

# Pairwise Clustering by Random Walks

In a similarity based clustering task, one defines a \"similarity function\" between pairs of points and then formulates a criterion (e.g. maximum intracluster similarity) that the clustering must optimize. The optimality criterion quantifies the intuitive notion that points in the same clusters should be similar while points in different clusters should be dissimilar. Most sensible criteria are NP hard to optimize. An alternative view that has been successful in recent years is represented by spectral methods, where clustering is based on the first few eigenvectors of a matrix.

# Inference of Identity by Descent for Linkage Analysis

Advisor: Professor Elizabeth Thompson Inference of identity by descent for linkage analysis Identity by descent (IBD) describes the pattern of shared inheritance of DNA among individuals. Two or more copies of DNA are identical by descent if they are inherited from the same common ancestor. IBD underlies the genetic similarity between individuals and thus similarity in observed genetic traits. In a family study of a genetic disease, estimated IBD among individuals in the family is used to identify potential locations of the gene that causes the disease.

# Up-and-Down and the Percentile-Finding Problem

A problem encountered across many fields in science, engineering and medicine, is finding a specific percentile of a binary-response threshold distribution (for example: finding the ED50 of a medication). Statisticians have designed two popular sequential solutions to this challenge: 'Up-and-Down' (U&D), a 1940's vintage method; and Bayesian designs - most prominently 'Continual Reassessment Method' (CRM, Quigley et al., 1990), a design tailored to Phase I clinical trials. U&D generates a random walk revolving around the target percentile.

# Lattice Conditional Independence Models for Incomplete Multivariate Data and for Seemingly Unrelated Regressions

Advisor: Michael Perlman

# Nonparametric Estimation for Current Status Data with Competing Risks

We study the nonparametric maximum likelihood estimator (MLE) for current status data with competing risks. These data arise naturally in cross-sectional survival studies with several failure causes, and generalizations arise in HIV vaccine clinical trials. Until now, the asymptotic properties of the MLE have been largely unknown. We resolve this issue by proving consistency, the rate of convergence, and the limiting distribution of the MLE.

# TBD

# Bayesian Population Reconstruction: A Method for Estimating Age- and Sex-specific Vital Rates and Population Counts with Uncertainty from Fragmentary Data

Current methods for reconstructing human populations of the past by age and sex are deterministic or do not formally account for measurement error. I propose \\\"Bayesian reconsruction\\\", a method for simultaneously estimating age-specific population counts, fertility rates, mortality rates and net international migration flows from fragmentary data, that incorporates measurement error. Expert opinion is incorportated formally through informative priors. Inference is based on joint posterior probability distributions which yield fully probabilistic interval estimates.

# Improving on the Sandwich

Advisor: Peter Hoff

# Parametrizations of Discrete Graphical Models

Advisor: Thomas Richardson Graphical models provide an intuitive way of representing conditional independence relations over multivariate distributions. We work with a very general class of graphs we dub Mixed Euphonious Graphs (MEGs), which include DAGs, undirected graphs and ancestral graphs as special cases. Markov properties and parametrizations of discrete distributions obeying the global Markov property for MEGs were found by Richardson (2003, 2009). We discuss this parametrization, and a Maximum Likelihood fitting algorithm which uses it.

# Assessing the Detrended Fluctuation Analysis Method of Estimating the Hurst Coefficient

# High Dimensional Inference of Graphical Models Using Regularized Score Matching

Advisor: Mathias Drton

# Postulating Monotonicity in Bayesian Nonparametric Regression

It is often reasonable, by using earlier empirical evidence or theoretical understanding of the considered applied context, to assume that the regression surface corresponding to a response variable, as a function of the model covariates, is either monotonically increasing or monotonically decreasing, but then otherwise leave the form of such a function unspecified. In this talk we consider the practical implications of making such a postulate when applying variable dimensional Bayesian modeling, MCMC, and model averaging.

# Markov Equivalence Classes for Bayesian Belief Networks

Acyclic digraphs are used to represent the underlying relationships of some Bayesian belief networks, which are in turn used in expert systems and other representations of statistically interdependent items. But the set of such digraphs turns out to be too big and, instead, a smaller number of equivalence classes truly represent the set of possible networks. Until now, little has been known about the combinatorial properties of these classes, such as their asymptotic growth with number of vertices or the average class size.

# Parameter Identification and Assessment of Independence in Multivariate Statistical Modeling

Linear (causal) relationships between random variables can be conveniently encoded using a mixed graph (a graph with both directed and bidirected edges) where a directed edge implies a direct linear effect and a bidirected edge captures the existence of unobserved confounding. Even when there is a known a mixed graph that accurately reflects the data generating mechanism, that is, all causal relationships are known and linear, confounding can make it impossible to infer parameters of interest. More concretely, many mixed graphs have (generically) unidentifiable parameters.

# Latent Class Transition Model Extensions with Covariates for the Chronically Disabled U.S. Elderly Population

# Explicit Limit Results for Markov Chains and Other Markov Processes

The statistical literature abounds with limit results (central limit theorems, laws of large numbers and laws of iterated logarithm) for Markov chains, Markov renewal processes, and Markov additive processes. However, most of the general results are not applicable in practice because the limiting quantitites are not available in an explicit form, in general.

# Likelihood Inference for Population Structure, Using Coalescent

# Bayesian Nonparametric Inference of Effective Population Size Trajectories from Genomic Data

Phylodynamics is an area on the intersection of phylogenetics and population genetics that aims to reconstruct population size trajectories from genetic data.

# Covariance Estimation and Testing for the Array Normal Model

Advisor: Peter Hoff

# Bayesian Nonparametric Inference of Population Trajectories with Gaussian Processes

Advisor: Vladimir Minin Changes in population size influence genetic diversity of the population and, as a result, leave imprints in genomes of individuals in the population. We are interested in an inverse problem of reconstructing past population dynamics from genomic data. We start with a standard framework based on the coalescent, a stochastic process that generates genealogies connecting randomly sampled individuals from the population of interest. These genealogies serve as a glue between the population demographic history and genomic sequences.

# Robust Bayesian Analysis of Gene Expression Microarray Data

# Bayesian Methods for Inferring Gene Regulatory Networks

Advisor: Adrian Raftery Gene regulatory networks are an important piece in understanding the functioning of living cells. As more and more gene expression data is becoming available, researchers need fast, reliable techniques for inferring these networks. I have developed ScanBMA, a fast Bayesian model averaging algorithm, used to infer networks from time-series data. I have also developed Model-based Clustering with Data Correction (MCDC), a method for automatically detecting and correcting errors that systematically affect some but not all data.

# Learning Transcriptional Networks from the Integration of ChIP-chip and Expression Data in a Nonparametric Model

We have developed LeTICE, an algorithm for learning a transcriptional regulatory network from ChIP-chip location and expression data. The network is specified by a binary matrix of transcription factor â€“ gene interactions which partitions the genes into a collection of modules (groups of genes regulated by the same TFs) and a background (a group of genes which do not belong to any module). We define a likelihood of a network given location and expression data and then search for the network optimizing the likelihood using numerical optimization.

# MS Thesis Presentation - On Left-Stochastic Decomposition Clustering

Advisor: Maya Gupta

# A Sharp Multiplier Inequality with Applications to Heavy-Tailed Regression Problems

Advisor:Professor Jon A. Wellner We develop a sharp multiplier inequality used to study the size of the multiplier empirical process $(\sum_{i=1}^n \xi_i f(X_i))_{f \in \mathcal{F}}$, where $\xi_i$'s and $\mathcal{F}$ are multipliers and an indexing function class respectively. We show that in general the size of the suprema of the multiplier empirical process is determined jointly by the growth order of the corresponding empirical process, and the worst size of the maxima of the multipliers.

# Bayesian Hierarchical Curve Registration

A number of different scientific fields ranging from biomedicine to economics, to molecular biology, generate functional data. The statistical analysis of a sample of curves, known as Functional Data Analysis (FDA), has as one of its goals explaining how variation in the functional outcome can be explained by some predictors. However, these curves tend to be misaligned, exhibiting variation not only in amplitude, but also in phase. Teasing apart these sources of variation is a central issue in FDA.

# Statistical Approaches to Analyze Mass Spectrometry Data

Advisors: Vladimir Minin & David Goodlett

Proteomics attempts to understand biological functions of an organism through the lens of expressed proteins, basic building blocks of all living cells. Mass spectrometry is used in the field of shotgun proteomics to generate mass spectra that are in turn used to identify and quantify proteins in a given sample.

# The Up-and-Down Percentile-Finding Method: Stochastic Properties, Estimation and Design

# Models and Inference for Network and Attribute Data

Latent variable network models provide low-dimensional representations of relational patterns in terms of additive and multiplicative actor-specific effects. In this talk we discuss these models in two contexts. First, we extend this class of models to estimate and make inference on the dependencies between a set of network relations and actor-specific attributes. Approaches to this problem typically condition on either the relations or attributes and are unable to provide predictions simultaneously for missing attribute and network information.

# Gravimetric Anomaly Detection Using Compressed Sensing

Advisor: Marina Meila We address the problem of identifying underground anomalies (e.g. holes) based on gravity measurements. This is a theoretically well-studied and difficult problem. In all except a few special cases, the inverse problem has multiple solutions, and additional constraints are needed to regularize it. Our approach makes general assumptions about the shape of the anomaly that can also be seen as sparsity assumptions. We can then adapt recently developed sparse reconstruction algorithms to bear on this problem.

# Estimating Social Contact Networks to Improve Influenza Simulation Models

Advisor: Mark Handcock Influenza pandemics pose a serious global health concern. The recent A (H1N1) influenza pandemic caused 18,500 lab-confirmed deaths, and mutation of the A (H5N1) \"avian\" influenza virus could also cause a pandemic with an estimated 60% case mortality rate in humans, requiring fast analysis of intervention and containment strategies. When a new influenza virus emerges with pandemic potential, stochastic simulation models are used to assess the effectiveness of different strategies.

# Introduction to Graphical Models

I will give a brief introduction to graphical models that will be followed by an outline of a few topics that future students of Michael Perlman and Thomas Richardson could work on.

# Phylogentic Stochastic Mapping

Advisor: Vladimir Minin

# Semiparametric Copula Models for Diverse Types of Dependent Data

In multivariate analysis, we are often interested in studying the dependence structure among diverse types of data, including continuous, ordinal, and non-ordered categorical data. One approach to analyze these data is using copula models. In this talk, I will discuss a method extending copula models to mixed continuous and ordinal data and study its asymptotic properties. Then I will introduce a new model incorporating copula models and model-based clustering ideas to deal with mixed continuous, ordinal and categorical data.

# Ergodic Limit Laws for Stochastic Optimization Problems

Propp and Wilson's coupling from the past (CFTP) algorithm provides exact samples and, thus, an elegant alternative to convergence diagnostics for standard MCMC samplers. I shall explain how this method works and discuss some practicalities regarding its use in MCMC sampling. Unfortunately the CFTP technique is only applicable when the distribution to be sampled possesses certain special properties. We propose a way to use the method's basic idea more generally and demonstrate that our algorithm works well in some quite challenging applications.

# Applications of Robust Statistical Methods in Quantitative Finance

Advisor: Douglas Martin Financial asset returns and fundamental factor exposure data often contain outliers, observations that are inconsistent with the majority of the data. Both academic finance researchers and quantitative finance professionals are well aware of the occurrence of outliers in financial data, and seek to limit the influence of such observations in data analyses. Commonly used outlier mitigation techniques assume that it is sufficient to deal with outliers in each variable separately.

# Wavelet Variance Analysis for Time Series and Random Fields

Wavelets give rise to the concept of wavelet variance that decomposes the variance of a time series on a scale by scale basis and that has considerable appeal when physical phenomena are analyzed in terms of variations operating over a range of different scales. The wavelet variance has been applied to a variety of time series and is useful as an exploratory tool to identify important scales, to assess the exponent parameter of a power law process, to detect inhomogeneity and to estimate a time varying spectral density function.

# Hammersley's Process with Sources and Sinks

Hammersley (1972) initiated a very interesting "hydrodynamical" approach to the study of the behavior of the lengths of longest increasing subsequences of random permutations. In the nineties Aldous and Diaconis (1995) introduced a modified version of the interacting particle process, studied in Hammersley (1972), and used this modification in a proof of the fact that the length of a longest increasing subsequence of a (uniform) random permutation of length n, divided by sqrt{n}, converges in probability to 2.

# Robust Estimation of Factor Models in Finance

# Statistical inference using Kronecker structured covariance

We consider the problem of testing and estimation of separable covariances for relational data sets in the context of the matrix-variate normal distribution. Relational data are often represented as a square matrix, the entries of which record the relationships between pairs of objects. Many statistical methods for the analysis of such data assume some degree of similarity or dependence between objects in terms of the way they relate to each other. However, formal tests for such dependence have not been developed.

# Predictive Modeling of Cholera Outbreaks in Bangladesh

Advisors: Vladimir Minin and Ira Longini Despite seasonal cholera outbreaks in Bangladesh, little is known about the relationship between environmental conditions and cholera cases. We seek to develop a predictive model for cholera outbreaks in Bangladesh based on environmental predictors. To do this, we estimate the contribution of environmental variables, such as water depth and water temperature, to cholera outbreaks in the context of two different disease transmission models.

# Modeling Longitudinal Multivariate Data with Mixed Outcomes: Hierarchical Latent Trait and Individual-Level Mixture Models

Advisor: Elena Erosheva I develop Bayesian hierarchical latent variable models for the study of longitudinal multivariate data. The latent variable models seek to represent multivariate data with a reduced number of dimensions while the hierarchical formulation enables the description of the latent structure evolution over time as well as factors associated with this evolution. Research on cognitive assessments and scientific interest in relating cognitive decline to neuroimaging results and biomarker information motivate these models.

# TBD

There will be a riveting introduction to social networks and the latent space model used for modeling networks. I will discuss the difficulties in estimating the parameters of this model by traditional methods and explain the estimator we came up with to deal with these issues.

# Testing high-dimensional covariance/correlation structures

Advisor: Mathias Drton

Abstract:

Two hypothesis testing problems related to high-dimensional covariance/correlation structures will be presented.

# Estimation of Convex-Transformed Densities

A convex-transformed density is a quasi-concave (or a quasi-convex) density which is a composition of monotone and convex functions. We consider a scale of such families of multivariate densities indexed by a parameter which is a monotone function. The exponential function corresponds to log-concave densities, while power functions correspond to heavier tailed densities or densities concentrated on the positive orthant.

# Classification by Opinion-Changing Behavior: A Mixture Model Approach

Popular theories in political science regarding opinion-changing behavior postulate existence of one or both of two broad categories of people: those who hold their opinions over time; and those that hold no solid opinion and, when asked to make a choice, do so seemingly at random. This study explores evidence for a third category: durable changers. This group of people will change their opinion in a rational, informed manner, after being exposed to new information.

# Scalable Manifold Learning and Related Topics

Advisor: Marina Meila

# Robust Statistics and Heavy-Tailed Distributions in Portfolio Optimization

# Postprocessing of Precipitation Forecasts with an SPDE Based Spatio-temporal Model for Large Data

We introduce a hierarchical Bayesian model (HBM) for precipitation monitoring data that incorporates numerical weather prediction (NWP) model output at high spatial and temporal resolution and a physics-based stochastic partial differential equation (SPDE). The SPDE explicitly models phenomena such as advection and diffusion that occur in many natural processes. We approximate the solution of the SPDE in the spectral space using the method of eigenfunctions to reduce the dimensionality of the problem.

# Allele-Sharing Methods for Linkage Detection Using Extended Pedigrees

Allele-sharing methods provide a robust approach to linkage detection for complex traits using pedigree data. Affected related individuals have increased probability of sharing genes identical-by-descent (IBD) at trait loci and hence also at linked marker loci at which they therefore show increased similarity over that predicted under Mendelian segregation. Relatives of discordant phenotype have decreased probability of sharing genes IBD at trait loci and hence have decreased similarity at linked markers.

# Estimating the Treatment Effect of Non-Randomized Educational Interventions: The Case of Special Education

A central goal of the education literature is to demonstrate that specific educational interventions have a treatment effect on student test performance. Researchers often have access to student test scores for students in the treatment and control groups both prior to and after the intervention, but usually must estimate the treatment effect from observational data in which the intervention has not been randomly assigned to units. This talk begins with a discussion of the assumptions that underlie common approaches to estimating a treatment effect with observational data.

# Classifying Immune Responses in Peptide Microarray Immunoassays

Advisor: Dr. Raphael Gottardo Peptide microarrays tiling immunogenic regions of pathogens (e.g. envelope proteins of a virus) have become an important high throughput tool for querying and mapping antibody binding. Antibodies play a key role in the immune system by preventing and controlling infection. Antibody binding locations provide crucial information for understanding natural infection and for deriving effective vaccines. In the context of vaccine development, the peptide microarray can reveal patterns of antibody response stimulated via vaccine treatment.

# Portfolio Optimization with Tail Risk Measures and Non-Normal Returns

Advisor: R. Douglas Martin

# TBD

The Capital Asset Pricing Model (CAPM) is today\\\'s most important financial model for estimating cost of capital and asset allocation. Its centerpiece are variables, commonly called betas and alphas, estimated using ordinary least squares (OLS) regression. Since financial returns typically have an asymmetric and heavy-tailed distribution, OLS estimates can be severely biased. In this talk we will introduce robust regression estimates with zero bias in beta and low bias in alpha (even under asymmetric distributions) but 99% asymptotic efficiency at the Gaussian model.

# Model-based and model-free community recovery in graphs

Advisor: Marina Meila

Abstract:

# Extensions of Latent Class Transition Models with Application to Chronic Disability Survey Data

Latent class transition models (LCTMs) are used to study the movement of individuals among homogeneous subgroups through time. Traditional LCTMs assume a complete set of observations for each individual. However, many longitudinal surveys have a rolling enrollment design, with late entry and early exit. Thus, methodology is needed to account for all the possible times at which individuals can be observed.

# Logic Regression

Advisors: Charles Kooperberg & Michael LeBlanc

# Maximum-Likelihood Inference after Model Selection

Standard statistical technique often fail in the presence of data-driven model selection, yielding inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. In particular, the observed data is constrained to lie in a subset of the original sample space that is determined by the selected model. This often makes the post-selection likelihood of the observed data intractable and inference difficult. Recently, novel methodologies have been proposed for performing valid inference in selected models.

# Jump Estimation in Inverse Regression Models

We provide an asymptotic theory for penalized least squares estimators of locally constant functions with finitely many jumps which are blurred by an operator and random noise. Differences to the direct case are highlighted, particularly, it turns out that a sqrt(n) rate of convergence for estimation of the jump locations is generic in the inverse case. Moreover, locations of jumps are jointly asymptotic normal, which allows to construct confidence regions for the graph of a function with a finite number of jumps.

# Nonstationary Modeling Through Dimension Expansion

If atmospheric, agricultural, and other environmental systems share one underlying theme it is complex spatial structures, being influenced by such features as topography and weather. Ideally we might model these effects directly; however, information on the underlying causes is often not routinely available. Hence, when modeling environmental systems there exists a need for a class of spatial models which does not rely on the assumption of stationarity. In this talk, we propose a novel approach to modeling nonstationary spatial fields.

# Using the Structure of d-Connecting Paths as a Qualitative Measure of the Strength of Dependence

# Restricted Covariance Priors with Applications in Spatial Statistics

We present a Bayesian model for area-level count data that uses Gaussian random effects with a novel type of G-Wishart prior on the inverse variance-covariance matrix. The usual G-Wishart prior restricts off-diagonal elements of the precision matrix to 0 according to the neighborhood structure of the study region. This preserves conditional independence of non-neighboring regions but is more flexible than the traditional intrinsic autoregression prior.

# R-Squared Inference Under Non-Normal Error

Advisor: Professor Ross L. Prentice

# Covariance Estimation in the Presence of Diverse Types of Data

Advisor: Peter Hoff

# TBD

MURI week continues this Friday. I'll be talking about probabilistic weather forecasting using Bayesian Model Averaging, an altogether different approach than the probabilistic forecasting method described by Tilmann in seminar earlier this week. I'll be discussing my work on forecasting of wind and rain, and looking at a modification of the EM algorithm for mixed continuous/discrete distributions.

# Bayes and Empirical Bayes Methods for Social Network Analysis

Advisor: Peter Hoff

Abstract:

# Estimates and Projections of the Total Fertility Rate

# Bayesian Analysis of Deterministic Models

Advisor: Adrian Raftery

# Maximum-Likelihood Inference after Model Selection

Co-Advisors: Mathias Drton & Raphael Gottardo Standard statistical technique often fail in the presence of data-driven model selection, yielding inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. In particular, the observed data is constrained to lie in a subset of the original sample space that is determined by the selected model. This often makes the post-selection likelihood of the observed data intractable and inference difficult. Recently, novel methodologies have been proposed for performing valid inference in selected models.

# Probablistic Weather Forecasting with Spatial Dependence

# Bayesian Modeling For Multivariate Mixed Outcomes With Applications To Cognitive Testing Data

This talk describes new multivariate regression and model-based clustering methods for statistical inference with multivariate mixed outcomes. We use the term mixed outcomes to refer to binary, ordered categorical, count, continuous and other ordered outcomes in combination. Such data structures are common in social, behavioral, and medical sciences. We develop two regression approaches, the semiparametric Bayesian latent variable model and the semiparametric reduced rank multivariate regression model, for mixed outcome data.

# Peptide Sequencing Using Tandem Mass Spectrometry

Tandem mass spectrometry has become a leading technology for protein identification. Much research has been done to automate the task of matching spectra to peptides.

In this study, we propose a probabilistic sequencing algorithm. It includes a probabilistic network to model the chemistry in the generation of theoretical spectrum, a pair hidden markov model to match theoretical spectrum and observed spectrum, and a probabilistic score function to rank the candidate sequences.

# Gravimetric Anomaly Detection Using Compressed Sensing

We address the problem of identifying underground anomalies (e.g. holes) based on gravity measurements. This is theoretically well-studied and difficult problem. In all except a few special cases, the inverse problem has multiple solutions, and additional constraints are needed to regularize it. Our approach makes general assumptions about the shape of the anomaly that can be seen as sparsity assumptions. Then we adapt recently developed sparse reconstruction algorithms to bear on this problem.

# Bayesian Space-Time Smoothing Models for Small Area Estimation

Advisor: Jon Wakefield Area and time-specific estimates of disease rates, cause-specific mortality rates and other key health indicators are of great interest for health care and policy purposes. Such estimates provide the information needed to identify areas with increased risk, effectively allocate resources, and target interventions. A wide variety of data, such as vital statistics, complex surveys, demographic surveillance sites, and disease registries, are used for these purposes.

# Bayesian Inference for Exponential-family Random Graph Models for Social Networks

Exponential-family random graph model (ERGM) has been widely applied in the fields of social network analysis, genetics (e.g. protein interaction networks), information theory etc. Because of the intractability of the likelihood function, Markov Chain Monte-Carlo (MCMC) approximation is typically applied to obtain maximum likelihood estimators (Geyer and Thompson 1992). However, ERGMs still suffer from inferential degeneracy and computational deficiency. In this talk, we present the Bayesian inference to ERGM.

# A New Goodness of Fit Test: The Reversed Berk-Jones Statistic

In classical testing problems, we often use statistics based on the empirical distribution function to test whether or not the underlying distribution of the data is what we think it might be. Berk and Jones introduced such a statistic in 1979. I'll talk about a statistic which is related to theirs (called the reversed Berk-Jones statistic), and some of its properties. Along the way we'll chat about what exactly the empirical distribution function is, and why I think it's so cool. That is all.

# Improving Serfling's Inequality for the Hypergeometric Distribution

Advisor: Jon Wellner Abstract: We discuss a method for obtaining finite sample Gaussian bounds for the tail of the hypergeometric distribution. The method is based on TusnÃ¡dy's approach (1975) to bounding the tail of symmetric binomial random variables. In this talk, we review TusnÃ¡dy's result, and discuss how it can be adapted to and extended in the hypergeometric case.

# Statistical Solutions to Some Problems in Medical Imaging

# To Sample or Not to Sample? Why is That the Question for Census 2000?

Public Talk

# A General Approach to Nonparametric Monotone Function Estimation

For several important monotone parameters, such as the distribution function, monotone density function, and monotone regression function, sensible nonparametric estimators can be obtained by minimizing the empirical risk based on an appropriate loss function. For more complex monotone parameters, such as a monotone covariate-adjusted dose-response curve, or in the context of more complex data structures, this approach may not be possible and alternative approaches are needed. We discuss general strategies for monotone function estimation in two important settings.

# On Monotonicity Constraints in High-Dimensional Optimization: Convexity and Mixture Models

# Whole-Genome Quantitative Trait Prediction and Heritability Mapping via an Infinite Allele Model

The paradox of missing heritability refers to the common finding that in complex genetic traits with high heritability as estimated by methods such as twin studies, only a small fraction of the population variance is explained by the few Single Nucleotide Polymorphism (SNP) markers which are found to be individually significantly associated with the trait. Human height, with heritability estimates as high as 80% largely unexplained by individual SNPâ€™s, is the canonical example of such a trait.