B and T cell receptors, also known as adaptive immune receptors, perform key roles in adaptive immunity.
These proteins identify and deal with foreign invaders like viruses or bacteria, allowing for robust and long-lasting immunological protection.
The DNA sequences coding for these receptors arise by a complex recombination process followed by a series of productivity-based filters, as well as affinity maturation for B cells, giving considerable diversity to the circulating pool of these sequences.

Causal Discovery algorithms attack the challenging problem of learning the causal relationships among a set of variables from observational data, but are often partly ad-hoc and give the researcher no measure of confidence in the correctness of the learned causal structure. I introduce Bayesian Causal Model Selection (BCMS), a Bayesian framework for causal discovery that unifies existing methods by expressing identifiability assumptions through the model prior.

With the increasing ability to collect myriad types of spatial data, we find ourselves regularly presented with new modeling problems that require novel solutions, but many of the available options for fitting spatial statistical models have limited applicability. Here we describe, evaluate and critique Template Model Builder (TMB), an existing but relatively unknown and unvetted (within the statistics community) modeling tool that leverages Laplace approximations to fit a large class of mixed effects models, including many spatial and spatial-temporal models.

  1. Call to Order
  2. Chair's Remarks
    • 2020-2021 Faculty Teaching Preferences
    • February 21 - Research Exchange presentation by Provost Mark Richards
    • March 31 - VRI deadline at 11:59pm (no exceptions)
    • April 4 - Admitted  Student Preview Day for College of Arts and Sciences
    • May 7 - Amanda Cox,  Graduate School Public Lecture. 

Consider the heteroscedastic nonparametric regression model with random design $Y_i = f(X_i) + V^{1/2}(X_i)\varepsilon_i, \quad i=1,2,\ldots,n$, with $f(\cdot)$ and $V(\cdot)$ $\alpha$- and $\beta$-H\" older smooth, respectively. We show that the minimax rate of estimating $V(\cdot)$ under both local and global squared risks is of the order $n^{-\frac{8\alpha\beta}{4\alpha\beta + 2\alpha + \beta}} \vee n^{-\frac{2\beta}{2\beta+1}}$, where $a\vee b\define \max\{a,b\}$ for any two real numbers $a,b$.

A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, November 18th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

Chair’s Remarks
Daniel Pollack informed that faculty that there will be faculty lunches with the chair candidates on Tuesday and Thursday (Nov. 19 & 21). If any faculty would like to sign up, please reach out to Kristine Chan.

The need for rigorous and timely health and demographic summaries has led to an explosion in geographic studies, particularly in low and middle income countries. While household surveys are a major source of data in this context, they present challenges for statistical modeling. These challenges include biases due to oversampling certain population segments, nonlinear interactions between covariates, and multiple scales of prediction. However, many common statistical methods have never been tested rigorously in these settings.

A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, December 2nd, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

Chair’s Remarks
Daniel Pollack reminded the faculty about Holiday Party happening on Dec. 11 – food and drinks will be provided. Vickie Graybeal has sent an email out for faculty, postdocs, and staff to RSVP by Dec.4 at 12:00pm.

The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, November 4th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

The meeting began with approval of previous meeting’s minutes from October 21, 2019.

A special meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, November 8th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

The meeting began with approval of previous meeting’s minutes from November 4, 2019.

There being no chair remarks, announcements, committee reports, and new business, the meeting passed into the executive session at 12:37pm and was adjourned at 2:00pm.

A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, December 9th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

Daniel Pollack has announced that the Senior Lecturer position has been posted on the department website.

Committee Reports
The GSRs reported students’ interactions and feedback with each chair candidate.

The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, October 21st, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

The meeting began with approval of previous meeting’s minutes from October 7, 2019.

Chair’s Remarks
Daniel Pollack announced Vickie Graybeal’s twenty years of service award.

The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, October 7th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

The meeting began with approval of previous meeting’s minutes from September 23, 2019.

Chair’s Remarks
Daniel Pollack went over existing department policies that were up for renewal. Of the policies, the delegation of authority, merit review process, and retention consultation were reviewed, voted, and approved. 

The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, September 23rd, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.

Chair’s Remarks
Daniel Pollack reported there will be no faculty retreat this Autumn quarter. Planning for the retreat will be revisited in the Spring.

He also provided updates and timelines on the two ongoing searches: Full Professor & Chair and Assistant Professor. 

In the social sciences, social networks are important structures which represent the relationships and interactions between actors in a population of study. The most common methods for measuring networks are to survey study participants about who their connections are and to collect interaction activity between pairs of actors. However, directly measuring the exact network of interest can be challenging.

Over the last few decades, shape constrained methods have increasingly gathered importance in statistical inference as  attractive alternatives to  traditional  nonparametric  methods which often require tuning parameters and restrictive smoothness assumptions. This talk focuses on application of shape-constraints like unimodality and log-concavity in comparing the outcome of two  HIV vaccine trials. To this end, we develop  shape-constrained tests of stochastic dominance, and shape-constrained plug-in estimator of  the Hellinger distance between two densities.

DNA copies inherited from the same ancestral copy by related individuals are said to be identical by descent (IBD). IBD gives rise to genetic similarities between related individuals. In quantitative genetics, two fundamental problems are heritability estimation and gene mapping for genetic traits. IBD plays a critical role in the study of both problems. When working with population-based samples where pedigree information is unavailable, it is essential to estimate IBD accurately from genetic marker data using pedigree-free methods.

Collecting social network data is notoriously difficult, meaning that indirectly observed or missing observations are very common. In this talk, we address two of such scenarios: inference on network measures without network observations and inference of regression coefficients when actors in the network have latent block memberships.

Testing mutual independence for high-dimensional observations is a fundamental statistical challenge. Popular tests based on linear and simple rank correlations are known to be incapable of detecting non-linear, non-monotone relationships, calling for methods that can account for such dependences. To address this challenge, we propose a family of tests that are constructed using maxima of pairwise rank correlations that permit consistent assessment of pairwise independence.

Can we do exact and tractable inferences in Mallows-like models for incomplete data? I will show that the answer is yes for the most general form Mallows-type model and a large class of partial orders known as partial rankings (including special cases like top-t rankings). I will also demonstrate that despite partial rankings lacking a sufficient statistic, exact inference is possible with overhead that is at most polynomial in O(nN) and that, in practice, the overhead per data point is negligible.

Traditional infectious disease epidemiology focuses on fitting deterministic and stochastic epidemics models to surveillance case count data. Recently, researchers began to make use of infectious disease agent genetic data to complement statistical analyses of case count data. Such genetic analyses rely on the field of phylodynamics --- a set of population genetics tools that aim at reconstructing demographic history of a population based on molecular sequences of individuals sampled from the population of interest.

The adaptive immune system synthesizes antibodies, the soluble form of B cell receptors (BCRs), to bind to and neutralize pathogens that enter our body. B cells are able to generate a diverse set of high affinity antibodies through the affinity maturation process. During maturation, ``naive'' BCR sequences first accumulate mutations according to a neutral evolutionary process called somatic hypermutation (SHM), which may modify the associated binding affinities, and then are subject to natural selection by clonal expansion, which promotes the higher affinity antibodies.

We present a method for analyzing low-energy paths between molecular conformations by combining techniques in both manifold learning, which identifies such paths, and functional regression, which can parameterize them by explanatory non-linear functions. Unsupervised manifold learning approaches are useful for understanding molecular dynamics simulations since they disregard small-scale information such as peripheral hydrogen vibrations that can nevertheless drastically affect the observed energy.

In recent years, new technologies in neuroscience have made it possible to measure the activities of large numbers of neurons in behaving animals. For each neuron, a fluorescence trace is measured; this can be seen as a first-order approximation of the neuron's activity over time. Determining the exact time at which a neuron spikes on the basis of its fluorescence trace is an important open problem in the field of computational neuroscience. Recently, a convex optimization problem involving an L1 penalty was proposed for this task.

Child mortality, and, in particular under-five mortality (U5MR), is an important indicator of the overall health of a population. Subnational estimation of U5MR is relatively new endeavor

High-dimensional data sets often have lower-dimensional structure taking the form of a submanifold of a Euclidean space. It is challenging but necessary to develop statistical methods for these data sets that respect the manifold structure. We present research from two different areas: manifold learning (i.e., support estimation) and smooth regression on manifolds.

The amount of sea ice (frozen ocean water) found in the Arctic is declining rapidly as a result of climate change. This has increased the need for accurate forecasts of where sea ice will be located.  Of particular interest is predicting the sea ice edge contour, or the boundary of the region where at least 15% of the area is ice-covered. Current sea ice forecasts are issued from deterministic numerical prediction systems.

In this dissertation, we study general strategies for constructing nonparametric monotone function estimators in two broad statistical settings. In the first setting, a sensible initial estimator of the monotone function of interest is available, but may fail to be monotone. We study the correction of such an estimator obtained via projection onto the space of functions monotone over a finite grid in the domain.

Scientific studies in many fields involve understanding and characterizing dependence relationships among large numbers of variables. This can be challenging in settings where data is limited and noisy. Take survey data as an example, understanding the associations between questions may help researchers better explain themes amongst related questions and impute missing values. Yet, such data typically contains a combination of binary, continuous, and categorical variables; a high proportion of missing values; and complex data structures.

Estimating population size fluctuations is one of the key tasks in Ecology. However, traditional sampling based approaches to perform this task have limitations when populations of interest are extinct or are hard to reach, as is the case for individuals infected for a short time period by a pathogen.

Collecting social network data is notoriously difficult, meaning that indirectly observed or missing observations are very common. In this talk, we address two of such scenarios: inference on network measures without any direct network observations and inference of regression coefficients when important features are missing.

In this talk we define a new class of multivariate nonparametric measures of dependence that we refer to as symmetric rank covariances. This new class generalizes many existing classical rank measures of dependence, such as Kendall's tau and Hoeffding's D, as well as the more recently discovered Bergsma--Dassios sign covariance. Symmetric rank covariances make explicit the implicit symmetries hidden in the standard definitions of the above measures and, in doing so, lead naturally to multivariate extensions of the Bergsma--Dassios sign covariance.

In the social sciences, social networks are important structures which represent the relationships and interactions between actors in a population of study. In these fields, the most common method for measuring networks is to directly survey study participants about who their connections are. However, directly measuring the network of interest can be challenging. Participants do not always provide accurate accounts of their connections, which can result in mismeasurement of the network.

In this talk, we consider causal discovery when the underlying structure corresponds to a linear structural equation model with error terms which are non-Gaussian. Previous work by Shimizu et al. (2006) has shown that under this framework, a unique directed acyclic graph--not simply an equivalence class--can be identified from infinite data. We extend that result in two directions. First, we show that a unique graph can still be consistently recovered in the high dimensional setting where p, the number of variables, exceeds n, the number of observed samples.


  1. Faculty Search discussion

We develop a scalable method to estimate the parameters in models of very large binary network datasets. Maximum likelihood estimates are generally impossible to obtain because the full likelihood involves an intractable high dimensional integral. Also, full-likelihood Bayesian estimation is impractical for very large datasets as the MCMC algorithm is very slow.

Time: 12.30-1.30pm March 7, 2016 
Place: Padelford Hall, C-301 

  1. 12:30 - Pedagogy Meeting

Time: 12.30-1.30pm January 9, 2017 
Place: Padelford Hall, C-301 

  1. Welcome back & Updates (Thomas R.)
  2. Mentoring & Diversity (Jessica G.)
  3. Consulting / Paul Sampson Replacement (Thomas R.)


Time: 12.30-1.30pm February 13, 2017 
Place: Padelford Hall, C-301 

  1. Updates (Thomas R.)
  2. 3-year Affiliate/Adjunct Renewals (Thomas R.)
  3. Affiliate/Adjunct Re-Appointments (Not up for periodic 3-year review associated with renewal) (Thomas R.)
  4. Case for Promotion to Affiliate Associate Professor (Thomas R.)
  5. Paul Sampson (Thomas R.)

Time: 12.30-1.30pm February 27, 2017 
Place: Padelford Hall, C-301 

  1. Renew Policies (Thomas R.)
  2. Biostatistics Search (Thomas R.)
  3. Affiliate Appointment for Jon Azose (Thomas R., Adrian R.)
  4. Loyce Adams (Emeritus Professor for AMATH) for Senator (Thomas R.)
  5. Annual Student Review (Michael P.)

Time: 12.30-1.30pm March 6, 2017 
Place: Padelford Hall, C-301 

  1. Annual Student Review (Michael P.)

Time: 12.30-1.30pm April 3, 2017 
Place: Padelford Hall, C-301 

  1. Upcoming talk by Nature Editor, 4/5, Physics/Astronomy Auditorium A118
  2. Computing Staff Updates (Thomas/Kris)
  3. Web-site overhaul (Thomas/Kris)
  4. Discuss and vote on Affiliate appointment for Sam Clark
  5. Update on Faculty Search for Full-Time Lecturer in Consulting (Elena)
  6. Search request for next year (Thomas)
  7. New learning spaces / scheduling policy (commencing Spring 2018):

Time: 12.30-1.30pm April 10, 2017 
Place: Padelford Hall, C-301 

  1. Meeting for Full + Assoc. Professors Only

Time: 12.30-1.30pm April 24, 2017 
Place: Padelford Hall, C-301 

  1. MS Student Review

Time: 12.30-1.30pm May 1, 2017 
Place: Padelford Hall, C-301 

  1. FTL Consulting Search
  2. Personnel Matter (Full Professors only)

Time: 12.30-1.30pm May 8, 2017 
Place: Padelford Hall, C-301 

  1. TCC Meeting

Time: 12.30-1.30pm May 22, 2017 
Place: Padelford Hall, C-301 

  1. FTL Consulting Search

Time: 12.30-1.30pm June 5, 2017 
Place: Padelford Hall, C-301 

  1. Update and discussion on searches.
  2. PhD Admission Policy; TOEFL Scores; TA Requirement for PhDs.
  3. College Absence Policy; also Effective Personnel Vote rule.
  4. 10 Year Department Review.

Time: 12:30-1:30pm October 2, 2017 
Place: Padelford Hall, C-301 

  1. Updates
  2. Discussion of upcoming Search
  3. Adjunct Appointment (Amy Willis)
  4. Research Prelim and 572

In a simulation study, data are generated under a variety of conditions with respect to underlying ability distribution, test length, and sample size. Item parameter estimates are obtained under two conditions: in one, the assumed ability distribution matches the underlying ability distribution; in the other, it does not. The item parameter estimates from the matching condition are compared to those from the nonmatching condition to determine the effect on the recovery of parameter estimates and item rankings.

Advisor: Jon Wellner In this talk, we discuss exponential tail inequalities for the sum in the context of sampling without replacement. Using an exponential inequality due to Serfling as the basis for investigation, we consider the special case of sampling from a finite population containing only 0s and 1s. This leads to considering exponential bounds for the Hypergeometric distribution.

Advisor: Tilmann Gneiting Accurate weather forecasts benefit society in crucial functions, including agriculture, transportation, recreation, and basic human and infrastructural safety. Over the past two decades, ensembles of numerical weather prediction models have been developed, in which multiple estimates of the current state of the atmosphere are used to generate probabilistic forecasts for future weather events. However, ensemble systems are uncalibrated and biased, and thus need to be statistically postprocessed. Bayesian model averaging (BMA) is a preferred way of doing this.

I will talk briefly about how I got involved in research in Model-Based Clustering in my final year of undergrad (and subsequently here) and give a brief outline of research I did then. The main part of the talk will be about different extensions to the model-based clustering methodology that I\'m working on. I\'ll mainly be focusing on research on variable selection with model-based clustering but I\'ll also talk, if I have time, about ideas I\'ll be working on for the next year.

Advisor - Vladimir Minin Abstract - Markov branching processes are a class of continuous-time Markov chains (CTMCs) with ubiquitous modeling applications. Multi-type processes are necessary to model phenomena such as competition, predation, or infection, but often feature large or uncountable state spaces, rendering general CTMC techniques impractical. We present new methodology motivated by processes arising in molecular epidemiology, cellular differentiation, and infectious disease dynamics.

Advisors: Peter Guttorp & Don Percival

It is well known that many penalized regression problems can be interpreted as estimating unknown regression coefficients having assumed a specific statistical model. This includes the lasso when tuning parameters are estimated from the marginal likelihood of the data, the Bayesian lasso, Gaussian random effects models, ridge regression, etc. In the first part, we consider estimating a mean matrix from a single noisy realization. We assume possibly sparse elementwise effects and use a lasso penalty.

Medical professionals and researchers used a variety of imaging techniques in their clinical practice and scientific investigations. In this talk I will focus on Mammography which is used for breast examinations and routine breast cancer screening. While the mammographic images proved to be a useful non-invasive tool for clinical monitoring, the images often luck detail and clarity. For example, in addition to having limited spatial resolution, skin-air boundary of the imaged breast is often obscured. This boundary is, however, an important initial step in the breast density estimation.

We propose a method for estimating the number of groups in a data set. Our method is an extension of Generalized Single Linkage clustering (GSL) (Stuetzle and Nugent 2010), a nonparametric clustering method based on the premise that groups in the data correspond to modes of the underlying data density. GSL starts with a nonparametric density estimate. It recursively splits the data into high density regions separated by valleys. The leaves of the resulting cluster tree correspond to modes of the density estimate.

Advisor: Vladimir Minin The field of phylodynamics seeks to estimate effective population size fluctuations from molecular sequences of individuals sampled from the population of interest. One way to accomplish this task is to formulate an observed sequence data likelihood by using a coalescent model for the sampled individuals’ genealogy and then integrating over all possible genealogies via Monte Carlo or, less efficiently, by conditioning on one genealogy estimated from sequence data. These strategies also work when molecular sequences are sampled serially through time.

Advisor: R. Douglas Martin The literature on use of robust estimates, skewed distribution MLE’s and non-normal distribution hierarchical Bayes models for multi-factor models in finance is surprisingly thin, and limited for the most part to single factor models (SFM’s). The ultimate goal of our research is the study of the relative merits of robust versus non-normal MLE estimation of multi-factor models and the use of hierarchical Bayes modeling of multi-factor models using skewed fat-tailed distributions.

I will describe a new model of image data that we call the "epitome". The epitome of an image is its miniature, condensed version containing the essence of the textural and shape properties of the image. As opposed to previously used simple image models, such as templates or basis functions, the size of the epitome is considerably smaller than the size of the image or object it represents, but the epitome still contains most constitutive elements needed to reconstruct the image.

Advisors - Adrian Raftery and Ka Yee Yeung (UW Tacoma)

We propose a method for combining probability forecasts from different sources. The commonly used method of linearly combining probability forecasts has limitations, in that a weighted combination of distinct calibrated forecasts is necessarily uncalibrated. In view of this, we propose a recalibration method. We illustrate our findings with simulation examples and a case study on operational probability of precipitation forecasts.

Over the last few years, the speaker (and collaborators Leanne Bischof and Jon Huntington) have been developing fast and sophisticated algorithms and software for identifying pure minerals and mixtures of minerals from shortwave infrared spectra. The software, called The Spectral Assistant (TSA), has been designed to be used with a particular FIELD-PORTABLE spectrometer, the PIMA-II, which is about the size of a shoe box and can be used by geologists collecting samples in the field.

Advisor: Marina Meila

Social interaction data are data that are generated from the interaction or relationship between two or more actors, thus the observational units are pairs, trios, etc. of actors. This type of data are common in all fields of social science (e.g. political science, sociology, anthropology, and economics) for the interaction of actors is a key element in social science theory.

Data that can be represented in the form of an array is present in many of the social and biological sciences. In this talk we address two statistical problems concerning these data. The first problem is modeling the heterogeneity along the dimensions of an array. Previously developed models are either non-stochastic and difficult to interpret, or require a large number of parameters prohibiting likelihood based inference for some arrays.

Microarrays are part of a new class of biotechnologies that can be used to measure expression levels (DNA or RNA abundance) for thousands of genes at a time. This new technology is being applied increasingly in biological and medical research to address a wide range of problems, such as the classification of tumors or the study of host responses to bacterial infections. DNA microarray experiments raise numerous statistical questions in fields as diverse as image analysis, experimental design, hypothesis testing, cluster analysis, etc.

Advisor: Adrian Raftery The future of international migration is a topic of great social and political importance, and yet international migration is hard to even estimate, let alone predict. The unreliability of point projections of migration indicates a need for better quantification of uncertainty in migration projections. We accomplish this quantification of uncertainty with a Bayesian hierarchical autoregressive model on net migration rates. In an initial model, we assume error terms are independent across countries.

Correlated failure time data arise often in many application areas. For example, in genetic epidemiology study, the disease occurrence times of pairs of family members are often correlated and the degree of correlation may provide important leads in respect to disease etiology. Univariate failure time data methods are well established, including Kaplan-Meier method, censored data rank test and Cox regression method. However, the standard tools for multivariate failure data analysis data are not available yet.

Advisors: Ross Prentice and Ching-Yun Wang

In large collections of multivariate time series it is of interest to determine interactions between each pair of time series. We study methods for inferring time series interactions in three domains: 1) conditional independencies between time series, 2) Granger and instantaneous causality estimation in subsampled and mixed frequency time series, and 3) Granger causality estimation in multivariate categorical data. First, we explore a Bayesian framework for inferring graphical models of time series.

A common problem with financial historical data is that they often have unequal lengths of histories. Examples include country market indices, currency rates and hedge fund returns histories. Practitioners often deal with such issues by truncating all the series so that the remaining data have the same length, which is apparently not an ideal solution. We discuss existing statistical methods that utilize the full data set, such as maximum likelihood estimation and multiple imputation.

Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop We develop simple methods for constructing parameter priors for model choice among Directed Acyclic Graphical (DAG) models. In particular, we introduce several assumptions that permit the construction of parameter priors for a large number of DAG models from a small set of assessments. We then present a method for directly computing the marginal likelihood of every DAG model given a random sample with no missing observations.

Co-chairs: Vladimir Minin & Erick Matsen

Functional data often exhibit a common shape but also variations in amplitude and phase across curves. The analysis often proceed by synchronization of the data through curve registration. We propose a Bayesian Hierarchical model for curve registration. Our model provides a formal account of amplitude and phase variability while borrowing strength from the data across curves in the estimation of the model parameters.

In recent years spatial-temporal modeling has become increasingly popular in the field of public health and epidemiology. Motivated by two datasets, we address three issues in the Bayesian modeling of health data in space and time.

Spectral segmentation is a technique used to group data based on pairwise similarities. A similarity matrix is used as input into a spectral clustering algorithm and a clustering over the data is output. The clustering criterion is such that similar points are put in the same cluster and dissimilar points are put in different clusters. Generally, this similarity matrix is assumed known, while in reality this matrix is usually constructed by hand, a very time consuming process.

Advisor: Professor Gary Chan

Correlated failure time data arise often in many application areas. For example, in genetic epidemiology study, the disease occurrence times of pairs of family members are often correlated and the degree of correlation may provide important leads in respect to disease etiology. Univariate failure time data methods are well established, including Kaplan-Meier method, censored data rank test and Cox regression method. However, the standard tools for multivariate failure data analysis data are not available yet.

Spatial Point process are often modeled as Markov fields, and inference for such models are sometimes either inefficient or computationally intensive due to difficulties in evaluating the normalizing constant. Simulation study for such process is hard. We exploit the partial order in the plane and introduce a class of Markov point processes known as \"Directed Markov Point Processes\" and investigate their properties. This Markov structure enables to study some of the well known spatial processes in detail.

We consider Bayesian empirical likelihood estimation and develop an efficient Hamiltonian Monte Carlo method for sampling from the posterior distribution of the parameters of interest. The proposed method uses hitherto unknown properties of the gradient of the underlying log-empirical likelihood function. It is seen that these properties hold under minimal assumptions on the parameter space, prior density and the functions used in the estimating equations determining the empirical likelihood.

I will discuss the most important of results obtained along the direction of nonparametric estimation of two multivariate families of densities that exhibit monotonicity constraints, and which can otherwise be characterized as certain mixtures models. Discussion will emphasize on chracterizations of the estimators, their strong consistency and we will embark on discussing rates of convergence of these estimators, both in the global and the local sense.

Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop We provide a classification of graphical models according to their representation as subfamilies of exponential families.

Advisor: Mathias Drton & Ali Shojaie

Modern statistical problems are increasingly high-dimensional, with the number of covariables p potentially vastly exceeding sample size N. Fortunately, significant progress has been made in developing rigorous statistical tools for tackling such problems, but these methods have primarily targeted prediction, point estimation, and or variable selection.

The goal of clustering is to identify distinct groups in a data set and assign a group label to each observation. To cast clustering as a statistical problem, we regard the data as an iid sample from some unknown probability density p. We adopt the premise that groups correspond to modes of the density. Our goal then is to find the modes and assign each observation to the \"domain of attraction\" of a mode. We do this by estimating the cluster tree of the density, a representation of the hierarchical structure of its level sets.

Despite seasonal cholera outbreaks in Bangladesh, little is known
about the relationship between environmental conditions and cholera
cases. We seek to develop a predictive model for cholera outbreaks
in Bangladesh based on environmental predictors. To do this, we must
estimate the environmental parameters in the context of a disease
transmission model. We develop a method to simultaneously estimate
the transmission parameters and the environmental parameters in a
Susceptible-Infectious-Recovered-Susceptible (SIRS) model. The

The amount of data we generate as a global civilization is growing exponentially. What's more important however, is the fact that storing, accessing and analyzing data is getting cheaper and faster. Organizations all over the world have realized that data is a prized commodity, and many in the industry are scrambling to extract value from their complex data sets. For this endeavor, they need individuals with the right skills and experience, and the quantitative disciplines in Academia are a great source for such individuals. In this talk, I will briefly describe my journey from a Ph.D.

Haplotypes are specific combinations of alleles on the same chromosome, and various methods exist for the analysis of haplotype data from unrelated individuals. However, humans are diploid and studies of genetic variation might consist of unphased genotype data, where an unordered pair of alleles is observed at each locus. There is a coming need for less-computationally intensive models that may be directly applied to unphased genotype data from thousands of individuals at thousands of loci. In this talk, we present such a model for genetic variation.

Advisor: Vladimir Minin Branching processes are a class of continuous-time Markov chains (CTMCs) frequently used in stochastic modeling with ubiquitous applications. One-dimensional cases such as birth-death processes are well studied, but it is often necessary to model systems with more than one species --- bivariate or other multi-type processes are commonly used to model phenomena such as competition, predation, or infection.

The general Fund-of-Funds (GFoF) class of investment organizations includes fund-of-hedge funds (FoHF), family offices, endowments, pension plans and asset management companies. GFoF portfolios are characterized by two important types of returns problems among others. The first is that the returns histories of the portfolio assets are unequal, sometimes quite short and often contain multiple frequencies, resulting in structured missing data problems. The second is that the returns have fat-tailed and skewed distributions to varying degrees.

Advisor: Brian Leroux

We present an affine-invariant random walk for drawing uniform random samples from a convex body for which the maximum volume inscribed ellipsoid, known as John's ellipsoid, may be computed. We consider a polytope where as a special case. Our algorithm makes steps using uniform sampling from the John's ellipsoid of the symmetrization of at the current point. We show that from a warm start, the random walk mixes in steps. This sampling algorithm thus offers improvement over the affine-invariant walk known as the Dikin Walk (which mixes in steps from a warm start) for applications in which .

Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop

Advisor: Elizabeth Thompson

We consider the problem of testing and estimating separable covariances for relational data sets. We propose to model these data as matrix normal distributions with separate row and column covariance matrices. The existing literature on testing and estimation in the context of a matrix normal distribution requires multiple observations of the matrix, which rarely occurs for relational data sets.

A natural approach to survival analysis in many settings is to model the subject’s “health” status as a latent stochastic process, where the terminal event is represented by the first time that the process crosses a threshold. “Threshold regression” models the covariate effects on the latent process. Much of the literature on threshold regression assumes that the process is one-dimensional Wiener, where crossing times have a tractable inverse Gaussian distribution but where the process characteristics are fixed at baseline.

Advisor: Jon Wakefield Public health data are frequently obtained from surveys, which often have complex design sampling frames. It is crucial that analyses account for the latter to give appropriate inference. We describe two scenarios, with both having important spatial components. The first example is motivated by Behavioral Risk Factor Surveillance System (BRFSS) data. Empirical Bayes and Bayes hierarchical models for small area estimation have been used extensively for surveys like BRFSS.

One of the fundamental goals of nonparametric cluster analysis is to estimate the cluster tree of a density. I will define and illustrate the cluster tree and describe a graph-based procedure for its estimation. The cluster tree will usually have spurious leaves due to variability in the density estimate. I will introduce a bootstrap-based method for eliminating spurious leaves and “clustering with confidence”.

Advisor: Peter Hoff Many applications involve estimation of a signal matrix from a noisy data matrix. In such cases, it has been observed that estimators that shrink or truncate the singular values of the data matrix perform well when the signal matrix has approximately low rank. In this talk, we generalize this approach to the estimation of a tensor of parameters from noisy tensor data. We develop new classes of estimators that shrink or threshold the mode-specific singular values from the higher-order singular value decomposition.

Probabilistic weather forecasting is becoming an increasingly important and active area of research. Most current statistical post-processing techniques account for forecast bias and predictive variance without regard to forecast location. We will discuss a technique that adjusts bias and predictive variance locally, called geostatistical model averaging (GMA). In particular, GMA allows the parameters of the predictive distribution to vary over the model grid.

High-dimensional datasets often have lower-dimensional structure, which frequently takes the form of a manifold. There are many algorithms (e.g., Isomap) that are used in practice to fit manifolds and thus reduce the dimensionality of a given dataset. In our work, we consider the problem of recovering a d-dimensional submanifold M of R^n when provided with noiseless samples from M. Ideally, the estimate M_hat of M should be an actual manifold. Generally speaking, existing manifold learning algorithms do not meet these criteria.

Bayesian model averaging has been shown to be a useful method for developing probabilistic weather forecasts for quantities (such as temperature) that can be represented by univariate normal distributions. This talk will discuss how these methods can be extended to other distributions, using wind forecasting as an example.

Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop Graphical Markov models represent statistical dependencies by combining two simple yet powerful mathematical concepts: graphs and conditional independence. A graphical Markov model is constructed by specifying local dependencies for each node of the graph in terms of its immediate neighbors, yet can represent a highly varied and complex system of multivariate dependencies by means of the global structure of the graph.

Advisor: Tyler McCormick

We explore and exploit the use of differential operators on manifolds - the Laplace-Beltrami operator in particular - in learning tasks. In particular, we are interested in uncovering the geometric structure of data(unsupervised learning) and in exploiting information contained in unlabeled data for regression and classification tasks (semi-supervised learning). First, building on the Laplacian Eigenmap and Diffusion Maps framework, we propose a new paradigm that offers a guarantee, under reasonable assumptions, that any manifold learning algorithm will preserve the geometry of a data set.

We will address three aspects of statistical methodology for Exponential family Random Graph Models (ERGMs) in the context of applications to social network analysis. We start by addressing the topic of degeneracy in ERGMs. This is a problem often misunderstood to characterize the entire family of ERGMs, but is properly understood as a more limited issue of model misspecification.

Advisor: Jon Wakefield We consider the problem of detecting clusters of non-infectious and rare diseases. Cluster detection is the routine surveillance over a large expanse of small administrative regions to identify individual \'hot-spots\' of elevated residual spatial risk without any preconceptions about their locations. A class of cluster detection procedures known as moving-window methods superimpose a large number of circular regions onto the study area.

In recent decades, there has been much progress and interest in spatial statistics, with applications in agriculture, epidemiology, geology and other areas of environmental science and in image analysis. Two contrasting approaches have emerged, one based on Markov random fields, the other on geostatistics. The development of Markov Chain Monte Carlo as a computational tool has been phenomenal and has made Bayesian inference for spatial models relatively easy to perform, whereas frequentist inference still presents difficult problems.

Advisor: Thomas Richardson

I propose a method for estimating migration flows between all pairs of countries, including breakdowns by place of birth. My estimator is a pseudo-Bayes estimator which smooths a set of state-of-the-art estimates of migration flows towards a simpler estimate which contains fewer structural zeroes. The smoothing process provides a natural way to bypass the state-of-the-art estimator's unrealistic assumption that the number of global migrants is as small as possible.

The United Nations Population Division produces estimates and projections of the total fertility rate for all countries in the world every two years. For countries with fertility above replacement level, future levels are projected by choosing one out of three scenarios describing the pace of future fertility decline.

I will discuss a Bayesian hierarchical model for producing country-specific projections of the total fertility rate, and assessing the uncertainty in these predictions. Results for various countries will be presented.

Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop

Meiotic recombination is a biological process that shuffles our genetic material before we pass it along to our offspring. There are two known outcomes of recombination: crossover and gene conversion. Recently, fine-scale human crossover rates have been inferred with some success using statistical methodology applied to population data (i.e. genetic data on random samples of individuals from a population). However, reliable estimation of gene conversion rates has proven more difficult to come by.

Segments of genome inherited from a common ancestor by multiple individuals are said to be identical by descent (IBD). Dense genotyping platforms permit the detection of IBD segments less than 5 centiMorgans long, which arise due to coancestry on the order of dozens of generations ago. Generalizations of classical pedigree-based linkage methods use this inferred IBD and can be applied in situations where pedigree data is incomplete. We present a method for inferring IBD in groups of individuals without pedigrees.

The methodology of Markov basis initiated by Diaconis and Sturmfels (1998) stimulated active research on Markov bases for more than a decade. It also motivated improvements of algorithms for Gr\"obner basis computation for toric ideals, such as those implemented in 4ti2.

Advisor: Peter Guttorp Local competition between trees affects growth and mortality, from which emerges spatial patterns of surviving trees. Often, the patterns resulting from this unspecified process are treated as instances of spatial patterns and analyzed with point process methods. Alternatively, forest simulation models assume mechanistic processes and parameters to examine the effects of these assumptions on tree patterns over time, and assess sensitivity to changing conditions, such as climate.

Advisor: Peter Hoff Maximum likelihood estimation is a popular method of statistical inference in part due to its efficiency. Unfortunately, much of the efficiency is lost when the model has been misspecified. To account for possible model misspecification, the sandwich estimate of variance can be used with MLE inference to generate asymptotically correct confidence intervals, but these intervals typically perform poorly at small sample sizes. In this talk, we present a pivot-based method that performs better than the sandwich and its adjustments at small sample sizes.

Analysis of the patterns of entities and their attributes in space is a common and useful endeavor in ecology. Often, the end of a statistical analysis is a general characterization of the observed pattern or series of patterns. However, a good description of the outcome may be somewhat dissatisfying to the practicing scientist or resources manager in that the mechanisms and processes that led to the outcomes remain unknown.

The localized haplotype-cluster model uses variable-order Markov chains to create an empirical model for haplotype probabilities that adapts to the changing structure of linkage disequilibrium (LD) across the genome. By clustering haplotypes based on the Markov property, the model is able to take advantage of conditional independencies to improve estimates of haplotype frequencies while still respecting the dependencies induced by LD.

Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop

At the present time there is no well accepted test for comparing least squares and robust linear regression coefficient estimates. To fill this gap we propose and demonstrate the efficacy of two Wald-like statistical tests for the above purposes, using for robust regression the class of MM-estimators.

Epidermal nerve fiber (ENF) density and morphology are used to diagnose small fiber involvement in diabetic and other small fiber neuropathies. ENF density and summed length of ENFs per epidermal surface area are reduced in diabetic subjects. Furthermore, based on mainly visual inspection, it has been reported that ENFs of subjects with diabetic neuropathy seem to appear more clustered than ENFs of healthy subjects. Therefore, it is important to understand the spatial structure of ENFs in healthy and diseased subjects.

Advisors: Adrian Dobra and Jon Wakefield Understanding the relationships between disease incidence and risk factors such as demographic characteristics, life style factors, and environmental contaminants is a central goal in public health and epidemiology. Often outcomes and risk factors are measured at specific locations or at particular times. We present flexible Bayesian models for spatial and temporal data to address important public health questions in two examples. In the first example, we consider low birthweight and preterm birth along with three risk factors in North Carolina.

Advisor: Peter Guttorp

Advisor: Elizabeth Thompson A major topic in statistical genetics is discovering the locations of genes contributing to complex traits through linkage analysis. The likelihood of a genetic marker controlling the expression of the trait is calculated using estimated identity-by-descent (IBD) graphs, which indicate whether copies of the marker shared among individuals are inherited from a common ancestor. Methods for estimating IBD graphs either use pedigree or population relationships between the individuals, and do not scale to a large number of individuals.

Advisor: R. Douglas Martin

In a similarity based clustering task, one defines a \"similarity function\" between pairs of points and then formulates a criterion (e.g. maximum intracluster similarity) that the clustering must optimize. The optimality criterion quantifies the intuitive notion that points in the same clusters should be similar while points in different clusters should be dissimilar. Most sensible criteria are NP hard to optimize. An alternative view that has been successful in recent years is represented by spectral methods, where clustering is based on the first few eigenvectors of a matrix.

Advisor: Professor Elizabeth Thompson Inference of identity by descent for linkage analysis Identity by descent (IBD) describes the pattern of shared inheritance of DNA among individuals. Two or more copies of DNA are identical by descent if they are inherited from the same common ancestor. IBD underlies the genetic similarity between individuals and thus similarity in observed genetic traits. In a family study of a genetic disease, estimated IBD among individuals in the family is used to identify potential locations of the gene that causes the disease.

A problem encountered across many fields in science, engineering and medicine, is finding a specific percentile of a binary-response threshold distribution (for example: finding the ED50 of a medication). Statisticians have designed two popular sequential solutions to this challenge: 'Up-and-Down' (U&D), a 1940's vintage method; and Bayesian designs - most prominently 'Continual Reassessment Method' (CRM, Quigley et al., 1990), a design tailored to Phase I clinical trials. U&D generates a random walk revolving around the target percentile.

Advisor: Michael Perlman

We study the nonparametric maximum likelihood estimator (MLE) for current status data with competing risks. These data arise naturally in cross-sectional survival studies with several failure causes, and generalizations arise in HIV vaccine clinical trials. Until now, the asymptotic properties of the MLE have been largely unknown. We resolve this issue by proving consistency, the rate of convergence, and the limiting distribution of the MLE.

Current methods for reconstructing human populations of the past by age and sex are deterministic or do not formally account for measurement error. I propose \\\"Bayesian reconsruction\\\", a method for simultaneously estimating age-specific population counts, fertility rates, mortality rates and net international migration flows from fragmentary data, that incorporates measurement error. Expert opinion is incorportated formally through informative priors. Inference is based on joint posterior probability distributions which yield fully probabilistic interval estimates.

Advisor: Peter Hoff

Advisor: Thomas Richardson Graphical models provide an intuitive way of representing conditional independence relations over multivariate distributions. We work with a very general class of graphs we dub Mixed Euphonious Graphs (MEGs), which include DAGs, undirected graphs and ancestral graphs as special cases. Markov properties and parametrizations of discrete distributions obeying the global Markov property for MEGs were found by Richardson (2003, 2009). We discuss this parametrization, and a Maximum Likelihood fitting algorithm which uses it.

Advisor: Mathias Drton

It is often reasonable, by using earlier empirical evidence or theoretical understanding of the considered applied context, to assume that the regression surface corresponding to a response variable, as a function of the model covariates, is either monotonically increasing or monotonically decreasing, but then otherwise leave the form of such a function unspecified. In this talk we consider the practical implications of making such a postulate when applying variable dimensional Bayesian modeling, MCMC, and model averaging.

Acyclic digraphs are used to represent the underlying relationships of some Bayesian belief networks, which are in turn used in expert systems and other representations of statistically interdependent items. But the set of such digraphs turns out to be too big and, instead, a smaller number of equivalence classes truly represent the set of possible networks. Until now, little has been known about the combinatorial properties of these classes, such as their asymptotic growth with number of vertices or the average class size.

Linear (causal) relationships between random variables can be conveniently encoded using a mixed graph (a graph with both directed and bidirected edges) where a directed edge implies a direct linear effect and a bidirected edge captures the existence of unobserved confounding. Even when there is a known a mixed graph that accurately reflects the data generating mechanism, that is, all causal relationships are known and linear, confounding can make it impossible to infer parameters of interest. More concretely, many mixed graphs have (generically) unidentifiable parameters.

The statistical literature abounds with limit results (central limit theorems, laws of large numbers and laws of iterated logarithm) for Markov chains, Markov renewal processes, and Markov additive processes. However, most of the general results are not applicable in practice because the limiting quantitites are not available in an explicit form, in general.

Phylodynamics is an area on the intersection of phylogenetics and population genetics that aims to reconstruct population size trajectories from genetic data.

Advisor: Peter Hoff

Advisor: Vladimir Minin Changes in population size influence genetic diversity of the population and, as a result, leave imprints in genomes of individuals in the population. We are interested in an inverse problem of reconstructing past population dynamics from genomic data. We start with a standard framework based on the coalescent, a stochastic process that generates genealogies connecting randomly sampled individuals from the population of interest. These genealogies serve as a glue between the population demographic history and genomic sequences.

Advisor: Adrian Raftery Gene regulatory networks are an important piece in understanding the functioning of living cells. As more and more gene expression data is becoming available, researchers need fast, reliable techniques for inferring these networks. I have developed ScanBMA, a fast Bayesian model averaging algorithm, used to infer networks from time-series data. I have also developed Model-based Clustering with Data Correction (MCDC), a method for automatically detecting and correcting errors that systematically affect some but not all data.

We have developed LeTICE, an algorithm for learning a transcriptional regulatory network from ChIP-chip location and expression data. The network is specified by a binary matrix of transcription factor – gene interactions which partitions the genes into a collection of modules (groups of genes regulated by the same TFs) and a background (a group of genes which do not belong to any module). We define a likelihood of a network given location and expression data and then search for the network optimizing the likelihood using numerical optimization.

Advisor: Maya Gupta

Advisor:Professor Jon A. Wellner We develop a sharp multiplier inequality used to study the size of the multiplier empirical process $(\sum_{i=1}^n \xi_i f(X_i))_{f \in \mathcal{F}}$, where $\xi_i$'s and $\mathcal{F}$ are multipliers and an indexing function class respectively. We show that in general the size of the suprema of the multiplier empirical process is determined jointly by the growth order of the corresponding empirical process, and the worst size of the maxima of the multipliers.

A number of different scientific fields ranging from biomedicine to economics, to molecular biology, generate functional data. The statistical analysis of a sample of curves, known as Functional Data Analysis (FDA), has as one of its goals explaining how variation in the functional outcome can be explained by some predictors. However, these curves tend to be misaligned, exhibiting variation not only in amplitude, but also in phase. Teasing apart these sources of variation is a central issue in FDA.

Advisors: Vladimir Minin & David Goodlett

Proteomics attempts to understand biological functions of an organism through the lens of expressed proteins, basic building blocks of all living cells. Mass spectrometry is used in the field of shotgun proteomics to generate mass spectra that are in turn used to identify and quantify proteins in a given sample.

Latent variable network models provide low-dimensional representations of relational patterns in terms of additive and multiplicative actor-specific effects. In this talk we discuss these models in two contexts. First, we extend this class of models to estimate and make inference on the dependencies between a set of network relations and actor-specific attributes. Approaches to this problem typically condition on either the relations or attributes and are unable to provide predictions simultaneously for missing attribute and network information.

Advisor: Marina Meila We address the problem of identifying underground anomalies (e.g. holes) based on gravity measurements. This is a theoretically well-studied and difficult problem. In all except a few special cases, the inverse problem has multiple solutions, and additional constraints are needed to regularize it. Our approach makes general assumptions about the shape of the anomaly that can also be seen as sparsity assumptions. We can then adapt recently developed sparse reconstruction algorithms to bear on this problem.

Advisor: Mark Handcock Influenza pandemics pose a serious global health concern. The recent A (H1N1) influenza pandemic caused 18,500 lab-confirmed deaths, and mutation of the A (H5N1) \"avian\" influenza virus could also cause a pandemic with an estimated 60% case mortality rate in humans, requiring fast analysis of intervention and containment strategies. When a new influenza virus emerges with pandemic potential, stochastic simulation models are used to assess the effectiveness of different strategies.

I will give a brief introduction to graphical models that will be followed by an outline of a few topics that future students of Michael Perlman and Thomas Richardson could work on.

Advisor: Vladimir Minin

In multivariate analysis, we are often interested in studying the dependence structure among diverse types of data, including continuous, ordinal, and non-ordered categorical data. One approach to analyze these data is using copula models. In this talk, I will discuss a method extending copula models to mixed continuous and ordinal data and study its asymptotic properties. Then I will introduce a new model incorporating copula models and model-based clustering ideas to deal with mixed continuous, ordinal and categorical data.

Propp and Wilson's coupling from the past (CFTP) algorithm provides exact samples and, thus, an elegant alternative to convergence diagnostics for standard MCMC samplers. I shall explain how this method works and discuss some practicalities regarding its use in MCMC sampling. Unfortunately the CFTP technique is only applicable when the distribution to be sampled possesses certain special properties. We propose a way to use the method's basic idea more generally and demonstrate that our algorithm works well in some quite challenging applications.

Advisor: Douglas Martin Financial asset returns and fundamental factor exposure data often contain outliers, observations that are inconsistent with the majority of the data. Both academic finance researchers and quantitative finance professionals are well aware of the occurrence of outliers in financial data, and seek to limit the influence of such observations in data analyses. Commonly used outlier mitigation techniques assume that it is sufficient to deal with outliers in each variable separately.

Wavelets give rise to the concept of wavelet variance that decomposes the variance of a time series on a scale by scale basis and that has considerable appeal when physical phenomena are analyzed in terms of variations operating over a range of different scales. The wavelet variance has been applied to a variety of time series and is useful as an exploratory tool to identify important scales, to assess the exponent parameter of a power law process, to detect inhomogeneity and to estimate a time varying spectral density function.

Hammersley (1972) initiated a very interesting "hydrodynamical" approach to the study of the behavior of the lengths of longest increasing subsequences of random permutations. In the nineties Aldous and Diaconis (1995) introduced a modified version of the interacting particle process, studied in Hammersley (1972), and used this modification in a proof of the fact that the length of a longest increasing subsequence of a (uniform) random permutation of length n, divided by sqrt{n}, converges in probability to 2.

We consider the problem of testing and estimation of separable covariances for relational data sets in the context of the matrix-variate normal distribution. Relational data are often represented as a square matrix, the entries of which record the relationships between pairs of objects. Many statistical methods for the analysis of such data assume some degree of similarity or dependence between objects in terms of the way they relate to each other. However, formal tests for such dependence have not been developed.

Advisors: Vladimir Minin and Ira Longini Despite seasonal cholera outbreaks in Bangladesh, little is known about the relationship between environmental conditions and cholera cases. We seek to develop a predictive model for cholera outbreaks in Bangladesh based on environmental predictors. To do this, we estimate the contribution of environmental variables, such as water depth and water temperature, to cholera outbreaks in the context of two different disease transmission models.

Advisor: Elena Erosheva I develop Bayesian hierarchical latent variable models for the study of longitudinal multivariate data. The latent variable models seek to represent multivariate data with a reduced number of dimensions while the hierarchical formulation enables the description of the latent structure evolution over time as well as factors associated with this evolution. Research on cognitive assessments and scientific interest in relating cognitive decline to neuroimaging results and biomarker information motivate these models.

There will be a riveting introduction to social networks and the latent space model used for modeling networks. I will discuss the difficulties in estimating the parameters of this model by traditional methods and explain the estimator we came up with to deal with these issues.

Advisor: Mathias Drton


Two hypothesis testing problems related to high-dimensional covariance/correlation structures will be presented.

A convex-transformed density is a quasi-concave (or a quasi-convex) density which is a composition of monotone and convex functions. We consider a scale of such families of multivariate densities indexed by a parameter which is a monotone function. The exponential function corresponds to log-concave densities, while power functions correspond to heavier tailed densities or densities concentrated on the positive orthant.

Popular theories in political science regarding opinion-changing behavior postulate existence of one or both of two broad categories of people: those who hold their opinions over time; and those that hold no solid opinion and, when asked to make a choice, do so seemingly at random. This study explores evidence for a third category: durable changers. This group of people will change their opinion in a rational, informed manner, after being exposed to new information.

Advisor: Marina Meila

We introduce a hierarchical Bayesian model (HBM) for precipitation monitoring data that incorporates numerical weather prediction (NWP) model output at high spatial and temporal resolution and a physics-based stochastic partial differential equation (SPDE). The SPDE explicitly models phenomena such as advection and diffusion that occur in many natural processes. We approximate the solution of the SPDE in the spectral space using the method of eigenfunctions to reduce the dimensionality of the problem.

Allele-sharing methods provide a robust approach to linkage detection for complex traits using pedigree data. Affected related individuals have increased probability of sharing genes identical-by-descent (IBD) at trait loci and hence also at linked marker loci at which they therefore show increased similarity over that predicted under Mendelian segregation. Relatives of discordant phenotype have decreased probability of sharing genes IBD at trait loci and hence have decreased similarity at linked markers.

A central goal of the education literature is to demonstrate that specific educational interventions have a treatment effect on student test performance. Researchers often have access to student test scores for students in the treatment and control groups both prior to and after the intervention, but usually must estimate the treatment effect from observational data in which the intervention has not been randomly assigned to units. This talk begins with a discussion of the assumptions that underlie common approaches to estimating a treatment effect with observational data.

Advisor: Dr. Raphael Gottardo Peptide microarrays tiling immunogenic regions of pathogens (e.g. envelope proteins of a virus) have become an important high throughput tool for querying and mapping antibody binding. Antibodies play a key role in the immune system by preventing and controlling infection. Antibody binding locations provide crucial information for understanding natural infection and for deriving effective vaccines. In the context of vaccine development, the peptide microarray can reveal patterns of antibody response stimulated via vaccine treatment.

Advisor: R. Douglas Martin

The Capital Asset Pricing Model (CAPM) is today\\\'s most important financial model for estimating cost of capital and asset allocation. Its centerpiece are variables, commonly called betas and alphas, estimated using ordinary least squares (OLS) regression. Since financial returns typically have an asymmetric and heavy-tailed distribution, OLS estimates can be severely biased. In this talk we will introduce robust regression estimates with zero bias in beta and low bias in alpha (even under asymmetric distributions) but 99% asymptotic efficiency at the Gaussian model.

Advisor: Marina Meila


Latent class transition models (LCTMs) are used to study the movement of individuals among homogeneous subgroups through time. Traditional LCTMs assume a complete set of observations for each individual. However, many longitudinal surveys have a rolling enrollment design, with late entry and early exit. Thus, methodology is needed to account for all the possible times at which individuals can be observed.

Advisors: Charles Kooperberg & Michael LeBlanc

Standard statistical technique often fail in the presence of data-driven model selection, yielding inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. In particular, the observed data is constrained to lie in a subset of the original sample space that is determined by the selected model. This often makes the post-selection likelihood of the observed data intractable and inference difficult. Recently, novel methodologies have been proposed for performing valid inference in selected models.

We provide an asymptotic theory for penalized least squares estimators of locally constant functions with finitely many jumps which are blurred by an operator and random noise. Differences to the direct case are highlighted, particularly, it turns out that a sqrt(n) rate of convergence for estimation of the jump locations is generic in the inverse case. Moreover, locations of jumps are jointly asymptotic normal, which allows to construct confidence regions for the graph of a function with a finite number of jumps.

If atmospheric, agricultural, and other environmental systems share one underlying theme it is complex spatial structures, being influenced by such features as topography and weather. Ideally we might model these effects directly; however, information on the underlying causes is often not routinely available. Hence, when modeling environmental systems there exists a need for a class of spatial models which does not rely on the assumption of stationarity. In this talk, we propose a novel approach to modeling nonstationary spatial fields.

We present a Bayesian model for area-level count data that uses Gaussian random effects with a novel type of G-Wishart prior on the inverse variance-covariance matrix. The usual G-Wishart prior restricts off-diagonal elements of the precision matrix to 0 according to the neighborhood structure of the study region. This preserves conditional independence of non-neighboring regions but is more flexible than the traditional intrinsic autoregression prior.

Advisor: Professor Ross L. Prentice

Advisor: Peter Hoff

MURI week continues this Friday. I'll be talking about probabilistic weather forecasting using Bayesian Model Averaging, an altogether different approach than the probabilistic forecasting method described by Tilmann in seminar earlier this week. I'll be discussing my work on forecasting of wind and rain, and looking at a modification of the EM algorithm for mixed continuous/discrete distributions.

Advisor: Peter Hoff


Advisor: Adrian Raftery

Co-Advisors: Mathias Drton & Raphael Gottardo Standard statistical technique often fail in the presence of data-driven model selection, yielding inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. In particular, the observed data is constrained to lie in a subset of the original sample space that is determined by the selected model. This often makes the post-selection likelihood of the observed data intractable and inference difficult. Recently, novel methodologies have been proposed for performing valid inference in selected models.

This talk describes new multivariate regression and model-based clustering methods for statistical inference with multivariate mixed outcomes. We use the term mixed outcomes to refer to binary, ordered categorical, count, continuous and other ordered outcomes in combination. Such data structures are common in social, behavioral, and medical sciences. We develop two regression approaches, the semiparametric Bayesian latent variable model and the semiparametric reduced rank multivariate regression model, for mixed outcome data.

Tandem mass spectrometry has become a leading technology for protein identification. Much research has been done to automate the task of matching spectra to peptides.

In this study, we propose a probabilistic sequencing algorithm. It includes a probabilistic network to model the chemistry in the generation of theoretical spectrum, a pair hidden markov model to match theoretical spectrum and observed spectrum, and a probabilistic score function to rank the candidate sequences.

We address the problem of identifying underground anomalies (e.g. holes) based on gravity measurements. This is theoretically well-studied and difficult problem. In all except a few special cases, the inverse problem has multiple solutions, and additional constraints are needed to regularize it. Our approach makes general assumptions about the shape of the anomaly that can be seen as sparsity assumptions. Then we adapt recently developed sparse reconstruction algorithms to bear on this problem.

Advisor: Jon Wakefield Area and time-specific estimates of disease rates, cause-specific mortality rates and other key health indicators are of great interest for health care and policy purposes. Such estimates provide the information needed to identify areas with increased risk, effectively allocate resources, and target interventions. A wide variety of data, such as vital statistics, complex surveys, demographic surveillance sites, and disease registries, are used for these purposes.

Exponential-family random graph model (ERGM) has been widely applied in the fields of social network analysis, genetics (e.g. protein interaction networks), information theory etc. Because of the intractability of the likelihood function, Markov Chain Monte-Carlo (MCMC) approximation is typically applied to obtain maximum likelihood estimators (Geyer and Thompson 1992). However, ERGMs still suffer from inferential degeneracy and computational deficiency. In this talk, we present the Bayesian inference to ERGM.

In classical testing problems, we often use statistics based on the empirical distribution function to test whether or not the underlying distribution of the data is what we think it might be. Berk and Jones introduced such a statistic in 1979. I'll talk about a statistic which is related to theirs (called the reversed Berk-Jones statistic), and some of its properties. Along the way we'll chat about what exactly the empirical distribution function is, and why I think it's so cool. That is all.

Advisor: Jon Wellner Abstract: We discuss a method for obtaining finite sample Gaussian bounds for the tail of the hypergeometric distribution. The method is based on Tusnády's approach (1975) to bounding the tail of symmetric binomial random variables. In this talk, we review Tusnády's result, and discuss how it can be adapted to and extended in the hypergeometric case.

Public Talk

For several important monotone parameters, such as the distribution function, monotone density function, and monotone regression function, sensible nonparametric estimators can be obtained by minimizing the empirical risk based on an appropriate loss function. For more complex monotone parameters, such as a monotone covariate-adjusted dose-response curve, or in the context of more complex data structures, this approach may not be possible and alternative approaches are needed. We discuss general strategies for monotone function estimation in two important settings.

The paradox of missing heritability refers to the common finding that in complex genetic traits with high heritability as estimated by methods such as twin studies, only a small fraction of the population variance is explained by the few Single Nucleotide Polymorphism (SNP) markers which are found to be individually significantly associated with the trait. Human height, with heritability estimates as high as 80% largely unexplained by individual SNP’s, is the canonical example of such a trait.