This talk explores two statistical tasks involving high-dimensional time series. The first task is to forecast high-dimensional time series using Bayesian hierarchical models (BHM). The data under modeling is related to smoking epidemic and human mortality measures obtained from multiple populations around the world. I propose a BHM for estimating and forecasting the all-age smoking attributable fraction (ASAF), which serves as a summarizing statistical measure of the effect of smoking on mortality.
Estimating and Forecasting the Smoking-attributable Mortality Fraction for Both Sexes Jointly in 69 Countries
Smoking is one of the preventable threats to human health and is a major risk factor for lung cancer, upper aero-digestive cancer, and chronic obstructive pulmonary disease. Estimating and forecasting the smoking attributable fraction (SAF) of mortality can yield insights into smoking epidemics and also provide a basis for more accurate mortality and life expectancy projection.
An individualized treatment rule (ITR) is a treatment rule which assigns treatments to individuals based on (a subset of) their measured covariates. An optimal ITR is the ITR which maximizes the population mean outcome. In any given problem, there is no guarantee that the optimal ITR will outperform standard practice. The utility of personalization can be explored using a confidence interval for the mean outcome under the optimal rule.
In this talk, we consider structural equation models represented by a mixed graph which encode both direct causal relationships as well as latent confounding. First, we use an empirical likelihood approach to fit structural equation models without explicitly assuming a distributional form for the errors. Through simulations, we show that when the errors are skewed, the empirical likelihood approach may provide a more efficient estimator than methods assuming a Gaussian likelihood.
The multivariate skew normal distribution extends the class of normal distributions by the addition of a shape parameter. It allows to model phenomena whose empirical outcome behaves in non-normal fashion but still retains some similarity with the normal distribution. It has been introduced in Azzalini & Dalla Valle (1996), and further probabilistic properties as well as statistical aspects have been explored in Azzalini and Capitanio (1999).
Learning an ensemble of models instead of a single one can be a remarkably effective way to reduce predictive error. Ensemble methods include bagging, boosting, stacking, error-correcting output codes, and others. But how can we explain the amazing success of these methods (and hopefully design better ones as a result)? Many different explanations have been proposed, using concepts like the bias-variance tradeoff, margins, and Bayesian model averaging. However, each of these explanations has significant shortcomings.
One of the first topics to come up in an introductory statistics course is means and medians. But why do we have more than one way of measuring the "middle" of a set of data? This talk will show how different metrics (ways of defining distance) give rise to different measures of "middle", as well as looking at some of the practical reasons we might pick one measure over another.
In this age of exponential growth in science, engineering, and technology, the capability to evaluate the performance, reliability, and safety of complex systems presents new challenges. Today\'s methodology must respond to the ever increasing demands for such evaluations to provide key information for decision and policy makers at all levels of government and industry on problems ranging from national security to space exploration.
We consider a parametric version of the two-sided matching model described in Roth and Sotomayer (1990). We develop the model for the analysis of matching data, where the data consist of pairs of individuals, with one individual from each of two distinct populations (for example employers and employees, or men and women). Individuals agree to form pairs based on utilities they have for one another, resulting in a stable set of matches between the two populations.
Many medical studies collect both repeated measures data and survival data. In this talk, I discuss jointly modeling these two kinds of data in a study of patients with chronic kidney disease in which longitudinal biomarkers of kidney function and time to cardiovascular events were recorded. Joint modeling these processes is important because the measurements of kidney function are error-prone. Ignoring this error (e.g., using a simpler time-varying covariates model) can give biased estimates of the effect of kidney function on the risk of cardiovascular events.
In retrospective surveys, classification errors are usually of a systematic nature, since respondents tend to be consistent in their answers and to forget about past changes in their labor market status.
In many areas of science, models involve unseen latent variables. Often these variables are such that, were we able to observe them, the testing of scientific hypotheses would be straightforward. A classical example is that of Bernoulli trials (tosses of a fair coin) observed with error. If the number of successes (heads) is observed, testing that the coin is fair is straightforward, but how should uncertainty in observation be taken into account?
This talk is about the influences of kinships on first migration. Event history data collected in a special survey of several Mexican communities are used to show that the migration decisions of individuals are affected by what fathers and brothers do or have done in the past. Instead of simple individual models, the models proposed are designed to retrieve fixed and time dependent effects on the joint migration risks of two members of a pair while simultaneously reducing or eliminating the impact of unmeasured common conditions shared by the members of the pairs.
Network science is an interdisciplinary endeavor, with methods and applications drawn from across the natural, social, and information sciences. In addition to theoretical developments, electronic databases currently provide detailed records of human communication and interaction patterns, offering novel avenues to map and explore the structure of social networks. I will talk about the structure of a social network based on the cell phone communication patterns of millions of individuals, and what implications it has for diffusion processes on social networks.
Many studies yield functional data, with the ideal units of observation curves and observed data sampled on a fine grid. These curves frequently have irregular features requiring spatially adaptive nonparametric representations. I will discuss new methods for modeling these data using functional mixed models, which treat the curves as responses and relate them to covariates using nonparametric fixed and random effect functions.
The dramatic variations in HIV spread among different populations of the world has led to much research on the causal mechanisms of transmission. Some focus on bio-medical factors, such as the enhanced infectivity caused by other sexually transmitted diseases, others focus on the sexual behavior and network patterns that spread the pathogen. This talk will explore several examples from the network paradigm, which data from the United States, Uganda, and Thailand.
Probabilistic forecasts of continuous variables take the form of predictive densities or predictive cumulative distribution functions. We propose a diagnostic approach to the evaluation of predictive performance that is based on the paradigm of maximizing the sharpness of the predictive distributions subject to calibration. Calibration refers to the statistical consistency between the distributional forecasts and the observations, and is a joint property of the predictions and the events that materialize.
By comparison with modern parametric Bayesian statistics, practicable and robust methods for exploration and data analysis in nonparametric settings are underdeveloped. The rapid development of non-Bayesian methods and ranges of ad-hoc non-parametric tools for data mining reflect the need for a non-parametric Bayesian approach to exploring and managing data sets in even moderate dimensional problems. I will address this issue by presenting multivariate Polya tree based methods for modeling multidimensional probability distributions.
Suppose we have two bivariate random vectors and we want to compare them according to the degree of dependence between their component random variables. Rather than using a single summary statistic like coefficient of correlation, it is more useful and informative to compare the whole distributions or some aspects of them. For this purpose, several partial orders have been introduced in the literature assuming that their marginal distributions are identical. But in many problems of practical interest, this is not the case.
In this talk models for claim frequency and average claim size in non-life insurance are considered. Both covariates and spatial random effects are included allowing the modelling of a spatial dependency pattern. We assume a Poisson model for the number of claims, while claim size is modelled using a Gamma distribution. However, in contrast to the usual compound Poisson model going back to Lundberg (1903), we allow for dependencies between claim size and claim frequency. A fully Bayesian approach is followed; parameters are estimated using Markov Chain Monte Carlo (MCMC).
Advisor: David Madigan
Expressing weather forecasts as probabilities has been a regular (though small) part of operational meteorological forecasting in the U.S. since at least 1965, when the Weather Bureau produced its first probability of precipitation forecasts. In fact, the concept that weather forecasts are uncertain has been understood since the early days of weather forecasting (e.g., in the late 1800s, Cleveland Abbe, the â€œFatherâ€ of weather forecasting in the U.S., called his forecasts â€œprobabilitiesâ€).
This presentation outlines Bayesian selection methodology for semiparametric components in two different scenarios. The first is in models that involve additive semiparametric function estimation, while the second is in time-space varying coefficient models. While these statistical models may differ, the approach to modeling the flexible components involved is similar. It entails the adoption of proper shrinkage priors, coupled with a point mass at zero. In effect, this corresponds to adopting both traditional Bayesian regularization and model averaging simultaneously.
Texture is a powerful cue in visual perception, and texture analysis and synthesis has been an active research area in computer vision. We present a statistical theory for texture modeling and random field approximation, which combines multi-channel filtering and random field modeling via the maximum entropy principle. Our theory characterizes a texture by a random field, the modeling of which consists of two steps.
LÃ©vy Noise Induced Transitions Between Meta-Stable States in Stochastic (Partial) Differential Equations
A spectral analysis of the time series representing average temperatures during the last ice age featuring the Dansgaard-Oeschger events reveals an a-stable noise component with an a ~ 1.78. Based on this observation, papers in the physics literature attempted a qualitative interpretation by studying diffusion equations that describe simple dynamical systems perturbed by small LÃ©vy noise. We study exit and transition problems for solutions of stochastic differential equations and stochastic reaction-diffusion equations derived from this proto type.
Tens of thousands of individuals undergo polygraph security screening examinations in the U.S. every year. How good is the polygraph in detecting deception in such settings? Is there a scientific underpinning for the detection of deception? Are there suitable alternatives to the polygraph for security screening? Two years ago, the NAS-NRC Committee to Review the Scientific Evidence on the Polygraph released it\'s report, â€œThe Polygraph and Lie Detection,â€ addressing these issues.
This talk will focus on topics related to Bayesian networks, a type of graphical model. In general, a graphical model has a qualitative part, a graph over a set of variables, and a quantitative part, a joint distribution over the set of variables. The qualitative part represents a set of independence constraints true of the joint distribution. In the case of a Bayesian network the graph is a directed acyclic graph.
The availability of molecular genetic markers enables the dissection of a quantitative trait into quantitative trait loci (QTL), i.e., chromosomal regions that show strong association with the observed phenotypic trait variance. In plants, the first QTL experiments were targeted to a single mapping population that was derived from crossing two extreme, often fully-inbred, individuals. This simple design allowed regression and Maximum Likelihood methods for data analysis. However, the success of the identified QTL was hampered by several factors.
The Capital Asset Pricing Model (CAPM) is today\'s most important financial model for estimating cost of capital and asset allocation. It\'s centerpiece are variables, commonly called betas and alphas, estimated using ordinary least squares (OLS) regression. Since financial returns typically have an asymmetric and heavy-tailed distribution, OLS estimates can be severely biased. In this talk we will introduce robust regression estimates with zero bias in beta and low bias in alpha (even under asymmetric distributions) but 99% asymptotic efficiency at the Gaussian model.
The observed frequency of a particular outcome in data-based simulation, known as bootstrap probability (BP) of Felsenstein (1985), is very useful as a confidence level of data analysis with discrete outcomes such as estimating the phylogenetic tree from aligned DNA sequences or identifying the clusters from microarray expression profiles. We argue that the length of simulated data sets should be
When considering a graphical Gaussian model NG Markov with respect to a decomposable graph G, the parameter space of interest for the precision parameter is the cone PG of positive definite matrices with fixed zeros corresponding to the missing edges of G. The parameter space for the scale parameter of NG is the cone QG, dual to PG, of incomplete matrices with submatrices corresponding to the cliques of G being positive definite. We construct on the cones QG and PG two families of Wishart distributions, namely the type I and type II Wisharts.
Covariance structure is of fundamental importance in many areas of statistical inference and a wide range of applications, including genomics, fMRI analysis, risk management, and web search problems. In the high dimensional setting where the dimension p can be much larger than the sample size n, classical methods and results based on fixed p and large n are no longer applicable. In this talk, I will discuss some recent results on optimal estimation of covariance/precision matrices as well as sparse linear discriminant analysis with high-dimensional data.
We develop a Bayesian â€œsum-of-treesâ€ model, named BART, where each tree is constrained by a prior to be a weak learner. Fitting and inference are accomplished via an iterative backfitting MCMC algorithm. This model is motivated by ensemble methods in general, and boosting algorithms in particular. Like boosting, each weak learner (i.e., each weak tree) contributes a small amount to the overall model. However, our procedure is defined by a statistical model: a prior and a likelihood, while boosting is defined by an algorithm.
Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns. Likewise, fusion of real-time satellite data with in situ sea surface temperature measurements for ecological modeling remains taxing for probabilistic spatial-temporal models on a global scale.
The understanding of the limiting behavior of empirical processes has found much application in a wide variety of problems in statistics. In particular, empirical processes are natural quantities to look at when studying the asymptotics of infinite-dimensional data.
In this paper we consider the analysis of semiparametric models for binary panel data with state dependence. A hierarchical modeling approach is used for dealing with the initial conditions problem, for addressing heterogeneity, and for incorporating correlation between the covariates and the random effects. We consider a semiparametric model in which a Markov process prior is used to model an unknown regression function.
Legislative voting records are widely used in political sciences to characterize revealed preferences among the member of a deliberative assembly. In this context, item-response models (a class of latent factor models) such as NOMINATE and IDEAL are the preeminent quantititive tools used for analysis. This class of models assumes that member\'s choices can be explained by continuous latent features, often called ideal points. For unidimensional latent spaces, this often results in a ranking of members along the liberal-conservative spectrum.
Model uncertainty is central to economics, where researchers attempt to discriminate among alternative theories in robustness analyses. Bayesian Model Averaging (BMA) is an approach designed to address model uncertainty as part of the empirical strategy. Applications of BMA to economics are widespread; however it is often unclear whether subtle differences in the choice of parameter and model priors affect inference. We present an integrated procedure, based on 12 popular, noninformative parameter priors and any given model prior, to conduct sensitivity analysis in BMA.
From the inception of the proportional representation movement is has been an issue whether larger parties are favored at the expense of smaller parties in one apportionment of seats as compared to another apportionment. A number of methods have been proposed and are used in countries with a proportional representation system. These methods exhibit the regularity of order that captures the preferential treatment of larger versus smaller parties. This order, namely majorization, permits the comparison of seat allocation in two apportionments.
With modern \"Big Data\" settings, off-the-shelf statistical machine learning methods are frequently proving insufficient. A key challenge posed by these modern settings is that the data might have a large number of features, in what we will call \"Big-p\" data, to denote the fact that the dimension \"p\" of the data is large, potentially even larger than the number of samples.
Learning Transcriptional Regulatory Modules from the Integration of ChIP-Chip and Gene Expression Data
The most common way for governments to protect the population from environmental insults, such as air or water pollution, is to set a standard. Most standards consist of two parts: a cutoff value beyond which health risks are deemed unacceptable, and an implementation rule, specifying how compliance with the standard will be ascertained. We illustrate the concepts with two US environmental standards, one for air pollution and one for water pollution. From a statistical point of view, the US EPA implementation rules in these examples have poor performace characteristics.
Events in an online social network can be categorized roughly into endogenous events, where users just respond to the actions of their neighbors within the network, or exogenous events, where users take actions due to drives external to the network. How much external drive should be provided to each user, such that the network activity can be steered towards a target state?
Advisors: Thomas Richardson and Dan Goldhaber Lord (1967) describes a hypothetical â€œparadoxâ€ in which two statisticians, analyzing the same dataset using different but defensible methods, come to very different conclusions about the effects of an intervention on student outcomes.
Advisor: Marina Meila Abstract: This talk investigates the methodology and scalability of non-linear dimension reduction techniques. With data being observed in increasingly higher dimensions and on a larger scale than before, the demand for non-linear dimension reduction is growing. There is very little consensus, however, on how non-linear dimension reduction should be performed. The goal of Manifold Learning (ML) is to embed the data into s-dimensional Euclidean space (where manifold dimension < s < observed dimension) without distorting the geometry. Existing ML algorithms (e.g.
The PRISM Group (formerly known as the Spatial Climate Analysis Service) at Oregon State University is the de facto climate mapping center for the United States, and a leader in the emerging discipline of geospatial climatology. Under funding from the USDA-NRCS, NOAA, NPS, USFS, and other agencies, The PRISM Group has mapped the long-term mean climate on a monthly basis for all US states and possessions. These maps are the official climate data sets of the USDA, and have been used in thousands of applications worldwide.
Besag and Newell (1991) provide what might be viewed as a statistician's version of Stan Openshaw's Geographical Analysis Machine (GAM), with the specific aim of identifying spatially localized anomalies ("clusters") in a database comparing the addresses at the time of diagnosis of all registered cases of childhood leukemias in Great Britain between 1966 and 1983 with the nominal populations at risk in more than 100,000 census enumeration districts (ED's).