Advisor: Tyler McCormick
In the social sciences, statistical associations among variables are of particular interest in a variety of scientific questions. For example, understanding the associations between questions in a survey may help researchers understand themes amongst related questions or improve imputation for missing data. In demand estimation, as another example, learning product competition structure from data can help economists model consumer choices more effectively. In these settings and others in the social sciences, variables are often high dimensional (the number of questions collected in a survey and the number of products on the market, for example), but obtaining data can be onerous so the observations are usually limited, leading researchers to use external domain knowledge that comes from a variety of sources.
Our work develops scalable Bayesian graphical models that allow flexible incorporation of additional information to improve both inferential and predictive tasks. In the first part, we propose a Bayesian framework to infer latent graphical models from data that consists of both continuous and binary variables. We show that our method improves estimation of both the underlying correlation matrix and the latent graph structure. We use our method to estimate the distribution of deaths by cause using verbal autopsy (VA) surveys. VA is a commonly used tool to assess cause of death in areas without complete-coverage civil registration, through an interview with caregivers of the decedent to elicit data describing the signs and symptoms leading up to the death. By extending our method to latent Gaussian mixture models, we show improved performance in cause-of-death assignment accuracy. In practice, Bayesian stochastic search over the space of large precision matrix can be computationally challenging and sensitive to prior specifications. Accordingly in the second part, we propose a deterministic alternative to estimate Gaussian graphical models using Expectation Conditional Maximization (ECM) algorithm. This model extends the EM approach in Bayesian variable selection to graphical model estimation. We show that the ECM approach finds posterior modes quickly, enables fast posterior exploration under a sequence of mixture priors, and can incorporate multiple sources of information. We also extend this approach to copula graphical models for non-Gaussian data. Finally, we describe the further extension of the ECM approach to conditional graphical models with high dimensional covariates, with an application in discovering product competition from a large marketplace for laptops and tablets.