Analysis of Incomplete Network Data

Mengjie Pan

Collecting social network data is notoriously difficult, meaning that indirectly observed or missing observations are very common. In this talk, we address two of such scenarios: inference on network measures without any direct network observations and inference of regression coefficients when important features are missing.

Direct network data is often expensive to collect because it requires soliciting connections between all members of the population. Collecting aggregate relational data (ARD), responses to questions of the form “How many of your social connections have trait k,” is much more cost effective. In the first part of the talk, we show that we can use ARD to estimate individual and global network properties. Building on the latent surface model proposed by McCormick and Zheng (2015), we connect ARD to a network formation model, which allows us to obtain draws from the posterior distribution over graphs given the ARD response vector. We can then compute network statistics based on these posterior samples. We demonstrate our method using evidence from simulation and replicating results from cases where the complete graph was observed.

In the second part of the talk, we discuss how we make inference on coefficients where the outcome of a linear regression is the interaction between a pair of nodes and there are unobserved clusters of nodes. Building on exchangeable errors proposed by Marrs, McCormick, and Fosdick (2017), we propose block-exchangeable errors and a two-step procedure for estimation of the standard errors.