One of the challenges of building statistical models for large data sets is balancing the correctness of inference procedures against computational realities. In the context of Bayesian procedures, the pain of such computations has been particularly acute as it has appeared that algorithms such as Markov chain Monte Carlo necessarily need to touch all of the data at each iteration in order to arrive at a correct answer. Several recent proposals have been made to use subsets (or "minibatches") of data to perform MCMC in ways analogous to stochastic gradient descent. Unfortunately, these proposals have only provided approximations, although in some cases it has been possible to bound the error of the resulting stationary distribution.
In this talk I will discuss two new, complementary algorithms for using subsets of data to perform faster MCMC. In both cases, these procedures yield stationary distributions that are exactly the desired target posterior distribution. The first of these, "Firefly Monte Carlo", is an auxiliary variable method that uses randomized subsets of data to achieve valid transition operators, with connections to recent developments in pseudo-marginal MCMC. The second approach I will discuss, parallel predictive prefetching, uses subsets of data to parallelize Markov chain Monte Carlo across multiple cores, while still leaving the target distribution intact. These methods have both yielded significant gains in wallclock performance in sampling from posterior distributions with millions of data.