In this exam, we develop scalable learning methods for sequential data models with latent (hidden) states. Two popular examples of such models are state space models (SSMs) and recurrent neural networks (RNNs). By augmenting an observed sequence with a latent states sequence, SSMs and RNNs model complex temporal dynamics with a simpler, smaller parametrization. Unfortunately, learning the parameters of these latent state sequence models requires processing the latent states along the entire sequence, which scales poorly for long sequential data.

To tackle this challenge, we propose scalable learning methods that use stochastic gradients based on processing subsequences instead of processing full sequences. Although this leads to desirable computational speed-ups, these stochastic gradients for sequential data break temporal dependencies and are biased. We develop theory to analyze the effect of this bias on learning and develop efficient estimators to control this bias. For SSMs, we use buffered stochastic gradient estimates, which reduce the bias by passing additional messages in a buffer around each subsequence. For RNNs, we adaptively truncate backpropagation to save computation and memory when possible by estimating the relative bias. We find these methods provide significant speed-ups while controlling the error due to bias in both synthetic and real data sets with millions of time points.