Probabilistic topic models provide a suite of tools for analyzing large document collections. Topic modeling algorithms discover the latent themes that underlie the documents and identify how each document exhibits those themes. Topic modeling can be used to help explore, summarize, and form predictions about documents.
Traditional topic modeling algorithms analyze a document collection and estimate its latent thematic structure. However, many collections contain an additional type of data: how people use the documents. For example, readers click on articles in a newspaper website, scientists place articles in their personal libraries, and lawmakers vote on a collection of bills. Behavior data is essential both for making predictions about users (such as for a recommendation system) and for understanding how a collection and its users are organized.
In this talk, I will review the basics of topic modeling and describe our recent research on collaborative topic models, models that simultaneously analyze a collection of texts and its corresponding user behavior. We studied collaborative topic models on 80,000 scientists' libraries, a collection that contains 250,000 articles.
With this analysis, I will show how we can build interpretable recommendation systems that point scientists to articles they will like. Further, the same analysis lets us organize the scientific literature according to discovered patterns of readership. For example, we can identify articles important within a field and articles that transcend disciplinary boundaries.
More broadly, topic modeling is a case study in the large field of applied probabilistic modeling. Finally, I will survey some recent advances in this field. I will show how modern probabilistic modeling gives data scientists a rich language for expressing statistical assumptions and scalable algorithms for uncovering hidden patterns in massive data.
David Blei is an associate professor of Computer Science at Princeton University. He earned his Bachelor's degree in Computer Science and Mathematics from Brown University and his PhD in Computer Science from the University of California, Berkeley. His research focuses on probabilistic topic models, Bayesian nonparametric methods, and approximate posterior inference. He works on a variety of applications, including text, images, music, social networks, user behavior, and scientific data.