Topic Modeling

Topic Modeling of Latin Text was my senior capstone project for the computer science major. I completed the project with a group of 5 other people from Fall 2017-Winter 2018.

The project consists of two parts: an algorithm and an application. Although the focus throughout was on computer science, this was, at its heart, a digital humanities project. We of course needed to design a robust algorithm that would render useful and substantive results, but we also sought to create a user interface tool that would allow users to easily and intuitively navigate the results of the algorithm. After all, what’s the point of the algorithm if the intended user can’t explore the results?

The Algorithm

Topic modeling is a method of statistically identifying abstract topics that are present throughout a set of documents by grouping together words in those documents that are related. The “topics” are the resulting lists of words. Our project was an implementation of the Latent Dirichlet Allocation (LDA) algorithm, one of the most popular topic modeling algorithms, with some specific tweaks to make it perform well on a Latin text. The algorithm was written in Python, and was largely a collaborative effort. I worked on several parts of the algorithm, including implementing the statistical calculations that make up the meat of the algorithm, and generating some of the data structures that held metadata for each word used to build the user interface.

The Application

The purpose of the application, as stated above, is to allow the user to easily and intuitively navigate the results of the algorithm. The interface has four parts: 1) metadata about the inputed model, 2) a re-creation of the text with each word annotated with its topic, 3) heat maps of topics, and 4) word clouds of topics. We wrote the application in HTML, CSS, and Javascript, and also used the D3 library. I worked specifically on the annotated text. The application may be visually sparse (no one on our team had much design experience), but there’s a lot of thought and user-testing invested in the usability and smooth integration between individual parts. I think the application exhibits thorough proof of concept, and contains a strong functional foundation – it’s just waiting for a glow-up.

To see the website we created for the project, which includes access to the source code and many more details on the algorithm and application, click here.

For a much more robust overview of the project, check out the video of our presentation at Carleton’s Computer Science Comps Gala: