Topic Modeling of Latin Text was my senior capstone project for the computer science major. I completed the project with a group of 5 other people from Fall 2017-Winter 2018.
The project consists of two parts: an algorithm and an application. Although the focus throughout was on computer science, this was, at its heart, a digital humanities project. We of course needed to design a robust algorithm that would render useful and substantive results, but we also sought to create a user interface tool that would allow users to easily and intuitively navigate the results of the algorithm. After all, what’s the point of the algorithm if the intended user can’t explore the results?
Topic modeling is a method of statistically identifying abstract topics that are present throughout a set of documents by grouping together words in those documents that are related. The “topics” are the resulting lists of words. Our project was an implementation of the Latent Dirichlet Allocation (LDA) algorithm, one of the most popular topic modeling algorithms, with some specific tweaks to make it perform well on a Latin text. The algorithm was written in Python, and was largely a collaborative effort. I worked on several parts of the algorithm, including implementing the statistical calculations that make up the meat of the algorithm, and generating some of the data structures that held metadata for each word used to build the user interface.
To see the website we created for the project, which includes access to the source code and many more details on the algorithm and application, click here.
For a much more robust overview of the project, check out the video of our presentation at Carleton’s Computer Science Comps Gala: