Journal Article

Abstract

Given a small, well-understood corpus that is of interest to a Humanities scholar, we propose sub-corpus topic modeling (STM) as a tool for discovering meaningful passages in a larger collection of less well- understood texts. STM allows Humanities scholars to discover unknown passages from the vast sea of works that Moretti calls the ‘‘great unread’’ and to significantly increase the researcher’s ability to discuss aspects of influence and the development of intellectual movements across a broader swath of the literary landscape. In this article, we test three typical Humanities research problems: in the first, a researcher wants to find text passages that exhibit similarities to a collection of influential non literary texts from a single author (here, Darwin); in the second, a researcher wants to discover literary passages related to a well understood corpus of literary texts (here, emblematic texts from the Modern Breakthrough); and in the third, a researcher hopes to understand the influence that a particular domain (here, folklore) has had on the realm of literature over a series of decades. We explore these research challenges with three experiments.