Data on Narrative Voices in Early Novels


Spanning works published between 1660 and 1850, the Early Novels Database (END) is the product of a collaboration between faculty, librarians, and students at nearby Swarthmore, Bryn Mawr, Haverford, and Penn. The project team pulled two thousand novels from special collections and recorded physical information about the copy (dimensions, format, pages), bibliographic metadata about the work (author name, publication date), and — going beyond what library MARC records typically contain — interpretive information about the text (the mode of narrative address, the author’s gender). You can browse categories collected by the team and filter the data by plugging their GitHub repository into a Flat Viewer.

With the END, we can visualize some trends in first person, third person, and epistolary novels, the gender of their authors, and how those categories evolved over three centuries.

This is just a sketch, but it suggests some starting points for more in-depth research questions: why does it seem like female authors predominantly published epistolary novels in the late 18th century, transitioning to third-person in the early 19th? Why did men remain entrenched in the first-person? These super interesting questions now have entry points thanks to undergraduate research and the humanistically-informed deliberation those students engaged in to carefully curate this data.

Two other things I really like about this project: first, its collaborative ethic. Working in GitHub tracks every contribution made to a repository, so that each student’s work is legible and they get credit for their research.

Second, the project’s thoughtful adherence to linked open data principles means that it contributes to a broader research ecosystem and allows researchers to crossreference its curation methods with similar data.

Nicholas Paige published one such dataset on Zenodo alongside his monograph Technologies of the Novel: Quantitative Data and the Evolution of Literary Systems.1 (I was introduced to this book by Akrish Adhikari, one of the CDH’s Humanities Data Teaching Fellows.) We could compare the END novels with this other dataset of French- and English-language works published between 1601 and 1828, performing a comparative analysis of narrative voices in two languages.2

Count of novels classified by modes of narrative address in the Early Novels Database, and English and French datasets for Technologies of the Novel.
end tn-en tn-fr
3rd Person 895 283 911
1st Person 455 117 310
Epistolary 432 95 87
Dramatic Dialogue 30 2 3
total 1,812 497 1,311

But cross-checking the trends would require taking into account the different approach Paige’s book takes to sampling. Paige began with a systematic sample — like the END researchers — by pulling a few novels each year as he moved chronologically forward from the early 17th century. Once he reached the mid-18th century, he began cluster sampling by pulling works published in one or two years each decade as a representative for that ten-year period. The dataset’s resulting coverage looks like this:

The number of French novels sampled per year, 1601–1828, by Nicholas Paige for Technologies of the Novel.

In the book’s Annex section, describing its “premises and protocols,” Paige explains how his methods evolved alongisde his understanding of the subject.

The determination of proper cluster size – how many years of a decade should stand in for the decade as a whole – is dependent on the number of novels published. Basically, small populations must be sampled much more heavily than large populations. But because before starting I had no clear idea of what the “population” of any given decade was, my initial determinations were made on the basis of assumptions about the production of novels – notably, that the French production was at the start very small and grew steadily. This turned out not to be the case; as a result, some of the early decades might be considered “oversampled” (results would have been largely similar with somewhat smaller clusters), and I did have to increase my cluster size in some later decades. (208)

For those who follow Andrew Piper’s contention that poor generalization is a genuine problem in literary studies, traversing and comparing linked data is one way to generalize better.3 We just have to be aware of the different methods used to produce each dataset, and how the questions we want to ask of that data might require some finessing as a result. ,

  1. Nicholas D. Paige, Technologies of the Novel: Quantitative Data and the Evolution of Literary Systems (Cambridge: Cambridge University Press, 2020).↩︎

  2. This table only includes counts of novels marked with a single form of narrative address. Another 200 or so are tagged with multiple forms. ↩︎

  3. Andrew Piper, Can We Be Wrong? The Problem of Textual Evidence in a Time of Data (Cambridge: Cambridge University Press, 2020).↩︎