Experiments with Speculative Fiction in HathiTrust


I just shared a Jupyter notebook for working with three thousand speculative fiction novels using HathiTrust Research Center (HTRC) Analytics. The notebook uses “Extracted Features” rather than the full text of the novels: a data format devised by HathiTrust in order to enable text analysis on post-1926 books still under copyright protection.

Beginning with a print book that looks like this…

A page from the print edition of H. G. Wells's The First Men in the Moon.

…then scanning and OCRing it to grab its text…



 As I sit down to write here amidst the 
 shadows of vine-leaves under the blue sky of 
 southern Italy, it comes to me with a certain 
 quality of astonishment that my participation 
 in these amazing adventures of Mr. Cavor 
 was, after all, the outcome of the purest acci- 
 dent. It might have been any one. I fell 
 into these things at a time when I thought 
 myself removed from the slightest possibility 
 of disturbing experiences. I had gone to 
 Lympne because I had imagined it the most 
 uneventful place in the world. " Here, at any 
 rate," said I, " I shall find peace and a chance 
 to work ! " 
 ' And this book is the sequel. So utterly at

…HTRC finally transforms that text into Extracted Features: a compressed .json file no longer readable by human eyes (“consumptive” reading), yet containing “quantitative abstractions of a book’s written content” that we can explore through text analysis (“non-consumptive” reading):


Each element you see in the .json sample above is a feature, a “quantifiable marker of something measurable, a datum,” as Peter Organisciak and Boris Capitanu put it in their Programming Historian tutorial on text mining with HTRC. They continue:

A computer cannot understand the meaning of a sentence implicitly, but it can understand the counts of various words and word forms, or the presence or absence of stylistic markers, from which it can be trained to better understand text. Many text features are non-consumptive in that they don’t retain enough information to reconstruct the book text.

Extracted Features files allow researchers not only to count “tokens” (words) in each “volume” (published book), but also to filter by parts of speech, browse extensive bibliographic metadata, view quantitative information about each printed page in the dataset, use named entity recognition (NER) to identify people, places, or organizations in the text, graph these elements, and more.

You can follow the instructions in the Jupyter notebook over at my GitHub repo: Experiments with 20th-Century Speculative Fiction in HathiTrust. ,

—Princeton, August 2021