A Running List of Speculative Fiction Datasets


Of all literary genres, speculative fiction seems to be the most suited for the quantitative lens. Think of fan-driven databases like Memory Alpha, in-world galactic encyclopedias, or hand-curated anthologies, zines, and collections like those featured in The Stuff of Science Fiction. These narrative worlds are primed for counting.

Unfortunately, many databases and bibliographies of SF are inherently exclusionary. Suzanne Boswell, who uses The Internet Speculative Fiction Database (ISFDB) in a network analysis of women writers in pulp magazines of the 1920s-40s (article | data), argues that standard bibliographies of SF are almost never representative:

The ISFDB chooses what constitutes science fiction, horror, and fantasy by deciding what to archive in its database. For the early twentieth century, this mainly means the pulps. This decision makes it difficult to track the contributions of women to the science fiction genre: if the pulps excluded women, and bibliographic archives only count the pulps as science fiction, where do we find the women? Another example: in the early twentieth century, most science fiction by Black authors was published outside of the pulps (W. E. B. Dubois’s “The Comet” [1920]; Pauline Hopkins’s Of One Blood [1902–1903]; George Schuyler’s short stories in The Pittsburgh Courier [1936–1937]). The ISFDB will have the bibliographic information for, say, George Schuyler—but it does not have bibliographic information for other Black-dominated magazines, or Black fiction pamphlets, where other Black speculative fiction writers may exist outside of sf archives. Marginalized authors who write outside the pulps enter science fiction archives as exceptions: their community does not come with them. In this way, science fiction archives repeat the exclusionary patterns of the early twentieth century.

Beginning a research project using databases that reflect standard bibliographies can severely limit the diverse range of voices that contributed to 20th-century speculative fiction. Some of the datasets and projects below directly grapple with this history. Others present an opportunity to.

Text Corpuses

fulltext ¦ Gutenberg SF Bookshelf, 235 full text books as collected by Project Gutenberg in 2007. A broader, ongoing list of ~1,000 works is also available for browsing. The latter can be downloaded using either the roboting guidelines (e.g., wget or curl), or the mirroring guidelines (e.g., rsync).

fulltext ¦ SciFi Stories Text Corpus, extracted by Robin Sloan from the Internet Archive’s Pulp Magazine Archive, itself a collection of 12,000 volumes with bad metadata but good OCR. The Text Corpus presents that Archive as a single, monster, 150MB .txt file. Sloan assembled this text to create a model for his amazing “Writing with the machine” project: enter some text, and watch an AI trained on pulp fiction autocomplete your prose.

nonconsumtpive ¦ HathiTrust Research Center’s 20th Century English-Language Speculative Fiction, as collected by Laure Thompson and David Mimno. Contains 2,454 “Volumes of speculative fiction identified both through matching titles and authors to Worlds Without End (WWE), an extensive fan-built database of speculative fiction, and via computational text similarity analysis techniques.” Corpus can be explored using HTRC’s Feature Reader or their browser-based Token Count and Tag Cloud Creator, Named Entity Recognizer, and Topic Model Explorer.

nonconsumtpive ¦ Code and data in the GitHub repository supporting Ted Underwood’s article “The Life Cycles of Genres” in Cultural Analytics, primarily using HathiTrust.

nonconsumtpive ¦ Temple Digital Scholarship Center’s New-Wave SF Corpus, also in HathiTrust. At last check, roughly 250 SF books and magazines with another 1,000 in the queue. Scanned from the Paskow Science Fiction Collection. Here’s a post on the digitization process, and an introduction to text mining that collection.

Awards & Anthologies

Classics of Science Fiction database, by Jim Harris and Mike Jorgensen. The most prominent novels and short stories in the history of the genre, as measured by 1) the number of awards / nominations they’ve received, and 2) how many “citations” (i.e. inclusions in republished anthologies) they’ve received. Documentation is available on the project’s methods. It also includes many sublists, including most cited works by year, by author, etc. A nicely designed v. 4 website contains blog posts and updates on the project.

Science Fiction Awards Database, 1951–present, maintained by Locus.

For more in depth numbers on the Hugo Awards, see these Hugo Award Stats with numbers of nominating ballots from 2010–2019.

Related: an extensive series of lists detailing the work published in Fantasy & Science Fiction magazine, including a (complete?) list of pseudonyms and given author names, translated works, letters from reader organized by authors, cover art and artists, etc.

Novums & Neologies

Technovelgy, a dictionary of 3,300 inventions organized by date of their appearance in works of SF from 1634–present. Includes speculative inventions that have since been realized and purely fictional creations from ablative heat shield (1934) to zeroentropy spray (1943).

Jesse Sheidlower’s Historical Dictionary of Science Fiction.

Speculative [email protected], an ontology of story and novum type as catalogued by famous collector Bob Gibson. Displayed in a beautiful dendrogram.

Not quite neology or novum, but for a project showcasing the emergence of science fictional concepts, see 100 Years of Science Fiction, which scrapes “reader comments, plot descriptions, and user-generated tags” from 2,633 SF novels (“all Sci Fi novels published since 1900”) on GoodReads. Finished product is an impressive, interactive network visualization of broad concepts, keywords, and authors.

Standard Databases

Internet Speculative Fiction Database, extensively maintained by a community of readers. Includes subgenre tags. MySQL data acessible via R (one example here) or via Beautiful Soup and Python (one example here).

Worlds Without End, another reader-maintained database.

Premios y Lista, Comparativas: Ciencia ficción.

Encyclopaedia of Science Fiction, major project originally published in print by the scholars John Clute and Peter Nicholls. Now in its 4th edition.

中文科幻数据库 (CSFDB, the Chinese Science Fiction Database).

The Science Fiction and Fantasy Research Database, for locating secondary sources and criticism by work.

Noosfere, French-language SF database begun in 1999.

Tailored Bibliographies

A Crash Course in the History of Black SF, by Nisi Shawl on Worlds Without End.

SF by Women Writers, a list on Worlds Without End. The original list was produced by the Classics of Science Fiction project.

Last Updated May 2022