ARCHER: A Representative Corpus of Historical English Registers

Period: 1650-1999
Size: 1.8 million words
ARCHER is a multi-genre corpus of British and American English covering the period 1650-1990, first constructed by Douglas Biber and Edward Finegan in the 1990s. It is managed as an ongoing project by a consortium of participants at fourteen universities in seven countries. ARCHER is available upon request.


Chadwyck-Healey Literature Collection

Period: c.1500-c.1950
Size: c. 138 million words
Julia Schlüter has compiled a manual for using these text collections as a linguistic corpus (in German).

The Chadwyck-Healey Literature Collection consists of several parts that have to be queried individually:

Early English Prose Fiction
More than 200 works from the period 1500–1700, exploring the rich diversity of prose fiction in English in the period preceding the emergence of the realist novel as its dominant form.

Eighteenth Century Fiction
Eighteenth-Century Fiction brings together 96 complete works of English prose from the period 1700–1780 by writers from the British Isles. It is the largest collection of literature from the period available in electronic form.

Nineteenth Century Fiction
Nineteenth-Century Fiction collects 250 British and Irish novels from the period 1782 to 1903, stretching from the golden age of Gothic fiction to the Decadent and New Woman novels of the 1890s. Major novelists of the period such as Austen, Scott, Mary Shelley, Dickens, Eliot, Hardy and the Brontës feature alongside popular romances, sensation fiction, colonial adventure novels and children’s literature.

English Prose Drama
Prose Drama contains more than 1,600 plays written by more than 350 different authors from the Renaissance to the end of the nineteenth century. The database includes plays, masques, entertainments, and certain closet dramas.

Early American Fiction
Early American Fiction 1789–1875 is the latest product of an ongoing collaboration between ProQuest and the University of Virginia Library. In 1996 University of Virginia Library received a grant from The Andrew W. Mellon Foundation to digitize and publish its unique collections of early American fiction. This made possible the first Chadwyck-Healey Early American Fiction 1789–1850 database, completed in 2000, which offered preservation-quality facsimile page images and keyword-searchable full text for more than four hundred works of American fiction published before 1850. Early American Fiction 1789–1875, the second phase of this project, has been made possible by further sponsorship from the Mellon Foundation.

American Drama
Containing more than 1,500 dramatic works from the colonial period to the beginning of the twentieth century, American Drama 1714–1915 is the largest electronic collection of American dramatic writing of its kind. It provides literary researchers and historians with a comprehensive survey of American dramaturgy from its origins up to the era of sensational melodrama and manners comedy exemplified by the work of such playwrights as David Belasco, Clyde Fitch and William Vaughn Moody.


Corpus of Historical American English (COHA)

"The Corpus of Historical American English (COHA) is the largest structured corpus of historical English. It is related to many other corpora of English that we have created. These corpora were formerly known as the "BYU Corpora", and they offer unparalleled insight into variation in English. If you are interested in historical corpora, you might also look at our Google Books (see comparison), Hansard, and TIME corpora.

COHA contains more than 475 million words of text from the 1820s-2010s (which makes it 50-100 times as large as other comparable historical corpora of English) and the corpus is balanced by genre decade by decade. The creation of the corpus results from a grant from the National Endowment for the Humanities (NEH) from 2008-2010."

(https://www.english-corpora.org/coha/, 08.08.2023)


Penn Parsed Corpora of Historical English

Period: 1150-1914
Size: c. 3.9 million words
The Penn Historical Corpora, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English (PPCMBE), are syntactically annotated corpora of prose text samples of English from the indicated time periods. Their syntactic annotation (parsing) permits searching not only for words and word sequences, but also for syntactic structure. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language. The three components of the Penn Parsed Corpora of Historical English are available upon request:


The Corpus of Late Modern English Texts, version 3.0 (CLMET)

Period: 1710-1920
Size: 34 million words
CLMET3.0 is a principled collection of public domain texts drawn from various online archiving projects. In total, the corpus contains some 34 million words of running text. The corpus covers five major genres: narrative fiction, narrative non-fiction, drama, letters and treatise, in addition to a number of unclassified texts. The corpus is free, see the website for details.


ProQuest Historical Newspapers

Period: 1560-1760
Size: 1.2 million words
Released in Spring 2006, A Corpus of English Dialogues 1560-1760 (CED) is a 1.2-million-word computerized corpus of Early Modern English speech-related texts. The CED is part of the research project “Exploring spoken interaction of the Early Modern English period (1560-1760)” (see e.g. Culpeper and Kytö 1997, 2000, and forthcoming), and was compiled by Merja Kytö and Jonathan Culpeper, in collaboration with Terry Walker and Dawn Archer, at Uppsala and Lancaster Universities. The CED is available upon request


A Linguistic Atlas of Early Middle English (LAEME)

Period: 1150-1325
Size: 625,000 words
Complete texts (or large samples of very long texts) have been diplomatically transcribed from original manuscripts or facsimiles. Each word and each derivational and inflectional morpheme in the text is lexico-grammatically tagged. The present LAEME CTT consists of 650,000 words tagged at this unprecedented level of detail, enabling investigations at all linguistic levels. The CTT is searchable on the website under LAEME TASKS: TAGGED TEXTS. From each tagged text is derived a text dictionary, which lists all the linguistic material in the tagged texts, arranged by lexico-grammatical tag. The text dictionaries are searchable under LAEME TASKS: TEXT DICTIONARIES. The full tagged texts and text dictionaries are also accessible from the individual entries in the Index of Sources, to be found on the website under Auxiliary Data Sets. Considerable editorial and textual commentary accompanies each tagged text. The corpus has provided the source material for all the related publications listed in the LAEME bibliography (to be found on the website under Auxiliary Data Sets).


Oxford Text Archive (OTA)

The Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. Several corpora from the OTA are available upon request:

  • Complete Corpus of Old English
  • The Old English electronic corpus is a complete record of surviving Old English except for some variant manuscripts of individual texts. A list of included texts can be found here.
  • Corpus of Biblical Texts in Scots
  • Corpus of Early English Correspondence Sampler
  • A manual can be found here.
  • Corpus of Late Modern English Prose
  • Dictionary of Old English Corpus in Electronic Form
  • The English Language of the North-West in the late Modern English period
  • Helsinki Corpus of English Texts
  • Older Scottish Texts (Edinburgh DOST Corpus)
  • The Helsinki Corpus of Older Scots
  • York-Helsinki Parsed Corpus of Old English Poetry
  • York-Toronto-Helsinki Parsed Corpus of Old English Prose

Last modified: Tuesday, 8 August 2023, 2:25 PM