English Gigaword (Fifth Edition)

Size: c. 4 billion tokens (25 GB of raw text)
The English Gigaword Corpus is a collection of newswire texts from several news agencies (Los Angeles Times, Agence France Press …) in English acquired by the LDC at the University of Pennsylvania. The English component and smaller datasets in Spanish and French are available upon request

TIME Magazine Corpus of American English

275,000 texts (100 million words) from the TIME magazine covering a timespan from 1923 to the 1990s. Easily searchable with the BYU interface.

SUSANNE Corpus (Geoffrey Sampson)

The SUSANNE Corpus was created, with the sponsorship of the Economic and Social Research Council (UK), as part of the process of developing a comprehensive language-engineering-oriented taxonomy and annotation scheme for the (logical and surface) grammar of English. The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis.

LUCY Corpus (Geoffrey Sampson)

The LUCY Corpus is an electronic sample of modern written English produced in the UK by a spectrum of writers ranging from skilled published authors to young children, equipped with detailed annotation identifying grammatical and other linguistic structure. Compilation of the LUCY Corpus was sponsored by the Economic and Social Research Council (UK), under grant R000 238146, 2000-03, and was carried out at the University of Sussex.

The Wellington Corpus of Written New Zealand English (WWC)

One million words of written New Zealand English collected from writings published in the years 1986 to 1990. The WWC has the same basic categories as the Brown Corpus of written American English (1961) and the Lancaster-Oslo-Bergen corpus (LOB) of written British English (1961). The corpus also parallels the structure of the Macquarie Corpus of written Australian English (1986). The WWC consists of 2,000 word excerpts on a variety of topics. Text categories include press material, religious texts, skills, trades and hobbies, popular lore, biography, scholarly writing and fiction. This corpus is part of the ICAME CD-ROM and available upon request.

Oxford Text Archive (OTA)

The Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. The following written corpora are available upon request:

  • Edinburgh Associative Corpus (EAT)
  • Edited Polytechnic of Wales Corpus (EPOW)

The Blog Authorship Corpus

Size: 170 million words
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Scientext: Learner Corpus of English

Size: 1.1 million words
The LLS research group at the Université de Savoie collected the corpus of academic texts written by French learners of English (1,1 M words) in the 2nd and 3rd year of a university degree. It is accessible through an online interface.

WestburyLAB Usenet Corpus

Size: 30 billion words
This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. The corpus is untagged, raw text.

WaCKy corpora (UKwac)

Size: 2 billion words
The WaCKy corpora are a set of four large corpora sampled from the world wide web by means of a seed word list taken from other corpora. The corpora are available in four languages: German, Italian, French and British English. The corpora can be obtained from the authors upon request.

International Corpus of Learner English (ICLE)

Size: 3.7 million words
The International Corpus of Learner English compiled at the Université catholique de Louvain is a corpus of writing by higher intermediate to advanced learners of English. It contains 3.7 million words of EFL writing from learners representing 16 mother tongue backgrounds (Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish and Tswana). ICLE can be accessed via DBIS. The printed manual is available upon request.

EF-Cambridge Open Language Database (EFCAMDAT 2)

Size: 83.5 million words
EFCAMDAT is a an electronic resource containing samples of written language production by thousands of adult learners of English as a second language, covering all CEFR levels. It consists of short written assignments submitted to the Englishtown online language school. All files are tagged for parts-of-speech and fully parsed, and more than two thirds of the material has been graded by instructors.

Thomson Reuters Text Research Collection (TRC2)

Size: 2,871,075,221 bytes
The Thomson Reuters Text Research Collection is a corpus aimed at information retrieval and text mining purposes. It consists of s 1,800,370 news stories covering the period from 2008-01-01 00:00:03 to 2009-02-28 23:54:14. The corpus can be made available upon request.

Topic Annotated Enron Email Corpus

Size: c. 5000 email messages
This corpus contains about 5000 emails from Enron Corporation sent from January 2001 to December 2001. The emails are categorized to reflect the business activites and interests of Enron employees in that year. The corpus is a subset of the complete Enron Email corpus freely available at http://www.cs.cmu.edu/~enron/. The Topic Annotated Enron Email Corpus is available upon request.

Last modified: Tuesday, 30 October 2018, 1:32 PM