Other Corpora: Written
English Gigaword (Fifth Edition)
Size: c. 4 billion tokens (25 GB of raw text)
The English Gigaword Corpus is a collection of newswire texts from
several news agencies (Los Angeles Times, Agence France Press …) in
English acquired by the LDC at the University of Pennsylvania. The
English component and smaller datasets in Spanish and French are
available upon request
TIME Magazine Corpus of American English
275,000 texts (100 million words) from the TIME magazine covering a timespan from 1923 to the 1990s. Easily searchable with the BYU interface.SUSANNE Corpus (Geoffrey Sampson)
The SUSANNE Corpus was created, with the sponsorship of the Economic and
Social Research Council (UK), as part of the process of developing a
comprehensive language-engineering-oriented taxonomy and annotation
scheme for the (logical and surface) grammar of English. The SUSANNE
scheme attempts to provide a method of representing all aspects of
English grammar which are sufficiently definite to be susceptible of
formal annotation, with the categories and boundaries between categories
specified in sufficient detail that, ideally, two analysts
independently annotating the same text and referring to the same scheme
must produce the same structural analysis.
LUCY Corpus (Geoffrey Sampson)
The LUCY Corpus is an electronic sample of modern written English produced in the UK by a spectrum of writers ranging from skilled published authors to young children, equipped with detailed annotation identifying grammatical and other linguistic structure. Compilation of the LUCY Corpus was sponsored by the Economic and Social Research Council (UK), under grant R000 238146, 2000-03, and was carried out at the University of Sussex.The Wellington Corpus of Written New Zealand English (WWC)
One million words of written New Zealand English collected from writings published in the years 1986 to 1990. The WWC has the same basic categories as the Brown Corpus of written American English (1961) and the Lancaster-Oslo-Bergen corpus (LOB) of written British English (1961). The corpus also parallels the structure of the Macquarie Corpus of written Australian English (1986). The WWC consists of 2,000 word excerpts on a variety of topics. Text categories include press material, religious texts, skills, trades and hobbies, popular lore, biography, scholarly writing and fiction. This corpus is part of the ICAME CD-ROM and available upon request.
Oxford Text Archive (OTA)
The Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. The following written corpora are available upon request:
- Edinburgh Associative Corpus (EAT)
- Edited Polytechnic of Wales Corpus (EPOW)
The Blog Authorship Corpus
Size: 170 million wordsThe Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
Scientext: Learner Corpus of English
Size: 1.1 million words
The LLS research group at the Université de Savoie collected the corpus
of academic texts written by French learners of English (1,1 M words) in
the 2nd and 3rd year of a university degree. It is accessible through
an online interface.
WestburyLAB Usenet Corpus
Size: 30 billion wordsThis corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. The corpus is untagged, raw text.
WaCKy corpora (UKwac)
Size: 2 billion words
The WaCKy corpora are a set of four large corpora sampled from the world
wide web by means of a seed word list taken from other corpora. The
corpora are available in four languages: German, Italian, French and
British English. The corpora can be obtained from the authors upon
request.
International Corpus of Learner English (ICLE)
Size: 3.7 million words
The International Corpus of Learner English compiled at the Université
catholique de Louvain is a corpus of writing by higher intermediate to
advanced learners of English. It contains 3.7 million words of EFL
writing from learners representing 16 mother tongue backgrounds
(Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian,
Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish and
Tswana). The printed manual is available
upon request. For access, please contact one of the staff members.
EF-Cambridge Open Language Database (EFCAMDAT 2)
Size: 83.5 million words
EFCAMDAT is a an electronic resource
containing samples of written language production by thousands of
adult learners of English as a second language, covering all CEFR levels. It consists of short written assignments submitted to the Englishtown online language school. All files are tagged for parts-of-speech and fully parsed, and more than two thirds of the material has been graded by instructors.
Thomson Reuters Text Research Collection (TRC2)
Size: 2,871,075,221 bytes
The Thomson Reuters Text Research Collection is a corpus aimed at
information retrieval and text mining purposes. It consists of s
1,800,370 news stories covering the period from 2008-01-01 00:00:03 to
2009-02-28 23:54:14. The corpus can be made available upon request.
Topic Annotated Enron Email Corpus
Size: c. 5000 email messages
This corpus contains about 5000 emails from Enron Corporation sent from
January 2001 to December 2001. The emails are categorized to reflect the
business activites and interests of Enron employees in that year. The
corpus is a subset of the complete Enron Email corpus freely available
at http://www.cs.cmu.edu/~enron/. The Topic Annotated Enron Email Corpus is available upon request.