Size: c. 4 billion tokens (25 GB of raw text)
The English Gigaword Corpus is a collection of newswire texts from several news agencies (Los Angeles Times, Agence France Press …) in English acquired by the LDC at the University of Pennsylvania. The English component and smaller datasets in Spanish and French are available upon request
The SUSANNE Corpus was created, with the sponsorship of the Economic and
Social Research Council (UK), as part of the process of developing a
comprehensive language-engineering-oriented taxonomy and annotation
scheme for the (logical and surface) grammar of English. The SUSANNE
scheme attempts to provide a method of representing all aspects of
English grammar which are sufficiently definite to be susceptible of
formal annotation, with the categories and boundaries between categories
specified in sufficient detail that, ideally, two analysts
independently annotating the same text and referring to the same scheme
must produce the same structural analysis.
One million words of written New Zealand English collected from writings published in the years 1986 to 1990. The WWC has the same basic categories as the Brown Corpus of written American English (1961) and the Lancaster-Oslo-Bergen corpus (LOB) of written British English (1961). The corpus also parallels the structure of the Macquarie Corpus of written Australian English (1986). The WWC consists of 2,000 word excerpts on a variety of topics. Text categories include press material, religious texts, skills, trades and hobbies, popular lore, biography, scholarly writing and fiction. This corpus is part of the ICAME CD-ROM and available upon request.
The Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. The following written corpora are available upon request:
- Edinburgh Associative Corpus (EAT)
- Edited Polytechnic of Wales Corpus (EPOW)
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
Size: 1.1 million words
The LLS research group at the Université de Savoie collected the corpus of academic texts written by French learners of English (1,1 M words) in the 2nd and 3rd year of a university degree. It is accessible through an online interface.
This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. The corpus is untagged, raw text.
Size: 2 billion words
The WaCKy corpora are a set of four large corpora sampled from the world wide web by means of a seed word list taken from other corpora. The corpora are available in four languages: German, Italian, French and British English. The corpora can be obtained from the authors upon request.
Size: 3.7 million words
The International Corpus of Learner English compiled at the Université catholique de Louvain is a corpus of writing by higher intermediate to advanced learners of English. It contains 3.7 million words of EFL writing from learners representing 16 mother tongue backgrounds (Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish and Tswana). ICLE can be accessed via DBIS. The printed manual is available upon request.
Size: 83.5 million words
EFCAMDAT is a an electronic resource containing samples of written language production by thousands of adult learners of English as a second language, covering all CEFR levels. It consists of short written assignments submitted to the Englishtown online language school. All files are tagged for parts-of-speech and fully parsed, and more than two thirds of the material has been graded by instructors.
Size: 2,871,075,221 bytes
The Thomson Reuters Text Research Collection is a corpus aimed at information retrieval and text mining purposes. It consists of s 1,800,370 news stories covering the period from 2008-01-01 00:00:03 to 2009-02-28 23:54:14. The corpus can be made available upon request.
Size: c. 5000 email messages
This corpus contains about 5000 emails from Enron Corporation sent from January 2001 to December 2001. The emails are categorized to reflect the business activites and interests of Enron employees in that year. The corpus is a subset of the complete Enron Email corpus freely available at http://www.cs.cmu.edu/~enron/. The Topic Annotated Enron Email Corpus is available upon request.