Linguistic Resources: Synchronic multi-purpose Corpora

The Brown family (Brown, Frown, LOB, FLOB): 1st generation major corpora

Size: 1 million words each
The Brown Corpus was the first computer-readable general corpus of texts prepared for linguistic research on modern English. It was compiled by W. Nelson Francis and Henry Kučera at Brown University in the 1960s and contains of over 1 million words (500 samples of 2000+ words each) of running text of edited English prose printed in the United States during the calendar year 1961. The Brown Corpus has inspired a whole family of corpora, including the Lancaster-Oslo/Bergen Corpus (LOB), Brown's British English counterpart, as well as Frown and FLOB, the 1990s equivalents of Brown and LOB respectively. Manuals for the corpora can be found here. The Brown family is part of the ICAME-CD-ROM.

British National Corpus 1994 (BNC1994)

Size: 100 million words
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007.

British National Corpus 2014 (BNC2014)

Size: 100 million words
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of present-day British English , both spoken and written. The spoken section is publically available via CQPweb (either hosted at Uni Bamberg or Uni Lancaster) and can be downloaded (see Uni Lancaster for details: http://cass.lancs.ac.uk/bnc2014/). At the date of writing, the written section is only available via #LancsBox X (http://corpora.lancs.ac.uk/lancsbox/downloadx.php.

Corpus of Contemporary American English (COCA)

Size: 440+ million words
Compiled by Mark Davies' team at the Brigham Young University, the Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. COCA was released in 2008 and it is now used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). It includes spoken as well as written texts. Note, however, that this corpus is to a large extent based on internet data and that spoken data are mostly compiled from TV and radio shows. There are more corpora available from Mark Davies.

OpenANC (Subset of the American National Corpus)

Size: 14 million words
The Open ANC includes over 14 million words from the Second Release of the ANC that can be freely downloaded and distributed.

International Corpus of English

Size: 1 million words each
The International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation. The following varieties are already available:

ICE-GB (Great Britain)

The only fully POS tagged and parsed corpus of the family can be used with its own software called ICECUP.

Other ICE varieties

Canada
East Africa
Hong-Kong
India
Ireland (North & South)
Jamaica
New Zealand (hint: 1st person pronoun I is spelled in small letters and mistagged as numeral!)
Singapore
Nigeria
USA (written only)

POS (CLAWS7) and semantically (USAS) tagged versions of ICE Canada, Hong Kong, India, New Zealand, Philippines, Singapore and the written parts of ICE-USA and Nigeria are available upon request. Please note that ICE Nigeria employs a different POS tag format (vertical Penn Treebank) than the other components (CLAWS).

Last modified: Wednesday, 29 November 2023, 12:03 PM