Synchronic multi-purpose Corpora
The corpora in this group
- usually sample texts from many different registers in order to represent the language as it is used in one national variety.
- are synchronic, i.e. contain only material from one specific point in time
- contain present day English
The Brown family (Brown, Frown, LOB, FLOB): 1st generation major corpora
Size: 1 million words each
The Brown Corpus was the first computer-readable general corpus of texts
prepared for linguistic research on modern English. It was compiled by
W. Nelson Francis and Henry Kučera at Brown University in the 1960s and
contains of over 1 million words (500 samples of 2000+ words each) of
running text of edited English prose printed in the United States during
the calendar year 1961. The Brown Corpus has inspired a whole family of
corpora, including the Lancaster-Oslo/Bergen Corpus (LOB), Brown's
British English counterpart, as well as Frown and FLOB, the 1990s
equivalents of Brown and LOB respectively. Manuals for the corpora can
be found here. The Brown family is part of the ICAME-CD-ROM.
British National Corpus 1994 (BNC1994)
Size: 100 million words
The British National Corpus (BNC) is a 100 million word collection of
samples of written and spoken language from a wide range of sources,
designed to represent a wide cross-section of British English from the
later part of the 20th century, both spoken and written. The latest
edition is the BNC XML Edition, released in 2007.
British National Corpus 2014 (BNC2014)
Size: 100 million words
The British National Corpus (BNC) is a 100 million word collection of
samples of written and spoken language from a wide range of sources,
designed to represent a wide cross-section of present-day British English , both spoken and written. The spoken section is publically available via CQPweb (either hosted at Uni Bamberg or Uni Lancaster) and can be downloaded (see Uni Lancaster for details: http://cass.lancs.ac.uk/bnc2014/). At the date of writing, the written section is only available via #LancsBox X (http://corpora.lancs.ac.uk/lancsbox/downloadx.php.
Corpus of Contemporary American English (COCA)
Size: 440+ million words
Compiled by Mark Davies' team at the Brigham Young University, the
Corpus of Contemporary American English (COCA) is the largest
freely-available corpus of English, and the only large and balanced
corpus of American English. COCA was released in 2008 and it is now used
by tens of thousands of users every month (linguists, teachers,
translators, and other researchers). It includes spoken as well as
written texts. Note, however, that this corpus is to a large extent
based on internet data and that spoken data are mostly compiled from TV
and radio shows. There are more corpora available from Mark Davies.
OpenANC (Subset of the American National Corpus)
Size: 14 million words
The Open ANC includes over 14 million words from the Second Release of the ANC that can be freely downloaded and distributed.
International Corpus of English
Size: 1 million words eachThe International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation. The following varieties are already available:
ICE-GB (Great Britain)
The only fully POS tagged and parsed corpus of the family can be used with its own software called ICECUP.
Other ICE varieties
- Canada
- East Africa
- Hong-Kong
- India
- Ireland (North & South)
- Jamaica
- New Zealand (hint: 1st person pronoun I is spelled in small letters and mistagged as numeral!)
- Singapore
- Nigeria
- USA (written only)
POS (CLAWS7) and semantically (USAS) tagged versions of ICE Canada, Hong Kong, India, New Zealand, Philippines, Singapore and the written parts of ICE-USA and Nigeria are available upon request. Please note that ICE Nigeria employs a different POS tag format (vertical Penn Treebank) than the other components (CLAWS).