Synchronic multi-purpose Corpora
The corpora in this group
- usually sample texts from many different registers in order to represent the language as it is used in one national variety.
- are synchronic, i.e. contain only material from one specific point in time
- contain present day English
Size: 1 million words each
The Brown Corpus was the first computer-readable general corpus of texts prepared for linguistic research on modern English. It was compiled by W. Nelson Francis and Henry Kučera at Brown University in the 1960s and contains of over 1 million words (500 samples of 2000+ words each) of running text of edited English prose printed in the United States during the calendar year 1961. The Brown Corpus has inspired a whole family of corpora, including the Lancaster-Oslo/Bergen Corpus (LOB), Brown's British English counterpart, as well as Frown and FLOB, the 1990s equivalents of Brown and LOB respectively. Manuals for the corpora can be found here. The Brown family is part of the ICAME-CD-ROM.
Size: 100 million words
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007. Three different interfaces can be used for this corpus:
Please note: BNCweb is only available from inside the university's network or with an enabled VPN connection. You can register for a personal account here.
Mark Davies' web interface hosted at the Brigham Young University in Utah. Easy to use with fancy features like timeline diagrams. However, only small sections of the corpus texts are displayed due to copyright restrictions.
Size: 440+ million words
Compiled by Mark Davies' team at the Brigham Young University, the Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. COCA was released in 2008 and it is now used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). It includes spoken as well as written texts. Note, however, that this corpus is to a large extent based on internet data and that spoken data are mostly compiled from TV and radio shows. There are more corpora available from Mark Davies.
Size: 14 million words
The Open ANC includes over 14 million words from the Second Release of the ANC that can be freely downloaded and distributed.
The International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation. The following varieties are already available:
Other ICE varieties
- East Africa
- Ireland (North & South)
- New Zealand (hint: 1st person pronoun I is spelled in small letters and mistagged as numeral!)
- USA (written only)
These corpora are not tagged and parsed and can be used with WordSmith Tools. Note that the markup of ICE Nigeria does not follow the criteria used by the other components.
POS (CLAWS7) and semantically (USAS)
tagged versions of ICE Canada, Hong Kong, India, New Zealand,
Philippines, Singapore and the written parts of ICE-USA and Nigeria are
available upon request. Please note that ICE Nigeria employs a different
POS tag format (vertical Penn Treebank) than the other components
Genres and their corresponding text codes can be found here.