Other Corpora: Spoken
Pop Lyrics Corpus (by Valentin Werner)
Web interface to the Pop Lyrics Corpus (by Valentin Werner). Available upon request.
The IvIE Corpus: English Intonation in the British Isles
The IViE corpus contains recordings of nine urban dialects of English spoken in the British Isles. Recordings of male and female speakers were made in London, Cambridge, Cardiff, Liverpool, Bradford, Leeds, Newcastle, Belfast in Northern Ireland and Dublin in the Republic of Ireland. The IvIE Corpus is available upon request.
Santa Barbara Corpus of Spoken American English (SBCSAE)
The Santa Barbara Corpus of Spoken American English is based on a large
body of recordings of naturally occurring spoken interaction from all
over the United States. The Santa Barbara Corpus represents a wide
variety of people of different regional origins, ages, occupations,
genders, and ethnic and social backgrounds. The predominant form of
language use represented is face-to-face conversation, but the corpus
also documents many other ways that that people use language in their
everyday lives: telephone conversations, card games, food preparation,
on-the-job talk, classroom lectures, sermons, story-telling, town hall
meetings, tour-guide spiels, and more. The corpus is available upon
request.
Diachronic Corpus of Present Day Spoken English (DCPSE)
DCPSE is a new parsed corpus of spoken English available on CD-ROM. It contains more than 400,000 words from ICE-GB (collected in the early 1990s) and 400,000 words from the London-Lund Corpus (late 1960s-early 1980s). The orthographic transcriptions have been normalised and annotated according to the same criteria. ICE-GB was used as a gold standard for the parsing of DCPSE. The parsing has been corrected by a variety of methods to provide as high a quality of result as possible. DCPSE is an incomparable resource for examining recent change in the grammar of spoken English. DCPSE is available upon request.
CHRISTINE Corpus (Geoffrey Sampson)
The CHRISTINE Corpus is a structurally-annotated sample of spoken
English. The sample is based on extracts from the
“demographically-sampled” speech section of the British National Corpus.
It therefore forms a suitable resource for studying grammatical and
other structural features in the spontaneous, informal usage of a
cross-section of speakers drawn from all social classes and regions of
the United Kingdom in the 1990s.
The Wellington Corpus of Spoken New Zealand English (WSC)
One million words of spoken New Zealand English collected in the years
1988 to 1994. The corpus consists of 2,000 word extracts (where
possible) and comprises different proportions of formal, semi-formal and
informal speech. Both monologue and dialogue categories are included
and there is broadcast as well as private material collected in a range
of settings. Seventy-five percent of the corpus is informal dialogue.
WSC is part of the ICAME CD-ROM.
Vienna-Oxford International Corpus of English (VOICE)
VOICE is based on audio-recordings of 151 naturally-occurring,
non-scripted, face-to-face interactions involving 753 identified
individuals from 49 different first language backgrounds using English
as a lingua franca (ELF), i.e. English used as a common means of
communication among speakers from different first-language backgrounds.
Size: 1,023,127 orthographically defined words, totalling 110 hours 35
minutes and 56 seconds of recording. The corpus is available for
download in XML format (including POS tags).
NPS Chat Corpus
The Naval Postgraduate School (NPS) has compiled a corpus of roughly
10,000 posts that have been privacy marked, POS tagged and dialogue-act
tagged. It is available upon request.