Linguistic Resources: Other Corpora: Spoken

Pop Lyrics Corpus (by Valentin Werner)

Web interface to the Pop Lyrics Corpus (by Valentin Werner). Available upon request.

The IvIE Corpus: English Intonation in the British Isles

The IViE corpus contains recordings of nine urban dialects of English spoken in the British Isles. Recordings of male and female speakers were made in London, Cambridge, Cardiff, Liverpool, Bradford, Leeds, Newcastle, Belfast in Northern Ireland and Dublin in the Republic of Ireland. The IvIE Corpus is available upon request.

Santa Barbara Corpus of Spoken American English (SBCSAE)

The Santa Barbara Corpus of Spoken American English is based on a large body of recordings of naturally occurring spoken interaction from all over the United States. The Santa Barbara Corpus represents a wide variety of people of different regional origins, ages, occupations, genders, and ethnic and social backgrounds. The predominant form of language use represented is face-to-face conversation, but the corpus also documents many other ways that that people use language in their everyday lives: telephone conversations, card games, food preparation, on-the-job talk, classroom lectures, sermons, story-telling, town hall meetings, tour-guide spiels, and more. The corpus is available upon request.

Diachronic Corpus of Present Day Spoken English (DCPSE)

DCPSE is a new parsed corpus of spoken English available on CD-ROM. It contains more than 400,000 words from ICE-GB (collected in the early 1990s) and 400,000 words from the London-Lund Corpus (late 1960s-early 1980s). The orthographic transcriptions have been normalised and annotated according to the same criteria. ICE-GB was used as a gold standard for the parsing of DCPSE. The parsing has been corrected by a variety of methods to provide as high a quality of result as possible. DCPSE is an incomparable resource for examining recent change in the grammar of spoken English. DCPSE is available upon request.

CHRISTINE Corpus (Geoffrey Sampson)

The CHRISTINE Corpus is a structurally-annotated sample of spoken English. The sample is based on extracts from the “demographically-sampled” speech section of the British National Corpus. It therefore forms a suitable resource for studying grammatical and other structural features in the spontaneous, informal usage of a cross-section of speakers drawn from all social classes and regions of the United Kingdom in the 1990s.

The Wellington Corpus of Spoken New Zealand English (WSC)

One million words of spoken New Zealand English collected in the years 1988 to 1994. The corpus consists of 2,000 word extracts (where possible) and comprises different proportions of formal, semi-formal and informal speech. Both monologue and dialogue categories are included and there is broadcast as well as private material collected in a range of settings. Seventy-five percent of the corpus is informal dialogue. WSC is part of the ICAME CD-ROM.

Vienna-Oxford International Corpus of English (VOICE)

VOICE is based on audio-recordings of 151 naturally-occurring, non-scripted, face-to-face interactions involving 753 identified individuals from 49 different first language backgrounds using English as a lingua franca (ELF), i.e. English used as a common means of communication among speakers from different first-language backgrounds. Size: 1,023,127 orthographically defined words, totalling 110 hours 35 minutes and 56 seconds of recording. The corpus is available for download in XML format (including POS tags).

NPS Chat Corpus

The Naval Postgraduate School (NPS) has compiled a corpus of roughly 10,000 posts that have been privacy marked, POS tagged and dialogue-act tagged. It is available upon request.

Last modified: Tuesday, 8 August 2023, 2:23 PM