Querying Corpora: Simple Queries with wildcards
Corpora nowadays come in all forms and flavours. While there is a tendency to access corpora via web interfaces (e.g. BNCweb, COCA, COHA, …), in some cases it is desirable to store the corpus files on your own hard drive and search it with specialized programs like AntConc or Wordsmith Tools.
Whichever interface is used for corpus queries, the search syntax for simple queries is the same: just type in the 'word' you are looking for. While the basic search functions supports wildcards, search for part-of-speech tags and lemmata, some of the more complex queries can only be performed with regular expressions.
The following article will present you with a brief overview of the different simple query functions for each interface.
-- AntConc --
AntConc is probably the freeware corpus program out there. It comes with a wide range of useful tools for working with corpora (concordance, cluster search, n-grams, collocates, (key)word lists), has built-in support for regular expressions and working with tags, it's fast and completely free of charge.
Even when working with the simple search function (“Search Term” set to “Words”), AntConc supports wildcards and part-of-speech tags. The next chapters can serve as a quick reference for which wildcards are supported and how to query plain text and tagged corpora alike.
Supported wildcards
Wildcard | function |
---|---|
* | match zero or more characters |
+ | match zero or one character |
? | match any one character |
@ | match zero or one word |
# | match any one word |
| | boolean operator 'OR' |
& | match non word (e.g. !,.?@) |
Working with plain text corpora
Plain text corpora contain no part-of-speech information, so abstract queries like “singular nouns starting with h” prove to be difficult, if not impossible.
Query Examples
search string | description | match |
---|---|---|
fun* | “fun” + zero or more characters | fun, funny, funeral |
fun+ | “fun” + zero or one characher | fun, fund |
colo?r | “colo” + any one character + “r” | colour, colosr, colomr, but not color |
colo+r | “colo” + zeor or one character + “r” | colour, color, colomr |
you # him | “you” + any one word + “him” | you tell him, you love him, you send him |
I|you|he # him | “I” OR “you” OR “he” + any one word + “him” | I, you, he told him, he send him; but not I send him! |
I # him|you # him|he # him | “I” + any one word + “him” OR “you” + any one word + “him” OR “he” + any one word + “him” | I send him, you told him, he told him |
Working with part-of-speech tags
Example for POS-tagged data in horizontal format:
This_DT0 is_VBZ a_AT0 sample_NN1 text_NN1 to_TO0 illustrate_VVI the_AT0
format_SENT ._PUN
Using AntConc with part-of-speech (POS) tagged corpora enable us to
perform quite powerful searches. Tagged corpora come in various data
structures, i.e. horizontal, vertical & XML. The following chapter
assumes that your POS-tagged corpus is in a horizontal format: WORD_TAG . Most of the ICE (International Corpus of English) corpora come with a POS-tagged version.
There are several different schemes for tagging (also called tagsets). Please make sure you know which tagset your corpus is tagged with. The most common ones are CLAWS 5 (BNCweb), CLAWS 7 (ICE corpora, COCA, COHA, newer versions of BNCweb) and the Penn Treebank.
Query Examples
The difference between working with plain text corpora and pos-tagged corpora is that we can include the tags (in this case we're working with the CLAWS 7 tagset) in our search syntax.
search string | description | match |
---|---|---|
pre*_JJ | “pre” + zero or more characters + “_JJ”: search for adjectives starting with pre | precious_JJ, pregnant_JJ, pretty_JJ |
A*_N* | “A” + zero or more characters + “_N” + zero or more characters: search for nouns starting with “A” | Asians_NN2, apple_NN1, airport_NN1, Adams_NP1 |
I_* like_* | “I_” + zero or more characters + “like” + zero or more characters: search for “I like”, ignore tagging | I_PPIS1 like_VV0, i_mc1 like_II (mistagged!) |
*_PP?S& *_V* *_PP?O& | zero or more characters + “_PP” + any one character + “S” + non-word + space + zero or more characters + “_V” + zero or more characters + space + zero or more characters + “_PP” + any one characer + “O” + non-word: personal pronoun (subjective) + verb + personal pronoun (objective) | I_PPIS1 asked_VVD her_PPHO1, She_PPHS1 remembers_VVZ me_PPIO1 |
*_AT* ???*ing_N* of | zero or more characters + “_AT” + zero or more characters + space + any one character (three times) + zero or more characters + “ind_N” + zero or more characters + “of”: article + nouns ending with ing (at least 6 characters long) + of | the_AT cuddling_ of_IO, a_AT1 beginning_NN1 of_IO |
&_& *_I* | non-word + “_” + non-word + space + zero or more characters + “_I” + zero or more characters: non-words + prepositions | ._. By_II, ._. To_II, ._. For_IF, ?_? Like_II, “_” to_II |
*_PPY|*_PPIS*|*_PPH1|*_PPHS* | (zero or more characters + “_PPY”) OR (zero or more characters + “_PPIS” + zero or more characters) OR (zero or more characters + “_PPH1” ) OR (zero or more characters + “_PPHS” + zero or more characters) | All personal pronouns: I, you, he, she, it, we, you, they. Corresponding CLAWS7 tags: PPIS1, PPIS2, PPH1, PPHS1, PPHS2 |
-- BNCweb --
In BNCweb, we have three options to choose from when searching: CQP Syntax, Simple Query Syntax (case insensitive), Simple Query Syntax (case sensitive).
The CQP Syntax will be described further down when we're dealing with
CQPweb (e.g. our version of COCA). It is recommended to use Simple Query Syntax (case insensitive). There is already a very good description for this query syntax with good examples:
BNCweb query syntax cheat sheet.
(BNCweb is tagged with CLAWS 5 tagset)
-- CQPweb --
While the corpora hosted on our server look strikingly similar to BNCweb, the syntax is somewhat different. This is mostly a result of these corpora being tagged with CLAWS 7. Also, this corpus does not support simplified tags as BNCweb does. In the following chapter you will find an overview of the CQP Syntax and the Simplified Query Syntax.
Please note that the POS tagging in our version of COCA & COHA is
case-sensitive and lowercase! So, for example, use vvg instead of VVG.The updated version of COCA (i.e. COCA 2017 update), on the other hand, uses upper case letters for POS tags! The easiest way to find out which corpus uses which POS tags is to search for a random word and hover with the mouse over the search word.
CQP Syntax
The CQP syntax is very easy to learn and very comfortable. Don't forget to set “Query mode” to “CQP syntax”.
Examples
Query | Desciption | Match |
---|---|---|
[lemma=“house”] | search for lemma “house” | house, houses, housing, … |
[lemma=“house” & pos=“nn1”] | search for lemma house (tagged as singular noun nn1) | house |
[lemma=“house” & pos!=“nn1”] | search for lemma house (not tagged as nn1) | houses, housing |
[word=“apples”] | search for apples | apples |
[pos=“jj”][lemma=“house” & pos=“nn1”] | search for general adjective + lemma house (tagged as nn1) | green house, white house, strict house, main house, … |
“as” []{1,3} “as”; | as + one to three random tokens + as | as good as, as well as, as immediate to me as, as much reliable as |
“no” “sooner” []* “than” within s; | no + sooner + random tokens + than (all within the same sentence) | no sooner than, no sooner satisfied than, no sooner does he eat chocolate than |
Simple Query Syntax
Do not confuse “Simple Query Syntax” with SQL (as in “Simple Query Language” - something else entirely!) or the simple query examples, which are meant to be examples of simple queries of each syntax (regular expressions, CQP, simple query syntax).
The default query mode in CQPweb is set to “simple query”. The “simple query” may not be as powerful as default regular expressions, but very comfortable and provides handy functions (e.g. proximity searches with “<<range>>”) for corpus queries. While it does support some of the character classes (\w, \d, \s …), it does not support lookarounds, grouping and some quantifiers ( e.g. {3,}).
Examples
Query | Description | Match |
---|---|---|
s?ing | s + one arbitrary character + ing | sing, sang sung, ssng |
day* | day + zero or more characters | day, days, daylight, daydream |
+able | one or more characters + able | capable, readable, searchable, but not able |
??+able | three or more characters + able | capable, readable, but not unable, able, stable |
{lay} | search for lemma lay | lay, lays, laid |
day <<4>> night | search for all occurrences of day which appear in a 4-token window of night | day and night, “during the day and during the night” |
day <<5<< night | search for all occurrences of day which appear in a 4-token window after night | night (…) day |
day >>5>> night | search for all occurrences of day which appear in a 4-token window before night | day (…) night |
+_at green_jj +_nn1 | one or more characters (tagged as article) + green (tagged as adjective) + one or more characters (tagged as singular noun) | the green ball |
+_r* (green_jj | red_jj ) {house} | one or more characters (tagged as adverb) + either green or red (tagged as adjective) + Lemma house | very green houses, incredibly red house |
-- www.english-corpora.org (formerly BYU; COCA, COHA, GloWbE...) --
Please consult the help pages of www.english-corpora.org or the Introduction to Corpus Basics (KorPLUS)