Advanced Corpus Queries with Regular Expressions
Corpora nowadays come in all forms and flavours. While there is a tendency to access corpora via web interfaces (e.g. BNCweb, COCA, COHA, …), in some cases it is desirable to store the corpus files on your own hard drive and search it with specialized programs like AntConc or Wordsmith Tools.
Whichever interface is used for corpus queries, the search syntax for simple queries is the same: just type in the 'word' you are looking for. While the basic search functions support wildcards, search for part-of-speech tags and lemmata, some of the more complex queries can only be performed with regular expressions. Unfortunately, each interface/program has its own implementation of regular expressions (regex) or supports only certain types of regexes.
The following article will try to give a brief overview of how regular expressions can be used with the various interfaces and provide examples for corpus queries.
[A] regular expression (…) is a sequence of characters that
forms a search pattern, mainly for use in pattern matching with strings,
or string matching, i.e. “find and replace”-like operations.
http://en.wikipedia.org/wiki/Regular_expression
Regular Expression: Basics
Before you continue reading, please make yourself familiar with basic regular expressions. There are many great tutorials and instructions out there. For a detailed instruction click here, or here for a cheat sheet.
You should be at least familiar with the terms escaping, character classes, quantifiers and wildcards
-- AntConc --
OUTDATED - The implementation of regular expressions has changed in more recent versions of AntConc. This section will be updated soon.
The basic query (“Search term” set to “word”) already supports basic wildcards - I'd suggest using regular expressions only if you need to compile a more complex search string!
AntConc supports a wide variety of regular expressions, including lookahead & lookbehind assertions. Apart from the POSIX character classes (e.g. [:alpha:]), AntConc supports almost everything described on the cheat sheet.
Please note: the dot also matches linebreaks in AntConc. Make it non-greedy and include an end point for this wildcard:
Instead of .*? better use .*?\s .
Make sure to check “Regex” in the concordance view.
Supported Regex Functions
- Anchors
- \b Word boundary
- \B Not word boundary
- Character Classes
- \s White space
- \S Not white space
- \d Digit
- \D Not digit
- \w Word
- \W Not Word
- Assertions
- (?=)
- (?!)
- (?<=)
- (?<!)
- Quantifiers (add ? after quantifier to make it ungreedy)
- * 0 or more
- + 1 or more
- ? 0 or 1
- {3} exactly 3 times
- {3,} three or more
- {3,5} three, four or five
- Escape character: \
- Groups and Ranges
- . Any character
- (a|b) a or b
- () Group
- [abc] Range (a, b or c)
- [^abc] Not (a, b or c)
- [a-q] Range (a to q)
- [A-Q] Range (A to Q)
- [3-8] Range (3 to 8)
- Switches
- (?i) insert at beginning of regex to make search case insensitive
Working with Plain Text Corpora
Plain text corpora contain no part-of-speech information, so abstract queries like “singular nouns starting with h” prove to be difficult, if not impossible. However, due to the flexibility of regular expressions, many advanced queries can still be performed. It is not necessary to use the word boundary anchor, but it helps to keep your regex from going rogue, i.e. matching everything (which, depending on the size of your corpus, crashes AntConc).
Simple Query Examples
regex | description | match |
---|---|---|
\bart\w*\b | word boundary (anchor) + “art” (character string) + word (character class) + 0 or more times (quantifier) + word boundary (anchor) | art, artists, articulated; but not: prearticulate, cart |
\b\w{1}art\b | word boundary (anchor) + word (character class) + 1 time (quantifier) + “art” (character string) + word boundary (anchor) | matches: part, cart, but not thwart |
\b\w{4,6}\b | word boundary (anchor) + word (character class) + 4-6 times (quantifier) + word boundary (anchor) | four to six letter words; matches: part, moment, danger, … |
\b[a-d,A-D]\w*\b | word boundary (anchor) + single character from a to d or A to D (range) + word (character class) + 0 or more times (quantifier) + word boundary (anchor) | words starting with a, A, b, B, c, C, d or D; matches: cart, Art, Benjamin, Artist, artist |
\b[I]t\b | word boundary (anchor) + single character I (range) + “t” (string) + word boundary (anchor) | matches “It” but not “it” |
colou?r | “colo” (character string) + “u” (character string) + 0 or 1 times (quantifier) + “r” (character string) | color, colour |
\b\w+ness\b | word boundary (anchor) + word (character class) + 1 or more (quantifier) + “ness” (string) + word boundary (anchor) | otherness, awareness, peacefulness, business |
\byou\b\s\bsee\b | word boundary (anchor) + “you” (string) + word boundary (anchor) + whitespace (character class) + word boundary (anchor) + “see” (string) + word boundary (anchor) | you see |
Advanced Query Examples
regex | description | match |
---|---|---|
\bappl(?!ied)\w*\b | word boundary (anchor) + “appl” (character string) + GROUP( ?! (negative lookahead) + “ied” (string)) + word (character class) + 0 or more times (quantifier) + word boundary (anchor) | apple, applicable, application; but not: applied |
(?<!the\s)(?<!sell-out\s)\b\w*house\w*\b | GROUP( ?<! (negative lookbehind) + “the” (string) + whitespace (character class)) + GROUP( ?<! (negative lookbehind) + “sell-out” (string) + whitespace (character class)) + word boundary (anchor) + word (character class) + 0 or more times (quantifier) + “house” (character string) + word (character class) + 0 or more times (quantifier) + word boundary (anchor) | matches: a cathouse, a green warehouse, some houses, his house, but not: the house, the cathouse, the warehouse; but (!): the green warehouse, the green house |
\bgreen\b\s(?=house) | word boundary (anchor) + “green” (string) + word boundary (anchor) + whitespace (character class) + GROUP( ?=(positive lookahead) + “house” (string)) | the green house, a gray green house, but not: gray house, the house, a red house |
It is a good idea to test your regex on a single file before querying an entire corpus. Since AntConc strips the corpus of linebreaks, the non-greedy dot (.*) matches the entire corpus! Also, keep in mind that lookarounds do not support nesting or quantifiers! Here is a good and detailed description of lookarounds with many useful examples.
Working with part-of-speech tags
Example for POS-tagged data in horizontal format:
This_DT0 is_VBZ a_AT0 sample_NN1 text_NN1 to_TO0 illustrate_VVI the_AT0
format_SENT ._PUN
Using AntConc with part-of-speech (POS) tagged corpora enable us to
perform quite powerful searches. Tagged corpora come in various data
structures, i.e. horizontal, vertical & XML. The following chapter
assumes that your POS-tagged corpus is in a horizontal format: WORD_TAG . Most of the ICE (International Corpus of English) corpora come with a POS-tagged version.
There are several different schemes for tagging (also called tagsets). Please make sure you know which tagset your corpus is tagged with. The most common ones are CLAWS 5 (BNCweb), CLAWS 7 (ICE corpora, COCA, COHA, newer versions of BNCweb) and the Penn Treebank.
Simple Query Examples
The difference between working with plain text corpora and pos-tagged corpora is that we can include the tags (in this case we're working with the CLAWS 7 tagset) in our search syntax.
regex | description | match |
---|---|---|
\bart\w*\b | word boundary (anchor) + “art” (character string) + word (character class) + 0 or more times (quantifier) + word boundary (anchor) | art_NN1, artists_NN2, articulated_VVN; but not: cart_NN1 |
\b\w{1}art\b | word boundary (anchor) + word (character class) + 1 time (quantifier) + “art” (character string) + word boundary (anchor) | matches: part_NN1, cart_NN1, but not thwart_VVI |
\b\w{4,}_NN\w*\b | word boundary (anchor) + word (character class) + 4 or more times (quantifier) + “_NN” (string) + words (character class) + 0 or more times (quantifier) + word boundary (anchor) | singular and plural nouns longer than 3 letters; matches: work_NN1, complexity_NN1, ways_NN2, people_NN |
\b\w*_VVI\b | word boundary (anchor) + word (character class) + 0 or more times (quantifier) + “_VVI” (string) + word boundary (anchor) | words ending with _VVI tag, i.e. verbs in infinitive |
\blove\w*?_V\w*\b | word boundary (anchor) + “love” (string) + word (character class) + 0 or more times (quantifier) + non-greedy marker + “_V” (string) + word (character class) + 0 or more times (quantifier) + word boundary (anchor) | love as verb, e.g. love_VV0, loves_VVZ, love_VVI |
\b\w*?_PPY\b|\b\w*?_PPIS\w*?\b|\b\w*?_PPHS\w*?\b|\b\w*?_PPH1\b | (word boundary (anchor) + word (character class) + 0 or more times (quantifier) + “_PPY” (string) + word boundary (anchor)) OR (word boundary (anchor) + word (character class) + 0 or more times (quantifier) + “_PPIS” (string) + word (character class) + 0 or more times (quantifier) + word boundary (anchor)) OR (boolean operator) (word boundary (anchor) + word (character class) + 0 or more times (quantifier) + non-greedy marker + “_PPHS” (string) + word (character class) + 0 or more times (quantifier) + non-greedy marker + word boundary (anchor)) OR (boolean operator) (word boundary (anchor) + word (character class) + 0 or more times (quantifier) + non-greedy marker + “_PPH1” (string) + word boundary (anchor)) | All personal pronous: I, you, he, she, it, we, you, they; corresponding CLAWS7 tags: PPH1, PPHS1, PPHS2, PPIS1, PPIS2 |
Advanced Query Examples
Don't get frustrated when working with RegExes. They might seem confusing at the beginning, but are, once mastered, an incredibly powerful tool for your research. A good idea is always to test your regular expression on a small file in an editor (e.g. NotePad++).
regex | description | match |
---|---|---|
(\b\w*_RGR\b|\b\w*_RGT\b)\s\b\w*_RR\w*\b | GROUP (word boundary (anchor) + word (character class) + 0 or more times (quantifier) + “_RGR” (string) + word boundary (anchor) OR word boundary (anchor) + word (character class) + 0 or more times (quantifier) + “_RGT” (string) + word boundary (anchor)) + whitespace (character class) + word boundary (anchor) + word (character class) + 0 or more time (quantifier) + “_RR” (string) + word (character class) + 0 or more times (quantifier) + word bounadary (anchor) | matches all adverbs preceded by a comparative or superlative degree adverb; matches: less_RGR readily_RR, more_RGR critically_RR, most_RGT easily_RR, … |
(\w*_II\b\s|\w*_IO\b\s|\w*_IF\b\s|\w*_IW\b\s)(?!\w{1,}_AT\w{0,1}\b\s)((\b\w*_\w*\b\s){0,3}\b\w*_NN1\b) | GROUP ( word (character class) ⇐ 0 or more times (quantifier) + “_II” (string) + word boundary (anchor) + whitespace (character class) OR word (character class) ⇐ 0 or more times (quantifier) + “_IO” (string) + word boundary (anchor) + whitespace (character class) OR word (character class) ⇐ 0 or more times (quantifier) + “_IF” (string) + word boundary (anchor) + whitespace (character class) OR word (character class) ⇐ 0 or more times (quantifier) + “_IW” (string) + word boundary (anchor) + whitespace (character class) + GROUP(?!(negative lookahead) + word (character class) ⇐ 1 or more times (quantifier) + “_AT” (string) + word (character class) ⇐ 0 or 1 times (quantifier) + word boundary (anchor) + ) + GROUP(word boundary (anchor) + word (character class) ⇐ 0 or more times (quantifier) + “_” (string) + word (character class) ⇐ 0 or more times (quantifier) + whitespace (character class)) ⇐ 0 to 3 times (quantifier) + word boundary (anchor) + word (character class) ⇐ 0 or more times (quantifier) + “_NN1” (string) + word boundary (anchor) | matches pattern: any preposition + NOT article +3 optional words + singular noun; e.g. in_II danger_NN1, in_II very_RG great_JJ danger_NN1, but NOT in_II a_AT1 great_JJ danger_NN1 |
(\w*?_II\b\s|\w*?_IO\b\s|\w*?_IF\b\s|\w*?_IW\b\s)(?!\w{1,}_AT\w{0,1}\b\s)(((\b\w*?_R\w*?\b\s){0,2}\b\w*?_JJ\w*?\b\s){0,3}\b\w*?_NN1\b) | GROUP A1 ( word (character class) ⇐ 0
or more times (quantifier, non-greedy) + “_II” (string) + word boundary
(anchor) + whitespace (character class) OR word (character class) ⇐ 0
or more times (quantifier, non-greedy) + “_IO” (string) + word boundary
(anchor) + whitespace (character class) OR word (character class) ⇐ 0 or
more times (quantifier, non-greedy) + “_IF” (string) + word boundary
(anchor) + whitespace (character class) OR word (character class) ⇐ 0 or
more times (quantifier, non-greedy) + “_IW” (string) + word boundary
(anchor) + whitespace (character class))GROUP A1 + GROUP B1(?!(negative lookahead) + word (character class) ⇐ 1 or more times (quantifier) + “_AT” (string) + word (character class) ⇐ 0 or 1 times (quantifier) + word boundary (anchor) + whitespace (character class))GROUP B2 + GROUP C1( GROUP C2(GROUP C3(word boundary (anchor) + word (character class) ⇐ 0 or more times (quantifier, non-greedy) + “_R” + word (character class) ⇐ 0 or more times (quantifier, non-greedy) + word boundary (anchor) + whitespace (character class) )GROUP C3 ⇐ 0 to two times (quantifier) + word boundary (anchor) + word (character class) ⇐ 0 or more times (quantifier, non-greedy) + “_JJ” (string) + word (character class) ⇐ 0 or more times (quantifier, non-greedy) + whitespace (character class))GROUP C2 ⇐ 0 to 3 times (quantifier) + word boundary (anchor) + word (character class) ⇐ 0 or more times (quantifie, non-greedy) + “_NN1” (string) + word boundary (anchor))GROUP C1 | matches pattern: any preposition + NOT article + up to 3 optional adjectives (preceeded by 1-2 optional adverbs) + singular noun; e.g. in_II danger_NN1, in_II very_RG great_JJ danger_NN1, but NOT in_II a_AT1 great_JJ danger_NN1 |
In detail
Since the last two advanced examples probably look like nightmares to beginners, it's a good idea to discuss one of them in greater detail here. The following regex
(\w*_II\b\s|\w*_IO\b\s|\w*_IF\b\s|\w*_IW\b\s)(?!\w{1,}_AT\w{0,1}\b\s)(((\b\w*?_R\w*?\b\s){0,2}\b\w*?_JJ\w*?\b\s){0,3}\b\w*?_NN1\b)
can be split into 3 groups.
Our first group is responsible for matching all tokens that are tagged as prepositions:
(\w*_II\b\s|\w*_IO\b\s|\w*_IF\b\s|\w*_IW\b\s)
So, in essence we've created a group that matches 4 different possibilities. Namely:
- WORD_II + whitespace, e.g. (“through_II ”)
- WORD_IO + whitespace, (“of_IO ”)
- WORD_IF + whitespace, (“for_IF ”)
- WORD_IW + whitespace, (“with_IW ”)
If we have a match for this search string, the regex engine moves on to our next search string, the second group:
Please note: lookarounds (esp. positive/negative lookbehind) do not support all quantifiers. So the following syntax is invalid (?<!\w+\b) , while this syntax is perfectly fine (?<!\w{3}\b) .
(?!\w{1,}_AT\w{0,1}\b\s)
This bit tells the regex engine to restrict our previous (negative
lookahead) match to prepositions that are not followed by an article
(tags: _AT & _AT1).
Once these requirements have been met, the engine goes on looking for our third group:
(((\b\w*?_R\w*?\b\s){0,2}\b\w*?_JJ\w*?\b\s){0,3}\b\w*?_NN1\b)
Here we have a group consisting of
- an optional group which can occur 0 to 3 times (2 adverbs + 1 adjective)
((\b\w*?_R\w*?\b\s){0,2}\b\w*?_JJ\w*?\b\s){0,3}
- and an essential part which matches our noun.
\b\w*_NN1\b
-- BNCweb --
In BNCweb, we have three options to choose from when searching: CQP Syntax, Simple Query Syntax (case insensitive), Simple Query Syntax (case sensitive).
The CQP Syntax will be described further down when we're dealing with
CQPweb (e.g. our version of COCA). It is recommended to use Simple Query Syntax (case insensitive). There is already a very good description for this query syntax with good examples:
BNCweb query syntax cheat sheet.
(BNCweb is tagged with CLAWS 5 tagset)
-- CQPweb --
While the corpora hosted on our server look strikingly similar to
BNCweb, the syntax is somewhat different. This is mostly a result of
these corpora being tagged with CLAWS 7. Also, this corpus does not
support simplified tags as BNCweb does. In the following chapter you
will find an overview of the CQP Syntax and the Simplified Query Syntax. Both are a funny and powerful mix of a simple, comfortable syntax paired with support for regular expressions.
There's also a YouTube channel on CQPweb:
Don't forget to set “Query mode” to “CQP syntax”.
CQP Syntax
The CQP syntax is very easy to learn and very comfortable. It even supports advanced queries with regular expressions.
Examples
Query | Desciption | Match |
---|---|---|
[lemma=“house”] | search for lemma “house” | house, houses, housing, … |
[lemma=“house” & pos=“nn1”] | search for lemma house (tagged as singular noun nn1) | house |
[lemma=“house” & pos!=“nn1”] | search for lemma house (not tagged as nn1) | houses, housing |
[word=“apples”] | search for apples | apples |
[pos=“jj”][lemma=“house” & pos=“nn1”] | search for general adjective + lemma house (tagged as nn1) | green house, white house, strict house, main house, … |
“as” []{1,3} “as”; | as + one to three random tokens + as | as good as, as well as, as immediate to me as, as much reliable as |
“no” “sooner” []* “than” within s; | no + sooner + random tokens + than (all within the same sentence) | no sooner than, no sooner satisfied than, no sooner does he eat chocolate than |
“no” “sooner” []{0,2} [pos=“pp\w*”] []{0,2} “than” within s; | no + sooner + zero to two random tokens + all tags starting with “pp” (i.e. personal pronouns) + zero to two random tokens + than (all within the same sentence) | no sooner had he said it than, no sooner would I leave than, no sooner did we arrive than |
“no” “sooner” [pos!=“pp\w*”]{1,3} “than” within s; | no + sooner + one to three tokens not tagged as personal pronouns (i.e. tag not starting with “pp”) + than (all within the same sentence) | no sooner within than, no sooner get back in than, no sooner left her mouth than |
a:[] [pos="PR."] c:[] :: a.word = c.word; | WORD1 + preposition + WORD2 where WORD1 = WORD2 | Can be used to find NPNs, such as shoulder to shoulder, hand in hand, etc. ; in the world of regular expressions, this function is called back-referencing; |
Simple Query Syntax
Do not confuse “Simple Query Syntax” with SQL (as in “Simple Query Language” - something else entirely!) or the simple query examples, which are meant to be examples of simple queries of each syntax (regular expressions, CQP, simple query syntax).
The default query mode in CQPweb is set to “simple query”. The “simple
query” may not be as powerful as cqp-syntax paired with pcre-style
regular expressions, but it is very comfortable and
provides handy functions (e.g. proximity searches with “«range»”) for
corpus queries. While it does support some of the character classes (\w,
\d, \s …), it does not support lookarounds, grouping and some
quantifiers ( e.g. {3,}).
-- www.english-corpora.org (formerly BYU; COCA, COHA, GloWbE...) --
Please consult the help pages of www.english-corpora.org or the Introduction to Corpus Basics (KorPLUS).