Linguistic Resources: Querying Corpora: Simple Queries with wildcards

Corpora nowadays come in all forms and flavours. While there is a tendency to access corpora via web interfaces (e.g. BNCweb, COCA, COHA, …), in some cases it is desirable to store the corpus files on your own hard drive and search it with specialized programs like AntConc or Wordsmith Tools.

Whichever interface is used for corpus queries, the search syntax for simple queries is the same: just type in the 'word' you are looking for. While the basic search functions supports wildcards, search for part-of-speech tags and lemmata, some of the more complex queries can only be performed with regular expressions.

The following article will present you with a brief overview of the different simple query functions for each interface.

-- AntConc --

AntConc is probably the freeware corpus program out there. It comes with a wide range of useful tools for working with corpora (concordance, cluster search, n-grams, collocates, (key)word lists), has built-in support for regular expressions and working with tags, it's fast and completely free of charge.

Even when working with the simple search function (“Search Term” set to “Words”), AntConc supports wildcards and part-of-speech tags. The next chapters can serve as a quick reference for which wildcards are supported and how to query plain text and tagged corpora alike.

Supported wildcards

Wildcard	function
*	match zero or more characters
+	match zero or one character
?	match any one character
@	match zero or one word
#	match any one word
\|	boolean operator 'OR'
&	match non word (e.g. !,.?@)

Working with plain text corpora

Plain text corpora contain no part-of-speech information, so abstract queries like “singular nouns starting with h” prove to be difficult, if not impossible.

Query Examples

search string	description	match
fun*	“fun” + zero or more characters	fun, funny, funeral
fun+	“fun” + zero or one characher	fun, fund
colo?r	“colo” + any one character + “r”	colour, colosr, colomr, but not color
colo+r	“colo” + zeor or one character + “r”	colour, color, colomr
you # him	“you” + any one word + “him”	you tell him, you love him, you send him
I\|you\|he # him	“I” OR “you” OR “he” + any one word + “him”	I, you, he told him, he send him; but not I send him!
I # him\|you # him\|he # him	“I” + any one word + “him” OR “you” + any one word + “him” OR “he” + any one word + “him”	I send him, you told him, he told him

Working with part-of-speech tags

Example for POS-tagged data in horizontal format:
This_DT0 is_VBZ a_AT0 sample_NN1 text_NN1 to_TO0 illustrate_VVI the_AT0 format_SENT ._PUN

Using AntConc with part-of-speech (POS) tagged corpora enable us to perform quite powerful searches. Tagged corpora come in various data structures, i.e. horizontal, vertical & XML. The following chapter assumes that your POS-tagged corpus is in a horizontal format: WORD_TAG . Most of the ICE (International Corpus of English) corpora come with a POS-tagged version.

There are several different schemes for tagging (also called tagsets). Please make sure you know which tagset your corpus is tagged with. The most common ones are CLAWS 5 (BNCweb), CLAWS 7 (ICE corpora, COCA, COHA, newer versions of BNCweb) and the Penn Treebank.

Query Examples

The difference between working with plain text corpora and pos-tagged corpora is that we can include the tags (in this case we're working with the CLAWS 7 tagset) in our search syntax.

search string	description	match
pre*_JJ	“pre” + zero or more characters + “_JJ”: search for adjectives starting with pre	precious_JJ, pregnant_JJ, pretty_JJ
A_N	“A” + zero or more characters + “_N” + zero or more characters: search for nouns starting with “A”	Asians_NN2, apple_NN1, airport_NN1, Adams_NP1
I_* like_*	“I_” + zero or more characters + “like” + zero or more characters: search for “I like”, ignore tagging	I_PPIS1 like_VV0, i_mc1 like_II (mistagged!)
_PP?S& _V* *_PP?O&	zero or more characters + “_PP” + any one character + “S” + non-word + space + zero or more characters + “_V” + zero or more characters + space + zero or more characters + “_PP” + any one characer + “O” + non-word: personal pronoun (subjective) + verb + personal pronoun (objective)	I_PPIS1 asked_VVD her_PPHO1, She_PPHS1 remembers_VVZ me_PPIO1
_AT ???ing_N of	zero or more characters + “_AT” + zero or more characters + space + any one character (three times) + zero or more characters + “ind_N” + zero or more characters + “of”: article + nouns ending with ing (at least 6 characters long) + of	the_AT cuddling_ of_IO, a_AT1 beginning_NN1 of_IO
&_& _I	non-word + “_” + non-word + space + zero or more characters + “_I” + zero or more characters: non-words + prepositions	._. By_II, ._. To_II, ._. For_IF, ?_? Like_II, “_” to_II
_PPY\|_PPIS\|_PPH1\|_PPHS	(zero or more characters + “_PPY”) OR (zero or more characters + “_PPIS” + zero or more characters) OR (zero or more characters + “_PPH1” ) OR (zero or more characters + “_PPHS” + zero or more characters)	All personal pronouns: I, you, he, she, it, we, you, they. Corresponding CLAWS7 tags: PPIS1, PPIS2, PPH1, PPHS1, PPHS2

-- BNCweb --

In BNCweb, we have three options to choose from when searching: CQP Syntax, Simple Query Syntax (case insensitive), Simple Query Syntax (case sensitive). The CQP Syntax will be described further down when we're dealing with CQPweb (e.g. our version of COCA). It is recommended to use Simple Query Syntax (case insensitive). There is already a very good description for this query syntax with good examples:

BNCweb query syntax cheat sheet.
(BNCweb is tagged with CLAWS 5 tagset)

-- CQPweb --

While the corpora hosted on our server look strikingly similar to BNCweb, the syntax is somewhat different. This is mostly a result of these corpora being tagged with CLAWS 7. Also, this corpus does not support simplified tags as BNCweb does. In the following chapter you will find an overview of the CQP Syntax and the Simplified Query Syntax.

Please note that the POS tagging in our version of COCA & COHA is case-sensitive and lowercase! So, for example, use vvg instead of VVG.The updated version of COCA (i.e. COCA 2017 update), on the other hand, uses upper case letters for POS tags! The easiest way to find out which corpus uses which POS tags is to search for a random word and hover with the mouse over the search word.

CQP Syntax

The CQP syntax is very easy to learn and very comfortable. Don't forget to set “Query mode” to “CQP syntax”.

Examples

Query	Desciption	Match
[lemma=“house”]	search for lemma “house”	house, houses, housing, …
[lemma=“house” & pos=“nn1”]	search for lemma house (tagged as singular noun nn1)	house
[lemma=“house” & pos!=“nn1”]	search for lemma house (not tagged as nn1)	houses, housing
[word=“apples”]	search for apples	apples
[pos=“jj”][lemma=“house” & pos=“nn1”]	search for general adjective + lemma house (tagged as nn1)	green house, white house, strict house, main house, …
“as” []{1,3} “as”;	as + one to three random tokens + as	as good as, as well as, as immediate to me as, as much reliable as
“no” “sooner” []* “than” within s;	no + sooner + random tokens + than (all within the same sentence)	no sooner than, no sooner satisfied than, no sooner does he eat chocolate than

Simple Query Syntax

Do not confuse “Simple Query Syntax” with SQL (as in “Simple Query Language” - something else entirely!) or the simple query examples, which are meant to be examples of simple queries of each syntax (regular expressions, CQP, simple query syntax).

The default query mode in CQPweb is set to “simple query”. The “simple query” may not be as powerful as default regular expressions, but very comfortable and provides handy functions (e.g. proximity searches with “<<range>>”) for corpus queries. While it does support some of the character classes (\w, \d, \s …), it does not support lookarounds, grouping and some quantifiers ( e.g. {3,}).

Examples

Query	Description	Match
s?ing	s + one arbitrary character + ing	sing, sang sung, ssng
day*	day + zero or more characters	day, days, daylight, daydream
+able	one or more characters + able	capable, readable, searchable, but not able
??+able	three or more characters + able	capable, readable, but not unable, able, stable
{lay}	search for lemma lay	lay, lays, laid
day <<4>> night	search for all occurrences of day which appear in a 4-token window of night	day and night, “during the day and during the night”
day <<5<< night	search for all occurrences of day which appear in a 4-token window after night	night (…) day
day >>5>> night	search for all occurrences of day which appear in a 4-token window before night	day (…) night
+_at green_jj +_nn1	one or more characters (tagged as article) + green (tagged as adjective) + one or more characters (tagged as singular noun)	the green ball
+_r* (green_jj \| red_jj ) {house}	one or more characters (tagged as adverb) + either green or red (tagged as adjective) + Lemma house	very green houses, incredibly red house

-- www.english-corpora.org (formerly BYU; COCA, COHA, GloWbE...) --

Please consult the help pages of www.english-corpora.org or the Introduction to Corpus Basics (KorPLUS)

Last modified: Tuesday, 8 August 2023, 2:06 PM