Table of Contents
This Appendix contains a brief but hopefully sufficient and covering introduction to the world of regular expressions. It documents regular expressions in the form available within KatePart, which is not compatible with the regular expressions of perl, nor with those of for example grep.
Regular Expressions provides us with a way to describe some possible contents of a text string in a way understood by a small piece of software, so that it can investigate if a text matches, and also in the case of advanced applications with the means of saving pieces or the matching text.
An example: Say you want to search a text for paragraphs that starts with either of the names “Henrik” or “Pernille” followed by some form of the verb “say”.
With a normal search, you would start out searching for the
first name, “Henrik” maybe followed by “sa”
like this: Henrik sa
, and while looking for
matches, you would have to discard those not being the beginning of a
paragraph, as well as those in which the word starting with the
letters “sa” was not either “says”,
“said” or so. And then of course repeat all of that with
the next name...
With Regular Expressions, that task could be accomplished with a single search, and with a larger degree of preciseness.
To achieve this, Regular Expressions defines rules for expressing in details a generalization of a string to match. Our example, which we might literally express like this: “A line starting with either ‘Henrik’ or ‘Pernille’ (possibly following up to 4 blanks or tab characters) followed by a whitespace followed by ‘sa’ and then either ‘ys’ or ‘id’” could be expressed with the following regular expression:
^[
\t]{0,4}(Henrik|Pernille) sa(ys|id)
The above example demonstrates all four major concepts of modern Regular Expressions, namely:
Patterns
Assertions
Quantifiers
Back references
The caret (^
) starting the expression is an
assertion, being true only if the following matching string is at the
start of a line.
The strings [ \t]
and
(Henrik|Pernille) sa(ys|id)
are patterns. The first
one is a character class that matches either a
blank or a (horizontal) tab character; the other pattern contains
first a subpattern matching either Henrik
or Pernille
, then a piece
matching the exact string sa
and finally a
subpattern matching either ys
or id
The string {0,4}
is a quantifier saying
“anywhere from 0 up to 4 of the previous”.
Because regular expression software supporting the concept of back references saves the entire matching part of the string as well as sub-patterns enclosed in parentheses, given some means of access to those references, we could get our hands on either the whole match (when searching a text document in an editor with a regular expression, that is often marked as selected) or either the name found, or the last part of the verb.
All together, the expression will match where we wanted it to, and only there.
The following sections will describe in details how to construct and use patterns, character classes, assertions, quantifiers and back references, and the final section will give a few useful examples.