Appendix A. Regular Expressions

Anders Lund

This Appendix contains a brief but hopefully sufficient and
covering introduction to the world of regular
expressions. It documents regular expressions in the form
available within KatePart, which is not compatible with the regular
expressions of perl, nor with those of for example
grep.

Introduction

Regular Expressions provides us with a way to describe some possible contents of a text string in a way understood by a small piece of software, so that it can investigate if a text matches, and also in the case of advanced applications with the means of saving pieces or the matching text.

An example: Say you want to search a text for paragraphs that starts with either of the names Henrik or Pernille followed by some form of the verb say.

With a normal search, you would start out searching for the first name, Henrik maybe followed by sa like this: Henrik sa, and while looking for matches, you would have to discard those not being the beginning of a paragraph, as well as those in which the word starting with the letters sa was not either says, said or so. And then of course repeat all of that with the next name...

With Regular Expressions, that task could be accomplished with a single search, and with a larger degree of preciseness.

To achieve this, Regular Expressions defines rules for expressing in details a generalization of a string to match. Our example, which we might literally express like this: A line starting with either Henrik or Pernille (possibly following up to 4 blanks or tab characters) followed by a whitespace followed by sa and then either ys or id could be expressed with the following regular expression:

^[ \t]{0,4}(Henrik|Pernille) sa(ys|id)

The above example demonstrates all four major concepts of modern Regular Expressions, namely:

  • Patterns

  • Assertions

  • Quantifiers

  • Back references

The caret (^) starting the expression is an assertion, being true only if the following matching string is at the start of a line.

The strings [ \t] and (Henrik|Pernille) sa(ys|id) are patterns. The first one is a character class that matches either a blank or a (horizontal) tab character; the other pattern contains first a subpattern matching either Henrik or Pernille, then a piece matching the exact string sa and finally a subpattern matching either ys or id

The string {0,4} is a quantifier saying anywhere from 0 up to 4 of the previous.

Because regular expression software supporting the concept of back references saves the entire matching part of the string as well as sub-patterns enclosed in parentheses, given some means of access to those references, we could get our hands on either the whole match (when searching a text document in an editor with a regular expression, that is often marked as selected) or either the name found, or the last part of the verb.

All together, the expression will match where we wanted it to, and only there.

The following sections will describe in details how to construct and use patterns, character classes, assertions, quantifiers and back references, and the final section will give a few useful examples.