CPSC 229 | Foundations of Computation | Spring 2024 |
Regular expressions are patterns that can be matched against strings. Regular expressions are important tools for text processing. Many text editors and most programming languages have some built-in support for regular expressions. Unfortunately, the syntax is not completely standardized. However, most of the basics are supported by most implementations.
Certain characters have special purposes in regular expressions. These are called meta-characters or meta-symbols. Meta-characters are not part of the strings that are matched by a pattern. Instead, they are part of the syntax that is used for representing patterns. Typically, the following characters are meta-characters:
. * | ? + ( ) [ ] { } ^ $ \
(The first thing on the preceding line is a period.) These characters have special meaning in regular expressions. For example, parentheses are used for grouping. If you want to use a meta-character as a regular character instead of with its special meaning, you have to "escape" it by preceding it with a backslash, such as \*, \(, \$, or \\. (You might run into a few implementations where some of these characters are not treated as meta-characters, and the special meta-character meaning is obtained by using the backslash. For example, "(" and ")" might be considered regular characters, while "\(" and "\(" are used for grouping in the regular expression syntax.)
In addition to the special symbols listed above, certain other things can be represented by escaped characters. For example, an escaped t, "\t", represents a tab character, while "\s" represents any whitespace character.
In the rest of this document, I will discuss Perl-style regular expressions. Perl is a programming language that was one of the first to introduce a rich regular expression syntax. The same syntax has been adopted (with some variations) by many other languages, including Java, JavaScript, Python, and Microsoft's .NET framework. It is also used in some text editors.
Regular expressions are patterns. A string can be classified as either matching or not matching the pattern. Regular expressions are used in at least three different ways:
Theoretical discussions, such as in CPSC 229, often consider only the first use. Practical applications on a computer often use the second ("find") or third ("find and replace") operations.
Here is a table giving examples of some of the more common types of patterns and the strings (or substrings) that they match:
Pattern | Matches |
a | matches the single character "a". Any non-special character matches only itself. (Note that a space is a non-special character so that a space in a regular expression matches a space in the string.) |
. | a period matches any single character except (usually) end-of-line characters. The new line and carriage return characters can (generally) be matched by \n and \r. (\n marks end-of-line in UNIX, while \r\n is used in Windows.) |
[abc] | matches one of the single characters "a", "b", or "c". [ and ] make a character class that matches any single character that is among those listed between the brackets. |
[a-zA-Z] | matches any single alphabetic character; a hyphen inside a character class indicates a range of characters, so that [a-d] is the same as [abcd] |
[^a-zA-Z] | matches any single non-alphabetic character; a "^" at the beginning of a character class negates the class so that it matches any character that is not listed. |
[+\-*/] | matches any one of the usual arithmetical operators. (In a character class, most special characters lose their special status and can be used without blackslashes. However "\", "^", and "]" are still special and must be escaped and "-" becomes special. So the "-" in the example has to be written "\-".) |
ab | matches the string "ab"; when patterns are concatenated, the strings that they match are also concatenated |
[a-z][a-z][0-9] | matches a string consisting of two lower-case letters followed by one digit |
a|b | matches either "a" or "b"; a "|" between patterns means "or" and the overall pattern matches any string that matches either of the sub-patterns |
a|bc | matches either "a" or "bc"; the "|" has lower precedence than concatenation so that "a|bc" means the same thing as "a|(bc)" |
a* | matches the empty string and any string of a's; a "*" after a pattern means repeat the pattern zero or more times |
ab* | matches a string consisting of an a followed by zero or more b's; * has higher precedence than concatenation so that "ab*" means "a(b*)" |
(ab)* | matches the empty string and the strings ab, abab, ababab, abababab, ... |
a+ | matches a sequence of one or more a's (does not match the empty string); "+" means "one or more repetitions of the preceding pattern"; + has the same precedence as * |
a? | matches the empty string and the string "a"; "?" means "optional" or "either empty or matching the previous pattern"; ? has the same precedence as * |
a{6} | matches aaaaaa; {n} means "matching exactly n copies of the preceding pattern, where n is a positive integer. {n} has the same precedence as * |
a{3,5} | matches aaa, aaaa, and aaaaa; {m,n} means "matching m through n copies of the preceding pattern, where m and n are non-negative integers; "{m,}" matches m or more copies; "{n,n}" matches exactly n copies and so is the same as {n}. {m,n} has the same precedence as * |
"[^"]*" | matches a string enclosed in double quotes, including the quotation marks, where the quoted string cannot contain any embedded double quotes; the pattern ".*" would match strings with nested quotation marks, such as: "one" two "three" |
\w | matches a single "word" character; this is an abbreviation for [a-zA-Z0-9]; other abbreviations include: \W = any non-word character, \s = any whitespace character, \S any non-whitespace character |
^a | matches an a at the beginning of a line; a "^" does not match any characters itself but "anchors" the expression to the start of the line. |
a$ | matches an a at the end of a line; a "$" does not match any characters itself but "anchors" the expression to the end of the line. |
\bfoo\b | matches foo as a complete word; that is, foo must be bounded on both ends by a non-word character or by a start or end of line; \b does not itself match any character but "anchors" the pattern to a word boundary. Similarly, \B matches any non-word-boundary. |
\b\w+\b | matches any entire word. (Note: keep in mind that digits are word characters.) |
^.+$ | matches an entire non-empty line. (Remember that "." matches any character, "+" means "one or more", and the ^ and $ anchor the pattern to the beginning and end of the line) |
There is one more important aspect of regular expressions on a computer: backreferences. A backreference is a way of referring to a substring that was matched by an earlier part of the expression. Backreferences take the form \1, \2, \3, ..., \9. \1 represents the part of the string that was matched by the first parenthesized sub-expression in the regular expression; \2, the part that was matched by the second parenthesized sub-expression, and so on. For example, the expression
^(\w+).*\1$
matches a line of text that begins and ends with the same word. The \1 matches whatever sequence of characters were matches the by the \w+ that is enclosed in the first (and only) set of parentheses in the expression. The numbering of sub-expressions is done by counting left parentheses, and sub-expressions can be nested. For example, in
((\d+)\s*[+\-*/]\s*(\d+))=(\d+)
\1 would refer to whatever matched (\d+)\s*[+\-*/]\s*(\d+), \2 would refer to the stuff that matched the first \d+, and \4 would refer to the stuff that matches the final, \d+.
When doing a "find and replace" operation with regular expressions, it is usually possible to use backreferences in the replacement string. This means that is is possible to include selected pieces of the original string in the replacement. This is actually the most interesting use of backreferences. In the replacement text, \0 can be used to represent the entire matched substring. (In some implementations, including Java's, backreferences in the replacement string are written as $0, $1, etc., instead of using "\".)
(Warning: CPSC 229 students should note that "regular expressions" that contain backreferences might not be regular expressions at all in the usual sense. That is, the language that is represented by a "regular expression" with backreferences might not be a regular language! Backreferences extend the power of regular expression beyond what can be done with the regular expressions introduced in Section 3.2 of the CPSC 229 textbook.)