Regular Expressions on the Computer


Regular expressions are patterns that can be matched against strings. Regular expressions are important tools for text processing. Many text editors and most programming languages have some built-in support for regular expressions. Unfortunately, the syntax is not standardized; however, most of the basics are supported by most implementations. (Note in particular that the syntax is not the same as the syntax in Section 3.2 of the CPSC 229 book. There are a lot more options in computer implementations, and the special character "|" takes on the role played by "+" in the book (and to make it worse, "+" has a different meaning).)

Certain characters have special purposes in regular expressions. These are called meta-characters or meta-symbols. Meta-characters are not part of the strings that are matched by a pattern. Instead, they are part of the syntax that is used for representing patterns. Typically, the following characters are meta-characters:

          .   *   |   ?   +   (   )   [   ]   {   }   ^   $   \

(The first thing on the preceding line is a period.) These characters have special meaning in regular expressions. For example, parentheses are used for grouping, just as they are in arithmetic. If you want to use a meta-character as a regular character instead of with its special meaning, you have to "escape" it by preceding it with a backslash, such as \*, \(, \$, or \\. (You might run into a few implementations where some of these characters are not treated as meta-characters, and the special meta-character meaning is obtained by using the backslash. For example, "(" and ")" might be considered regular characters, while "\(" and "\(" are used for grouping in the regular expression syntax.)

In addition to the special symbols listed above, certain other things can be represented by escaped characters. For example, an escaped t, "\t", represents a tab character, while "\s" represents any whitespace character.

In the rest of this document, I will discuss Perl-style regular expressions. Perl is a programming language that was one of the first to introduce a rich regular expression syntax. The same syntax has ben adopted (with maybe a few variations) by many other languages, including Java, JavaScript, Python, and Microsoft's .NET framework. It is also used in some text editors, such as kate and the Eclipse text editor. And it can be used on the command line using the egrep command.


Patterns and Pattern Matching

Regular expressions are patterns. A string can be classified as either matching or not matching the pattern. Regular expressions are used in at least three different ways:

Theoretical discussions, such as in CPSC 229, often consider only the first use. Practical applications on a computer often use the second ("find") or third ("find and replace") operations.

Here is a table giving examples of some of the more common types of patterns and the strings (or substrings) that they match:

Pattern Matches
a the single character "a". Any non-special character matches only itself. (Note that a space is a non-special character so that a space in a regular expression matches a space in the string.)
. a period matches any single character except (usually) end-of-line characters. The new line and carriage return characters can (generally) be matched by \n and \r. (\n marks end-of-line in UNIX, while \r\n is used in Windows.)
[abc] one of the single characters "a", "b", or "c". [ and ] make a character class that matches any single character that is among those listed between the brackets.
[a-zA-Z] any single alphabetic character; a hyphen inside a character class indicates a range of characters, so that [a-d] is the same as [abcd]
[^a-zA-Z] any single non-alphabetic character; a "^" at the beginning of a character class negates the class so that it matches any character that is not listed.
[+\-*/] matches any one of the usual arithmetical operators. (In a character class, most special characters lose their special status and can be used without blackslashes. However "\", "^", and "]" are still special and must be escaped and "-" becomes special.)
ab Matches the string "ab"; when patterns are concatenated, the strings that they match are also concatenated
[a-z][a-z][0-9] matches a string consisting of two lower-case letters followed by one digit
a|b matches either "a" or "b"; a "|" between patterns means "or" and the overall pattern matches any string that matches either of the sub-patterns
a|bc matches either "a" or "bc"; the "|" has lower precedence than concatenation so that "a|bc" means the same thing as "a|(bc)"
a* matches the empty string and any string of a's; a "*" after a pattern means repeat the pattern zero or more times
ab* matches a string consisting of an a followed by zero or more b's; * has higher precedence than concatenation so that "ab*" means "a(b*)"
(ab)* matches the empty string and the strings ab, abab, ababab, abababab, ...
a+ matches a sequence of one or more a's (does not match the empty string); "+" means "one or more repetitions of the preceding pattern"; + has the same precedence as *
a? matches the empty string and the string "a"; "?" means "optional" or "either empty or matching the previous pattern"; ? has the same precedence as *
a{3,5} matches aaa, aaaa, and aaaaa; {m,n} means "matching m through n copies of the preceding pattern, where m and n are non-negative integers; "{m,}" matches m or more copies; "{n,n}" matches exactly n copies. {m,n} has the same precedence as *
"[^"]*" matches a string enclosed in double quotes, including the quotation marks, where the quoted string cannot contain any embedded double quotes; the pattern ".*" would match strings with nested quotation marks, such as: "one" two "three"
\w matches a single "word" character; this is an abbreviation for [a-zA-Z0-9]; other abbreviations include: \W = any non-word character, \s = any whitespace character, \S any non-whitespace character
^a matches an a at the beginning of a line; a "^" does not match any characters itself but "anchors" the expression to the start of the line.
a$ matches an a at the end of a line; a "$" does not match any characters itself but "anchors" the expression to the end of the line.
\bfoo\b matches foo as a complete word; that is, foo must be bounded on both ends by a non-word character or by a start or end of line; \b does not itself match any character but "anchors" the pattern to a word boundary. Similarly, \B matches any non-word-boundary.
\b\w+\b matches any entire word. (Note: keep in mind that digits are word characters.)
^.+$ matches an entire non-empty line. (Remember that "." matches any character, "+" means "one or more", and the ^ and $ anchor the pattern to the beginning and end of the line)



Backreferences

There is one more important aspect of regular expressions on a computer: backreferences. A backreference is a way of referring to a substring that was matched by an earlier part of the expression. Backreferences take the form \1, \2, \3, ..., \9. \1 represents the part of the string that was matched by the first parenthesized sub-expression in the regular expression; \2, the part that was matched by the second parenthesized sub-expression, and so on. For example, the expression

          ^(\w+).*\1$

matches a line of text that begins and ends with the same word. The \1 matches whatever sequence of characters were matches the by the \w+ that is enclosed in the first (and only) set of parentheses in the expression. The numbering of sub-expressions is done by counting left parentheses, and sub-expressions can be nested. For example, in

          ((\d+)\s*[+\-*/]\s*(\d+))=(\d+)

\1 would refer to whatever matched (\d+)\s*[+\-*/]\s*(\d+), \2 would refer to the stuff that matched the first \d+, and \4 would refer to the stuff that matches the final, \d+.

When doing a "find and replace" operation with regular expressions, it is usually possible to use backreferences in the replacement string. This means that is is possible to include selected pieces of the original string in the replacement. This is actually the most interesting use of backreferences. In the replacement text, \0 can be used to represent the entire matched substring. (In some implementations, including Java's, backreferences in the replacement string are written as $0, $1, etc., instead of using "\".)

(Warning: CPSC 229 students should note that "regular expressions" that contain backreferences might not be regular expressions at all in the usual sense. That is, the language that is represented by a "regular expression" with backreferences might not be a regular language! Backreferences extend the power of regular expression beyond what can be done with the regular expressions introduced in Section 3.2 of the CPSC 229 textbook.)


The egrep Command

grep is a command line utility that is commonly available on UNIX computers, including Linux and Mac OS. Its basic purpose is to find all the lines in a file that contain a substring that matches a given regular expression. It prints any matching lines to standard output. By default, grep uses a simple form of regular expression, but you can use grep -E or egrep to use the full regular expression syntax described above. I will use egrep here. The command:

           egrep  'regular-expression'  file-name

will print all lines from the file named file-name that match the regular-expression (where "match" means "contains a substring that matches"). The single quotes around the regular expression can be omitted if the regular expression does not contain any characters that have special meaning on the command line; if in doubt, use the single quotes. You can have multiple file names in the command.

Like most UNIX command-line utility programs, egrep has a lot of options. Some of the more useful ones include:

UNIX utility program such as grep are designed to work together using the "pipe" operator, "|". On the command line, this operator sends the output from one program into the input of the next program. For example, here is a command that will print an alphabetical list of all "tags" that are used in an html file, with duplicates removed:

          egrep -o '<\w+[ />]' index.html | egrep -o '\w+' | sort -u 

The first command, "egrep -o '<\w+[ />]'" prints all substrings from the file that match the regular expression <\w+[ />]. This will include strings such as "<hr>" and "<table " that are complete HTML tags or the beginning of HTML tags. The next command in the pipe, "egrep -o '\w+'", allows only the alphabetic parts of these tags to pass through. The final sort command puts the list into alphabetical order; the -u option tells it to remove duplicates. The output from the final command is printed on standard output.


Regular Expressions in Text Editors

Many text editors are capable of doing regular expression search and regular expression search and replace. However, this is usually an option that has to be turned on by checking an option in the find or replace dialog. Usually the box is named something like "Regular Expression", but I've also seen "Use Grep" used. (In the kate text editor's Replace dialog box, you also have to check a box labeled "Use Placeholders" if you want to be able to use backreferences in the replacement text for a regular expression search and replace.)

Note that sometimes, just as in grep, matches are only done within a line, so that it is not possible to match strings that extend over several lines. However, some editors allow \n to be used to explicitly match the new-line character, which allows a matched string to extend over several lines.

Regular expression find-and-replace can be a powerful tool for reformatting a text file, especially when applied to an entire file at once with a "Replace All" command.


Regular Expressions in Java

Java has support for regular expressions, provided by the classes java.util.regex.Pattern and java.util.regex.Matcher. Full details can be found in the API documentation for those classes.

Java regular expressions are specified by strings that use the syntax described above. There is one unfortunate complication when specifying a regular expression as a String literal in Java: String literals themselves have special characters that have to be escaped. For example, suppose you want to write the regular expression

          \([^"]*\)

in Java. This expression matches a string that starts with a left parenthesis, ends with a right parenthesis, and contains no double quotation marks. To write this as a Java string literal, you have to escape the special characters \ and " with backslashes and enclose it in parentheses:

          "\\([^\"]\\)"

This can get very ugly, but it is unavoidable. Remember in particular that you have to use two backslashes to "double escape" any character that is supposed to be escaped in the regular expression.

When Java does regular expression search and replace, the syntax for backreferences in the replacement text uses dollar signs rather than backslashes: $0 represents the entire string that was matched; $1 represents the string that matched the first parenthesized sub-expression, and so on. If you want to include a literal $ or \ in the replacement text, you have to escape them with a backslash: \$ and \\.

Regular expressions come up in several places in the Java API. For example, the delimiter used by a Scanner is specified as a regular expression. The String class includes several instance methods that make it easy to use regular expressions for several purposes. The following methods are defined for an object of type String:

To do fancier stuff with regular expression, you have to use the Pattern and Matcher classes. A Pattern represents a "compiled" regular expression. A pattern object is created using a static function in the Pattern class:

          Pattern regexPattern = Pattern.compile(regexString);

where regexString specifies the expression as a string. A second parameter can be added to specify one or more flags such as Pattern.CASE_INSENSITIVE and Pattern.MULTILINE. Multiple flags can be combined with the bitwise or operator, |. In fact, the only case you are really likely to use is:

          Pattern regexPattern = Pattern.compile(regexString, Pattern.CASE_INSENSITIVE);

In order to match the Pattern against a string, you have to create a Matcher:

          Matcher regexMatcher = regexPattern.matcher(String stringToBeMatched);

There are three reasons to use a Matcher instead of simply using stringToBeMatched.matches(regexString):

First, it is possible to search for a matching substring, instead of just trying to match the entire string. Do this with regexMatcher.find(), which returns a boolean to indicate whether or not a matching substring was found. By default, the next call to find() after a successful search will start looking for a match at the end of the string that was found by the previous match.

Second, it is possible to set a "region" within the string that is being matched. The region is the substring that is considered for the match. A successful find() operation sets the beginning of the region to be the position at the end of the substring that was found, but the region can also be set explicitly by calling regexMatcher.reset(startIndex,endIndex). By default, the regular expression "anchor" characters ^ and $ match the beginning and end of the region.

And third, it is possible to discover the entire string that was matched and the substrings that were matched by parenthesized sub-expressions. This is done by calling regexMatcher.group(n) after a successful match. This returns the entire matched string when n = 0 or the string that matched the n-th parenthesized sub-expression when n > 0. You can also find the starting and ending positions for group number n, by calling regexMatcher.start(n) and regexMatcher.end(n).


David Eck