| |||
Links Sections The Substitution Operator (s///) The Translation Operator (tr///) The Binding Operators (=~ and !~) Chapters Part I: Basic Perl 02-Numeric and String
Literals Part II: Intermediate Perl Part III: Advanced Perl 13-Handling Errors and
Signals Part IV: Perl and the Internet 21-Using Perl with Web
Servers Appendixes |
You can use a regular expression to find patterns in strings: for example, to look for a specific name in a phone list or all of the names that start with the letter a. Pattern matching is one of Perl's most powerful and probably least understood features. But after you read this chapter, you'll be able to handle regular expressions almost as well as a Perl guru. With a little practice, you'll be able to do some incredibly handy things.
There are three main uses for regular expressions in Perl: matching, substitution, and translation. The matching operation uses the m// operator, which evaluates to a true or false value. The substitution operation substitutes one expression for another; it uses the s/// operator. The translation operation translates one set of characters to another and uses the tr/// operator. These operators are summarized in Table 10.1.
Operator | Description |
---|---|
m/PATTERN/ | This operator returns true if PATTERN is found in $_. |
s/PATTERN/REPLACEMENT/ | This operator replaces the sub- string matched by PATTERN with REPLACEMENT. |
tr/CHARACTERS/REPLACEMENTS/ | This operator replaces characters specified by CHARACTERS with the characters in REPLACEMENTS. |
All three regular expression operators work with $_ as the string to search. You can use the binding operators (see the section "The Binding Operators (=~ and !~)" later in this section) to search a variable other than $_.
Both the matching (m//) and the substitution (s///) operators perform variable interpolation on the PATTERN and REPLACEMENT strings. This comes in handy if you need to read the pattern from the keyboard or a file.
If the match pattern evaluates to the empty string, the last valid pattern is used. So, if you see a statement like print if //; in a Perl program, look for the previous regular expression operator to see what the pattern really is. The substitution operator also uses this interpretation of the empty pattern.
In this chapter, you learn about pattern delimiters and then about each type of regular expression operator. After that, you learn how to create patterns in the section"How to Create Patterns" .. Then, the "Pattern Examples" section shows you some situations and how regular expressions can be used to resolve the situations.
m//;
you see two of the standard delimiters - the slashes
(//). However, you can use any character as the delimiter. This feature
is useful if you want to use the slash character inside your pattern. For
instance, to match a file you would normally use:
m/\/root\/home\/random.dat/
This match statement is hard to read
because all of the slashes seem to run together (some programmers say they look
like teepees). If you use an alternate delimiter, if might look like this:
m!/root/home/random.dat!
or
m{/root/home/random.dat}
You can see that these examples are a
little clearer. The last example also shows that if a left bracket is used as
the starting delimiter, then the ending delimiter must be the right bracket.
Errata Note |
The printed version of this book shows the above examples as m!\/root\/home\/random.dat! and as m{\/root\/home\/random.dat}. While I was writing the book it did not occur to be that the / character was not a metacharacter and only needed to be escaped because of the delimiters. Obviously, if the / character is the delimiter, it needs to be escaped in order to use it inside the pattern. However, if an alternative delimiter is used, it no longer needs to be escaped. - this fact was pointed out to me by Garen Deve. |
Both the match and substitution operators let you use variable interpolation. You can take advantage of this to use a single-quoted string that does not require the slash to be escaped. For instance:
$file = '/root/home/random.dat';
m/$file/;
You might find that this technique yields clearer code than
simply changing the delimiters.
If you choose the single quote as your delimiter character, then no variable interpolation is performed on the pattern. However, you still need to use the backslash character to escape any of the meta-characters discussed in the "How to Create Patterns" section later in this chapter.
Tip
I tend to avoid delimiters that might be confused
with characters in the pattern. For example, using the plus sign as a
delimiter (m+abc+) does not help program readability. A casual
reader might think that you intend to add two expressions instead of
matching them.
Caution |
The ? has a special meaning when used as a match pattern delimiter. It works like the / delimiter except that it matches only once between calls to the reset() function. This feature may be removed in future versions of Perl, so avoid using it. |
The next few sections look at the matching, substitution, and translation operators in more detail.
The matching operator only searches the $_ variable. This makes the match statement shorter because you don't need to specify where to search. Here is a quick example:
$_ = "AAA bbb AAA";
print "Found bbb\n" if m/bbb/;
The print statement is executed only if
the bbb character sequence is found in the $_ variable. In
this particular case, bbb will be found, so the program will display
the following:
Found bbb
The matching operator allows you to use variable
interpolation in order to create the pattern. For example:
$needToFind = "bbb";
$_ = "AAA bbb AAA";
print "Found bbb\n" if m/$needToFind/;
Using the matching operator is
so commonplace that Perl allows you to leave off the m from the
matching operator as long as slashes are used as delimiters:
$_ = "AAA bbb AAA";
print "Found bbb\n" if /bbb/;
Using the matching operator to find a
string inside a file is very easy because the defaults are designed to
facilitate this activity. For example:
$target = "M";
open(INPUT, "<findstr.dat");
while (<INPUT>) {
if (/$target/) {
print "Found $target on line $.";
}
}
close(INPUT);
Note |
The $. special variable keeps track of the record number. Every time the diamond operators read a line, this variable is incremented. |
This example reads every line in an input searching for the letter M. When an M is found, the print statement is executed. The print statement prints the letter that is found and the line number it was found on.
Option | Description |
---|---|
g | This option finds all occurrences of the pattern in the string. A list of matches is returned or you can iterate over the matches using a loop statement. |
i | This option ignores the case of characters in the string. |
m | This option treats the string as multiple lines. Perl does some optimization by assuming that $_ contains a single line of input. If you know that it contains multiple newline characters, use this option to turn off the optimization. |
o | This option compiles the pattern only once. You can achieve some small performance gains with this option. It should be used with variable interpolation only when the value of the variable will not change during the lifetime of the program. |
s | This option treats the string as a single line. |
x | This option lets you use extended regular expressions. Basically, this means that Perl will ignore whitespace that's not escaped with a backslash or within a character class. I highly recommend this option so you can use spaces to make your regular expressions more readable. See the section "Example: Extension Syntax" later in this chapter for more information. |
All options are specified after the last pattern delimiter. For instance, if you want the match to ignore the case of the characters in the string, you can do this:
$_ = "AAA BBB AAA";
print "Found bbb\n" if m/bbb/i;
This program finds a match even though
the pattern uses lowercase and the string uses uppercase because the /i
option was used, telling Perl to ignore the case.
The result from a global pattern match can be assigned to an array variable or used inside a loop. This feature comes in handy after you learn about meta-characters in the section called "How to Create Patterns" later in this chapter.
s/a/z/;
This statement changes the first a in $_
into a z. Not too complicated, huh? Things won't get complicated until
we start talking about regular expressions in earnest in the section "How to
Create Patterns" later in the chapter.
You can use variable interpolation with the substitution operator just as you can with the matching operator. For instance:
$needToReplace = "bbb";
$replacementText = "1234567890";
$_ = "AAA bbb AAA";
$result = s/$needToReplace/$replacementText/;
Note |
You can use variable interpolation in the replacement pattern as shown here, but none of the meta-characters described later in the chapter can be used in the replacement pattern. |
This program changes the $_ variable to hold "AAA 1234567890 AAA" instead of its original value, and the $result variable will be equal to 1 - the number of substitutions made.
Frequently, the substitution operator is used to remove substrings. For instance, if you want to remove the "bbb" sequence of characters from the $_ variable, you could do this:
s/bbb//;
By replacing the matched string with nothing, you have
effectively deleted it.
If brackets of any type are used as delimiters for the search pattern, you need to use a second set of brackets to enclose the replacement pattern. For instance:
$_ = "AAA bbb AAA";
$result = s{bbb}{1234567890};
Option | Description |
---|---|
e | This option forces Perl to evaluate the replacement pattern as an expression. |
g | This option replaces all occurrences of the pattern in the string. |
i | This option ignores the case of characters in the string. |
m | This option treats the string as multiple lines. Perl does some optimization by assuming that $_ contains a single line of input. If you know that it contains multiple newline characters, use this option to turn off the optimization. |
o | This option compiles the pattern only once. You can achieve some small performance gains with this option. It should be used with variable interpolation only when the value of the variable will not change during the lifetime of the program. |
s | This option treats the string as a single line. |
x | This option lets you use extended regular expressions. Basically, this means that Perl ignores whitespace that is not escaped with a backslash or within a character class. I highly recommend this option so you can use spaces to make your regular expressions more readable. See the section "Example: Extension Syntax" later in this chapter for more information. |
The /e option changes the interpretation of the pattern delimiters. If used, variable interpolation is active even if single quotes are used. In addition, if back quotes are used as delimiters, the replacement pattern is executed as a DOS or UNIX command. The output of the command is then used as the replacement text.
tr/a/z/;
This statement translates all occurrences of a
into z. If you specify more than one character in the match character
list, you can translate multiple characters at a time. For instance:
tr/ab/z/;
translates all a and all b characters
into the z character. If the replacement list of characters is shorter
than the target list of characters, the last character in the replacement list
is repeated as often as needed. However, if more than one replacement character
is given for a matched character, only the first is used. For instance:
tr/WWW/ABC/;
results in all W characters being converted
to an A character. The rest of the replacement list is ignored.
Unlike the matching and substitution operators, the translation operator doesn't perform variable interpolation.
Note |
The tr operator gets its name from the UNIX
tr utility. If you are familiar with the tr utility, then you already know
how to use the tr operator.Z
The UNIX sed utility uses a y to indicate translations. To make learning Perl easier for sed users, y is supported as a synonym for tr. |
Option | Description |
---|---|
c | This option complements the match character list. In other words, the translation is done for every character that does not match the character list. |
d | This option deletes any character in the match list that does not have a corresponding character in the replacement list. |
s | This option reduces repeated instances of matched characters to a single instance of that character. |
Normally, if the match list is longer than the replacement list, the last character in the replacement list is used as the replacement for the extra characters. However, when the d option is used, the matched characters are simply deleted.
If the replacement list is empty, then no translation is done. The operator will still return the number of characters that matched, though. This is useful when you need to know how often a given letter appears in a string. This feature also can compress repeated characters using the s option.
Tip
UNIX programmers may be familiar with using the tr
utility to convert lowercase characters to uppercase characters, or vice
versa. Perl now has the lc() and uc() functions that can
do this much quicker.
$scalar = "The root has many leaves";
$match = $scalar =~ m/root/;
$substitution = $scalar =~ s/root/tree/;
$translate = $scalar =~ tr/h/H/;
print("\$match = $match\n");
print("\$substitution = $substitution\n");
print("\$translate = $translate\n");
print("\$scalar = $scalar\n");
This program displays the
following:
$match = 1
$substitution = 1
$translate = 2
$scalar = THe tree Has many leaves
This example uses all three of
the regular expression operators with the regular binding operator. Each of the
regular expression operators was bound to the $scalar variable instead
of $_. This example also shows the return values of the regular
expression operators. If you don't need the return values, you could do this:
$scalar = "The root has many leaves";
print("String has root.\n") if $scalar =~ m/root/;
$scalar =~ s/root/tree/;
$scalar =~ tr/h/H/;
print("\$scalar = $scalar\n");
This program displays the following:
String has root.
$scalar = THe tree Has many leaves
The left operand of the binding
operator is the string to be searched, modified, or transformed; the right
operand is the regular expression operator to be evaluated.
The complementary binding operator is valid only when used with the matching regular expression operator. If you use it with the substitution or translation operator, you get the following message if you're using the -w command-line option to run Perl:
Useless use of not in void context at test.pl line 4.
You can see
that the !~ is the opposite of =~ by replacing the =~
in the previous example:
$scalar = "The root has many leaves";
print("String has root.\n") if $scalar !~ m/root/;
$scalar =~ s/root/tree/;
$scalar =~ tr/h/H/;
print("\$scalar = $scalar\n");
This program displays the following:
$scalar = THe tree Has many leaves
The first print line does not
get executed because the complementary binding operator returns false.
When creating patterns, the meta-meaning will always be the default. If you really intend to match the literal character, you need to prefix the meta-character with a backslash. You might recall that the backslash is used to create an escape sequence.
Patterns can have many different components. These components all combine to provide you with the power to match any type of string. The following list of components will give you a good idea of the variety of ways that patterns can be created. The section "Pattern Examples" later in this chapter shows many examples of these rules in action.
Meta-Character | Description |
---|---|
^ | This meta-character - the caret - will match the beginning of a string or if the /m option is used, matches the beginning of a line. It is one of two pattern anchors - the other anchor is the $. |
. | This meta-character will match any character except for the newline unless the /s option is specified. If the /s option is specified, then the newline will also be matched. |
$ | This meta-character will match the end of a string or if the /m option is used, matches the end of a line. It is one of two pattern anchors - the other anchor is the ^. |
| | This meta-character - called alternation - lets you specify two values that can cause the match to succeed. For instance, m/a|b/ means that the $_ variable must contain the "a" or "b" character for the match to succeed. |
* | This meta-character indicates that the "thing" immediately to the left should be matched 0 or more times in order to be evaluated as true. |
+ | This meta-character indicates that the "thing" immediately to the left should be matched 1 or more times in order to be evaluated as true. |
? | This meta-character indicates that the "thing" immediately to the left should be matched 0 or 1 times in order to be evaluated as true. When used in conjunction with the +, _, ?, or {n, m} meta- characters and brackets, it means that the regular expression should be non-greedy and match the smallest possible string. |
Meta-Brackets | Description |
---|---|
() | The parentheses let you affect the order of pattern evaluation and act as a form of pattern memory. See the section "Pattern Memory" later in this chapter for more information. |
(?...) | If a question mark immediately follows the left parentheses, it indicates that an extended mode component is being specified. See the section "Example: Extension Syntax" later in this chapter for more information. |
{n, m} | The curly braces let specify how many times the "thing" immediately to the left should be matched. {n} means that it should be matched exactly n times. {n,} means it must be matched at least n times. {n, m} means that it must be matched at least n times and not more than m times. |
[] | The square brackets let you create a character class. For instance, m/[abc]/ will evaluate to true if any of "a", "b", or "c" is contained in $_. The square brackets are a more readable alternative to the alternation meta-character. |
Meta-Sequences | Description | ||
---|---|---|---|
\ | This meta-character "escapes" the following character. This means that any special meaning normally attached to that character is ignored. For instance, if you need to include a dollar sign in a pattern, you must use \$ to avoid Perl's variable interpolation. Use \\ to specify the backslash character in your pattern. | ||
\nnn | Any Octal byte. Use zero padding for values from \000 to \077 inclusively. For larger values simply use the three-digit number (like \100 or \323). | ||
\a | Alarm. | ||
\A | This meta-sequence represents the beginning of the string. Its meaning is not affected by the /m option. | ||
\b | This meta-sequence represents the backspace character inside a character class; otherwise, it represents a word boundary. A word boundary is the spot between word (\w) and non-word(\W) characters. Perl thinks that the \W meta-sequence matches the imaginary characters off the ends of the string. | ||
\B | Match a non-word boundary. | ||
\cn | Any control character. | ||
\d | Match a single digit character. | ||
\D | Match a single non-digit character. | ||
\e | Escape. | ||
\E | Terminate the \L or \U sequence. | ||
\f | Form Feed. | ||
\G | Match only where the previous m//g left off. | ||
\l | Change the next character to lowercase. | ||
\L | Change the following characters to lowercase until a \E sequence is encountered. | ||
\n | Newline. | ||
\Q | Quote Regular Expression meta-characters literally until the \E sequence is encountered. | ||
\r | Carriage Return. | ||
\s | Match a single whitespace character. | ||
\S | Match a single non-whitespace character. | ||
\t | Tab. | ||
\u | Change the next character to uppercase. | ||
\U | Change the following characters to uppercase until a \E sequence is encountered. | ||
\v | Vertical Tab. | ||
\w | Match a single word character. Word characters are the alphanumeric and underscore characters. | ||
\W | Match a single non-word character. | ||
\xnn | Any Hexadecimal byte. | ||
\Z | This meta-sequence represents the end of the string. Its meaning is not affected by the /m option. | ||
\$ | Dollar Sign. | ||
\@ | Ampersand. | ||
\% | Percent Sign.
|
Tip |
Some programmers like to enclose the alternation
sequence inside parentheses to help indicate where the sequence begins
and ends.
m/(dog|cat)/;However, this will affect something called pattern memory, which you'll be learning about in the section "Example: Pattern Memory" later in the chapter. |
Errata Note |
The printed version of this book says: m/0123456789/;. The square brackets were missing and the semi-colon is extraneous since this is an example of an expression, not a statement. Randal Schwartz was kind enough to point out this problem. |
Errata Note |
The printed version of this book states "The caret is always the first character in the pattern when used as an anchor". However, this is not strictly true when the alternation meta-character is used. For example, /Jack$|^John/ will match when "Jack" is at the end of a string or when "John" is at the beginning of a string. Randal Schwartz was kind enough to mention that this concept needs clarification. |
The \B meta-sequence will match everywhere except at a word boundary.
The power of patterns is that you don't always know in advance the value of the string that you will be searching. If you need to match the first word in a string that was read in from a file, you probably have no idea how long it might be; therefore, you need to build a pattern. You might start with the \w symbolic character class, which will match any single alphanumeric or underscore character. So, assuming that the string is in the $_ variable, you can match a one-character word like this:
m/\w/;
If you need to match both a one-character word and a
two-character word, you can do this:
m/\w|\w\w/;
This pattern says to match a single word character or
two consecutive word characters. You could continue to add alternation
components to match the different lengths of words that you might expect to see,
but there is a better way.
You can use the + quantifier to say that the match should succeed only if the component is matched one or more times. It is used this way:
m/\w+/;
If the value of $_ was "AAA BBB", then
m/\w+/; would match the "AAA" in the string. If $_
was blank, full of whitespace, or full of other non-word characters, an
undefined value would be returned.
The preceding pattern will let you determine if $_ contains a word but does not let you know what the word is. In order to accomplish that, you need to enclose the matching components inside parentheses. For example:
m/(\w+)/;
By doing this, you force Perl to store the matched
string into the $1 variable. The $1 variable can be considered
as pattern memory.
This introduction to pattern components describes most of the details you need to know in order to create your own patterns or regular expressions. However, some of the components deserve a bit more study. The next few sections look at character classes, quantifiers, pattern memory, pattern precedence, and the extension syntax. Then the rest of the chapter is devoted to showing specific examples of when to use the different components.
You can use variable interpolation inside the character class, but you must be careful when doing so. For example,
$_ = "AAABBBCCC";
$charList = "ADE";
print "matched" if m/[$charList]/;
will display
matched
This is because the variable interpolation results in a
character class of [ADE]. If you use the variable as one-half of a character
range, you need to ensure that you don't mix numbers and digits. For example,
$_ = "AAABBBCCC";
$charList = "ADE";
print "matched" if m/[$charList-9]/;
will result in the following error
message when executed:
/[ADE-9]/: invalid [] range in regexp at test.pl line 4.
At times,
it's necessary to match on any character except for a given character list. This
is done by complementing the character class with the caret. For example,
$_ = "AAABBBCCC";
print "matched" if m/[^ABC]/;
will display nothing. This match returns
true only if a character besides A, B, or C is in the
searched string. If you complement a list with just the letter A,
$_ = "AAABBBCCC";
print "matched" if m/[^A]/;
then the string "matched" will be
displayed because B and C are part of the string - in other
words, a character besides the letter A.
Perl has shortcuts for some character classes that are frequently used. Here is a list of what I call symbolic character classes:
You can use these symbols inside other character classes but not as endpoints of a range. For example, you can do the following:
$_ = "\tAAA";
print "matched" if m/[\d\s]/;
which will display
matched
because the value of $_ includes the tab
character.
Tip
Meta-characters that appear inside the square
brackets that define a character class are used in their literal sense.
They lose their meta-meaning. This may be a little confusing at first. In
fact, I have a tendency to forget this when evaluating
patterns.
Note |
I think that most of the confusion regarding regular expressions lies in the fact that each character of a pattern might have several possible meanings. The caret could be an anchor, it could be a caret, or it could be used to complement a character class. Therefore, it is vital that you decide which context any given pattern character or symbol is in before assigning a meaning to it. |
Quantifier | Description |
---|---|
* | The component must be present zero or more times. |
+ | The component must be present one or more times. |
? | The component must be present zero or one times. |
{n} | The component must be present n times. |
{n,} | The component must be present at least n times. |
{n,m} | The component must be present at least n times and no more than m times. |
If you need to match a word whose length is unknown, you need to use the + quantifier. You can't use an * because a zero length word makes no sense. So, the match statement might look like this:
m/^\w+/;
This pattern will match "QQQ" and
"AAAAA" but not "" or " BBB ". In order to account
for the leading whitespace, which may or not be at the beginning of a string,
you need to use the asterisk (*) quantifier in conjunction with the
\s symbolic character class in the following way:
m/\s*\w+/;
Tip
Be careful when using the * quantifier
because it can match an empty string, which might not be your intention.
The pattern /b*/ will match any string - even one without any
b characters.
Errata Note |
The printed version of this book has the first match
statement as m/\w+/;, notice that pattern anchor was left out. |
At times, you may need to match an exact number of components. The following match statement will be true only if five words are present in the $_ variable:
$_ = "AA AB AC AD AE";
m/^(\w+\W+){5}$/;
In this example, we are matching at least one word
character followed by zero or more non-word characters. Notice that Perl
considers the end of a string as a non-word character. The {5}
quantifier is used to ensure that that combination of components is present five
times.
Errata Note |
The printed version of the book used the pattern m/(\w+\s*){5}/; in order to match the five words. This is incorrect since the pattern \w+\s* matches a single character (remember that * matches zero or more instances of a character). Therefore m/(\w+\s*){5}/; matches "AAAA" as well as "A A A A A". |
The * and + quantifiers are greedy. They match as many characters as possible. This may not always be the behavior that you need. You can create non-greedy components by following the quantifier with a ?.
Use the following file specification in order to look at the * and + quantifiers more closely:
$_ = '/user/Jackie/temp/names.dat';
The regular expression
.* will match the entire file specification. This can be seen in the
following small program:
$_ = '/user/Jackie/temp/names.dat';
m/.*/;
print $&;
This program displays
/user/Jackie/temp/names.dat
You can see that the *
quantifier is greedy. It matched the whole string. If you add the ?
modifier to make the .* component non-greedy, what do you think the
program would display?
$_ = '/user/Jackie/temp/names.dat';
m/.*?/;
print $&;
This program displays nothing because the least amount of
characters that the * matches is zero. If we change the * to a
+, then the program will display
/
Next, let's look at the concept of pattern memory, which lets
you keep bits of matched string around after the match is complete.
You saw a simple example of this earlier right after the component descriptions. That example looked for the first word in a string and stored it into the first buffer, $1. The following small program
$_ = "AAA BBB CCC";
m/(\w+)/;
print("$1\n");
will display
AAA
You can use as many buffers as you need. Each time you add a set of parentheses, another buffer is used. The pattern matched by the first set is placed into $1. The pattern matched by the second set is placed into $2. And so on.
If you want to find all the words in the string, you need to use the /g match option. In order to find all the words, you can use a loop statement that loops until the match operator returns false.
$_ = "AAA BBB CCC";
while (m/(\w+)/g) {
print("$1\n");
}
The program will display
AAA
BBB
CCC
If looping through the matches is not the right approach for your
needs, perhaps you need to create an array consisting of the matches.
$_ = "AAA BBB CCC";
@matches = m/(\w+)/g;
print("@matches\n");
The program will display
AAA BBB CCC
Perl also has a few special variables to help you know
what matched and what did not. These variables will occasionally save you from
having to add parentheses to find information.
Tip |
If you need to save the value of the matched strings stored in the pattern memory, make sure to assign them to other variables. Pattern memory is local to the enclosing block and lasts only until another match is done. |
m/a|b+/
it's hard to tell if the pattern should be
m/(a|b)+/ # match any sequence of "a" and "b" characters
# in any order.
or
m/a|(b+)/ # match either the "a" character or the "b" character
# repeated one or more times.
The order of precedence shown
in Table 10.7 is designed to solve problems like this. By looking at the table,
you can see that quantifiers have a higher precedence than alternation.
Therefore, the second interpretation is correct.
Precedence Level | Component |
---|---|
1 | Parentheses |
2 | Quantifiers |
3 | Sequences and Anchors |
4 | Alternation |
Tip
You can use parentheses to affect the order that
components are evaluated because they have the highest precedence.
However, unless you use the extended syntax, you will be affecting the
pattern memory.
At this time, Perl recognizes five extensions. These vary widely in functionality - from adding comments to setting options. Table 10.8 lists the extensions and gives a short description of each.
Extension | Description |
---|---|
(?# TEXT) | This extension lets you add comments to your regular expression. The TEXT value is ignored. |
(?:...) | This extension lets you add parentheses to your regular expression without causing a pattern memory position to be used. |
(?=...) | This extension lets you match values without including them in the $& variable. |
(?!...) | This extension lets you specify what should not follow your pattern. For instance, /blue(?!bird)/ means that "bluebox" and "bluesy" will be matched but not "bluebird". |
(?sxi) | This extension lets you specify an embedded option in the pattern rather than adding it after the last delimiter. This is useful if you are storing patterns in variables and using variable interpolation to do the matching. |
By far the most useful feature of extended mode, in my opinion, is the ability to add comments directly inside your patterns. For example, would you rather a see a pattern that looks like this:
# Match a string with two words. $1 will be the
# first word. $2 will be the second word.
m/^\s*(\w+)\W+(\w+)\s*$/;
or one that looks like this:
m/
(?# This pattern will match any string with two)
(?# and only two words in it. The matched words)
(?# will be available in $1 and $2 if the match)
(?# is successful.)
^ (?# Anchor this match to the beginning)
(?# of the string)
\s* (?# skip over any whitespace characters)
(?# use the * because there may be none)
(\w+) (?# Match the first word, we know it's)
(?# the first word because of the anchor)
(?# above. Place the matched word into)
(?# pattern memory.)
\W+ (?# Match at least one non-word)
(?# character, there may be more than one)
(\w+) (?# Match another word, put into pattern)
(?# memory also.)
\s* (?# skip over any whitespace characters)
(?# use the * because there may be none)
$ (?# Anchor this match to the end of the)
(?# string. Because both ^ and $ anchors)
(?# are present, the entire string will)
(?# need to match the pattern. A)
(?# sub-string that fits the pattern will)
(?# not match.)
/x;
Of course, the commented pattern is much longer, but they take the
same amount of time to execute. In addition, it will be much easier to maintain
the commented pattern because each component is explained. When you know what
each component is doing in relation to the rest of the pattern, it becomes easy
to modify its behavior when the need arises.
Extensions also let you change the order of evaluation without affecting pattern memory. For example,
m/(?:a|b)+/;
matches the a or b characters
repeated one or more times in any order. The pattern memory will not be
affected.
At times, you might like to include a pattern component in your pattern without including it in the $& variable that holds the matched string. The technical term for this is a zero-width positive look-ahead assertion. You can use this to ensure that the string following the matched component is correct without affecting the matched value. For example, if you have some data that looks like this:
David Veterinarian 56
Jackie Orthopedist 34
Karen Veterinarian 28
and you want to find all veterinarians and store
the value of the first column, you can use a look-ahead assertion. This will do
both tasks in one step. For example:
while (<>) {
push(@array, $&) if m/^\w+(?=\s+Vet)/;
}
print("@array\n");
This program will display:
David Karen
Let's look at the pattern with comments added using
the extended mode. In this case, it doesn't make sense to add comments directly
to the pattern because the pattern is part of the if statement
modifier. Adding comments in that location would make the comments hard to
format. So let's use a different tactic.
$pattern = '^\w+ (?# Match the first word in the string)
(?=\s+ (?# Use a look-ahead assertion to match)
(?# one or more whitespace characters)
Vet) (?# In addition to the whitespace, make)
(?# sure that the next column starts)
(?# with the character sequence "Vet")
';
while (<>) {
push(@array, $&) if m/$pattern/x;
}
print("@array\n");
Here we used a variable to hold the pattern and then
used variable interpolation in the pattern with the match operator. You might
want to pick a more descriptive variable name than $pattern, however.
Tip |
Although the Perl documentation does not mention it, I believe you have only one look-ahead assertion per pattern, and it must be the last pattern component. |
The last extension that we'll discuss is the zero-width negative assertion. This type of component is used to specify values that shouldn't follow the matched string. For example, using the same data as in the previous example, you can look for everyone who is not a veterinarian. Your first inclination might be to simply replace the (?=...) with the (?!...) in the previous example.
while (<>) {
push(@array, $&) if m/^\w+(?!\s+Vet)/;
}
print("@array\n");
Unfortunately, this program displays
Davi Jackie Kare
which is not what you need. The problem is that
Perl is looking at the last character of the word to see if it matches the
Vet character sequence. In order to correctly match the first word, you
need to explicitly tell Perl that the first word ends at a word boundary, like
this:
while (<>) {
push(@array, $&) if m/^\w+\b(?!\s+Vet)/;
}
print("@array\n");
This program displays
Jackie
which is correct.
Tip
There are many ways of matching any value. If the
first method you try doesn't work, try breaking the value into smaller
components and match each boundary. If all else fails, you can always ask
for help on the comp.lang.perl.misc
newsgroup.
m/(.)\1/;
This pattern uses pattern memory to store a single
character. Then a back-reference (\1) is used to repeat the first
character. The back-reference is used to reference the pattern memory while
still inside the pattern. Anywhere else in the program, use the $1
variable. After this statement, $1 will hold the repeated character.
This pattern will match two of any non-newline character.
m/^\s*(\w+)/;
After this statement, $1 will hold the
first word in the string. Any whitespace at the beginning of the string will
be skipped by the \s* meta-character sequence. Then the \w+
meta-character sequence will match the next word. Note that the * -
which matches zero or more - is used to match the whitespace because there may
not be any. The + - which matches one or more - is used for the word.
m/
(\w+) (?# Match a word, store its value into pattern memory)
[.!?]? (?# Some strings might hold a sentence. If so, this)
(?# component will match zero or one punctuation)
(?# characters)
\s* (?# Match trailing whitespace using the * because there)
(?# might not be any)
$ (?# Anchor the match to the end of the string)
/x;
After this statement, $1 will hold the last word in the
string. You need to expand the character class, [.!?], by adding more
punctuation.
m/^(\w+)\W+(\w+)$/x;
After this statement, $1 will hold
the first word and $2 will hold the second word, assuming that the
pattern matches. The pattern starts with a caret and ends with a dollar sign,
which means that the entire string must match the pattern. The \w+
meta-character sequence matches one word. The \W+ meta-character
sequence matches the whitespace between words. You can test for additional
words by adding one \W+(\w+) meta-character sequence for each
additional word to match.
m/^\s*(\w+)\W+(\w+)\s*$/;
After this statement, $1 will
hold the first word and $2 will hold the second word, assuming that
the pattern matches. The \s* meta-character sequence will match any
leading or trailing whitespace.
$_ = "This is the way to San Jose.";
$word = '\w+'; # match a whole word.
$space = '\W+'; # match at least one character of whitespace
$string = '.*'; # match any number of anything except
# for the newline character.
($one, $two, $rest) = (m/^($word) $space ($word) $space ($string)/x);
After
this statement, $one will hold the first word, $two will
hold the second word, and $rest will hold everything else in the
$_ variable. This example uses variable interpolation to, hopefully,
make the match pattern easier to read. This technique also emphasizes which
meta-sequence is used to match words and whitespace. It lets the reader focus
on the whole of the pattern rather than the individual pattern components by
adding a level of abstraction.
$result = m/
^ (?# Anchor the pattern to the start of the string)
[\$\@\%] (?# Use a character class to match the first)
(?# character of a variable name)
[a-z] (?# Use a character class to ensure that the)
(?# character of the name is a letter)
\w* (?# Use a character class to ensure that the)
(?# rest of the variable name is either an)
(?# alphanumeric or an underscore character)
$ (?# Anchor the pattern to the end of the)
(?# string. This means that for the pattern to)
(?# match, the variable name must be the only)
(?# value in $_.)
/ix; # Use the /i option so that the search is
# case-insensitive and use the /x option to
# allow extensions.
After this statement,
$result will be true if $_ contains a legal variable name
and false if it does not.
$result = m/
(?# First check for just numbers in $_)
^ (?# Anchor to the start of the string)
\d+ (?# Match one or more digits)
$ (?# Anchor to the end of the string)
| (?# or)
(?# Now check for hexadecimal numbers)
^ (?# Anchor to the start of the string)
0x (?# The "0x" sequence starts a hexadecimal number)
[\da-f]+ (?# Match one or more hexadecimal characters)
$ (?# Anchor to the end of the string)
/ix;
After this statement, $result will be true if $_
contains an integer literal and false if it does not.
@results = m/\d+$|^0[x][\da-f]+/gi;
After this statement,
@result will contain a list of all integer literals in $_.
@result will contain an empty list if no literals were found.
m/\w\W/;
After this statement is executed, $& will
hold the last character of the first word and the next character that follows
it. If you want only the last character, use pattern memory,
m/(\w)\W/;. Then $1 will be equal to the last character of
the first word. If you use the global option, @array = m/\w\W/g;,
then you can create an array that holds the last character of each word in the
string.
m/\W\w/;
After this statement, $& will hold the
first character of the second word and the whitespace character that
immediately precedes it. While this pattern is the opposite of the pattern
that matches the end of words, it will not match the beginning of the first
word! This is because of the \W meta-character. Simply adding a
* meta-character to the pattern after the \W does not help,
because then it would match on zero non-word characters and therefore match
every word character in the string.
$_ = '/user/Jackie/temp/names.dat';
m!^.*/(.*)!;
After this match statement, $1 will equal
names.dat. The match is anchored to the beginning of the string, and
the .* component matches everything up to the last slash because
regular expressions are greedy. Then the next (.*) matches the file
name and stores it into pattern memory. You can store the file path into
pattern memory by placing parentheses around the first .* component.
m/(?:rock|monk)fish/x;
The alternative meta-character is used to
say that either rock or monk followed by fish needs
to be found. If you need to know which alternative was found, then use regular
parentheses in the pattern. After the match, $1 will be equal to
either rock or monk.
# read the whole file into memory.
open(FILE, "<fndstr.dat");
@array = <FILE>;
close(FILE);
# specify which string to find.
$stringToFind = "A";
# iterate over the array looking for the
# string. The $#array notation is used to
# determine the number of elements in the
# array.
for ($index = 0; $index <= $#array; $index++) {
last if $array[$index] =~ /$stringToFind/;
}
# Use $index to print two lines before
# and two lines after the line that contains
# the match.
foreach (@array[$index-2..$index+2]) {
print("$index: $_");
$index++;
}
There are many ways to perform this type of search, and this is
just one of them. This technique is only good for relatively small files
because the entire file is read into memory at once. In addition, the program
assumes that the input file always contains the string that you are looking
for.
s/^\s+//;
This pattern uses the \s predefined character
class to match any whitespace character. The plus sign means to match one or
more whitespace characters, and the caret means match only at the beginning of
the string.
s/\s+$//;
This pattern uses the \s predefined character
class to match any whitespace character. The plus sign means to match one or
more whitespace characters, and the dollar sign means match only at the end of
the string.
$prefix = "A";
s/^(.*)/$prefix$1/;
When the substitution is done, the value in the
$prefix variable will be added to the beginning of the $_
variable. This is done by using variable interpolation and pattern memory. Of
course, you might also consider using the string concatenation operator; for
instance, $_ = "A" . $_;, which is probably faster.
$suffix = "Z";
s/^(.*)/$1$suffix/;
When the substitution is done, the value in the
$suffix variable will be added to the end of the $_
variable. This is done by using variable interpolation and pattern memory. Of
course, you might also consider using the string concatenation operator; for
instance, $_ .= "Z";, which is probably faster.
s/^\s*(\w+)\W+(\w+)/$2 $1/;
This substitution statement uses the
pattern memory variables $1 and $2 to reverse the first two
words in a string. You can use a similar technique to manipulate columns of
information, the last two words, or even to change the order of more than two
matches.
s/\w/$& x 2/eg;
When the substitution is done, each
character in $_ will be repeated. If the original string was
"123abc", the new string would be "112233aabbcc". The
e option is used to force evaluation of the replacement string. The
$& special variable is used in the replacement pattern to
reference the matched string, which is then repeated by the string repetition
operator.
s/(\w+)/\u$1/g;
When the substitution is done, each character in
$_ will have its first letter capitalized. The /g option
means that each word - the \w+ meta-sequence - will be matched and
placed in $1. Then it will be replaced by \u$1. The
\u will capitalize whatever follows it; in this case, it's the
matched word.
$_ = "!!!!";
$char = "!";
$insert = "AAA";
s{
($char) # look for the specified character.
(?=$char) # look for it again, but don't include
# it the matched string, so the next
} # search will also find it.
{
$char . $insert # concatenate the specified character
# with the string to insert.
}xeg; # use extended mode, evaluate the
# replacement pattern, and match all
# possible strings.
print("$_\n");
This example uses the extended mode to add comments
directly inside the regular expression. This makes it easy to relate the
comment directly to a specific pattern element. The match pattern does not
directly reflect the originally stated goal of inserting a string between two
repeated characters. Instead, the example was quietly restated. The new goal
is to substitute all instances of $char with $char .
$insert, if $char is followed by $char. As you can see, the
end result is the same. Remember that sometimes you need to think outside the
box.
s/(\$\w+)/$1/eeg;
This is a simple example of secondary variable
interpolation. If $firstVar = "AAA" and $_ = '$firstVar',
then $_ would be equal to "AAA" after the substitution was
made. The key is that the replacement pattern is evaluated twice. This
technique is very powerful. It can be used to develop error messages used with
variable interpolation.
$errMsg = "File too large";
$fileName = "DATA.OUT";
$_ = 'Error: $errMsg for the file named $fileName';
s/(\$\w+)/$1/eeg;
print;
When this program is run, it will display
Error: File too large for the file named DATA.OUT
The values of
the $errMsg and $fileName variables were interpolated into
the replacement pattern as needed.
$cnt = tr/Aa//;
After this statement executes, $cnt
will hold the number of times the letter a appears in $_.
The tr operator does not have an option to ignore the case of the
string, so both upper- and lowercase need to be specified.
tr [\200-\377] [\000-\177];
This statement uses the square
brackets to delimit the character lists. Notice that spaces can be used
between the pairs of brackets to enhance readability of the lists. The octal
values are used to specify the character ranges. The translation operator is
more efficient - in this instance - than using logical operators and a loop
statement. This is because the translation can be done by creating a simple
lookup table.
s/^\s+//;
@array = split;
After this statement executes, @array will
be an array of words. Before splitting the string, you need to remove any
beginning whitespace. If this is not done, split will create an array
element with the whitespace as the first element in the array, and this is
probably not what you want.
$line =~ s/^\s+//;
@array = split(/\W/, $line);
After this statement executes,
@array will be an array of words.
@array = split(//);
After this statement executes,
@array will be an array of characters. split recognizes the
empty pattern as a request to make every character into a separate array
element.
@array = split(/:/);
@array will be an array of strings
consisting of the values between the delimiters. If there are repeated
delimiters - :: in this example - then an empty array element will be
created. Use /:+/ as the delimiter to match in order to eliminate the
empty array elements.
While the slash character is the default pattern delimiter, you can use any character in its place. This feature is useful if the pattern contains the slash character. If you use an opening bracket or parenthesis as the beginning delimiter, use the closing bracket or parenthesis as the ending delimiter. Using the single-quote as the delimiter will turn off variable interpolation for the pattern.
The matching operator has six options: /g, /i, /m, /o, /s, and /x. These options were described in Table 10.2. I've found that the /x option is very helpful for creating maintainable, commented programs. The /g option, used to find all matches in a string, is also very useful. And, of course, the capability to create case-insensitive patterns using the /i option is crucial in many cases.
The substitution operator has the same options as the matching operator and one more - the /e option. The /e option lets you evaluate the replacement pattern and use the new value as the replacement string. If you use back-quotes as delimiters, the replacement pattern will be executed as a DOS or UNIX command, and the resulting output will become the replacement string.
The translation operator has three options: /c, /d, and /s. These options are used to complement the match character list, delete characters not in the match character list, and eliminate repeated characters in a string. If no replacement list is specified, the number of matched characters will be returned. This is handy if you need to know how many times a given character appears in a string.
The binding operators are used to force the matching, substitution, and translation operators to search a variable other than $_. The =~ operator can be used with all three of the regular expression operators, while the !~ operator can be used only with the matching operator.
Quite a bit of space was devoted to creating patterns, and the topic deserves even more space. This is easily one of the more involved features of the Perl language. One key concept is that a character can have multiple meanings. For example, the plus sign can mean a plus sign in one instance (its literal meaning), and in another it means match something one or more times (its meta-meaning).
You learned about regular expression components and that they can be combined in an infinite number of ways. Table 10.5 listed most of the meta-meanings for different characters. You read about character classes, alternation, quantifiers, anchors, pattern memory, word boundaries, and extended components.
The last section of the chapter was devoted to presenting numerous examples of how to use regular expressions to accomplish specific goals. Each situation was described, and a pattern that matched that situation was shown. Some commentary was given for each example.
In the next chapter, you'll read about how to present information by using formats. Formats are used to help relieve some of the programming burden from the task of creating reports.
m{.*];
/AA[.<]$]ER/
$_ = 'AB AB AC';
print m/c$/i;