You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 

758 lines
49 KiB

<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>My Learning Website</title>
<link href="/styles/styles.css" rel="stylesheet" type="text/css">
<link href="/programming/styles/styles.css" rel="stylesheet" type="text/css">
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<div class="banner">
<h1 class="courselink">Perl 5 Essential Traning</h1>
<h2 class="lecturer">LinkedIn Learning : Bill Weinman</h2>
<h2 class="episodetitle">Regular Expressions</h2>
</div>
<article>
<h2 class="sectiontitle">About Regular Expressions</h2>
<p>Regular Expressions (or regex) is a pattern matching language, used to find and replace strings of text and commonly used in Unix utilities such as sed and awk.</p>
<p>Regular expressions use operators to do the searching and replacing and the simplest is probably m, which has the syntax</p>
<p class="inset">m/PATTERN/modifiers</p>
<p>The m is actually optional so we could also write this as</p>
<p class="inset">/PATTERN/modifiers</p>
<p>As with quote words, we can also use different separators which is just about any paired characters such as {}, (), [], &lt;&gt; or ||. Again, because we rarely see them as text, the pipe characters are commonly used. However, if you omit the m, you must use the forward slash character as separators.</p>
<p>The m operator will return a true or false value depending on whether a match is found.</p>
<p>Similar to this is the s operator which searches for text and replaces it with whatever text you provide to the operator. The syntax is</p>
<p class="inset">s/PATTERN/REPLACEMENT/modifiers</p>
<p>As with m, we can omit the s if we are using the forward slash as a delimeter or we can use the s with any possible delimiter.</p>
<p>If you have a commonly used search pattern, you can precompile it with the qr operator. To use it, you would replace the m or the s with qr and assign this to a variable.</p>
<pre class="inset">
1. $myregex = qr/my.STRING/is;
2. s/$myregex/foo;</pre>
<p>This is equivalent to</p>
<p class="inset">s/my.STRING/foo/is;</p>
<p>The match operator, =~, is used to bind a variable or an expression to a regular expression and is used for both match and replace operations. For a match operation, it will look something like this</p>
<p class="inset">$var =~ m/pattern/modifiers;</p>
<p>In a scalar context, this is going to return a true value if a match is found and false otherwise. In a list context, it will return a list of the matched parts.</p>
<p>The negated match operator is !~ and this returns the opposite. That is, it will return a false value if a match is found and true otherwise.</p>
<p class="inset">$var !~ m/pattern/modifiers;</p>
<p>If we use the match operator with s, it will look something like this</p>
<p class="inset">$var =~ s/pattern/replacement/modifiers;</p>
<p>This will return the matched text.</p>
<p>We haven’t mentioned what the pattern is and this can be a simple piece of text, but we wouldn’t really need regular expressions for that. The power in regular expressions lies in the fact that we can match patterns of text using special characters, some of which are listed in figure 59.</p>
<table>
<tr>
<th>CHARACTER</th>
<th>MATCHES</th>
<th>EXAMPLE</th>
</tr>
<tr>
<td>.</td>
<td>Any single character</td>
<td>/w.sh/ will match with wish or wash but not bash</td>
</tr>
<tr>
<td>*</td>
<td>Zero or more instances of a given character</td>
<td>/wa*sh/ will match with wash, waaaaaash or wsh</td>
</tr>
<tr>
<td>+</td>
<td>One or more instances of a given character</td>
<td>/bo+k/ will match with bok, book or booooook but not bk or bake</td>
</tr>
<tr>
<td>^</td>
<td>Anchors the start of a string</td>
<td>^w will match any string that starts with a w</td>
</tr>
<tr>
<td>$</td>
<td>Anchors the end of a string</td>
<td>w$ will match any string that ends with a w</td>
</tr>
</table>
<p class="caption">Figure 59 -some of the special characters used in regular expressions</p>
<p>We can also use modifiers as shown in the syntax examples above. The most common is probably i which tells regex to ignore the case.</p>
<h1 class="sectiontitle">Matching Text</h1>
<p>Matching text is the fundamental purpose of regular expressions. Since it returns a logical value, it is often used in conditional statements.</p>
<p>Consider the code in figure 60.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text";
8.
9. if ( $s =~ /line/ ) {
10. say 'True';
11. } else {
12. say 'False';
13. }</pre>
<p class="caption">Figure 60 - match.pl which uses a regex in a conditional statement</p>
<p>The conditional is on line 9. It will return the value of the regex so if the regex finds a match, it will return a true value and the code in the if clause will be executed. If it returns a false value, the code in the else clause will be executed.</p>
<p>As it is, the result of executing the code is that True is output. If we, for example, changed the regex to Line rather than line, since we are not using the i modifier, the match is case-sensitive so a match will not be found and False will be output.</p>
<p>If we add the i modifier, we will see True output in both cases. This is a simple and common way to test a regular expression. Another way is to use a precompiled regular expression and a ternary operator. This would look something like the code shown in figure 61.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text";
8. my $re = qr/line/;
9.
10. say $s =~ $re ? 'True' : 'False';</pre>
<p class =>Figure 61 - using a ternary operator to test a regex</p>
<h2 class="sectiontitle">Common Modifiers</h2>
<p>We saw in the previous example how the use of i as a modifier caused the matching operation to ignore case. Another modifier we might use is the global modifier, g. To demonstrate this, we have a modified version of match.pl in figure 62.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text";
8.
9. say foreach $s =~ /i/;
10.
11. if ( $s =~ /line/ ) {
12. say 'True';
13. } else {
14. say 'False';
15. }</pre>
<p class="caption">Figure 62 - the modified version of match.pl without a g modifier</p>
<p>In the code in figure 62, we have added line 9. We are using a foreach statement which is a list operation so we are using a list context rather than a scalar context. When we run this, in addition to True being output, as it was before, we see that 1 has been output. What is happening here is that the value returned is the number of matches found. As a method of counting the matches, this wouldn’t work because it stops searching when a match is found!</p>
<p>If we add the g modifier so line 9 becomes</p>
<p class="inset">say foreach $s =~ /i/g ;</p>
<p>the expression returns a list of all matches and the foreach statements outputs them in sequence. So, we see the letter i output three times and then True.</p>
<p>In essence, the g modifier causes the whole string to be checked for matches and a list is returned containing all of those matches. Obviously, this only makes sense in a list context.</p>
<p>The s modifier causes a string to be treated as a single line. To demonstrate this, we will modify the code in match.pl again and this modified code is shown in figure 63.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text\nmore text\mnore text";
8.
9. if ( $s =~ /t.m/ ) {
10. say 'True';
11. } else {
12. say 'False';
13. }</pre>
<p class="caption">Figure 63 - the modified version of match.pl without an s modifier</p>
<p>On line, we have added a couple of newline characters and we are trying to match /t.m/, that is, a t followed by any character and then a t. So, as an example, if we had tin in the text, this would give us a match.</p>
<p>If we run this, the output is False, indicating that a match wasn’t found. So, we didn’t match with t\n and this is because \n marks the end of the line and we are not searching beyond that. The last character checked is the t just before the first newline character.</p>
<p>Now, we will change line 9 to</p>
<p class="inset">if ( $s =~ /t.m/s ) {</p>
<p>When we run the code again, the output is true. A match has been found. This is because we are treating the text as a single line. Our pattern therefore matches with the t followed by a newline character followed by an m.</p>
<p>Note that \n is a single character in this context.</p>
<p>Now, consider what happens if we change line 9 like so</p>
<p class="inset">if ( $s =~ /^m/ ) {</p>
<p>We might expect a match to be found because we are looking for an m at the start of a line which is what we have, since m is the first character. However, without the s modifier, the output is False since the match operation ends at the end of the first line.</p>
<p>If we modify the code to put the s modifier in</p>
<p class="inset">if ( $s =~ /^m/s ) {</p>
<p>and run the code again, we still get False output. Remember that the s modifier causes the text to be treated as a single line, so we would only get a match if there was an m at the start of the first line.</p>
<p>We will modify line 9 again like so</p>
<p class="inset">if ( $s =~ /^m/m ) {</e>
<p>Here, we are using the m modifier which causes the text to be treated as multiple lines. If we run the code now, the m at the start of the second line will be matched and the output is now True.</p>
<p>The m modifier can also be combined with the x operator which allows the regex to be written on multiple lines. As an example, let’s say we modify line 9 like so</p>
<p class="inset">if ( $s =~ /^m.*text$/m ) {</p>
<p>This is looking for a match with m followed by any number of characters and with text at the end of the line. If we run it as is, we will get True output because a match was found.</p>
<p>Let’s say we then decide to spread this out a little by adding some white space to make it more readable. So the line becomes</p>
<pre class="inset">
if ( $s =~ /
^m
.*text
$
/m ) {</pre>
<p>If we run this code, we will now see False being output because the white space we have introduced is include in the match and the text does not match this. However, if we add the x modifier</p>
<pre class="inset">
if ( $s =~ /
^m
.*text
$
/mx ) {</pre>
<p>This indicates that the regex will be on multiple lines, so essentially the white space is ignored and we get the output as True again.</p>
<p>There are other modifiers as well. More information on all of these can be found at <a href="https://perldoc.perl.org/perlre#Modifiers">https://perldoc.perl.org/perlre#Modifiers</a>.</p>
<h2 class="sectiontitle">Extracting Matches</h2>
<p>We saw how we could use the fact that, in a scalar context, match returns a true or false in a conditional statement. However, there may be occasions where we want to actually see what the match was and we can return this by putting the pattern in parentheses. Let’s look at the first version of match.pl which was shown in figure 60.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text";
8.
9. if ( $s =~ /line/ ) {
10. say 'True';
11. } else {
12. say 'False';
13. }</pre>
<p class =>Figure 64 - the original version of our match.pl code</p>
<p>We want to modify line so that the pattern is in parentheses.</p>
<p class="inset">if ( $s =~ /(line)/ ) {</p>
<p>We also want to change the output so we can display the match, so we will change line 10 to</p>
<p class="inset">say “match is: $1”;</p>
<p>We’ll also change line 11 so if there is no match found, the output will be ‘no match’ rather than False. If we run this, we will get the output</p>
<p class="inset">match is line</p>
<p>Obviously, this is more useful with wildcards. That is, where we may not know in advance what the match will be. So if we change line 9 to</p>
<p class="inset">if ( $s =~ /(..is)/ ) {</p>
<p>for example, this will give us the output</p>
<p class="inset">match is This</p>
<p>We can also extract more than one match with $2, $3 and so on. So we could do something like this</p>
<p class="inset">if ( $s =~ /(..is).*(..ne).(..)/ ) {</p>
<p>So this matches with any two characters (..), then is, then any character (.), any characters (*), any two characters (..), ne, any character (.) and finally any two characters (..). This will match up with ‘This is a line of’. Note that the numbers of characters are very specific, the only time where multiple characters are allowed is with the asterisk. This could, of course, be simplified if we just wanted to match the text. But the placement of the parentheses means that we want to extract the first four characters (..is), the four characters ending in ne (..ne) and the last two characters so that’s why the expression is broken down as it is. For example, after the ne we have three additional characters but only the last two are in parentheses.</p>
<p>We can then add two additional lines so that we are outputting $2 and $3 as well. The code for mach.pl with these modifications is shown in figure 65.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text";
8.
9. if ( $s =~ /(..is).*(..ne).(..)/ ) {
10. say "match #1 is $1";
11. say "match #2 is $2";
12. say "match #3 is $3";
13. } else {
14. say 'No match.';
15. }</pre>
<p class =>Figure 65 - outputting several matches</p>
<p>When we run this, we get the output</p>
<pre class="inset">
match #1 is This
match #2 is line
match #3 is of</pre>
<h2 class="sectiontitle">Getting a List of Matches</h2>
<p>Sometimes, we want to get a number of matches from a regular expression without knowing how many matches to expect. Imagine, for instance, we want a list of all the letters is our string that appear after the letter i. We can create an array to hold the results like this</p>
<p class="inset">my @array = $s =~ /i(.)/g;</p>
<p>So, the regular expression will return a list of characters and we will initialise the array with this list. Note that we have used the g modifier so that we get all matches returned.</p>
<p>We can then output these with a foreach loop</p>
<p class="inset">say foreach @array;</p>
<p>The full version of the code is shown in figure 66.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text";
8.
9. my @array = $s =~ /i(.)/g;
10. say foreach @array;</pre>
<p class =>Figure 66 - using an array to capture an unspecified number of matches</p>
<p>In this simple example, of course, we can see that this should return 2 s characters and one n and we see this in the output.</p>
<pre class="inset">
s
s
n</pre>
<p>Note that it is the parentheses after the i in our regex that causes us to capture just the first character after the i.</p>
<p>It is interesting to note that if we change our string as follows</p>
<p class="inset">my $s = "This iis a line of text";</p>
<p>we might expect 4 results, s, i, s and n. Notice that the middle letter of iis is after an i, but since it is itself an i, the s that follows it is also after an i. In fact, we get three characters returned.</p>
<pre class="inset">
s
i
n</pre>
<p>I would deduce from this that each character is only checked once. The reg expression matches i followed by any letter so it finds the first i and takes the s after it, it finds the second i and takes the I after it and finally it finds the third i and takes the n after it.</p>
<p>So, after finding the second i and taking the i after it, it then doesn’t backtrack to see if the letter it just found was also an i. I would guess that if you had a string such as ‘iiiiiiiiiiii’ (twelve is), the regex would match with 5 of these characters so every letter is either the i or the i that follows it but never both. Running a quick test shows that this is the case.</p>
<p>I guess that this would be analogous to looking through a deck of cards and discarding every card that comes after an ace. If the fist four cards are aces, you would check the first card, discard the next, check the next card and discard the next so each card is either checked or discarded, but we don’t check the card that is being discarded. If we did, we would discard three cards in this scenario rather than two.</p>
<p>This may be significant because obviously it may have an impact on the number of results you get. For example, in the card analogy, you would be expecting to discard four cards and may be surprised that only two are actually discarded.</p>
<p>One final point here is that we initialised our array with the list returned by the regex, but this is not necessary. We could simply remove the array and change our foreach loop to</p>
<p class="inset">say foreach $s =~ /i(.)/g;</p>
<p>and this gives exactly the same result.</p>
<h2 class="sectiontitle">Simple Matches</h2>
<p>We have seen some examples of simple matches such as</p>
<p class="inset">if ( $s =~ /i/ ) {</p>
<p>which looks for a match with the character i. We can also anchor this to the start or end of the string so it only matches if the character (I should really say pattern, it just so happens that in this example the pattern is a single character) is at the start or the end of the string.</p>
<p>We can search for more complex patterns. For example</p>
<p class="inset">if ( $s =~ /i{3}/ ) {</p>
<p>would match if there were three consecutive instances of this character. We could also search for, let’s say, anything between 3 and 5 instances like so</p>
<p class="inset">if ( $s =~ /i{3,5}/ ) {</p>
<p>We can test this with the code shown in figure 67.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text";
8.
9. if ( $s =~ /a{3,5}/ ) {
10. say 'True';
11. } else {
12. say 'False';
13. }</pre>
<p class =>Figure 67 - match2.pl which is testing a match for between 3 and 5 consecutive instances of the letter a</p>
<p>As is, this code will give the output as False. If we edit line 7 so that we have three a’s as in</p>
<p class="inset">my $s = "This is aaa line of text";</p>
<p>we now get True as the output. We can also run this with 4 or 5 a’s and we still get True. If we have more than 5 a’s as in</p>
<p class="inset">my $s = "This is aaaaaaaaa line of text";</p>
<p>we might expect a False output since the number of a’s is outside the specified range. But, of course, it does include 5 consecutive a’s so it will give us a True output. In that sense, using the pattern like this, the upper limit is really redundant since it will give us a match in any scenario where there are at least three a’s. However, if we only specified 3 as, in the example above we actually have 9 so this would give us three matches. It’s possible that in some circumstances, this might lead to a misleading result.</p>
<p>If we want our pattern to match with 3 or more instances of the letter, we can do that by omitting the upper limit, so</p>
<p class="inset">1. if ( $s =~ /a{3,}/ ) {</p>
<p>We can amend the code so that it will look for a match with only between 3 and 5 instances of the letter. We do this by specifying the match as a returned value by putting the patterns in parentheses. This is shown in the modified version of match2.pl shown in figure 68.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match2.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is aaaaaaaaa line of text";
8.
9. if ( $s =~ /(a{3,5})/ ) {
10. say "Match is: $1";
11. } else {
12. say 'Match not found';
13. }</pre>
<p class ="caption">Figure 68 - the modified version of match2.pl which matches between 3 and 5 instances of a character</p>
<p>Be aware that when this is executed, it will still find a match but it will only match 5 instances of the character so we will see aaaaa in the output.</p>
<p>Actually, it looks to me as though this is doing the same thing as the original version without the parentheses, it is simply that we are displaying the result. In a sense, it is meaningless to query whether the original version matched with 5 or 9 instances, since we only asked for a true or false response so in either case, the result would be the same.</p>
<p>Essentially, I guess that this needs some context to be meaningful. For example, if you are counting instances of matches found (assuming this is important) this can make a real difference. To illustrate this, let’s amend match2 so that it only returns a match with three consecutive a’s. We will keep the parentheses and display the match found which should be aaa.</p>
<p>I will also add three extra lines after line 10 to output matches 2, 3 and 4. My expectation is that I will see three matches and get an error on the fourth. Actually, the output when I run this is</p>
<pre class="inset">
Match #1 is: aaa
Use of uninitialized value $2 in concatenation (.) or string at match2.pl line 11.
Match #2 is:
Use of uninitialized value $3 in concatenation (.) or string at match2.pl line 12.
Match #3 is:
Use of uninitialized value $4 in concatenation (.) or string at match2.pl line 13.
Match #4 is:</pre>
<p>I forgot to add the global modifier, so it stopped searching after finding the first match. I will add that in and run it again. Actually, the output is the same which surprises me although it shouldn’t since the values are scalar and I am trying to use them in a list context.</p>
<p>Figure 69 shows a modified version of this code using the correct context.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match2.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is aaaaaaaaa line of text";
8.
9. say foreach $s =~ /(a{3})/g </pre>
<p class =>Figure 69 - match2.pl modified to search for 3 consecutive a's and output all the instances that it finds</p>
<p>The output we get from running the code in figure 69 is</p>
<pre class="inset">
aaa
aaa
aaa </pre>
<h2 class="sectiontitle">A Brief Digression</h2>
<p>I am interested to find out what happens if we modify the match to find between 3 and 5 consecutive a’s. My expectation is that we will match 5 a’s if that is possible and then 4 or 3 but I don’t think that this is completely obvious. For example, it may match 3 and then stop looking for a fourth and fifth a since 3 is the minimum number that will match the pattern.</p>
<p>I think that since 5 consecutive a’s also matches the pattern, it will take the first 5 a’s, match with those and then take the remaining 4, which is also a match.</p>
<p>The modified code is shown in figure 70.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is aaaaaaaaa line of text";
8.
9. say foreach $s =~ /(a{3,5})/g</pre>
<p class="caption">Figure 70 - the modified version of the code shown in figure 69 where we are matching with between 3 and 5 consecutive a's</p>
<p>The output from this is</p>
<pre class="inset">
aaaaa
aaaa</pre>
<p>Actually, this does make sense if you think about it because if it stopped searching when the minimal version of the pattern was found (aaa), there wouldn’t be any point in specifying a range.</p>
<h2 class="sectiontitle">Matching Wildcards</h2>
<p>We mentioned wildcards previously but here, we will take a look at them in a little bit of detail and look at some examples. Firstly, look at the code sample shown in figure 71.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # warnings.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text";
8.
9. if ( $s =~ /(text)/ ) {
10. say "Match is: $1";
11. } else {
12. say "No match found";
13. }</pre>
<p class="caption">Figure 71 - warnings.pl</p>
<p>This is something we have seen before, it searches for the pattern /text/ in our string and we have put this in parentheses so we can use it as a return value in line 10. As we would expect, the string ‘text’ is output.</p>
<p>Now, let’s say that we amend the pattern in line 9 so that we have</p>
<p class="inset">if ( $s =~ /(t.xt)/ ) {</p>
<p>If we run this again, we will see that we get exactly the same result. We have replaced the e with a . which will match with any character, and this includes the e so it will still find the same match.</p>
<p>However, with the wildcard, we could change the text to something else such as</p>
<p class="inset">my $s = "This is a line of tExt";</p>
<p>or</p>
<p class="inset">my $s = "This is a line of t9xt";</p>
<p>and it would still find a match and output it.</p>
<p>Recall that the . matches with any character so if we change our text to</p>
<p class="inset">my $s = "This is a line of txt";</p>
<p>we won’t find a match.</p>
<p>If we change our pattern to</p>
<p class="inset">if ( $s =~ /(t+)/ ) {</p>
<p>This will try to match with 1 or more instances of the letter t. Note, we haven’t used the i modifier so this is case-sensitive.</p>
<p>When we run this, we will find and output the letter t. If we change our text to</p>
<p class="inset">my $s = "This is a line of tttttext";</p>
<p>and run it, we will get ttttt as the output.</p>
<p>If we want to match with 0 or more instances of a character, we would use the * for that. If we change the pattern to</p>
<p class="inset">if ( $s =~ /(lin*)/ ) {</p>
<p>this will give the output lin. If we amend the text to</p>
<p class="inset">my $s = "This is a lie of text";</p>
<p>we will still get a match when we run this and we will get the output as li. So, this wouldn’t match if we used + which matches with one or more instance, but it’s fine with * because here, it matches with an l, an i and 0 instances of the letter e.</p>
<p>It is quite common to see both the . and * being used together. We can see an example of this in figure 72.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # wildcards2.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text";
8.
9. if ( $s =~ /line (.*)/ ) {
10. say "Match is: $1";
11. } else {
12. say "No match found";
13. }</pre>
<p class =>Figure 72 - wildcards2.pl which shows an example of using the . and * characters together</p>
<p>Here, we are matching line followed by and number of any character. In other words, if the characters line are found we will match with this and then any number of characters up to the end of the line, but notice that line itself is not included in the returned pattern.</p>
<p>As such, the output from running this is</p>
<p class="inset">of text</p>
<p>Note that there is no space at the start of the output because there is a space in the pattern so really, I should have said that line followed by a space is not included in the output. You can confirm that this is the case by removing the space between line and of in line 7 and the result of running the code will now be that there is no match found.</p>
<p>So, we can summarise this simply by saying that .* will match any number of characters, whatever those characters are, similar to the way in which you will often see the * character on it’s own in, for example, a Windows environment.</p>
<p>Regular expression wildcards are, by default, greedy. This means that they will match as many characters as possible in order to satisfy the pattern. We saw this in the previous section (Simple Matches) and this is the reason why we saw that when we were matching between 3 and 5 instances of a character, the pattern would always match with five characters if it could.</p>
<p>As a further example of this, we can change the pattern on line 9 to</p>
<p class="inset">if ( $s =~ /(l.*e)/ ) {</p>
<p>So this is going to match an l followed by any number of characters and finally an e. The characters in line would be a valid match and is the shortest match we can make. However, if we run this, the output we get is</p>
<p class="inset">line of te</p>
<p>So, in this case, there was a longer string of characters which also matched the pattern and we have returned the longer of these two possible matches.</p>
<p>This might be a problem if you wanted to match something quite specific, for instance if you want to extract the word line on its own here because you would have to be sure that there wasn’t another e in the text after the pattern you are looking for.</p>
<p>Obviously, in most cases, you wouldn’t know whether that was the case because you probably wouldn’t need to search for a pattern if you knew exactly what was in the string. We can get around this problem, though, by specifying a non-greedy search and we do that by putting a ? after the wildcard. This would mean changing line 9 to</p>
<p class="inset">if ( $s =~ /(l.*?e)/ ) {</p>
<p>Now, we get line as the output when we run the code. In this case, the search is being lazy rather than greedy in that it matches the fewest characters possible that give us a valid match.</p>
<p>Be aware that using the ? character with a wildcard is different to using it on its own. So, if we change line 9 to</p>
<p class="inset">if ( $s =~ /(lin?e)/ ) {</p>
<p>this has the effect of making the n optional since we will match with 0 or more instances of that character. So both line and lie would match this pattern but, for example, like would not since there is no k in the pattern.</p>
<p>As an exercise, I tried to modify the pattern so that it would also match like and this is quite simple, we need to modify the pattern like so</p>
<p class="inset">if ( $s =~ /(li.?e)/ ) {</p>
<p>This will match with li followed by any character and an e, but the question mark means that we will match up with 0 or 1 instances of the character. As a result, this will match, again, with both line and lie, but will also now match with like.</p>
<p>Note that this pattern ill not match with either linke or linne.</p>
<p>So, rather than making the search lazy, as ? with a wildcard, when used with a single character like this, it makes the character optional in effect. This also applies when the character is a wildcard such as the . character.</p>
<h2 class="sectiontitle">Matching Classes of Characters</h2>
<p>In addition to matching characters we specify, we can also match groups of characters. For example, if we want to match white space (which might be spaces, tabs, vertical tabs, new lines, form feeds), we can do that with \s.</p>
<p>The inverse of this is \S, and it is quite common with this type of match for the upper-case letter to be the negation of the lower-case letter.</p>
<p>Consider the example in figure 73.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of txt";
8.
9. if ( my @a = $s =~ /(\S)/g ) {
10. say "Match is: ";
11. say foreach @a;
12. } else {
13. say "No match found";
14. }</pre>
<p class="caption">Figure 73 - words.pl which demonstrates a pattern that matches non-white space</p>
<p>The output that we see from this is</p>
<pre class="inset">
Match is:
T
h
i
s
a
l
i
n
e
o
f
t
e
x
t</pre>
<p>As we can see, the output is simply every character from the string that is not white space. We have taken the matched pattern, which remember is returned as a list, and initialised an array with it and then output the contents of that array using a foreach loop. Note that we used the g modifier to ensure all matches are returned.</p>
<p>We can also add a + character to \S, so line 9 becomes</p>
<p class="inset">1. if ( my @a = $s =~ /(\S+)/g ) {</p>
<p>This has the effect of grouping the non-white space characters, effectively returning a list of words rather than individual characters and this gives us the output</p>
<pre class="inset">
Match is:
This
is
a
line
of
txt</pre>
<p>There are some useful predefined classes of characters:</p>
<pre class="inset">
\d digits
\D anything that is not a digit
\w word-class characters, that is anything that might be in a word including letters or digits
\W anything that is not a word-class character</pre>
<p>And again, we can use these with or without a + sign, depending on whether we want the characters grouped.</p>
<p>We can also create our own classes of characters by putting them in square brackets. For example, if we want match any digit from 0 to 6, we can do that with</p>
<p class="inset">[0, 1, 2, 3, 4, 5, 6]</p>
<p>We can shorten this by using a range so</p>
<p class="inset">[0-6]</p>
<p>or</p>
<p class="inset">[0-9]</p>
<p>will match, respectively, any digit between 0 and 6 and any digit between 0 and 9. Again, we can use a + sign to group the characters.</p>
<p>Figure 74 shows a bit of code, actually a modified version of words.pl which we saw in figure 73 but this time we are looking to match a group of numeric digits.</p>
<p>We can negate a class by using the ^ character, and you will see that as with the ?, the meaning is slightly different in this context (normally it anchors a pattern to the start of a string).</p>
<p>So, if we want to match a character that is not a digit, we would use</p>
<p class="inset">[^0-9]</p>
<p>As with numbers, we can create a class for letters</p>
<p class="inset">[a-z]</p>
<p>or</p>
<p class="inset">[A-Z]</p>
<p>If we want to match any letter, regardless of case, we can combine the two with</p>
<p class="inset">[a-zA-Z]</p>
<p>We can also combine both numbers and letters in the same way so if we want to may letter (upper or lower case) or any digit, we would use</p>
<p class="inset">[a-zA-Z0-9]</p>
<p>Figure 74 demonstrates the use of this type of class.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text and this is a telephone number - 0131 655 6500. Let's see what we get";
8.
9. if ( my @a = $s =~ /([a-zA-Z0-9]+)/g ) {
10. say "Match is: ";
11. say foreach @a;
12. } else {
13. say "No match found";
14. }</pre>
<p class="caption">Figure 74 - words.pl modified to demonstrate our own class</p>
<p>The test on line 7 has been expanded to include some groups of numeric digits. The pattern is looking for any character that is either a letter (regardless of case) or a digit and we have used the + sign to group the characters. We have also used the g modifier to return all matches. The output from this is</p>
<pre class="inset">
Match is:
This
is
a
line
of
text
and
this
is
a
telephone
number
0131
655
6500
Let
s
see
what
we
get</pre>
<p>This kind of looks like we are matching anything that isn’t white space, and that is true given the string we are matching against in this example. But we are also not matching any other characters that are not in our class such as punctuation.</p>
<p>As an example, let’s say we modified the string to replace the word telephone with tele!phone. In the output, we would see tele and phone as two separately matched groupings. This has the same effect as changing the word to tele phone. In short, we are not just ignoring white space, we are ignoring any characters that do not conform to the pattern</p>
<p class="inset">[a-zA-Z0-9]</p>
<h2 class="sectiontitle">Matching Metacharacters</h2>
<p>Consider the code shown in figure 75.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # metacharacters.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a (line) of text";
8.
9. if ( my @a = $s =~ /(line)/ ) {
10. say "Match is: $1";
11. } else {
12. say "No match found";
13. } </pre>
<p class="caption">Figure 75 - metacharacters.pl</p>
<p>This is a fairly straightforward example, much like those we have seen before. Note that when we run this, we get the output as line. Note that in the string, the word line is in parentheses.</p>
<p>If we want to extract the parentheses as well, we need to escape these characters in the pattern s line 9 becomes</p>
<p class="inset">if ( my @a = $s =~ /(\(line\))/ ) {</p>
<p>Not forgetting, of course, that if we want to have this returned, we also need another set of parentheses around the set that we are escaping otherwise the parentheses would be ignored in that context. That is, they would form part of the pattern but would no longer enclose the whole pattern allowing us to use it as a return value.</p>
<p>Now, if we run this, we get the output</p>
<p class="inset">Match is: (line)</p>
<p>In this example, we knew in advance what was in the string. A more generalised version of this that will return the parentheses and everything inside them would use a wildcard, so line 9 would look like this</p>
<p class="inset">if ( my @a = $s =~ /(\(.*\))/ ) {</p>
<p>In this example, the result would be exactly the same although now, we could change the position of the parentheses and still get a pattern matched whereas in the previous version, we would also have needed to change the pattern in order to match something different between the parentheses.</p>
<p>Any character that has a special meaning in Perl regular expressions must be escaped if you want to match them in a pattern. These characters are shown in figure 76.</p>
<img src="images/image1.png" alt="Regular expression meta characters">
<p class="caption">Figure 76 - characters that have a special meaning in Perl regex which have to be escaped if you want to match them in a pattern</p>
<h2 class="sectiontitle">Search and Replace</h2>
<p>We have seen a few examples, now, where we have used a pattern to match some part of a text string. Figure 77 shows some sample code where we are replacing the matched text.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text";
8.
9. $s =~ s/[se]/x/g;
10. say $s; </pre>
<p class="caption">Figure 77 - replace.pl which demonstrates matching and replacing some text</p>
<p>The statement in line 9 is using the s operator so we have the pattern that we are searching for, that is any instance of s or e in the string, the text to replace this with which is x and the global modifier. If we omit the g, the code would replace the first character it found that matched the pattern and leave the rest of the string unchanged.</p>
<p>With the g, all instances of s or e are replaced with an x and this gives us the output</p>
<p class="inset">Thix ix a linx of txxt</p>
<p>We can also extract and reuse part of the string. For example, we could change line 9 to</p>
<p class="inset">$s =~ s/(l\w+)/$1 $1 $1 $1/;</p>
<p>Here, we are matching an l followed by a word character (any letter or digit) and the + sign means that we will match 1 or more instances. In effect, this means any characters until something that is not a word character is encountered and the result is that it will give you the first whole word starting with l.</p>
<p>We have seen $1 used before, this is the first result that is matched. In this case, the matched text is line and it is being replaced with four instances of this match. The output is therefore</p>
<p class="inset">This is a line line line line of text</p>
<p>We can extract more than one match in this way. If we change line 9 to</p>
<p class="inset">$s =~ s/(l\w+)\s+(\w+)/$1 $1 $2 $2 $2 $1 $1/;</p>
<p>Again, we are going to match the first word that starts with l, which we know is line, followed by a space and then the next group of word characters (that is, the next word). The output from this is</p>
<p class="inset">This is a line line of of line line text</p>
<p>We can make our pattern match a little bit more complicated than this. For example</p>
<p class="inset">$s =~ s/^(\w+)\s(\w+)\s(\w+)\s(\w+)( /$4 $3 $2 $1/;</p>
<p>Note that we are not searching explicitly for the word line here. The pattern is basically a word followed by a space etc until we have matched the first four words. We then output these in reverse order giving us the output</p>
<p class="inset">line a is This of text</p>
<p>Now, let’s look at a practical application of search and replace. Consider the code shown in figure 78.</p>
<pre class="inset">
1. #/usr/bin/perl
2. # match.pl by Philip Osztromok
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $n = 1234567890;
8. while ( $n =~ s/^(-?\d+)(\d{3})/$1,$2/ ) {}
9. say $n;</pre>
<p class="caption">Figure 78 - a practical application of search and replace, inserting commas in a high-value number</p>
<p>We have a 10 digit number, $n, on line 7. On line 8, we have a while loop that is trying to match with any number of digits (with an optional - at the start) followed by any group of 3 digits. It will carry on searching until it reaches the end of the string. Now, the first group of digits will be represented by $1 and each group of 3 is represented by $2. The replacement is $1 followed by a comma and then$2.</p>
<p>Since this is all in a while loop, it will continue searching until the end of the string has been reached so we will get not just the first group of 3 digits preceded by a zero, but every group. As a result, the output is</p>
<pre class="inset">
$r
1,234,567,890</pre>
<p>So we are basically just formatting the number in a conventional manner. This is the most complex example of a regular expression that we have seen so far, although it would be a relatively simple matter to use it with the pattern provided here.</p>
<p>It’s not completely clear to me, at least, how this works but it seems to grab the 1 first, presumably because it knows that the rest of the number is composed of a number of digits that is a multiple of 3 and then it just keeps adding the $2 on the end while the loop is executing.</p>
<p>I note that if the number is 11 or 12 digits, the output is still properly formatted.</p>
<h2 class="sectiontitle">Splitting Strings</h2>
<p>The code in figure 79 demonstrates another useful technique we can use with search and replace.</p>
<pre class="inset">
1. #!/usr/bin/perl
2. # split.pl by Bill Weinman &lt;http://bw.org/contact/&gt;
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "This is a line of text";
8.
9. my @a = split(/\s+/, $s);
10. say foreach @a;</pre>
<p class="caption">Figure 79 - split.pl, demonstrating the use of a patter to split a string into its constituent parts</p>
<p>Here, we are using the split function and we are passing two arguments to it. The first is a pattern and this is actually a white space character and this is specifying the delimiter that we want split to use. The second argument is the string we want to match text from.</p>
<p>The result of this is going to be a list, essentially this is a list of the parts of the string between the specified delimiter. In line 10, we have a foreach loop that displays the contents of the array (@a) that we initialized with our list).</p>
<p>A practical application here, is to break down an ip address into its constituent parts. This is shown in figure 80.</p>
<pre class="inset">
1. #!/usr/bin/perl
2. # split.pl by Bill Weinman &lt;http://bw.org/contact/&gt;
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "127.0.0.1";
8.
9. my @a = split(/\.+/, $s);
10. say foreach @a; </pre>
<p class="caption">Figure 80 - split.pl showing a practical application where we are extracting the four distinct octets from an IP address</p>
<p>Note that the . is a wildcard character and so it has been escaped in the pattern.</p>
<p>The example in figure 81 uses the : as a separator and a slightly different method of producing the output. Rather than assign the output of split to a variable and then use that in a foreach loop, we are simply using the split function as the input for the foreach loop.</p>
<pre class="inset">
1. #!/usr/bin/perl
2. # split.pl by Bill Weinman &lt;http://bw.org/contact/&gt;
3.
4. use 5.28.0;
5. use warnings;
6.
7. my $s = "value:another value:yet another value:one more here";
8.
9. say foreach split(/:/, $s); </pre>
<p class="caption">Figure 81 - split.pl with : as the delimiter</p>
<p>There are spaces in some of the constituent parts of the string, but these are being treated as any other character, in this case we might say anything that is not a colon. If we had leading spaces with something like this on line 7</p>
<p class="inset">my $s = "value: another value: yet another value: one more here";</p>
<p>We can accommodate this by amending the delimiter like this.</p>
<p class="inset">say foreach split(/:\s*/, $s);</p>
<p>The \s after the colon means that the delimiter is now a colon followed by a space. If we omit the asterisk, the delimiter is quite literal - a colon followed by one space so if we have a string of</p>
<p class="inset">my $s = "value:another value: yet another value: one more here";</p>
<p>This will give the output</p>
<pre class="inset">
value:another value
yet another value
one more here</pre>
<p>Note that the colon alone is no longer recognised as a delimiter and where there were leading spaces, they are still there except for the first one which is being recognised as part of the delimiter.</p>
<p>The asterisk changes the meaning so that rather than one space, we can have any number, including zero so if we put it back in, the string shown above will give the output</p>
<pre class="inset">
value
another value
yet another value
one more here</pre>
<p>In these examples, there is only a single delimiter, but if we need more than one, let’s say a colon or a comma if we have a string such as</p>
<p class="inset">my $s = "value:another value:yet another value,one more here";</p>
<p>We can accommodate this by making a class of delimiters such as</p>
<p class="inset">say foreach split(/[:,]/, $s);</p>
<p>In this example, we removed the leading space but if necessary, we can combine the two. That is, we replace the colon with the class but leave the rest of the pattern unchanged. If our string is</p>
<p class="inset">my $s = "value: another value: yet another value, one more here";</p>
<p>We can use</p>
<p class="inset">say foreach split(/[:,]\s*/, $s);</p>
<p>Again, this gives the output</p>
<pre class="inset">
value
another value
yet another value
one more here</pre>
<p>Notice that this is the same pattern previously used, but the character in the delimiter, the colon has just been swapped out for a class which includes the colon and the comma.</p>
</article>
<div class="btngroup">
<button class="button" onclick="window.location.href='operators.html';">
Previous Chapter - Operators
</button>
<button class="button" onclick="window.location.href='functions.html';">
Next Chapter - Functions
</button>
<button class="button" onclick="window.location.href='perl5essentialtraining.html'">
Course Contents
</button>
<button class="button" onclick="window.location.href='/programming/programming.html'">
Programming Page
</button>
<button class="button" onclick="window.location.href='/index.html'">
Home
</button>
</div>
</body>
</html>