main.append(addParagraph("Regular expression are very common in AWK and in many areas of the Unix world. A regular expression is a special string which defines a pattern and in general you can test other strings against the regex to see if the string contains a string that matches the pattern. So, for example, if the pattern abo, a string can be said to match that regex if it contains the substring, \"abc\"."))
main.append(addParagraph("In AWK a regex can sometimes be witten in quotes but it is more usual to see them written between two / characters."))
main.append(addParagraph("If you use a string where AWK expects a regex, that string will be converted to a regex, much as we saw with converting between strings and numbers."))
main.append(addParagraph("The simplest type of regex is a string like /abc/ which will match, as I mentioned, if a string contains those three characters in sequence."))
main.append(addParagraph("Regexes in AWK are case-sensitive and you will often see them used as the pattern in a pattern-action pair. This will perform the action for any line of the input that matches the regex in the pattern. For example, we saw this earlier when we used regexes to print out only the lines that contain the word UP or the word DOWN."))
main.append(addParagraph("As a reminder, that command was"))
main.append(addParagraph("You might recall that this matched with any line containing one of the two words and matched twice with the line containing both."))
main.append(addParagraph("A common way to use regexes in AWK is with the tilde character, ~, which compares a string with a regex and returns true if the pattern described by the regex is found in the string and similarly !~ will return true if the pattern is not found."))
main.append(addParagraph("You can use these expressions anywhere a comparison can be used and that includes in the pattern of a pattern-action statement as in"))
main.append(addParagraph("In this case we have a string which is the fourth field in the input and we are comparing this string to our pattern. Rather than outputting any line that contains the word up, this has the effect of outputting any line where the fourth word in the line is up."))
main.append(addParagraph("You can use metacharacters in AWK to make a regex more flexible. For example a . is used to match with any character."))
main.append(addParagraph("So"))
main.append(addSyntax("/a.c/ will match with abc, a.c, axc, a3c etc."))
main.append(addParagraph("The dot matches with exactly one characters so"))
main.append(addSyntax("/a.c/ will not match with ac or abbc"))
main.append(addParagraph("You can use a backslash to escape a special character, so if you want your pattern to match only with the literal string \"a.c\" you would use the pattern"))
main.append(addSyntax("/a\.c/"))
main.append(addParagraph("Similarly, you can escape the backslash character so if you want a pattern that will match the literal string \"a\c\" you would use the pattern"))
main.append(addSyntax("/a\\c/"))
main.append(addParagraph("You can also escape a forward slash which would normally denote the start or end of the regex so to match with the literal string \"a/c\" you would use the pattern"))
main.append(addSyntax("/a\/c/"))
main.append(addParagraph("You can use the characters ^ and $ to denote the start and end of a string respectively, so the pattern"))
main.append(addSyntax("/^abc/"))
main.append(addParagraph("will match with any string that starts with the letters abc."))
main.append(addSyntax("/abc$/"))
main.append(addParagraph("will match with any string that ends with the letters abc and if you use metacharacters, the pattern will match any string that shoffe starts (or ends) with that pattern. For instance"))
main.append(addSyntax("/^a.c /"))
main.append(addParagraph("will match any string that starts with an a followed by any character followed by a and"))
main.append(addSyntax("/a.c$/"))
main.append(addParagraph("will match with any string where the last three characters are an a and a c with any character between these."))
main.append(addParagraph("Let's look at one more example."))
main.append(addParagraph("So this looks like it will match with any line where the third word is <em>the</em> but it's a regex so it will actually match with any line where the third line includes these letter in sequence so it will match if the word in the third field is <em>the</em> but will also match if the field contains the word <em>them</em> or <em>neither</em>."))
main.append(addParagraph("We could add the ^ character"))
main.append(addParagraph("So that will exclude any line that includes <em>neither</em> as the third field but will still Match where the third field contains the word <em>them</em>. I would guess that you could add a $ so"))
main.append(addParagraph("Now our regex will only match if <em>the</em> is both the start and end of the field so it will only match a line where the third field contains only the word <em>the</em>."))
main.append(addParagraph("Note that this is really only of academic interest because we have now modified the regex to the point where it only matches the literal string and if that's the case, you don't actually need a regex. A much simpler and neater solution to that particular problem would be"))
main.append(addHeader("WORKING WITH CHARACTER CLASSES AND QUANTIFIERS"))
main.append(addParagraph("A regex can be refined using character classes which are sets of characters to be potentially matched. For example, if you match with a pattern with three characters where you know the first and third characters but the middle character can be one from a small set, you can use a character class."))
main.append(addParagraph("For example , the pattern"))
main.append(addSyntax("/a[x,y,z]c/"))
main.append(addParagraph("will match with \"axc\", \"ayc\" or \"azc\". So that's a followed by any one (and only one) character from the set and a c. It won't match with \"ac\", \"abc\" or \"axyzc\"."))
main.append(addParagraph("If we want to expand this so that the second character can be any lower case letter, we can replace the set with a range, so"))
main.append(addSyntax("/a[z-z]c/"))
main.append(addParagraph("will now match any three characters where the first character is an a, the second character is any lower-case letter and the third character is a c."))
main.append(addParagraph("A range is just a set of characters with a start and end point, it doesn't have to be the whole alphabet so we could rewrite"))
main.append(addSyntax("/a[x,y,z]c/"))
main.append(addParagraph("as a range"))
main.append(addSyntax("/a[x-z]c/"))
main.append(addParagraph("and that will have the same effect. You can also combine ranges so, for example, if the second character is any letter, upper or lower case, you can specify the pattern as "))
main.append(addSyntax("/a[a-zA-Z]c/"))
main.append(addSyntax("/a[^a-z]c/"))
main.append(addParagraph("will only match if the second character (between an a and a c) is NOT a lower-case letter. So this wouldn't match with \"abc\" but it would match with \"aBc\"."))
main.append(addParagraph("IF you put on asterisk after a character or a character class, this will match with 0 or more instances. So"))
main.append(addSyntax("/ab*c/"))
main.append(addSyntax("/ab+c/"))
main.append(addParagraph("will not match with \"ac\" but will match with \"abc\", \"abbc\" and so on. This is exactly the same as"))
main.append(addSyntax("/abb*c/"))
main.append(addParagraph("and the pattern here is an a followed by a b followed by zero or more b's and finally a c."))
main.append(addParagraph("If you want to specticy a set number of instances as a minimum, let's say you want an a followed by 3 or more b's and then a c, you can use either "))
main.append(addSyntax("/abbb+c/"))
main.append(addParagraph("or"))
main.append(addParagraph("A question mark will match O or I instances of the character so you would use it to specify that the character is optional."))
main.append(addSyntax("/ab?c/"))
main.append(addParagraph("will match with either \"ac\" or \"abc\" but it won't match if there are multiple instances of that character so \"abbc\", \"abbbc\" and so on will not match."))
main.append(addParagraph("If you put a number in curly brackets after a character or character class, this will match up with the specified number of instances so"))
main.append(addSyntax("/ ab{3}c/"))
main.append(addParagraph("will only match with \"abbbc\""))
main.append(addParagraph("We can also specify a range in curly braces so"))
main.append(addSyntax("/ab{3,5}c/"))
main.append(addParagraph("will match up with \"abbbc\", \"abbbbc\" or \"abbbbbc\"."))
main.append(addParagraph("As I mentioned earlier , the quantifier relates to the character or character class immediately preceding it but if you use on asterisk or plus sign after a character or particularly a range, these will match greedily. In other words, it will match with as many characters as it can."))
main.append(addParagraph("To given an example of this, let's say that you want a patten that will match with any HTML element. You might try using a pattern like this"))
main.append(addSyntax("/<.+>/"))
main.append(addParagraph("So this is going to match a string where the first character is <, the last character is > and in between there is any number of any characters but there must be at least one."))
main.append(addParagraph("Let's say that our HTML element is"))
main.append(addSyntax("<i>italic text</i>"))
main.append(addParagraph("We could break that down in to what each part of the pattern is matching with and the first part is a literal characer, <, so that matches with the < at the start of the element."))
main.append(addParagraph("This is followed by a . followed by a + sign so this means any number of characters so that will match with"))
main.append(addSyntax("i>italic text </i>"))
main.append(addParagraph("and then we have our final character, but there's nothing left to match with and so this pattern will not really match the HTML element."))
main.append(addParagraph("We can modify this so that we exclude the > sign from the second part of the pattern. The solution to this problem as given in the course video is"))
main.append(addSyntax("/<[^]+>/"))
main.append(addParagraph("Actually, both solutions seem to work in pretty much the same way as you can see in the image below."))
main.append(addImageWithCaption("./images/html_elements.jpg","Using a regex to try to extract complete HTML elements from a file."))
main.append(addParagraph("In both cases, the pattern is matching any HTML tag, opening or closing, rather than a complete HTML element so it does extract the paragraph, which is a complete element. It has both an opening tag at the start and a closing tag at the end."))
main.append(addParagraph("Both also returned the HTML tags which do represent a complete elements but not on a single line. In both cases, it looks like the pattern is matching anything where there is what looks like an HTLM tag and you can see that in this image."))
main.append(addImageWithCaption("./images/html_elements1.jpg","Using a regex to try to extract complete HTML elements from a file, part 2."))
main.append(addParagraph("I added a self-closing (img) element to see if the patterns would extract that as an element and they both did. I also added a bit of text before the tag to see if the patterns would still match, even though that line doesn't start with an opening tag and on both cases, they do still match."))
main.append(addParagraph("From this, it seems clear that both patterns are just going to match with any line that contains what seems to be an HTML tag and I guess that means that there is a < symbol and a > symbol somewhere in the line."))
main.append(addParagraph("In pattern-action statements this will not usually be a problem, but it can be important to know how much of a string is being matched in some of the functions that we will see late in the course."))