You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 

140 lines
18 KiB

import { addBanner, addArticle, addTitle, addHeader, addParagraph, addSubHeader } from '/scripts/article.js';
import { addInset, addInsetList, addInsetCodeListing, addInsetBulletList } from '/scripts/inset.js';
import { addImageWithCaption, addButtonGroup } from '/scripts/visuals.js';
import { addSidebar} from '/scripts/sidebar.js';
import { addSyntax } from '/scripts/code.js';
import { menu } from '/scripts/web_dev_buttons.js';
import { global_menu } from '/scripts/grid_layout1.js';
import { local_menu } from '/scripts/linux.js';
const heading = document.querySelector(".heading");
const global = document.querySelector(".global_menu");
const local = document.querySelector(".local_menu");
const sidebar = document.querySelector(".sidebar");
const main = document.querySelector(".main_content");
heading.append(addTitle("AWK Essential Training"));
heading.append(addParagraph("David D. Levine - LinkedIn Learning - November 2022"));
heading.append(addParagraph("Chapter 5 - Using Control Strucutures"));
main.append(addHeader("BUILDING CONTROL STRUCTURES!"))
main.append(addParagraph("It's important to remember that AWK is a programming language and it has the same sort of control structures you would find in a language like C. For example, an if statement has the syntax"))
main.append(addInsetCodeListing(["if ( condition ) {", " code-block", "}"]))
main.append(addParagraph("you can omit the curly brackets if there is only one statement in the code-block but they do help with readability."))
main.append(addParagraph("If you use a number or expression that returns a number as the condition, 0 will be considered false and anything else will be considered true. For example"))
main.append(addInsetCodeListing(["if (a-5) {", " print \"a is not 5\";", "}"]))
main.append(addParagraph("Where a=5, the result of the expression is 0 so the code would not be executed."))
main.append(addParagraph("Similarly for a string , an empty string would be considered false and a non-empty string would be considered false."))
main.append(addParagraph("You can also add an else clause to your if statement."))
main.append(addParagraph("The AWK program can be saved in a file and executed using the -f flag"))
main.append(addSyntax("awk -f myprogram.awk input.txt"))
main.append(addParagraph("You can also put your AWK program into a shell script on a Unix-type system and that includes line breaks. The following image shows the file shortlong.awk so this is the AWK version of the program."))
main.append(addImageWithCaption("./images/shortlong.jpg", "The AWK version of the shortlong program."))
main.append(addParagraph("For comparison this is what the same program looks like in a shell script. Note that the code includes line breaks and we want to be sure that the shell will recognise these as being part of the program so the code is again enclosed in single quotes."))
main.append(addImageWithCaption("./images/shortlong1.jpg", "The shell version of the shortlong program."))
main.append(addParagraph("You can run shortlong.sh in the same way that you would run any shell script. If you make it executable, you can then run it with just the script name although you may need to confirm it's location. In the simplest scenario, the file will be in the current directory so you will be able to run it with"))
main.append(addSyntax("./shortlong.sh"))
main.append(addParagraph("In AWK, a for loop is similar to a for loop in C so the syntax for that is"))
main.append(addInsetCodeListing(["for ( initialisation; condition; increment ) {", " body"]))
main.append(addParagraph("An example of this can be seen in the following program"))
main.append(addImageWithCaption("./images/firstthree.jpg", "The firstthree.awk program."))
main.append(addParagraph("For each line of the input file, this will output the line number, field number and the value in that field for each of the first three fields, as shown below."))
main.append(addImageWithCaption("./images/firstthree1.jpg", "The output from the firstthree.awk program."))
main.append(addHeader("CREATING AN HTML TABLE"))
main.append(addParagraph("I mentioned earlier that AWK is good for generating HTML so let's look at an example of that and this will help to reinforce what we have learned so far."))
main.append(addParagraph("Let's say that we have a tab-separated file as shown below."))
main.append(addImageWithCaption("./images/scores.jpg", "The scores.txt file."))
main.append(addParagraph("is shows the name of each bowler and 7 rounds of scores. The aim is to convert that into an HTML table."))
main.append(addParagraph("The first step is to create a header and as it appears at the top of the table, we will use the special pattern, BEGIN."))
main.append(addSyntax("BEGIN"))
main.append(addParagraph("We will then specify tab as the field separator."))
main.append(addSyntax("FS=\"\t\" ;"))
main.append(addParagraph("Next is the start of the table element"))
main.append(addSyntax("print \"<table>\";"))
main.append(addParagraph("and the first row."))
main.append(addSyntax("print \"<tr>\";"))
main.append(addParagraph("and then the table header, which we will indent with a tab"))
main.append(addSyntax("print \"\t<th>Bowler</th>\";"))
main.append(addParagraph("We want leaders for each of the 7 rounds of scores so we will use a for loop to generate that."))
main.append(addSyntax("for ( i=1; i<=7; i++ ) {"))
main.append(addSyntax(" print \"\t&l;th>Round\"; i \"</th>\";"))
main.append(addParagraph("We fill finish this part of the HTML by closing the row."))
main.append(addSyntax("print \"</tr>\";"))
main.append(addParagraph("So that's the header completed. Next, we will use an action without a pattern and this will start a new action which will generate a row for each of the bowlers and will also record to running total for each round."))
main.append(addParagraph("You might be expecting to use a for loop to iterate through each of the bowlers, but we don't need to do that because, of course, AWK will process every line of the input!"))
main.append(addParagraph("The process here is similar to the one we followed when we were doing the header, so rather than go through it line by line , I will show you the completed section and then mention points of interest. So that is"))
main.append(addInsetCodeListing(["{", " print \"tr\";", " for (i=2; i<=8; i++) {", " print \"\t<td>\" $i \"</td>\";", " total[i] += $i;", " }", " print \"</tr>\";", "}"]))
main.append(addParagraph("Notice that we are outputing the first field, the bowlers name before we start the for loop, which will output the scores which are stored in fields 2 to 8."))
main.append(addParagraph("This will still work correctly if we delete some of the records or add more because of the fact that AWK will process every line of the input so it doesn't matter how many there are."))
main.append(addParagraph("However, if we want to change the number of rounds, this is hard-coded into the for loop so we would have to change the condition"))
main.append(addSyntax("i <= 8"))
main.append(addParagraph("to match the number of scores so, for instance, if we added 3 more rounds to give 10 in total the condition would become"))
main.append(addSyntax("i <= 11"))
main.append(addParagraph("but otherwise, the program would be the same. We could use NF to make the program more flexible, in which case the for loop would be"))
main.append(addSyntax("for ( i=2; i<=NF; i++ )"))
main.append(addParagraph("As an exercise you might consider writing a second version of the program that doos use NF and see how this affects the program."))
main.append(addParagraph("The problem we have here is that if we are using NF, we would have to read in at least one line of the input before we output the header."))
main.append(addParagraph("To finish this off, we want to output the footer which will display the average score in each round so again, I will give you the code and then go through any points of interest."))
main.append(addInsetCodeListing(["END {", " print \"<tr>\";", " print \"\t<td><b><i>Average</i></b></td>", " for (i=2;i<=8;i++)", " print \"\t<td>\" int(total[i] /NR) \"</i></td>\";", " }", " print </tr>;", " print </table>;"]))
main.append(addSyntax(" print </table>;"))
main.append(addParagraph("We started with BEGIN to denote the start of the input and here we use END to denote the end of the input so the remaining code is being executed after the last line of input."))
main.append(addParagraph("You may also find the expression used to calculate the averages of interest."))
main.append(addSyntax("int (total[i] / NR)"))
main.append(addParagraph("If you have any programming experience, this will look pretty familiar but it is of some interest here because it shows, specifically, how this is done in Awk, or rather that this can be done in AWK."))
main.append(addParagraph("The expression in parentheses takes the total for each round and divides it by the number of rows (NR). This will most likely yield a float so it is converted to an integer for output."))
main.append(addParagraph("I mentioned earlier that the program is flexible in terms pf the number of records which means that we also need the same flexibility when calculating the averages so that it will be correct if we odd or remove rows. For that reason, the averages are calculated using NR rather than the actual number of records."))
main.append(addParagraph("We can run the program with the command"))
main.append(addSyntax("awk -f scores.awk scores.txt > scores.html"))
main.append(addParagraph("For reference, you can see the result by clicking <a href='scores.html'>here</a>."))
main.append(addHeader("CHALLENGE"))
main.append(addParagraph("The challenge is to iterate over the lines in a file containing HTML code and output any line that consists of an entire HTML entity, in others words that starts with an opening HTML bag and ends with a closing HTML tag."))
main.append(addHeader("SOLUTION"))
main.append(addParagraph("The approach I took in devising a solution to this is different to the approach described in the course video so I will describe both, starting with my solution. The approach I took is this. Consider a typical HTML element."))
main.append(addSyntax("<p>This is a paragraph.</p>"))
main.append(addParagraph("This is a typical HTML element where the entire element is contained in a single line starting with the opening tag and ending with the closing tag. The approach that I have taken is simply to search for a line that has both of these things so we are using a pattern that will look something like this."))
main.append(addSyntax("$0 ~ /^<[a-zA-Z]+>/"))
main.append(addParagraph("This is looking for a pattern at the start of the line and that pattern is a greater than symbol followed by 1 or more letters and ending with a greater than symbol so this is an HTML opening tag. Similarly"))
main.append(addSyntax("$"))
main.append(addParagraph("is looking for a pattern at the end of the line which is similar to the previous pattern with the addition of a forward slash after the less than symbol so this is looking for a closing tag."))
main.append(addParagraph("The program uses an if statement that looks for both of these things so it looks for a line that starts with an opening tag and finishes with a closing tag which essentially means that the line contains a complete HTML element."))
main.append(addImageWithCaption("./images/solution1.jpg", "The soluition to the problem of printing out a line from a file where that line is a complete HTML element."))
main.append(addParagraph("Here, we have the solution displayed with cat, the teat data alsp displayed with cat and the output from running the program against that test data. Note that it prints out the only two lines in the test data that consist of a complete HTML entity."))
main.append(addHeader("SOLUTION"))
main.append(addParagraph("We will look at the official solution now which, as I mentioned, takes a different approach. In this solution, we are breaking the line down into fields, so, for example, the opening tag will (by default) be the first field."))
main.append(addParagraph("It specifies a field separator like this"))
main.append(addSyntax("FS=\"[<>>]\""))
main.append(addParagraph("So the field separator is any character in this small set which means that either < which is the start of a tag or > which is the end of a tag. This actually helps to overcome one of the shortcomings of my solution. Let's say that a line starts with the opening tag of a list item so"))
main.append(addSyntax("<li>"))
main.append(addParagraph("The line starts with a field separator which means that $1 is empty. Between the opening < and the next field separator >, which terminates the tag, we have the name of the tag. In this case li which means that li is the second field."))
main.append(addParagraph("In more general terms, we can say that on any line that starts with an opening HTML tag, $2 represents the contents of that opening tag."))
main.append(addParagraph("This means that when we look at the closing tag, we can check to see if the contents of the closing tag, minus the backslash, is the same."))
main.append(addParagraph("In the main part of the program, we have an if statement"))
main.append(addSyntax("If( $ ( NF-1 ) == ( \" /\" $2) ) {"))
main.append(addParagraph("This is comparing two fields and remember that because of the defined field separator, a field is defined as any sequence of characters between two angled brackets. Technically, this could be a sequence of characters between any two angled brackets, but if we assume that the input file is correctly formed HTML, this means that the text inside a tag will be treated as a field. It does also mean that the text between tags will also be considered a field although in this context, we don't use it."))
main.append(addParagraph("The condition is testing two of the fields, $(NF-1) and $2 to see if they are equal. Just as the first field will be empty if the first character is a < the last field will also be empty if the final character is a >."))
main.append(addParagraph("Let's look at an example"))
main.append(addSyntax("<li>list item< / 11>"))
main.append(addParagraph("This contains four field separaters so there are five fields as follows"))
main.append(addInsetCodeListing(["$1 is an empty field", "$2 is li", "$3 is the text between the opening and closing tags.", "$4 is /li", "$5 is an empty field"]))
main.append(addParagraph("As there are five fields, NF=5 so we are comparing $2 and $4 in the if statement's condition . Note that the second part of the condition is"))
main.append(addSyntax("( \" /\" $ 2 )"))
main.append(addParagraph("which means that we are checking to see if the fourth field is equal to field 2 with a forward slash in front of it. In this example, the fourth field is /li and if we put an forward slash in front of field 2, this gives us /li so in this case it will return true and the line will be printed."))
main.append(addParagraph("In more general terms, we can say that if the line starts with an HTML tag and ends with an HTML tag and if the contents of these tags is the same except for the / in the closing tag, the condition will evaluate to true and the line will be printed."))
main.append(addParagraph("Note that we also have this line in front of the if statement"))
main.append(addSyntax("/^<.*>/"))
main.append(addParagraph("This essentially is the pattern in a pattern-action statement with the if statement representing the action and it means that the code will ignore any line that doesn't start with < and end with >. So it is effectively filtering out any lines that are clearly not complete HTML elements but it also filters out lines where there is white space at the start or end of the line."))
main.append(addParagraph("This means that a line that consists of, for example a tab character followed by a complete HTML element would not, in this context, be considered a complete HTML element . In other words, it ignores indented elements. However, if it is left out, it would ignore any other characters before the opening tag since we are not checking $1 so this line"))
main.append(addSyntax("some text <li>Linked item</li>"))
main.append(addParagraph("would be incorrectly identified as being a line containing only a complete HTML element and would be printed out."))
main.append(addParagraph("Aside from the fact that it ingrates fails to identify any element where there is white space at the start or end of the line, the solution has some other weaknesses too."))
main.append(addParagraph("It fails to recognise an element as being complete if the case is different. For example, the line"))
main.append(addSyntax("<Li>List item</li>"))
main.append(addParagraph("would not be identified as a complete HTML element. It will also, as was the case with my solution, fail to identify elements which have attributes such as"))
main.append(addSyntax("<a href= \"url\">URL</a>"))
main.append(addParagraph("In my station, this was because I assumed a tag name consisted of letters only. In the course solution, it fails because the attributes are only in the opening tag and this will cause the if statement to return false even if the tags do match."))
main.append(addParagraph("Finally, it doesn't check that the tags do match up with each other. In other words, the <li> tag at the start might match with the &;t;/li> al the end but the solution doesn't check that these tags belong to the same element."))
main.append(addParagraph("For example"))
main.append(addParagraph("would part on the second line as a complete item, but the </li> at the end of that line is terminating the list item on line 1. I'm not sure if that is a valid concern because this doesn't look like valid HTML unless you consider this is a list item within a list item, in which case the second line is a complete HTML element. I don't know of any situation where the HTML in the middle would violate that constraint and still be valid HTML (or at least, well-formed HTML) so I don't consider this to be a problem."))
main.append(addParagraph("Both solutions do tend to reinforce an earlier point which is that it is hard to reliably parse HTML with AWK."))
addSidebar("linux");