You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

131 lines
16 KiB

import { addBanner, addArticle, addHeader, addParagraph, addSubHeader } from "/scripts/article.js";
import { addInset, addInsetList, addInsetCodeListing, addInsetBulletList } from "/scripts/inset.js";
import { addImageWithCaption, addButtonGroup } from "/scripts/visuals.js";
import { menu } from "/scripts/buttongroups.js";
const main = document.querySelector("main");
main.append(addBanner("<a href=\"https://www.linkedin.com/learning/awk-essential-training\">AWK Essential Training</a>", "David D Levine", "November 2022"))
main.append(addArticle())
const article = document.querySelector("article")
article.append(addHeader("EXPLORING BASIC INPUT FIELD SEPARATORS"))
article.append(addParagraph("The notion of using records and fields in is fundamental to AWK so it is necessary to be clear about what constitutes a record or a fire field. For example, by default, awk would consider each line of a file to be a record and the record to be anything on the line that is separated by white space."))
article.append(addParagraph("If we run the command"))
article.append(addInset("awk { print $2 }"))
article.append(addParagraph("and use spaces to separate our input"))
article.append(addInset("one two three"))
article.append(addParagraph("the output is"))
article.append(addInset("two"))
article.append(addParagraph("If we use tabs rather than spaces to separate the input"))
article.append(addInset("one two three"))
article.append(addParagraph("the output is still"))
article.append(addInset("two"))
article.append(addParagraph("The key point here is that there is white space between the fields and it doesn't matter whether we use spaces, tabs or any combination of the two. For example, let's say we have 3 spaces between the first and seconds fields and a space follows by a tab and then two more spaces between the second and third fields"))
article.append(addInset("one two three"))
article.append(addParagraph("we still get the same output"))
article.append(addInset("two"))
article.append(addParagraph("You might recall, we can use the - F flag to specify a different separator and it's quite common for that to be a comma which would mean that white space would be seen as part of the field. For example, if our command is"))
article.append(addInset("awk - F, {prink $2}'"))
article.append(addParagraph("and our input is"))
article.append(addInset("one one, two two, three three"))
article.append(addParagraph("the output will be"))
article.append(addInset("two two"))
article.append(addParagraph("One scenario where this can be useful is if you have a file where the fields are separated by tabs and there are embedded spaces within those fields. If we specify tabs as the separator, we can read fields that include spaces. For example"))
article.append(addInset("awk - F t '{print $2}'"))
article.append(addParagraph("with the input"))
article.append(addInset("John Smith Alan Davies Peter Piper"))
article.append(addParagraph("will generate the output"))
article.append(addInset("Alan Davies"))
article.append(addParagraph("One interesting feature when using the default separador (any white space )is that it is not possible to have an empty field and the reason for this should be fairly obvious. Everything you type as input would be either part of a field or a separator so an empty field would consist of nothing with white space before and after it."))
article.append(addParagraph("However, awk would just see that as for one separator."))
article.append(addParagraph("As an experiment, it might be worth trying this"))
article.append(addParagraph("and providing input with white space at the beginning of the line ."))
article.append(addInset("one two three"))
article.append(addParagraph("The output from this will either be"))
article.append(addInset("one"))
article.append(addParagraph("or a blank line. If it is a blank line, this would show that you can have an empty field, using the default separator as long as it is the first field and I would assume that the same would then apply to the last field and you could try a similar experiment to rest that theory."))
article.append(addParagraph("On the other hand, if awk ignores the white space at the start and end of the input and looks for the first part of the input that is not white space as the first field, this would demonstrate the fact that you can't have empty fields using the default separator."))
article.append(addParagraph("When running this test, you will see that the output is one so this shows that awk will, under these circumstances, ignore any white space at the start or end of the line."))
article.append(addParagraph("If you specify a separator, let's say a comma, it is easy to indicate an empty field. The input"))
article.append(addInset("one,,three"))
article.append(addParagraph("has the second field empty."))
article.append(addInset(",two,three"))
article.append(addParagraph("has the first field empty"))
article.append(addInset("one,two,"))
article.append(addParagraph("has the third and in this case last field empty."))
article.append(addParagraph("The field separator doesn't have to be a single character and we saw that earlier when we specified this to be a comma followed by a space. We can actually use more or less any string so this might be, for example, ABC. S"))
article.append(addInset("oneABCtwoABCthree"))
article.append(addParagraph("has three fields (none of which are empty) and if we use awk to output $2 we will get"))
article.append(addInset("two"))
article.append(addParagraph("We can also a regex as the separator. Take the following example."))
article.append(addInset("awk -F '[,!]' '{print $2}'"))
article.append(addParagraph("The regex in this example is"))
article.append(addInset("'[,!]'"))
article.append(addParagraph("which defines the separator to be either a comma or an exclamation mark. Note that as we are typing the command directly into the shell, we have to enclose the regex in quote marks for the same reason that we had to that for the awk program. It contains characters that are special to the shell."))
article.append(addParagraph("If our input is"))
article.append(addInset("one!two,three"))
article.append(addParagraph("the output will be"))
article.append(addInset("two"))
article.append(addHeader("SPECIFYING FIELD AND RECORD SEPARATORS WITH VARIABLES"))
article.append(addParagraph("We talked about variables in AWK earlier and I mentioned that AWK has some pre-defined special variables. We can use one of these, FS, to specify the field separator within an AWk program ."))
article.append(addParagraph("Let's look at an example of that."))
article.append(addInset("awk {FS = \",\"; print $ 2}"))
article.append(addParagraph("Note that we are specifying a comma as the held separator and this is enclosed in quote's so that AWK will recognise it is a string . The way in which this works is a little bit strange, but we will look at some input/output first."))
article.append(addParagraph("With the input"))
article.append(addInset("one,two,three"))
article.append(addParagraph("we get a blank line as the output. If we try that again with"))
article.append(addInset("four,five,six"))
article.append(addParagraph("the output we get is"))
article.append(addInset("five"))
article.append(addParagraph("The first time we ran this, awk didn't seem to recognise the fact that we were using the comma as our field separator and that is, in fact, what happened. You may have noticed that there is a semicolon within the awk program and as you would probably expect, this is a statement separator. When the program is run for the first time, it seems that the definition of FS doesn't take effect until the program has completed so the first time we run it, AWK is still expecting the fields to be separated by white space and so it sees the input"))
article.append(addInset("one,two,three"))
article.append(addParagraph("as being a single field. When we run it a second time (by which I mean we provide a second record as input) it works as expected and does recognise the comma as our field separator."))
article.append(addParagraph("There are a couple of ways to get around this and one way is to use the special pattern, BEGIN, and an associated action, in this case {FS=","} and then we will add a second action to print the second field, {print $2}."))
article.append(addParagraph("In our first version of this , the program"))
article.append(addInset("awk '{FS=\",\"; {print $2}'"))
article.append(addParagraph("contains a pattern and an action. Hence the action is the part that is executed so I would assume that's why the value of FS isn't set until the first time we have output the second field."))
article.append(addParagraph("The second version is a pattern and an action so the field separator is being set by an action and AWK will recognise the comma as the separator for the first record we provide as input. So for the program"))
article.append(addInset("awk 'BEGIN{FS=\",\"] { print $2}'"))
article.append(addParagraph("and the input"))
article.append(addInset("one,two,three"))
article.append(addParagraph("the output is a now."))
article.append(addInset("two"))
article.append(addParagraph("By default, AWK defines a record as being a line but this can be a problem for two reasons. Firstly, what constitutes a line? This will normally be terminated with one either a CR or an LF character or even both. This is dependent on the OS and we can usually allow the OS to take care of that."))
article.append(addParagraph("For example, in the dukeofyork.txt file, as we saw earlier, there were 8 lines in the file and 8 records, but we didn't have to worry about how these lines were terminated because effectively the OS will tell AWK that there are 8 lines."))
article.append(addParagraph("The second reason might be more of a worry for us and that is that we might have some data that has multiple records that are not separated by lines . An example of this is the file onebigline.txt and if we cat this file, we will see"))
article.append(addInset("cat onebigline.txt"))
article.append(addInset("one,two,three!four,five,six!seven,eight,nine!ten,eleven,twelve$"))
article.append(addParagraph("Note that the $ at the end is not part of the file, this is the command prompt and it is on the same line as the file output because there are no newline characters in the file and so the command prompt is not nudged on to the next line."))
article.append(addParagraph("In the same way that we used FS to define the field separator, we can use RS to define the record separator. If we use the BEGIN pattern, we can use both of these to set the exclamation mark as the record separator and the comma as the field separator and then output the second field in each record."))
article.append(addInset("awk 'BEGIN{RS=\"!\", FS=\",\"} {print $2}' onebigline.txt"))
article.append(addParagraph("The file has four records based on this definition and each record has three fields. This gives us the output"))
article.append(addInset("two"))
article.append(addInset("five"))
article.append(addInset("eight"))
article.append(addInset("eleven"))
article.append(addParagraph("If we have records separated by blank lines as in the example multiaddress.txt, this looks as though it would be quite difficult to read individual records from. Each field within a record is on a line by itself and there is one blank line between each record."))
article.append(addParagraph("We can use the newline character as the field separator - that is , \"\n\""))
article.append(addParagraph("To separate the records, we would use the empty string and this would mean that the records are separated by one or more blank lines."))
article.append(addParagraph("If we want to output the records with each one on a single line with commas separating the output and no blank lines between the records, we can do it like this."))
article.append(addInset("awk 'BEGIN {RS=\"\";FS= \"\n\"} {print $1 \", \" $2 \", \" $3}' multiaddress.txt"))
article.append(addParagraph("We could adso do the same thing using awk variables which might be useful for someone reading the code since this provides some info on what the purpose of each field is."))
article.append(addInset("awk 'BEGIN {RS=\"\",FS=\"\n\"} {name=$1;address=$2;citystatezip=$3; print name \",\" address \",\" citystatezip} multiaddress.txt"))
article.append(addParagraph("In both cases, we get the same output. When awk produces this output, it is outputting fields and records and uses commmas to seperate the fields and a newline character to separate the records. These are the output defaults."))
article.append(addParagraph("However, just as we used FS and RS to define separators for our input, we can use OFS and ORS to define separators for the output. For example, we'll go back to our example with names, names.txt, where each record consists of two fields, a first name followed by a second name."))
article.append(addParagraph("Let's say that we want to output this with these names reversed and separated by a comma and a space (OFS=\", \") and with each record separated by an exclamation mark (ORS=\"!\")."))
article.append(addParagraph("We would do that with"))
article.append(addInset("awk 'BEGIN{OFS=\", \";ORS=\"!\"} {print $2,$1}' names.txt"))
article.append(addParagraph("Remember that the comma within that print statement tells awk to use the output field separator which gave us a space between the fields because that is the default for OFS. In this case, since we set OFS to a comma followed by a space, that's what we will see in the output."))
article.append(addHeader("CHALLENGE"))
article.append(addParagraph("For this challenge, we have a file with comma separated records where each record has three fields and we want to convect this to a file where the records are separated by tabs."))
article.append(addHeader("SOLUTION"))
article.append(addInset("awk 'BEGIN {FS=\",\";OFS=\"\t\"} {print $1,$2,$3} nameemailavg.cvs"))
article.append(addParagraph("We are telling awk that the input file has records that are separated where the fields are separated by commas by setting FS to "," and that the field separator for the output should be a tab by setting OFS to \"\t\"."))
article.append(addParagraph("We then just need to print each of the three fields and we use a comma in the print statement to tell awk to use the OFS which is the tab character and this gives us the desired output."))
article.append(addParagraph("You may be tempted to use this as your solution ."))
article.append(addInset("awk 'BEGIN {FS=", ";OFS=\"\t\"} {print}"))
article.append(addParagraph("This seems like a neat solution because it outputs the whole record so if it works for this example, it would work for other examples where the number of fields is different and this could include scenarios where we don't know in advance how many fields each record has or where the number of fields is not the same for every record."))
article.append(addParagraph("Unfortunately, it doesn't because print on its own will output the whole record without doing anything else with it so that output will still have it's fields separated by commas."))
article.append(addParagraph("If you have a file where the fields are separated by commas bar and there are, let's say, six fields in each record, you can easily modify the solution to accomodate this."))
article.append(addParagraph("However, if you need a solution that will work for any csv records with any number of fields or where different records have different numbers it fields, you would need to use a for loop to iterate over each record reading every field in the record in turn and outputting it with your tab-separator (unless it is the last field)."))
article.append(addParagraph("We will see how to use for loops in awk later in the course."))
3 months ago
main.append(menu("awk"))