Operating Systems and Systems Integration Regular Expressions, sed and awk — Solutions 1 Background For the background information you need to answer these questions, please refer to the shell programming lecture notes. The program egrep stands for extended grep. It supports so-called extended regular expressions. Well, since I have taught you the syntax of extended regular expressions in the lecture, I suggest that you use egrep rather than grep with regular expressions. Both grep and egrep support an option -o to print only the part of the line that matches the expression. The gnu sed program supports the -r option that tells sed to use regular expressions like egrep. So when you use sed, use it with sed -r. The gnu awk program has an option --posix that makes awk behave like egrep so that it understands the use of {n}, {n,m} without having to put in extra backslashes ‘\’. You might wonder what is the difference between “extended regular expressions” and the regular expressions that grep uses? The difference is explained at the end of the info page: $ info ’(grep)Regular Expressions’ where it says, “In basic regular expressions the metacharacters ‘?’, ‘+’, ‘{’, ‘|’, ‘(’, and ‘)’ lose their special meaning; instead use the backslashed versions ‘\?’, ‘\+’, ‘\{’, ‘\|’, ‘\(’, and ‘\)’.” 2 2.1 Questions egrepping through the dictionary Your dictionary is a file /usr/share/dict/words. Use egrep to: 1. Find all words containing three letter ‘a’s. $ egrep ’a.*a.*a’ /usr/share/dict/words 2. Find all words containing no vowels. (A vowel is one of the letters ‘a’, ‘e’, ‘i’, ‘o’ and ‘u’.) $ LANG=C egrep -i ’^[^aeiou]+$’ /usr/share/dict/words Note that LANG=C means, “Don’t use Unicode, use just plain old ascii.” 3. Find all words containing at least 5 vowels. Count the number of matching words. $ egrep ’[aeiou][^aeiou]*[aeiou][^aeiou]*’\ ’[aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou]’ \ /usr/share/dict/words | wc -l 5306 Nick Urbanik nicku(at)vtc.edu.hk ver. 1.3 Solutions Regular Expressions, sed and awk Operating Systems and Systems Integration 2 4. Find all words containing exactly 5 vowels. Count the number of matching words. $ egrep ’^[^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*’\ ’[aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$’ \ /usr/share/dict/words | wc -l 3798 2.2 egrep: Selecting data from student records 1. Save the file http://ictlab.tyict.vtc.edu.hk/snm/lab/regular-expressions/ artificial-student-data.txt to your local directory. For this data, write a regular expression that will select each of the following. Test it on the data using egrep -o 2. student number i $ egrep -o ’[0-9]{9}’ artificial-student-data.txt 3. Hong Kong ID. Count the number of Hong Kong ids. i $ egrep -oi ’[A-Z][0-9]{6}\([A0-9]\)’ \ artificial-student-data.txt | wc -l 560 4. the course code. Count the number of courses. The course and year are shown in this case on the sixth line: 2241/2. The course is 2241; this is the second year of study. i $ egrep -o ’^[0-9]{4}’ artificial-student-data.txt $ egrep -o ’^[0-9]{4}’ artificial-student-data.txt \ | sort -n | uniq -c 3 2241 4 2245 1 2262 4 2545 3 2565 5. the year of study i $ egrep -o ’^[0-9]{4}/[0-9]’ artificial-student-data.txt \ | sed ’s!.*/!!’ Nick Urbanik nicku(at)vtc.edu.hk ver. 1.3 Solutions Regular Expressions, sed and awk Operating Systems and Systems Integration 3 6. The company the student works for i Here is one possible solution: $ egrep ’[A-Z][0-9]{6}\([A0-9]\)’ artificial-student-data.txt \ | cut -b80-125 \ | grep -v ’^ *$’ \ | sed ’s/ *$//’ This works by: • listing all lines containing a hkid; • selecting only the columns containing the Company name, with cut, i.e., columns 80 to 125; • remove blank lines (student records with no company) • removing the trailing spaces from the end of each line. There are many other methods. 7. The home telephone number i This one just finds the 8 digit number that comes after the Hong Kong id: $ egrep ’[A-Z][0-9]{6}\([A0-9]\) +[0-9]{8}’ \ artificial-student-data.txt \ | awk ’{print $2}’ 8. The gender of the student i We can find the M or F just before a nine digit number: $ egrep -o ’[MF] +[0-9]9’ artificial-student-data.txt \ | awk ’{print $1}’ $ egrep -o ’[MF] +[0-9]9’ artificial-student-data.txt \ | awk ’{print $1}’ \ | sort \ | uniq -c 66 F 494 M The second pipeline counts the number of males and the number of females. 9. The student’s name i $ egrep ’[A-Z][0-9]{6}\([A0-9]\)’ artificial-student-data.txt \ | cut -b6-42 \ | sed ’s/ *$//’ This solution uses the same approach as the pipline that prints the company names. 2.3 Using sed Write a sed expression to output only the data for which you wrote each of the regular expressions above. For example, write a sed command that will print only the HK ids Nick Urbanik nicku(at)vtc.edu.hk ver. 1.3 Solutions Regular Expressions, sed and awk Operating Systems and Systems Integration 4 and all the HK ids from the file, using the regular expression you wrote for question 3 on page 2. You should write eight sed expressions. 2. student number $ sed -rn ’s/.*([0-9]{9}).*/\1/p’ artificial-student-data.txt 3. Hong Kong ID. $ sed -rn ’s/.*([A-Za-z][0-9]{6}\([Aa0-9]\)).*/\1/p’ \ artificial-student-data.txt 4. the course code. $ sed -rn ’s/^([0-9]{4}).*/\1/p’ artificial-student-data.txt 5. the year of study $ sed -rn ’s!^[0-9]{4}/([0-9]).*!\1!p’ artificial-student-data.txt 6. The company the student works for Here is one rather complicated solution: $ sed -rn ’s/^.{79}([A-Za-z0-9()/&",+.’\’’-]+’\ ’( [A-Za-z0-9()/&",+.’\’’-]+)+) .*/\1/p’ artificial-student-data.txt Part of the complication is that you cannot put a single quote “’” in a single-quoted string in the shell. To achieve this effect, you need to: • end the single quoted string with “’”, then • quote a single quote with a backslash like this: “\’”, then • start a new single-quoted string with “’”. This means that to put the word “it’s” in a single quoted string in the shell, we would need to write something as horrible as this: ’it’\’’s’ Another reason why the character class “[A-Za-z0-9()/&",+.’\’’-]” above looks so long and horrible is that there are so many different characters in the company names. The “-” needs to go at the end of the character class, otherwise it means a range of characters, as in “[A-Z”. Here is another that just uses the position of the company data: $ sed -rn ’s/^.{79}([^ ]+( [^ ]+)+) artificial-student-data.txt Nick Urbanik nicku(at)vtc.edu.hk .*/\1/p’ \ ver. 1.3 Solutions Regular Expressions, sed and awk Operating Systems and Systems Integration 5 7. The home telephone number We can just use the same approach as before: find the 8-digit number after the HK id: $ sed -rn ’s/[A-Z][0-9]{6}\([A0-9]\) +([0-9]{8}) .*/\1/p’ \ artificial-student-data.txt \ 8. The gender of the student We can find the M or F just before a nine digit number: $ sed -rn ’s/ ([MF]) +[0-9]{9} /\1/p’ artificial-student-data.txt 9. The student’s name Just for variety, I’ll take a different approach here from the approach I took when using egrep above. I’ll find the name just before the gender, which is just before the nine-digit student number: $ sed -rn ’s/.{,6}([A-Za-z]{2,},?( [A-Za-z]+)+) artificial-student-data.txt +[MF] +[0-9]{9} .*/\1/p’ \ Note that we cannot put ‘.*’ at the beginning of the pattern, because that will gobble up the beginning of the student’s name. That is because ‘*’ is “greedy,” and matches as much as it possibly can. We can limit the match to “at most 6 characters” with ‘.{,6}’. 2.4 Using awk Use awk and ls to add up the size of all the files in your current directory. i $ ls -l | awk ’{sum += $5} END{print sum}’ Nick Urbanik nicku(at)vtc.edu.hk ver. 1.3