\documentclass[solutions]{ictlab}

\RCS $Revision: 1.1 $

\usepackage{verbatim,alltt}
\usepackage[pdfpagemode=None,pdfauthor={Nick Urbanik}]{hyperref}

\newcommand*{\labTitle}{Introduction to Regular Expressions in Perl}

\begin{document}
For each exercise where you write a program, keep the original program
from each exercise and modify a copy for the next exercise.  You will
need to submit all your programs by email to me at
\url{nicku@vtc.edu.hk}.  See the subject web site for the format of
the subject line of your assignments.


\paragraph{Background:}

We have seen how character classes can match a set of characters.  For
example, the character class \texttt{/[0-9]/} matches any one digit,
and \texttt{/[0-9][0-9]/} matches any two digits, one after the other.

This next topic is like gold mining: extracting useful information
from among other less useful material.

You can use parentheses in a regular expression, and if there is a
match, then the variable \texttt{\$1} is set to the contents of the
first set of parentheses in the regular expression.  For example, this
code:
\begin{verbatim}
my $line = "STUDENT REGISTER           2001/02 2nd Term  MODE : PTE";
if ( $line =~ /MODE : ([Pp][Tt][Ee])/ )
{
    print "The mode of study is $1\n";
}
\end{verbatim}%$
 prints:
\begin{verbatim}
The mode of study is PTE
\end{verbatim}
 
\begin{enumerate}
\item Download the artificial student data from
  \url{http://nicku.org/snm/lab/regular-expressions/artificial-student-data.txt}.
  This file is in the old format of the student registration system,
  but contains no real data about any student.  We will work toward
  generating system accounts from this file over the next few classes.

\item Write a Perl program that can read all the lines of this file
  when it is given as a command line parameter, and display it on
  standard output.  For example, if your program is called
  \texttt{printit}, then the following command will display the
  content of the big file of student data:
\begin{verbatim}
$ printit artificial-student-data.txt
\end{verbatim}%$
\begin{verbatim}
#! /usr/bin/perl -w

while ( <> )
{
    print;
}
\end{verbatim}
or more simply,
\begin{verbatim}
#! /usr/bin/perl -w
print <>;
\end{verbatim}

\item Make a copy of your program, and modify it so that it prints
  only the lines that contain a number with eight or more digits.
\begin{verbatim}
#! /usr/bin/perl -w

while ( <> )
{
    print if /[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]/;
}
\end{verbatim}
The ``backwards'' \texttt{if} statement is convenient here.

\item Modify the last program so that it prints all lines that contain
  a Hong Kong ID\@.
\begin{verbatim}
#! /usr/bin/perl -w

while ( <> )
{
    print if /[a-zA-Z][0-9][0-9][0-9][0-9][0-9][0-9]\([0-9A-Z]\)/;
}
\end{verbatim}
  Note that here we need to put a backslash in front of each of the
  two literal parentheses around the last letter or digit of the Hong
  Kong \acro{ID}, otherwise they will have the meaning of selecting
  that last letter or digit for \texttt{\$1} (and the pattern will not
  match a Hong Kong \acro{ID}).

\item Modify this last program further so that it prints only the Hong
  Kong ID, and nothing else for each line.  Each Hong Kong ID should
  be printed one to each line.  There should be no other output from
  your program.
\begin{verbatim}
#! /usr/bin/perl -w

while ( <> )
{
    print "$1\n" if /([a-zA-Z][0-9][0-9][0-9][0-9][0-9][0-9]\([0-9A-Z]\))/;
}
\end{verbatim}%$
Oh, they were all too easy, weren't they?  Here I have used just the
methods shown so far in lectures, but you could shorten the patterns
using \texttt{\bs d} to represent one digit, and using the construct
\texttt{\{6\}} to mean a repetition of six times, so the patterns
could become:
\begin{verbatim}
/\d{8}/
/[a-zA-Z]\d{6}\([\dA-Z]\)/
\end{verbatim}
\end{enumerate}
\end{document}