Now that we know how regular expressions work, let's have a look at Notebook #3:
Date Site Evil(mvad)
May 29 2010 (Hartnell) 1029.3
May 30 2010 (Hartnell) 1119.2
June 1 2010 (Hartnell) 1319.4
May 29 2010 (Troughton) 1419.3
May 30 2010 (Troughton) 1420.0
June 1 2010 (Troughton) 1419.8
⋮ ⋮ ⋮
It has the date as three fields, the site name in parentheses, and then the reading. We know how to parse dates in this format, and the fields are separated by spaces, but how do we match those parentheses? The parentheses we have seen in regular expressions so far haven't matched characters: they have created groups.
The way we solve this problem—i.e., the way we match a literal left
parenthesis '(' or right parenthesis ')'—is to put a backslash in front
of it. This is another example of an escape
sequence: just as we use the
two-character sequence '\t'
in a string to represent a literal tab
character, we use the two-character sequence '\('
or '\)'
in a
regular expression to match the literal character '(' or ')'.
To get that backslash '\' into the string, though, we have to escape
it by doubling it up. This has nothing to do with regular expressions:
it is Python's rule for putting backslashes in strings. The string
representation of the regular expression that matches an opening
parenthesis is therefore '\\('
. This can be confusing, so let's take a
look at the various layers involved.
Our program text—i.e., what's stored in our .py
file—looks like this:
# find '()' in text
m = re.search('\\(\\)', text)
⋮ ⋮ ⋮
The file has two backslashes, an open parenthesis, two backslashes, and a close parenthesis inside quotes. When Python reads that file in, it turns the two-character sequence '\\' into a single '\' character in the string in memory. This is the first round of escaping. When we hand that string '\(\)' to the regular expression library, it takes the two-character sequence '\(' and turns it into an arc in the finite state machine that matches a literal parenthesis:
Turning this over, if we want a literal parenthesis to be matched, we
have to give the regular expression library '\('. If we want to put
'\(' in a string, we have to write it in our .py
file as '\\('.
With that out of the way, let's go back to Notebook #3. The regular
expression that will extract the five fields from each record is
'([A-Z][a-z]+) ([0-9]{1,2}) ([0-9]{4}) \\((.+)\\) (.+)'
, which is:
If we take a closer look at that "stuff", '\\('
and '\\)'
are how we
write the regular expressions that match a literal open parenthesis '('
or close parenthesis ')' character in our data. The two inner
parentheses that don't have backslashes in front of them create a group,
but don't match any literal characters. We create that group so that we
can save the results of the match (in this case, the name of the site).
Now that we know how to work with backslahes in regular expressions, we
can take a look at character sets that come up frequently enough to
deserve their own abbreviations. If we use '\d'
in a regular
expression it matches the digits 0 through 9. If we use '\s'
, it
matches the whitespace characters, space, tab, carriage return, and
newline. '\w'
matches word characters; it's equivalent to the set
'[A-Za-z0-9_]'
of upper-case letters, lower-case letters, digits, and
the underscore. (It's actually the set of characters that can appear in
a variable name in a programming language like C or Python.) Again, in
order to write one of these regular expressions as a string in Python,
we have to double the backslashes.
Now that we've seen these character sets, we can take a look at an
example of really bad design. The regular expression '\S'
means
"non-space characters", i.e., everything that isn't a space, tab,
carriage return, or newline. That might seem to contradict what we said
in the previous paragraph, but if we look closely, that's an upper-case
'S', not a lower-case 's'.
Similarly, and equally unfortunately, '\W'
means "non-word characters"
provided it's an upper-case 'W'. Upper- and lower-case 'S' and 'W' look
very similar, particularly when there aren't other characters right next
to them to give context. This means that these sequences are very easy
to mis-type and mis-read. Everyone eventually uses an upper-case 'S'
when they meant to use a lower-case 's' or vice versa, and then wastes a
few hours trying to track it down. So please, if you're ever designing a
library that's likely to be widely used, try to choose a notation that
doesn't make mistakes this easy.
Along with the abbreviations for character sets, the regular expression
library recognizes a few shortcuts for things that aren't actual
characters. For example, if we put a circumflex '^'
at the start of a
pattern, it matches the beginning of the input text. (Note that there's
no backslash in front of it.) This means that the pattern '^mask'
will
match the text 'mask size'
, because the letters 'mask' come at the
start of the string, but will not match the word 'unmask'
. Going to
the other end, if dollar sign '$'
is the last character in the
pattern, it matches the end of the line rather than a literal '$', so
'temp$' will match the string 'high-temp', but not the string
'temperature'.
The full rule is slightly more complicated. By default, regular
expressions act as if newline characters were the ends of records. For
example, the '.'
pattern matches everything except a newline. This
normally doesn't matter, since most I/O routines return one line of text
at a time, but if we read a whole file into a single string, then try to
match across line boundaries, we may not get the behavior we expect. We
can use the MULTILINE
option in our matches to prevent this; please
see the regular expression documentation for details.
A third shortcut that's often useful is '\b'
, often called "break". It
doesn't match any characters; instead, it matches the boundary between
word and non-word characters (where "word" means upper and lower case
characters, digits, and the underscore). For example, the regular
expression '\bage\b'
will match the string 'the age of'
because
there's a non-word character right before the 'a' and another non-word
character right after the 'e'. That same pattern will not match the word
'phage'
because there isn't a transition from non-word to word
characters, or vice versa, right before the 'a'. And remember: to get
that regular expression int our program, we have to escape the
backslashes using '\\bage\\b'
.