Let's start by reading data from two files named notebook-1.txt
and notebook-2.txt
.
For each file, we will discard the header line, and keep the first 6 lines after it:
readings = []
for filename in ('notebook-1.txt', 'notebook-2.txt'):
lines = open(filename, 'r').read().strip().split('\n')
readings += lines[1:7] # We are ignoring the header line (lines[0]) here.
for r in readings:
print r
This puts 6 lines from the first data file and 6 from the second
into the list readings
:
Baker 1 2009-11-17 1223.0
Baker 1 2010-06-24 1122.7
Baker 2 2009-07-24 2819.0
Baker 2 2010-08-25 2971.6
Baker 1 2011-01-05 1410.0
Baker 2 2010-09-04 4671.6
Davison/May 23, 2010/1724.7
Pertwee/May 24, 2010/2103.8
Davison/June 19, 2010/1731.9
Davison/July 6, 2010/2010.7
Pertwee/Aug 4, 2010/1731.3
Pertwee/Sept 3, 2010/4981.0
Each element of readings
is a record that the grad students have created.
We will be testing our regular expressions against each record to see how well
we are matching the different record styles/formats as we go along.
As defined in the Introduction, a regular expression is just a pattern that can match a string. In this section we will aim to derive a regular expression which matches the records (strings of data) based on the date field. This will allow us to, for example, list all the records that were created in May.
Without the use of regular expressions, we can determine if a record r
contains the month
"06" using the in
keyword (i.e. we are essentially asking the question 'does the record r
contain the string "06"?'):
for r in readings:
if '06' in r:
print r
Baker 1 2010-06-24 1122.7
If we want to select all the records for two months we also have to use the or
keyword:
for r in readings:
if ('06' in r) or ('07' in r):
print r
Baker 1 2010-06-24 1122.7
Baker 2 2009-07-24 2819.0
But if we say '05' in r
it can match against the day "05" as well
as the month "05". This is not the desired behaviour we want.
We can try to write a more complicated test that only
looks for the two-digit month in a particular place in the string, but
let's try using a regular expression to do this instead.
We will work up to our solution which uses a regular expression in stages.
We start by importing the regular expressions library, called re
, then examine each record in the list readings
. We will then
use the function re.search
to try and find a match for the string '06'
in a given record.
If the function is successful in finding a match, then we print that record out.
import re
for r in readings:
if re.search('06', r):
print r
Baker 1 2010-06-24 1122.7
Note that the first argument to re.search
is the pattern we are searching for,
written as a string. The second argument is the data we are searching
in. It's easy to reverse these accidentally, i.e., to put the data first
and the pattern second. This can be hard to track down, so please be
careful.
So far, our code that uses re.search
does the same thing as '06' in r
. But if we want to match
'06'
or '07'
, regular expressions let us combine the two comparisons
in a single expression:
for r in readings:
if re.search('06|07', r):
print r
Baker 1 2010-06-24 1122.7
Baker 2 2009-07-24 2819.0
The vertical bar in the pattern means "or". It tells regular expression library that we want to match either the specified text on the left of the vertical bar, or the text on the right of the vertical bar.
We are going to be throwing a lot of regular expressions against our
data, so to help us test whether different regular expressions are doing the right thing,
let's write a function that will tell us which records match a
particular pattern. Our function show_matches
takes a pattern and a
list of strings as arguments. It prints out two stars as a marker if the
pattern matches a string, and just indents with blanks if it does not:
def show_matches(pattern, strings):
for s in strings:
if re.search(pattern, s):
print '**', s
else:
print ' ', s
If we use this function to match '06|07'
against the data we read in
earlier, it prints stars beside the two records that have month '06'
or month '07'
:
show_matches('06|07', readings)
Baker 1 2009-11-17 1223.0
** Baker 1 2010-06-24 1122.7
** Baker 2 2009-07-24 2819.0
Baker 2 2010-08-25 2971.6
Baker 1 2011-01-05 1410.0
Baker 2 2010-09-04 4671.6
Davison/May 23, 2010/1724.7
Pertwee/May 24, 2010/2103.8
Davison/June 19, 2010/1731.9
Davison/July 6, 2010/2010.7
Pertwee/Aug 4, 2010/1731.3
Pertwee/Sept 3, 2010/4981.0
But if we change the pattern '06|7'
(without a '0' in front of the
'7'), the pattern seems to match a lot of things that don't have the
month '06'
or '07'
:
show_matches('06|7', readings)
** Baker 1 2009-11-17 1223.0
** Baker 1 2010-06-24 1122.7
** Baker 2 2009-07-24 2819.0
** Baker 2 2010-08-25 2971.6
Baker 1 2011-01-05 1410.0
** Baker 2 2010-09-04 4671.6
** Davison/May 23, 2010/1724.7
Pertwee/May 24, 2010/2103.8
** Davison/June 19, 2010/1731.9
** Davison/July 6, 2010/2010.7
** Pertwee/Aug 4, 2010/1731.3
Pertwee/Sept 3, 2010/4981.0
To understand why, think back to mathematics. The expression ab+c means "a times b, plus c" because multiplication has higher precedence than addition. If we want to force the other meaning, we have to use parentheses and write a(b+c).
The same is true for regular expressions. Adjacency has higher
precedence than "or", so the pattern '06|7'
means, "Either '06'
or
the digit '7'
". If we look back at our data, there are a lot of 7's in
our file, and this pattern is matching all of them.
If we want to match '06'
or '07'
without repeating the digit '0', we
have to parenthesize it as '0(6|7)'
. Having said that, most people
probably find the expression '06|07'
more readable anyway. Note that
the expression inside the parentheses is a regular expression in its own right,
and is therefore referred to as a sub-expression.
Let's go back to our function and our data. If we use the pattern
'05'
, then as we said earlier, we will match records that have '05' as
the day as well as those with '05' as the month. We can force our match
to do the right thing by taking advantage of context. If the date is
formatted as YYYY-MM-DD then there should be a dash '-'
before and
after the month, but only before the day. The pattern '-05-'
should
therefore only match a month of '05'. Sure enough, if we give this
pattern to our function it doesn't match any records. This is the
correct answer, since we don't have any readings in this sample of our
data set for May.
Matching is useful, but what we really want to do is extract the year, the month, and the day from our data so that we can reformat them. Parentheses can help here too: when a regular expression matches a piece of text, the library automatically remembers what matched against every parenthesized sub-expression.
Here's a simple example:
match = re.search('(2009|2010|2011)',
'Baker 1\t2009-11-17\t1223.0')
print match.group(1)
The first string is our pattern. It will match 2009, 2010, or 2011, and
the parentheses around it will make the library remember which of those
three strings was matched. The second string is just the first record
from our data. (Remember, '\t'
represents a tab.)
When re.search
is called, it returns None
if it doesn't find a
match, or a special match object if it
does. The call to match.group
returns the text that matched the
sub-expression inside the specified set of parentheses, counting from the
left. Since this pattern only has one set of parentheses,
match.group(1)
returns whatever text matched what's inside them.
The way sub-expressions are numbered sometimes trips people up. While
Python normally counts from 0, the first match in a regular expression
is extracted with match.group(1)
, the second with 2, and so forth. The
reason is that match.group(0)
returns all of the text that the entire
pattern matched.
What if we want to match the month as well as the year? A regular
expression to match legal months would be
'(01|02|03|04|05|06|07|08|09|10|11|12)'
. An expression to match days
would be three times longer, which would be hard to type and (more
importantly) hard to read.
Instead, we can use the dot character '.'
to match any single
character. For example, the expression '....'
matches exactly
four characters, and '....-..-..'
matches four characters, a dash, two
more characters, another dash, and two more characters. If we put each
set of dots in parentheses as '(....)-(..)-(..)'
the three groups
should record the year, month, and day each time there's a successful
match.
Let's test that out by calling re.search
with the pattern we just
described and the first record from our data:
match = re.search('(....)-(..)-(..)',
'Baker 1\t2009-11-17\t1223.0')
print match.group(1), match.group(2), match.group(3)
2009 11 17
When we print out the three groups, we get '2009'
, '11'
, and '17'
,
just as we wanted. Try doing that with substring searches…
Tip: If you want to match an actual dot/period/full-stop '.' character, you must place a single backward-slash before it (i.e. '\.'
).
Just using a single dot on its own (i.e. using '.'
) will result in any single character being matched, as demonstrated in the section above.
'05'
or '06'
) is a regular expression in its own right, and only matches itself. For example, '06'
will not match the string "6", nor will it match "60"; it will only match "06". Also, 'A'
only matches an upper-case A, but not a lower-case one. '|'
means "or", and can be used to combine two comparisons in just one regular expression. '.'
character matches any single character, and we use parentheses to enforce grouping and to remember things.Write a regular expression that matches each of the following binary strings: 000
, 101
, and 111
.
Write a regular expression to match all binary strings that are at least 1 digit long and at most 4 digits long.
Write a program which reads in a file containing the following words: hello, working, telling, as, meaningful, cold, world, caring, ingrid
, and uses a regular expression to match all the words that end in 'ing'.