Hello and welcome to Chapter 11, Regular
Expressions

from the book Python for Informatics:
Exploring Information.

As always, these slides are copyright
Creative Commons Attribution, as well as

the audio and the video that you're
watching or listening to right now.

And, so regular expressions are an
interesting thing.

You've seen from, in the chapters up till
now, I've,

I've had a singular focus on sort of
pulling information out of data.

Raw data, this mailbox file that perhaps
you're getting tired of already.

But it's a lot of fun, because I can have

you go look for something and, and
pick it out.

And you're doing something that like would
be really painful to do sort of by hand.

And while it's not all of computing, I
mean, there's games

and there's, you know, things like
weather computations that do calculations,

pulling and extracting data out is a big
part of computing.

And so there's actually a library that's
built specifically to do this.

And, and if you start doing a few finds
and slicing, it gets kind of

long after a while and that's like split,
for example,

really saved us a lot of time.

But sometimes the data that you are
looking for is a little

more sophisticated than broken into spaces
or colons or something like that.

And you just want to like tell something
to go find

I see what I want, and I see where it's
embedded in the string, go get it for me.

And regular expressions are themselves a
programming language.

They're like a really smart wild card for
searching.

So we've used wild cards in various

things in search, but they're, they're a
really smart version of a wild card.

And so, regular expressions are quite
powerful and they're very cryptic.

And as a matter of fact, you don't even
need

to learn them if you don't feel like it,
right?

I've got this little guide.

I need a guide for myself when I do
regular expressions.

It sometimes takes me a few minutes to
write

a regular expression to do exactly what I
want.

So in a way, writing a regular expression
is like program, writing a program.

It's highly specialized to searching and
extracting data from strings.

But it's like writing a program and it
takes a while to get

it right and you kind of like, oh, change
this, what about a slash there?

And, so, you, but they actually are kind
of fun.

And, and they are a great way to sort of
exchange little program snippets

to say, oh yeah, I'm looking for this, oh
here's a little reg expression you might

try and then, so they're, they're like
programs themselves.

It is this language of marker characters,
so when we

look for regular expressions, some
characters like A, B, C, have meaning

as A, B, C but some characters like caret or
dollar sign mean

at the beginning of the line, or at the
end of the line.

And so we encode in this string a, a
program, basically.

And so it's a rather old-school language.
It's from

long time.

It predates Python, which is over 20 years
old, and so

it's, it also marks you as sort of a
little cool, right?

It's a, it's a distinct marking that makes

it so that you know something other people
don't.

Right? So you can know how to program, but
if you know regular expressions

it'll be like woah, I tried to look at those
and they're kind of tough.

In a way, knowing regular expressions is

kind of like a tattoo.

So I, it's casual Friday and that's why
I'm wearing a T-shirt

today and so I figured I would come in
today in a T-shirt,

but seeing as it's the first time I'm wearing
a short-sleeved shirt, it's

also the first time I can show you my,
show my real tattoo here.

So, here's my real tattoo and in the
middle is Sakai,

the open source learning management system
always close to my heart.

And then you have the IMS logo, which

is IMS Learning Tools Interoperability,
which a standard,

it means a lot to me.
Blackboard, OLAT, Learning Objects, Angel,

Moodle, Instructure, Jenzabar, and
Desire2Learn.

I call this the ring of compliance,
because these are all

of the first six or seven learning
management systems that complied

with the IMS Learning Tools

Interoperability standards
specification, which is

something that I spent a lot of my life
making work.

So

I figured I'd make a tattoo and just
kind of

part of my rough, tough image and,
and actually

regular expressions are indeed part of my
rough, tough image,

because I'm like, I'm down with
regular expressions.

And people are like impressed with my
regular expression knowledge.

But as impressive as I am, I still need a
cheat sheet, so I'll have a cheat

sheet that you can download hopefully on
the pythonlearn

website or whatever, and I just, it

doesn't have to be much.

It's really just a kind of a, a crutch,
and these are the characters that have

special meaning, like caret or
dollar sign

match the beginning or end of line,
respectively.

So they're not really matching a dollar
sign, they match, they,

they mean something in our little mini
string-like programming language.

So, like many things that we do in Python
going forward, once you want some

sophisticated capability, it comes with
Python, but

it comes in the form of a library.

And so the regular expression library we
have to say import r-e

at the beginning of our programs to import
the regular expression library.

Then we call re.search to say I'm

looking for search from the regular
expression library.

There's two basic functions or method,
two, two basic

capabilities inside this library that
we're going to look at.

One is search, that replaces find, it's
like a smart find, and then

findall is a combination of a smart find
and a automatic extraction.

So we'll look at both of those in turn.

And I'll do it by comparing them to
existing

Python that you kind of already should
know at this point.

So here's some code that's, say, looking
for lines that

have the word fr-, have the string From
colon in them.

Right, so, we're going to open a file,
we're going to strip the white space.

If we find we, hunt within line for
From.

If it's greater than or equal to zero then
we'll print it. And so this

is just going to give us a number. If it's,
if it's not found, it's negative one.

So it's only going to print the lines that
that have From in them.

Here is the equivalent using

regular expressions.
So these two things are equivalent.

So we have to import the library, like I

mentioned before, and all the rest of it's
the same.

The if test is re.search. That says within

the library re, call the search utility
and then

pass in the line, the string we're looking
for

and the line, the actual text we're
looking in.

So this is like look for From inside of
line and return me a

True or a False, whichever, depending on
whether you find it or not.

Now you might say, I, you just got done
telling me that it, it was more dense.

And the answer is, there's a few more
characters here.

But we'll see in a second how you

can quickly add more power to the regular
expression.

Find, you have to start adding more

Python lines to make it more sophisticated
where in

the regular expression you start changing,

you change the search string to give more of

the direction of what you're looking for,
and that's what

we'll be doing, pretty much, is changing
the search string.

So now if we wanted to switch to say,
wait, wait, wait, we don't

just want the From anywhere in the line,
we want it to start with From.

So we would change
line.startswith('From'),

and that's either going to be true or false

depending on whether or not the
line starts with From.

Now, we do the same thing with

regular expressions by changing the
search string.

So now we are in regular expressions.

So this really just isn't a string, it's a
string plus

characters that are interpreted as

commands by the regular expression
library.

So the caret, which is the first one on
our,

our little regular expression sheet, matches
the beginning of the line.

It's not actually a caret.

So that says, the first character, this
two-character sequence, caret F,

means F but in column one, in the first
character of the line.

And so, again, this is going to give us a

True or a False, if this regular
expression matches.

The, the beginning of the line, From: and
it's the same as

this, it's, does it start with From.
So again, these two are equivalent.

But you see the pattern where we're
going to do something to this string using

these characters that have meaning, okay?
So, the next thing that's

most commonly done other than caret and
dollar sign for the end of line, is

the wildcard characters and so, we've used
wildcards

possibly in like DOS, where we can use ?

or * in like a dir command. dir .*.* if
you're familiar with that,

or even a Unix command like ls, you
know, star dot whatever.

This is not how regular expressions

work. And the problem is is that dot, dot

is that it matches a single character in
regular expressions.

Asterisk means any number of times.

So if I look at this, if I look at
this and color-code this to make a

little more sense, the caret is actually
kind of part of the

regular expect, regular expression
programming language. Says I'm, I'm

I'm a virtual character matching the
beginning of line.

The X is a real character.

The dot is part of the regular expression
programming language, any character.

Star is part of the regular expression
programming, it says

the immediate previous character many
times, zero or more times.

And then colon matches the colon.

And so if you look at lines, these are the
kinds of lines that will give me a True.

Because they start with an X,

followed by some number of characters,
followed by a colon.

So that's true.

Start with a X, followed by some number of
characters, followed by a colon.

Okay?

And so that's basically how this works.

And so this little, this, in this

five-character string there are, you know,
some of

these things are like instructions and
some of

them are the actual characters we're
looking for.

So the X and the colon

are the characters we're looking

for, and the caret, dot, and star are
programming.

Right? They are logic that we're adding
to the string.

Okay.

So let's say, for example, you're... 
Part of any of these things,

and part of the stuff we have done so far,

has to assume that the data is some
level of being clean and

so the data that I have been giving you,
mbox.txt, is not inconsistent.

Right? It doesn't have like too much
weirdness in it.

I'm not trying to trick you and
mislead you, although

we've had situations where you sort of get
a traceback because

you think there's going to be five words
you, you grab a line,

you break it, and there's only two
words and then you get

a traceback because you're looking at the
fifth word, or something like that.

But if your data is less clean, or even
you just are

want to be real careful, you can
fine-tune your matching.

So, here's that same match.

Give me a character X, followed by any
number of

characters, followed by a colon, and that's
what I'm looking for.

Give me lines that match that pattern.

So this X starts at any number of
characters,

colon, great, this, any number of
characters good, great.

Oh wait, and there's an email X that says

X Plane is two weeks behind sch, behind
schedule, colon, two weeks.

Well, the regular expression didn't know
that the dash made sense to you.

And you just assumed that everything that
started

with a capital X had a dash after it.

So X is what it starts with, any number of
any character, and then

a colon. So this becomes True.

This may not make you happy, right? It may
not be what you're looking for.

Because you haven't been specific enough
in your regular expression.

So, we can be more specific in our regular
expression.

So for example, this is a more specific
regular expression.

It still says start with an X as the first
character, then a dash,

that's a real character not a, then this

next thing, instead of being a dot, this
backslash capital S.

It's on the sheet.

Whoa. It's not on the sheet.

I lost the sheet. Come back, sheet.

I lost the sheet.

I can't live without my sheet.

Backslash capital S means a
non-whitespace character.

So that means spaces won't match.

And then I changed the asterisk, zero or
more times thing, to a plus.

And that means one or more times.

Here is a character, a non-whitespace.
These two things kind of work together.

A non-whitespace character at least one
time, as many as we like.

And then, a colon.

So, if we look here, it starts with X dash,

any number of non-whitespace
characters, and ends in colon.

Starts with X dash, any number

of non-whitespace characters, ends
in a colon.

True. True.

This one starts with an X, but doesn't
start with an X dash.

Oh, as a matter of fact, these characters
are blanks, so this becomes a False.

It does have an X and it does have a colon
and match the previous one,

but this one here is more specific.

Okay? So it's more specific and so it
matches what you want.

Now it depends on what you are looking for.

Maybe you do want this line,

and so you're looking for X. I don't
know. But if you want, you can be

increasingly sophisticated in what

you're looking for in a regular
expression.

So now, let's talk about extracting data.

So everything we've done so far is,
is it there or is it not.

But it's really common once

you find something you that want to
break it into pieces.

So we can combine the searching and the
parsing into one statement.

And instead of using search, which returns
for us a true/false, we are going to use

findall.
So in this example, I'm going to to show

you a new syntax. The square bracket in
regular expression language means

a way to list a set of characters.

So this says, this is a single character
that says,

I want to match anything in the range
0 through 9.

Plus means one or more of those.

So that says, so this is, this whole thing
says one or more digits.

That's a regular expression that says one
or more digits.

You can put other things inside here.

You can put like, you know,

you could make a thing that says a b c d.
And that would say, I'm

going to match a single character that's
a or b or c or d. Or you could say like,

you know, 1 3 5 7, bracket.

That's a single character

that's either a 1 or a 3 or a 5 or a 7.

So the bracket is a list of matching

characters and the dash inside the
bracket means range.

We'll see in a second that you can stick a
not inside the bracket. It's on this.

So, so again, remember in this little

mini-language, we are programming, right?

We are giving instructions to the regular
expression engine, as it were. Okay?

So, if we do this, and here is an
expression that

says I would like to find, you know, things
that are one or more digits.

And so,

so it's one or more digits and, and so
it's going to look

through here and it's going to find it as
many times as it can.

So there is one or more digits, there is
one or more digits,

and there is one or more digits.

And so what findall gives us back is a
list of strings.

So it found it.

Where do I match?
Where do I match?

It's looking the whole time and then,
it says, oh, I've got it.

2, 19, 42.

So it actually extracts the strings that
match

and gives you a Python list of strings.

Python list of strings.

Kind of of like split, except it's like a
super smart split, right?

It's split, but I've directed it what to
look for, and if,

so here's an example of, you know, that's
the one I just did.

Find me one or more digits and extract
them, so 2, 19, 42.

Here I'm saying, using the same bracket
syntax, to look for a single

character A, capital A E I O or U, and one
or more

of those. And if you look, there are no
upper-case vowels in my string.

So it says I'm going to find all the
things that match

A E I O U. So things like AA would match
and, you know, OU would match.

And so that's what we, we would get if
they were in the string.

But because there are none, we get an
empty string.

So even if there are none, you get an
empty string.

So it always returns a string.

It may be a zero-length string, and that's
what you have

to check. Okay?

Okay, now

matching has this notion of greedy,

where when you put one of these pluses

or asterisks it kind of has this outward
pushing feeling, right?

And so when you say,

I'm looking for something that starts with
an

F at the beginning of the line, followed

by one or more characters, followed by a

colon, you can think of this as pushing
outward.

So if we look at a line here that has From
colon using the colon

character, it will try to expand, so it
certainly has

to match the F and it's looking for a
colon, any number of characters,

but it's trying to make the string that
matches as big as possible.

So it skips over this colon and goes to
that

colon and so the thing that we get is
here.

And so, it ignored this and said I will
make as large a string as I can.

So, that that's the plus that's doing it.

Dot plus pushes, it's like, I've got a

colon, but is there another colon out
there?

So you push it, okay?

So that's greedy matching.

It can get you in some trouble, like being
greedy

in general, and both asterisk and plus sort
of behave

in a greedy way because they're zero more
or one

or more characters, so they can sort of
push outward, okay?

Now you can turn this off.

It's a programming language, we can tweak
it, okay?

And so we add a question mark.

So this is a three-character sequence now.
So if you say dot plus question

mark, that says one or more of any
characters, push,

but instead of being greedy and pushing as
far as you can, this means stop

at the first. Stop at the first.

Oops, stop at the first.

I can never draw on this thing fast
enough.

Stop at the first.

Okay?

And that's it, just don't be greedy, don't

try to make the string as large as
possible.

Go with the smaller one, the smaller
possible one.

We still need to find an F, and we still
need

to find a colon, but when you find the
first colon, stop.

And so what this does is this changes it
so that

what we match is from colon instead of
going all the way.

So the greedy match pushes as far as it
can. The non-greedy match

is satisfied with the first thing that
meets the criterion of the string.

So this is a little three-character
programming sequence,

any character one or more times and not
greedy.

If, for example, we were trying to solve the
problem

of pulling the email address out of a
string.

Right?

We can make good use of this non-blank
character

and so the at sign is just a character and

then we can say, I want at least one
non-blank

character before it and at least one
non-blank character after it.

So the way regular expressions does it
says, okay, I find my at sign and

I push in a greedy manner outwards, as

long as there are non-blank characters,
push, push, push, push,

push, push, push, oops, stop.
Push, push, push, push, push, stop.

Okay?

So it's some number of non-blank
characters, an

at sign, followed by some number of
non-blank characters.

So it's, that's using greedy matching. It,
it's doing that, okay?

And so this is where we get Stephen
Marquard, we can, and,

and we would know if there wasn't there by
the empty list, right?

And so we get stephen.marquard@uct.ac.za.

Now, we can also fine-tune what we
extract, right?

In the previous slide, we extracted
whatever matched.

Right?

Whatever this matched, it looked across
the whole string and found it,

found the thing, shoved it over, and gave
us what it matched.

But it's possible to make the match larger
than what's extracted,

to extract a subset of the match, and we'll
see that on this next slide.

Okay?

So here's this same thing, which is an at
sign followed, and then

with non-blank characters as far as the
eye can see in either direction.

But I'm going to add to it caret From
space.

So, so this has to be start with, the
first character has to be a caret, this,

it's gotta have the word From,

it's gotta have one space and then,
immediately, it's gotta find this, right?

It's gotta find a series of non-blanks,
followed by an at sign,

followed by another series of one or
more non-blanks. And then

what we do, so this, if we didn't put
these parentheses

in, it would match and we would get all of
this data.

It would go to here.

But what we can do with the parentheses,
the parentheses are part

of the regular expression language,
saying,

okay, I want to match the whole thing.

The parentheses aren't part of the care-,
a string up here.

I want to match the whole thing, but

I only want to extract this part in
parentheses.

So this whole thing is a regular
expression that's matched

and then the parentheses part is what's
retrieved for you.

And so this makes it so that the only time
it's going to

look for at signs is, are on lines that
start with From space.

It is going to want the immediate next
character to be a non-blank.

Some number of non-blank characters
followed by an at sign,

some number of non-blank characters, it's
going to stop right there.

And it's only going to extract from here
to here,

and so we get out Stephen Marquard.

But this is a pretty narrowly scoped thing
because

the first four characters have to be From
space.

And so that's a way to combine a stricter
match,

even though you don't actually want
all the data.

So you can add those things all over the
place.

Okay? Okay.

Then, we, we, we can compare the different
ways of extracting data.

So if we look at how we extract the host
name.

Remember how we did this many chapters ago.

So we did a data.find, which says oh,

the first at sign is at 21.
So the first at sign is at 21.

Then we say we want to find the space
after that.

So that's the at position, that's 31.
And then we want to extract the data

that's one beyond the at up to but not
including the space.

And that is the variable that we are
going to print out, host.

And so we've extracted this bit of
information and out comes the host.

Quite nice. Okay?

We also saw another technique, and by the
way, all these techniques are okay.

All these techniques are fine.

Another technique we saw, once we sort of
played

with split and lists, was what we, what I

called a double split version of this,
where the

first thing we do is we split that line.

The first thing we do is split the line
and then we know, and blanks,

that the second thing, which is the
sub one, words sub one,

is the entire email address. Then this is
the double split.

We take the email address and we split it by

an at sign and then we get a list of the

pieces of the email address, the email
name and the

email host, and then we grab the, the
sub one of that,

and then we have the host.

So that's a double, the double split way
of doing this, right?

Now in this, we still haven't done
the From yet,

but it is the double split way to do this.

So, if we think about how we would do
this in a regular expression, okay?

We're going to say, look through the
string, findall, we're going to,

use the findall, and the regular
expression exploded up says

look through the string for an at.
Do, do, do, do, do, do, got an at.

Then, oh, start extracting. End extracting.

And then this is another form of the

this is one character, it's a

single character, match any non-blank
character, and

zero or more of them. Okay?

So find an at sign, start extracting,

end extracting, match, this is one character.

That is a set of possible matches, and
that's some character, this means not.

Okay? Not a blank, that's a blank

right there, that's a blank character
right there.

Not a blank, as many times as you want.

You might want to, we might want to turn

that into a plus to guarantee at least one.

So that might be better done as a plus
right there.

So this is, probably make more sense as a
plus, to say, I

want at least, after the at sign, I want
at least one non-blank character.

And the parentheses simply say, I don't
want the at sign.

So if the at sign, I really want those
non-blank characters after the at sign.

Okay? So that's what I want to extract.

So it's like, go find the at sign.

Okay, great, found the at sign. Start

extracting, look for non-blank characters,
end extracting.

So pull that part out and put it right
there.

Now an even cooler version of this that

you probably kind of imagined right away is,

we say, you know what, I would like this
first character, the first

part of the line to be From, with a blank,
followed by any number of characters,

followed by an at sign, so the at sign is
real, then start

extracting, then any number of non-blank
characters, end extracting.

So this is a, this is like eight or nine
lines of Python

all rolled into one thing, okay?

So, start at the beginning of the line.
Look for string From, with a space.

Then skip a bunch of characters looking
for an at sign, skip characters until

you encounter an at sign, then start

extracting, match any non-blank, a single
non-blank character.

This is kind of like one non-blank

character, one non-blank character, but
once you

suffix it with the asterisk that changes it to
be many non-blank characters.

And then stop extracting, okay?

And so, you know, and so it's like find
From followed by a space, great.

That's the first part.

Now throw away characters until you find
an at sign.

Then start extracting.

Keep going with non-blank characters until
you hit

the first blank characters and pull that
part out.

Now the result is we get the exact same

data. But with this added to it, we are
much more narrow in the kind of things

that we're looking for and if we get
noisy data that like, something like,

you know, meet at Joe's, right?
We don't want that.

That won't match, right?

We want that to be like a False.

And, and it allows us to sort of really
fine-tune our matching

and extracting. And this is just the
beginning, they are very, very powerful.

So, the last thing that I will show you is
sort of a program that is kind of like one

of the programs that we did in a previous
section,

except now we're going to use regular
expressions to do it.

So if you remember, we had this thing where

we're doing spam confidence, where we're
looking for lines and

you know, and pulling this number out and then

calculating the average, or the
maximum, or whatever.

And so here is a, we import the regular
expression library, we open the file,

we're going to do this with the, appending
to the, a list, we'll put the list.

We'll put the numbers in a list rather
than doing the calculation in a loop.

We strip the data.

Now, here's the key thing, right?

We're going to have a regular expression
that says,

look for the first character being X,
followed by

a dash, followed by all this,
all this

exactly has to match literally, followed
by a colon.

And then there's a space, and then we
begin extracting and we are looking for

the digit 0 through 9 or a dot and
we are looking for one or

more, and then we end extracting.

So that's the, the parentheses are telling
us what to pull out.

So that just means that we're going to
pull out those numbers, all

the digits and the numbers, until we get
something other, I mean,

all the digits and the period, and we'll
get something other than

a digit and a period, and we, and then
we'll be done, okay?

And so if we, and so this is going to pull
those numbers out and give us back a list.

Now the thing about it is, we have

to realize that sometimes this is not
going to match, because

we're sending every line, not just the
ones that start

with X, we're sending every line through
this and so

we need to know when we didn't get a
match.

And that, the way we know we didn't get a
match is if the list, the

number of items in the list that we got
back, is zero, then we're going to continue.

So this is kind of our if where we're
searching for the needle in the haystack.

But then once we find what we are looking

for, the actual number that we are
interested in,

is already sitting here in stuff sub zero.
Okay?

And then we convert it to a float, we append it.

And when the loop is all done, we print
out the maximum.

Okay?

And so this is sort of encoding a number
of things

and ending up with a very, a very solid and
safe matching.

So we're really, it's hard for this to
find a line that's wrong and

you could even improve this a little bit
to make it even a little tighter

where we'd go find a number like 0.999.
You could say, oh, it's

all the numbers are zero dot, so

you could make this a little, a little more
precise.

So it wouldn't, so it would even skip
things that

you can make it, so it looks exactly the
way you want it to look.

So, I emphasize that this

is kind of a weird language and you need
some kind of thing.

We talked about all these.

We have the beginning of the line, we have
the end

of the line, matching any character,

matching space characters, matching
non-whitespace characters.

Star is a modifier that says zero or more
times.

Star question mark is a modifier that says
zero or more times non-greedy.

Plus is one or more times.

Plus question mark is one or more times
non-greedy.

When you have bracket syntax, it's a set,

it's a single character that's in the
listed set.

So that's lower-case vowels.

You can also have the first, if the first

character of this is a caret, that flips it.

So that means everything except capital
X, capital Y, capital Z.

So it's everything that's not in the set,

capital X, capital Y, capital Z, and then

you can also put dashes in to represent
ranges.

Bracket a through z and 0 through 9,
and lower-case

letters and digits will match, but again,
this is a single character.

Now, you can put a plus or a star after

these guys to make them happen more than
one time.

And you can even put them in twice.

So if I wanted a two-digit number, I could
say 0 dash 9, 0 dash 9.

Oops. This is one character.

This is one character and this is the
possible things.

So that's, you know, 0 0
would match.

1 0 would match, 99 would match, etc.

Okay?

And then the parentheses are the things
that if you

are in the middle of a big long matching
string and

you don't want to extract the whole thing,
you can limit the

things you're extracting to, to the stuff
that's just in there.

With all these characters that have all
this meaning,

we have to have a way to match those
characters.

So dollar sign is the end of a line.

But what if we're looking for something that

actually has a dollar sign in the string?

And that's what the backslash is for.

So if you put the backslash in front of

a otherwise meaningful character, you
don't, it becomes the actual character.

So this is saying match a dollar sign.

Those two characters say match a dollar
sign.

And then this says one character that's
0 through 9 or a, or a dot.

And then we put the plus modifier to say

at least one or more times and so that sort
of is

a greedy, of course.

So that will get us this and extract it,
including the dollar sign.

So the escape character is the backslash.

Okay. So there we are.

Now we're done.

So this is little bit cryptic.

It's, it's kind of a puzzle.

It's kind of fun.

And it's extremely powerful.
And you don't have to know it.

You don't have to learn it.

But if you do, you'll find that it's very
useful as we sort

of dig through data and are trying to
write things that are pretty quick.

And, and, and they, the thing I like about
regular expressions is that they

tend to be, if you write them well, they
tend to be less sensitive to bad data.

They tend to ignore data, they're, you

can put more detail, I exactly want this.

Whereas you're,

if you're writing find and extract, you're

making a lot of assumptions about the
data.

That it's clean and you're not going to,
you know, mis-hit on something.

So, okay, well, good luck, and you're

used to regular expressions, and we'll
see you later.