Python for Informatics - Chapter 11 - Regular Expressions

0:00 - 0:03

Hello and welcome to Chapter 11, Regular
Expressions
0:03 - 0:07

from the book Python for Informatics:
Exploring Information.
0:08 - 0:12

As always, these slides are copyright
Creative Commons Attribution, as well as
0:12 - 0:15

the audio and the video that you're
watching or listening to right now.
0:16 - 0:23

And, so regular expressions are an
interesting thing.
0:23 - 0:26

You've seen from, in the chapters up till
now, I've,
0:26 - 0:31

I've had a singular focus on sort of
pulling information out of data.
0:31 - 0:34

Raw data, this mailbox file that perhaps
you're getting tired of already.
0:34 - 0:36

But it's a lot of fun, because I can have
0:36 - 0:38

you go look for something and, and
pick it out.
0:38 - 0:42

And you're doing something that like would
be really painful to do sort of by hand.
0:45 - 0:47

And while it's not all of computing, I
mean, there's games
0:47 - 0:51

and there's, you know, things like
weather computations that do calculations,
0:52 - 0:57

pulling and extracting data out is a big
part of computing.
0:57 - 1:01

And so there's actually a library that's
built specifically to do this.
1:01 - 1:06

And, and if you start doing a few finds
and slicing, it gets kind of
1:06 - 1:08

long after a while and that's like split,
for example,
1:08 - 1:11

really saved us a lot of time.
1:11 - 1:14

But sometimes the data that you are
looking for is a little
1:14 - 1:18

more sophisticated than broken into spaces
or colons or something like that.
1:18 - 1:21

And you just want to like tell something
to go find
1:21 - 1:26

I see what I want, and I see where it's
embedded in the string, go get it for me.
1:26 - 1:29

And regular expressions are themselves a
programming language.
1:29 - 1:34

They're like a really smart wild card for
searching.
1:34 - 1:35

So we've used wild cards in various
1:35 - 1:40

things in search, but they're, they're a
really smart version of a wild card.
1:42 - 1:47

And so, regular expressions are quite
powerful and they're very cryptic.
1:47 - 1:49

And as a matter of fact, you don't even
need
1:49 - 1:51

to learn them if you don't feel like it,
right?
1:52 - 1:53

I've got this little guide.
1:53 - 1:56

I need a guide for myself when I do
regular expressions.
1:56 - 1:58

It sometimes takes me a few minutes to
write
1:58 - 2:00

a regular expression to do exactly what I
want.
2:00 - 2:05

So in a way, writing a regular expression
is like program, writing a program.
2:05 - 2:09

It's highly specialized to searching and
extracting data from strings.
2:09 - 2:12

But it's like writing a program and it
takes a while to get
2:12 - 2:15

it right and you kind of like, oh, change
this, what about a slash there?
2:15 - 2:18

And, so, you, but they actually are kind
of fun.
2:18 - 2:22

And, and they are a great way to sort of
exchange little program snippets
2:22 - 2:25

to say, oh yeah, I'm looking for this, oh
here's a little reg expression you might
2:25 - 2:28

try and then, so they're, they're like
programs themselves.
2:30 - 2:33

It is this language of marker characters,
so when we
2:33 - 2:37

look for regular expressions, some
characters like A, B, C, have meaning
2:37 - 2:41

as A, B, C but some characters like caret or
dollar sign mean
2:41 - 2:43

at the beginning of the line, or at the
end of the line.
2:43 - 2:47

And so we encode in this string a, a
program, basically.
2:47 - 2:51

And so it's a rather old-school language.
It's from
2:51 - 2:52

long time.
2:52 - 2:55

It predates Python, which is over 20 years
old, and so
2:55 - 3:01

it's, it also marks you as sort of a
little cool, right?
3:01 - 3:04

It's a, it's a distinct marking that makes
3:04 - 3:06

it so that you know something other people
don't.
3:06 - 3:10

Right? So you can know how to program, but
if you know regular expressions
3:10 - 3:13

it'll be like woah, I tried to look at those
and they're kind of tough.
3:13 - 3:16

In a way, knowing regular expressions is
3:16 - 3:18

kind of like a tattoo.
3:18 - 3:21

So I, it's casual Friday and that's why
I'm wearing a T-shirt
3:21 - 3:24

today and so I figured I would come in
today in a T-shirt,
3:24 - 3:26

but seeing as it's the first time I'm wearing
a short-sleeved shirt, it's
3:26 - 3:29

also the first time I can show you my,
show my real tattoo here.
3:29 - 3:33

So, here's my real tattoo and in the
middle is Sakai,
3:33 - 3:36

the open source learning management system
always close to my heart.
3:36 - 3:38

And then you have the IMS logo, which
3:38 - 3:41

is IMS Learning Tools Interoperability,
which a standard,
3:41 - 3:46

it means a lot to me.
Blackboard, OLAT, Learning Objects, Angel,
3:46 - 3:52

Moodle, Instructure, Jenzabar, and
Desire2Learn.
3:52 - 3:54

I call this the ring of compliance,
because these are all
3:54 - 4:00

of the first six or seven learning
management systems that complied
4:00 - 4:01

with the IMS Learning Tools
4:01 - 4:03

Interoperability standards
specification, which is
4:03 - 4:06

something that I spent a lot of my life
making work.
4:06 - 4:07

So
4:07 - 4:10

I figured I'd make a tattoo and just
kind of
4:10 - 4:13

part of my rough, tough image and,
and actually
4:13 - 4:16

regular expressions are indeed part of my
rough, tough image,
4:16 - 4:19

because I'm like, I'm down with
regular expressions.
4:19 - 4:23

And people are like impressed with my
regular expression knowledge.
4:23 - 4:27

But as impressive as I am, I still need a
cheat sheet, so I'll have a cheat
4:27 - 4:29

sheet that you can download hopefully on
the pythonlearn
4:29 - 4:32

website or whatever, and I just, it
4:32 - 4:33

doesn't have to be much.
4:33 - 4:36

It's really just a kind of a, a crutch,
and these are the characters that have
4:36 - 4:38

special meaning, like caret or
dollar sign
4:38 - 4:41

match the beginning or end of line,
respectively.
4:41 - 4:44

So they're not really matching a dollar
sign, they match, they,
4:44 - 4:47

they mean something in our little mini
string-like programming language.
4:49 - 4:53

So, like many things that we do in Python
going forward, once you want some
4:53 - 4:56

sophisticated capability, it comes with
Python, but
4:56 - 4:58

it comes in the form of a library.
4:58 - 5:01

And so the regular expression library we
have to say import r-e
5:01 - 5:04

at the beginning of our programs to import
the regular expression library.
5:04 - 5:06

Then we call re.search to say I'm
5:06 - 5:09

looking for search from the regular
expression library.
5:09 - 5:12

There's two basic functions or method,
two, two basic
5:12 - 5:14

capabilities inside this library that
we're going to look at.
5:14 - 5:19

One is search, that replaces find, it's
like a smart find, and then
5:19 - 5:24

findall is a combination of a smart find
and a automatic extraction.
5:24 - 5:26

So we'll look at both of those in turn.
5:26 - 5:29

And I'll do it by comparing them to
existing
5:29 - 5:31

Python that you kind of already should
know at this point.
5:34 - 5:37

So here's some code that's, say, looking
for lines that
5:37 - 5:40

have the word fr-, have the string From
colon in them.
5:40 - 5:44

Right, so, we're going to open a file,
we're going to strip the white space.
5:44 - 5:48

If we find we, hunt within line for
From.
5:48 - 5:51

If it's greater than or equal to zero then
we'll print it. And so this
5:51 - 5:55

is just going to give us a number. If it's,
if it's not found, it's negative one.
5:55 - 5:58

So it's only going to print the lines that
that have From in them.
5:58 - 6:00

Here is the equivalent using
6:00 - 6:03

regular expressions.
So these two things are equivalent.
6:03 - 6:05

So we have to import the library, like I
6:05 - 6:07

mentioned before, and all the rest of it's
the same.
6:07 - 6:11

The if test is re.search. That says within
6:11 - 6:15

the library re, call the search utility
and then
6:15 - 6:18

pass in the line, the string we're looking
for
6:18 - 6:20

and the line, the actual text we're
looking in.
6:20 - 6:25

So this is like look for From inside of
line and return me a
6:25 - 6:29

True or a False, whichever, depending on
whether you find it or not.
6:29 - 6:33

Now you might say, I, you just got done
telling me that it, it was more dense.
6:33 - 6:35

And the answer is, there's a few more
characters here.
6:35 - 6:36

But we'll see in a second how you
6:36 - 6:39

can quickly add more power to the regular
expression.
6:39 - 6:41

Find, you have to start adding more
6:41 - 6:43

Python lines to make it more sophisticated
where in
6:43 - 6:46

the regular expression you start changing,
6:46 - 6:50

you change the search string to give more of
6:50 - 6:52

the direction of what you're looking for,
and that's what
6:52 - 6:55

we'll be doing, pretty much, is changing
the search string.
6:55 - 6:58

So now if we wanted to switch to say,
wait, wait, wait, we don't
6:58 - 7:03

just want the From anywhere in the line,
we want it to start with From.
7:03 - 7:06

So we would change
line.startswith('From'),
7:06 - 7:07

and that's either going to be true or false
7:07 - 7:10

depending on whether or not the
line starts with From.
7:10 - 7:12

Now, we do the same thing with
7:12 - 7:15

regular expressions by changing the
search string.
7:16 - 7:17

So now we are in regular expressions.
7:17 - 7:20

So this really just isn't a string, it's a
string plus
7:20 - 7:22

characters that are interpreted as
7:22 - 7:24

commands by the regular expression
library.
7:24 - 7:28

So the caret, which is the first one on
our,
7:28 - 7:32

our little regular expression sheet, matches
the beginning of the line.
7:32 - 7:33

It's not actually a caret.
7:33 - 7:37

So that says, the first character, this
two-character sequence, caret F,
7:37 - 7:41

means F but in column one, in the first
character of the line.
7:41 - 7:43

And so, again, this is going to give us a
7:43 - 7:46

True or a False, if this regular
expression matches.
7:46 - 7:50

The, the beginning of the line, From: and
it's the same as
7:50 - 7:54

this, it's, does it start with From.
So again, these two are equivalent.
7:54 - 8:00

But you see the pattern where we're
going to do something to this string using
8:00 - 8:06

these characters that have meaning, okay?
So, the next thing that's
8:06 - 8:12

most commonly done other than caret and
dollar sign for the end of line, is
8:12 - 8:16

the wildcard characters and so, we've used
wildcards
8:16 - 8:20

possibly in like DOS, where we can use ?
8:20 - 8:25

or * in like a dir command. dir .*.* if
you're familiar with that,
8:25 - 8:30

or even a Unix command like ls, you
know, star dot whatever.
8:30 - 8:32

This is not how regular expressions
8:32 - 8:34

work. And the problem is is that dot, dot
8:34 - 8:38

is that it matches a single character in
regular expressions.
8:38 - 8:41

Asterisk means any number of times.
8:41 - 8:47

So if I look at this, if I look at
this and color-code this to make a
8:47 - 8:52

little more sense, the caret is actually
kind of part of the
8:52 - 8:57

regular expect, regular expression
programming language. Says I'm, I'm
8:57 - 8:59

I'm a virtual character matching the
beginning of line.
8:59 - 9:01

The X is a real character.
9:01 - 9:05

The dot is part of the regular expression
programming language, any character.
9:05 - 9:08

Star is part of the regular expression
programming, it says
9:08 - 9:12

the immediate previous character many
times, zero or more times.
9:12 - 9:15

And then colon matches the colon.
9:15 - 9:20

And so if you look at lines, these are the
kinds of lines that will give me a True.
9:20 - 9:22

Because they start with an X,
9:22 - 9:26

followed by some number of characters,
followed by a colon.
9:26 - 9:27

So that's true.
9:27 - 9:31

Start with a X, followed by some number of
characters, followed by a colon.
9:31 - 9:32

Okay?
9:32 - 9:35

And so that's basically how this works.
9:35 - 9:39

And so this little, this, in this
9:39 - 9:42

five-character string there are, you know,
some of
9:42 - 9:44

these things are like instructions and
some of
9:44 - 9:46

them are the actual characters we're
looking for.
9:46 - 9:48

So the X and the colon
9:48 - 9:49

are the characters we're looking
9:49 - 9:55

for, and the caret, dot, and star are
programming.
9:55 - 9:57

Right? They are logic that we're adding
to the string.
10:00 - 10:01

Okay.
10:01 - 10:05

So let's say, for example, you're...
Part of any of these things,
10:05 - 10:07

and part of the stuff we have done so far,
10:07 - 10:11

has to assume that the data is some
level of being clean and
10:11 - 10:14

so the data that I have been giving you,
mbox.txt, is not inconsistent.
10:15 - 10:18

Right? It doesn't have like too much
weirdness in it.
10:18 - 10:20

I'm not trying to trick you and
mislead you, although
10:20 - 10:23

we've had situations where you sort of get
a traceback because
10:23 - 10:25

you think there's going to be five words
you, you grab a line,
10:25 - 10:28

you break it, and there's only two
words and then you get
10:28 - 10:31

a traceback because you're looking at the
fifth word, or something like that.
10:33 - 10:35

But if your data is less clean, or even
you just are
10:35 - 10:40

want to be real careful, you can
fine-tune your matching.
10:40 - 10:43

So, here's that same match.
10:43 - 10:45

Give me a character X, followed by any
number of
10:45 - 10:48

characters, followed by a colon, and that's
what I'm looking for.
10:48 - 10:50

Give me lines that match that pattern.
10:50 - 10:52

So this X starts at any number of
characters,
10:52 - 10:55

colon, great, this, any number of
characters good, great.
10:55 - 10:57

Oh wait, and there's an email X that says
10:57 - 11:01

X Plane is two weeks behind sch, behind
schedule, colon, two weeks.
11:01 - 11:06

Well, the regular expression didn't know
that the dash made sense to you.
11:06 - 11:07

And you just assumed that everything that
started
11:07 - 11:09

with a capital X had a dash after it.
11:09 - 11:15

So X is what it starts with, any number of
any character, and then
11:15 - 11:17

a colon. So this becomes True.
11:17 - 11:22

This may not make you happy, right? It may
not be what you're looking for.
11:22 - 11:26

Because you haven't been specific enough
in your regular expression.
11:26 - 11:31

So, we can be more specific in our regular
expression.
11:31 - 11:35

So for example, this is a more specific
regular expression.
11:35 - 11:40

It still says start with an X as the first
character, then a dash,
11:40 - 11:43

that's a real character not a, then this
11:43 - 11:47

next thing, instead of being a dot, this
backslash capital S.
11:47 - 11:50

It's on the sheet.
11:50 - 11:51

Whoa. It's not on the sheet.
11:51 - 11:54

I lost the sheet. Come back, sheet.
11:55 - 11:55

I lost the sheet.
11:56 - 11:59

I can't live without my sheet.
12:01 - 12:06

Backslash capital S means a
non-whitespace character.
12:06 - 12:09

So that means spaces won't match.
12:09 - 12:14

And then I changed the asterisk, zero or
more times thing, to a plus.
12:14 - 12:16

And that means one or more times.
12:16 - 12:20

Here is a character, a non-whitespace.
These two things kind of work together.
12:20 - 12:25

A non-whitespace character at least one
time, as many as we like.
12:25 - 12:26

And then, a colon.
12:27 - 12:31

So, if we look here, it starts with X dash,
12:31 - 12:35

any number of non-whitespace
characters, and ends in colon.
12:35 - 12:37

Starts with X dash, any number
12:37 - 12:40

of non-whitespace characters, ends
in a colon.
12:40 - 12:42

True. True.
12:42 - 12:46

This one starts with an X, but doesn't
start with an X dash.
12:46 - 12:49

Oh, as a matter of fact, these characters
are blanks, so this becomes a False.
12:49 - 12:53

It does have an X and it does have a colon
and match the previous one,
12:53 - 12:56

but this one here is more specific.
13:00 - 13:03

Okay? So it's more specific and so it
matches what you want.
13:03 - 13:04

Now it depends on what you are looking for.
13:04 - 13:05

Maybe you do want this line,
13:05 - 13:09

and so you're looking for X. I don't
know. But if you want, you can be
13:09 - 13:13

increasingly sophisticated in what
13:13 - 13:15

you're looking for in a regular
expression.
13:15 - 13:20

So now, let's talk about extracting data.
13:20 - 13:24

So everything we've done so far is,
is it there or is it not.
13:24 - 13:25

But it's really common once
13:25 - 13:27

you find something you that want to
break it into pieces.
13:27 - 13:32

So we can combine the searching and the
parsing into one statement.
13:33 - 13:37

And instead of using search, which returns
for us a true/false, we are going to use
13:37 - 13:42

findall.
So in this example, I'm going to to show
13:42 - 13:51

you a new syntax. The square bracket in
regular expression language means
13:51 - 13:53

a way to list a set of characters.
13:53 - 13:58

So this says, this is a single character
that says,
13:58 - 14:00

I want to match anything in the range
0 through 9.
14:02 - 14:04

Plus means one or more of those.
14:04 - 14:09

So that says, so this is, this whole thing
says one or more digits.
14:09 - 14:12

That's a regular expression that says one
or more digits.
14:12 - 14:13

You can put other things inside here.
14:15 - 14:16

You can put like, you know,
14:17 - 14:22

you could make a thing that says a b c d.
And that would say, I'm
14:22 - 14:26

going to match a single character that's
a or b or c or d. Or you could say like,
14:27 - 14:32

you know, 1 3 5 7, bracket.
14:32 - 14:33

That's a single character
14:33 - 14:35

that's either a 1 or a 3 or a 5 or a 7.
14:35 - 14:37

So the bracket is a list of matching
14:37 - 14:41

characters and the dash inside the
bracket means range.
14:41 - 14:45

We'll see in a second that you can stick a
not inside the bracket. It's on this.
14:45 - 14:47

So, so again, remember in this little
14:47 - 14:50

mini-language, we are programming, right?
14:50 - 14:55

We are giving instructions to the regular
expression engine, as it were. Okay?
14:58 - 15:03

So, if we do this, and here is an
expression that
15:03 - 15:09

says I would like to find, you know, things
that are one or more digits.
15:09 - 15:10

And so,
15:14 - 15:17

so it's one or more digits and, and so
it's going to look
15:17 - 15:19

through here and it's going to find it as
many times as it can.
15:21 - 15:24

So there is one or more digits, there is
one or more digits,
15:24 - 15:27

and there is one or more digits.
15:27 - 15:30

And so what findall gives us back is a
list of strings.
15:30 - 15:32

So it found it.
15:32 - 15:33

Where do I match?
Where do I match?
15:33 - 15:38

It's looking the whole time and then,
it says, oh, I've got it.
15:38 - 15:39

2, 19, 42.
15:39 - 15:43

So it actually extracts the strings that
match
15:43 - 15:47

and gives you a Python list of strings.
15:47 - 15:48

Python list of strings.
15:48 - 15:53

Kind of of like split, except it's like a
super smart split, right?
15:53 - 15:57

It's split, but I've directed it what to
look for, and if,
16:01 - 16:05

so here's an example of, you know, that's
the one I just did.
16:05 - 16:10

Find me one or more digits and extract
them, so 2, 19, 42.
16:10 - 16:14

Here I'm saying, using the same bracket
syntax, to look for a single
16:14 - 16:20

character A, capital A E I O or U, and one
or more
16:20 - 16:25

of those. And if you look, there are no
upper-case vowels in my string.
16:25 - 16:27

So it says I'm going to find all the
things that match
16:27 - 16:36

A E I O U. So things like AA would match
and, you know, OU would match.
16:37 - 16:39

And so that's what we, we would get if
they were in the string.
16:41 - 16:44

But because there are none, we get an
empty string.
16:44 - 16:46

So even if there are none, you get an
empty string.
16:46 - 16:48

So it always returns a string.
16:48 - 16:52

It may be a zero-length string, and that's
what you have
16:52 - 16:54

to check. Okay?
17:00 - 17:02

Okay, now
17:03 - 17:06

matching has this notion of greedy,
17:07 - 17:10

where when you put one of these pluses
17:10 - 17:16

or asterisks it kind of has this outward
pushing feeling, right?
17:16 - 17:17

And so when you say,
17:17 - 17:19

I'm looking for something that starts with
an
17:19 - 17:22

F at the beginning of the line, followed
17:22 - 17:24

by one or more characters, followed by a
17:24 - 17:27

colon, you can think of this as pushing
outward.
17:27 - 17:32

So if we look at a line here that has From
colon using the colon
17:32 - 17:37

character, it will try to expand, so it
certainly has
17:37 - 17:43

to match the F and it's looking for a
colon, any number of characters,
17:43 - 17:47

but it's trying to make the string that
matches as big as possible.
17:47 - 17:50

So it skips over this colon and goes to
that
17:50 - 17:52

colon and so the thing that we get is
here.
17:52 - 17:56

And so, it ignored this and said I will
make as large a string as I can.
17:57 - 17:59

So, that that's the plus that's doing it.
17:59 - 18:04

Dot plus pushes, it's like, I've got a
18:04 - 18:07

colon, but is there another colon out
there?
18:07 - 18:09

So you push it, okay?
18:09 - 18:11

So that's greedy matching.
18:11 - 18:15

It can get you in some trouble, like being
greedy
18:15 - 18:18

in general, and both asterisk and plus sort
of behave
18:18 - 18:20

in a greedy way because they're zero more
or one
18:20 - 18:24

or more characters, so they can sort of
push outward, okay?
18:26 - 18:28

Now you can turn this off.
18:28 - 18:32

It's a programming language, we can tweak
it, okay?
18:32 - 18:36

And so we add a question mark.
18:36 - 18:41

So this is a three-character sequence now.
So if you say dot plus question
18:41 - 18:46

mark, that says one or more of any
characters, push,
18:46 - 18:52

but instead of being greedy and pushing as
far as you can, this means stop
18:52 - 18:57

at the first. Stop at the first.
18:57 - 18:59

Oops, stop at the first.
18:59 - 19:02

I can never draw on this thing fast
enough.
19:02 - 19:03

Stop at the first.
19:03 - 19:04

Okay?
19:04 - 19:06

And that's it, just don't be greedy, don't
19:06 - 19:08

try to make the string as large as
possible.
19:08 - 19:11

Go with the smaller one, the smaller
possible one.
19:11 - 19:13

We still need to find an F, and we still
need
19:13 - 19:17

to find a colon, but when you find the
first colon, stop.
19:17 - 19:19

And so what this does is this changes it
so that
19:19 - 19:23

what we match is from colon instead of
going all the way.
19:23 - 19:27

So the greedy match pushes as far as it
can. The non-greedy match
19:27 - 19:33

is satisfied with the first thing that
meets the criterion of the string.
19:33 - 19:36

So this is a little three-character
programming sequence,
19:36 - 19:39

any character one or more times and not
greedy.
19:48 - 19:51

If, for example, we were trying to solve the
problem
19:51 - 19:53

of pulling the email address out of a
string.
19:55 - 19:55

Right?
19:57 - 20:01

We can make good use of this non-blank
character
20:01 - 20:04

and so the at sign is just a character and
20:04 - 20:08

then we can say, I want at least one
non-blank
20:08 - 20:12

character before it and at least one
non-blank character after it.
20:12 - 20:16

So the way regular expressions does it
says, okay, I find my at sign and
20:16 - 20:20

I push in a greedy manner outwards, as
20:20 - 20:22

long as there are non-blank characters,
push, push, push, push,
20:22 - 20:27

push, push, push, oops, stop.
Push, push, push, push, push, stop.
20:27 - 20:27

Okay?
20:27 - 20:30

So it's some number of non-blank
characters, an
20:30 - 20:33

at sign, followed by some number of
non-blank characters.
20:33 - 20:38

So it's, that's using greedy matching. It,
it's doing that, okay?
20:38 - 20:41

And so this is where we get Stephen
Marquard, we can, and,
20:41 - 20:46

and we would know if there wasn't there by
the empty list, right?
20:46 - 20:51

And so we get stephen.marquard@uct.ac.za.
20:53 - 20:59

Now, we can also fine-tune what we
extract, right?
20:59 - 21:05

In the previous slide, we extracted
whatever matched.
21:05 - 21:06

Right?
21:06 - 21:10

Whatever this matched, it looked across
the whole string and found it,
21:10 - 21:15

found the thing, shoved it over, and gave
us what it matched.
21:15 - 21:19

But it's possible to make the match larger
than what's extracted,
21:19 - 21:23

to extract a subset of the match, and we'll
see that on this next slide.
21:23 - 21:24

Okay?
21:24 - 21:30

So here's this same thing, which is an at
sign followed, and then
21:30 - 21:34

with non-blank characters as far as the
eye can see in either direction.
21:34 - 21:37

But I'm going to add to it caret From
space.
21:37 - 21:44

So, so this has to be start with, the
first character has to be a caret, this,
21:44 - 21:46

it's gotta have the word From,
21:46 - 21:51

it's gotta have one space and then,
immediately, it's gotta find this, right?
21:51 - 21:54

It's gotta find a series of non-blanks,
followed by an at sign,
21:54 - 21:58

followed by another series of one or
more non-blanks. And then
21:58 - 22:00

what we do, so this, if we didn't put
these parentheses
22:00 - 22:04

in, it would match and we would get all of
this data.
22:04 - 22:05

It would go to here.
22:06 - 22:09

But what we can do with the parentheses,
the parentheses are part
22:09 - 22:12

of the regular expression language,
saying,
22:12 - 22:15

okay, I want to match the whole thing.
22:15 - 22:17

The parentheses aren't part of the care-,
a string up here.
22:17 - 22:19

I want to match the whole thing, but
22:19 - 22:21

I only want to extract this part in
parentheses.
22:22 - 22:25

So this whole thing is a regular
expression that's matched
22:25 - 22:29

and then the parentheses part is what's
retrieved for you.
22:29 - 22:32

And so this makes it so that the only time
it's going to
22:32 - 22:35

look for at signs is, are on lines that
start with From space.
22:35 - 22:39

It is going to want the immediate next
character to be a non-blank.
22:41 - 22:43

Some number of non-blank characters
followed by an at sign,
22:43 - 22:46

some number of non-blank characters, it's
going to stop right there.
22:46 - 22:48

And it's only going to extract from here
to here,
22:48 - 22:51

and so we get out Stephen Marquard.
22:51 - 22:56

But this is a pretty narrowly scoped thing
because
22:56 - 22:58

the first four characters have to be From
space.
22:58 - 23:01

And so that's a way to combine a stricter
match,
23:01 - 23:04

even though you don't actually want
all the data.
23:04 - 23:06

So you can add those things all over the
place.
23:06 - 23:09

Okay? Okay.
23:09 - 23:15

Then, we, we, we can compare the different
ways of extracting data.
23:15 - 23:20

So if we look at how we extract the host
name.
23:20 - 23:23

Remember how we did this many chapters ago.
23:23 - 23:26

So we did a data.find, which says oh,
23:26 - 23:30

the first at sign is at 21.
So the first at sign is at 21.
23:30 - 23:34

Then we say we want to find the space
after that.
23:34 - 23:39

So that's the at position, that's 31.
And then we want to extract the data
23:39 - 23:44

that's one beyond the at up to but not
including the space.
23:46 - 23:48

And that is the variable that we are
going to print out, host.
23:48 - 23:52

And so we've extracted this bit of
information and out comes the host.
23:52 - 23:53

Quite nice. Okay?
23:54 - 23:57

We also saw another technique, and by the
way, all these techniques are okay.
23:59 - 24:00

All these techniques are fine.
24:00 - 24:02

Another technique we saw, once we sort of
played
24:02 - 24:04

with split and lists, was what we, what I
24:04 - 24:08

called a double split version of this,
where the
24:08 - 24:10

first thing we do is we split that line.
24:12 - 24:16

The first thing we do is split the line
and then we know, and blanks,
24:19 - 24:24

that the second thing, which is the
sub one, words sub one,
24:24 - 24:29

is the entire email address. Then this is
the double split.
24:29 - 24:32

We take the email address and we split it by
24:32 - 24:35

an at sign and then we get a list of the
24:35 - 24:38

pieces of the email address, the email
name and the
24:38 - 24:44

email host, and then we grab the, the
sub one of that,
24:44 - 24:45

and then we have the host.
24:45 - 24:50

So that's a double, the double split way
of doing this, right?
24:50 - 24:53

Now in this, we still haven't done
the From yet,
24:53 - 24:57

but it is the double split way to do this.
24:57 - 25:04

So, if we think about how we would do
this in a regular expression, okay?
25:04 - 25:12

We're going to say, look through the
string, findall, we're going to,
25:12 - 25:15

use the findall, and the regular
expression exploded up says
25:16 - 25:21

look through the string for an at.
Do, do, do, do, do, do, got an at.
25:21 - 25:26

Then, oh, start extracting. End extracting.
25:26 - 25:29

And then this is another form of the
25:29 - 25:31

this is one character, it's a
25:31 - 25:35

single character, match any non-blank
character, and
25:35 - 25:37

zero or more of them. Okay?
25:37 - 25:42

So find an at sign, start extracting,
25:42 - 25:48

end extracting, match, this is one character.
25:48 - 25:54

That is a set of possible matches, and
that's some character, this means not.
25:57 - 25:59

Okay? Not a blank, that's a blank
25:59 - 26:01

right there, that's a blank character
right there.
26:01 - 26:04

Not a blank, as many times as you want.
26:04 - 26:05

You might want to, we might want to turn
26:05 - 26:08

that into a plus to guarantee at least one.
26:08 - 26:10

So that might be better done as a plus
right there.
26:14 - 26:16

So this is, probably make more sense as a
plus, to say, I
26:16 - 26:21

want at least, after the at sign, I want
at least one non-blank character.
26:26 - 26:31

And the parentheses simply say, I don't
want the at sign.
26:31 - 26:36

So if the at sign, I really want those
non-blank characters after the at sign.
26:36 - 26:39

Okay? So that's what I want to extract.
26:39 - 26:42

So it's like, go find the at sign.
26:42 - 26:44

Okay, great, found the at sign. Start
26:44 - 26:48

extracting, look for non-blank characters,
end extracting.
26:48 - 26:50

So pull that part out and put it right
there.
26:53 - 26:56

Now an even cooler version of this that
26:56 - 26:59

you probably kind of imagined right away is,
27:01 - 27:07

we say, you know what, I would like this
first character, the first
27:07 - 27:13

part of the line to be From, with a blank,
followed by any number of characters,
27:17 - 27:21

followed by an at sign, so the at sign is
real, then start
27:21 - 27:26

extracting, then any number of non-blank
characters, end extracting.
27:27 - 27:32

So this is a, this is like eight or nine
lines of Python
27:32 - 27:36

all rolled into one thing, okay?
27:39 - 27:44

So, start at the beginning of the line.
Look for string From, with a space.
27:44 - 27:50

Then skip a bunch of characters looking
for an at sign, skip characters until
27:50 - 27:53

you encounter an at sign, then start
27:53 - 27:58

extracting, match any non-blank, a single
non-blank character.
27:58 - 28:01

This is kind of like one non-blank
28:01 - 28:04

character, one non-blank character, but
once you
28:04 - 28:08

suffix it with the asterisk that changes it to
be many non-blank characters.
28:11 - 28:13

And then stop extracting, okay?
28:14 - 28:19

And so, you know, and so it's like find
From followed by a space, great.
28:21 - 28:22

That's the first part.
28:22 - 28:25

Now throw away characters until you find
an at sign.
28:26 - 28:28

Then start extracting.
28:28 - 28:31

Keep going with non-blank characters until
you hit
28:31 - 28:34

the first blank characters and pull that
part out.
28:34 - 28:36

Now the result is we get the exact same
28:36 - 28:42

data. But with this added to it, we are
much more narrow in the kind of things
28:42 - 28:47

that we're looking for and if we get
noisy data that like, something like,
28:47 - 28:53

you know, meet at Joe's, right?
We don't want that.
28:53 - 28:54

That won't match, right?
28:54 - 28:56

We want that to be like a False.
28:56 - 28:59

And, and it allows us to sort of really
fine-tune our matching
28:59 - 29:03

and extracting. And this is just the
beginning, they are very, very powerful.
29:03 - 29:09

So, the last thing that I will show you is
sort of a program that is kind of like one
29:09 - 29:12

of the programs that we did in a previous
section,
29:12 - 29:15

except now we're going to use regular
expressions to do it.
29:15 - 29:16

So if you remember, we had this thing where
29:16 - 29:20

we're doing spam confidence, where we're
looking for lines and
29:21 - 29:23

you know, and pulling this number out and then
29:23 - 29:26

calculating the average, or the
maximum, or whatever.
29:26 - 29:32

And so here is a, we import the regular
expression library, we open the file,
29:32 - 29:35

we're going to do this with the, appending
to the, a list, we'll put the list.
29:35 - 29:38

We'll put the numbers in a list rather
than doing the calculation in a loop.
29:39 - 29:40

We strip the data.
29:40 - 29:42

Now, here's the key thing, right?
29:42 - 29:45

We're going to have a regular expression
that says,
29:46 - 29:49

look for the first character being X,
followed by
29:49 - 29:51

a dash, followed by all this,
all this
29:51 - 29:55

exactly has to match literally, followed
by a colon.
29:55 - 30:01

And then there's a space, and then we
begin extracting and we are looking for
30:01 - 30:06

the digit 0 through 9 or a dot and
we are looking for one or
30:06 - 30:10

more, and then we end extracting.
30:10 - 30:13

So that's the, the parentheses are telling
us what to pull out.
30:13 - 30:15

So that just means that we're going to
pull out those numbers, all
30:15 - 30:18

the digits and the numbers, until we get
something other, I mean,
30:18 - 30:21

all the digits and the period, and we'll
get something other than
30:21 - 30:24

a digit and a period, and we, and then
we'll be done, okay?
30:24 - 30:30

And so if we, and so this is going to pull
those numbers out and give us back a list.
30:30 - 30:31

Now the thing about it is, we have
30:31 - 30:35

to realize that sometimes this is not
going to match, because
30:35 - 30:38

we're sending every line, not just the
ones that start
30:38 - 30:41

with X, we're sending every line through
this and so
30:41 - 30:44

we need to know when we didn't get a
match.
30:44 - 30:48

And that, the way we know we didn't get a
match is if the list, the
30:48 - 30:52

number of items in the list that we got
back, is zero, then we're going to continue.
30:52 - 30:57

So this is kind of our if where we're
searching for the needle in the haystack.
30:57 - 31:00

But then once we find what we are looking
31:00 - 31:02

for, the actual number that we are
interested in,
31:05 - 31:08

is already sitting here in stuff sub zero.
Okay?
31:08 - 31:11

And then we convert it to a float, we append it.
31:11 - 31:14

And when the loop is all done, we print
out the maximum.
31:14 - 31:15

Okay?
31:15 - 31:17

And so this is sort of encoding a number
of things
31:17 - 31:22

and ending up with a very, a very solid and
safe matching.
31:22 - 31:26

So we're really, it's hard for this to
find a line that's wrong and
31:26 - 31:30

you could even improve this a little bit
to make it even a little tighter
31:30 - 31:35

where we'd go find a number like 0.999.
You could say, oh, it's
31:35 - 31:41

all the numbers are zero dot, so
31:41 - 31:47

you could make this a little, a little more
precise.
31:47 - 31:49

So it wouldn't, so it would even skip
things that
31:49 - 31:53

you can make it, so it looks exactly the
way you want it to look.
31:53 - 31:55

So, I emphasize that this
31:55 - 31:57

is kind of a weird language and you need
some kind of thing.
31:57 - 31:59

We talked about all these.
31:59 - 32:02

We have the beginning of the line, we have
the end
32:02 - 32:04

of the line, matching any character,
32:04 - 32:08

matching space characters, matching
non-whitespace characters.
32:08 - 32:13

Star is a modifier that says zero or more
times.
32:13 - 32:18

Star question mark is a modifier that says
zero or more times non-greedy.
32:18 - 32:21

Plus is one or more times.
32:21 - 32:25

Plus question mark is one or more times
non-greedy.
32:25 - 32:27

When you have bracket syntax, it's a set,
32:27 - 32:31

it's a single character that's in the
listed set.
32:31 - 32:33

So that's lower-case vowels.
32:34 - 32:35

You can also have the first, if the first
32:35 - 32:39

character of this is a caret, that flips it.
32:39 - 32:43

So that means everything except capital
X, capital Y, capital Z.
32:43 - 32:45

So it's everything that's not in the set,
32:45 - 32:48

capital X, capital Y, capital Z, and then
32:48 - 32:51

you can also put dashes in to represent
ranges.
32:51 - 32:53

Bracket a through z and 0 through 9,
and lower-case
32:53 - 32:58

letters and digits will match, but again,
this is a single character.
32:58 - 33:01

Now, you can put a plus or a star after
33:01 - 33:04

these guys to make them happen more than
one time.
33:04 - 33:06

And you can even put them in twice.
33:06 - 33:12

So if I wanted a two-digit number, I could
say 0 dash 9, 0 dash 9.
33:13 - 33:15

Oops. This is one character.
33:15 - 33:18

This is one character and this is the
possible things.
33:18 - 33:22

So that's, you know, 0 0
would match.
33:22 - 33:26

1 0 would match, 99 would match, etc.
33:26 - 33:27

Okay?
33:29 - 33:32

And then the parentheses are the things
that if you
33:32 - 33:34

are in the middle of a big long matching
string and
33:34 - 33:37

you don't want to extract the whole thing,
you can limit the
33:37 - 33:40

things you're extracting to, to the stuff
that's just in there.
33:41 - 33:44

With all these characters that have all
this meaning,
33:44 - 33:46

we have to have a way to match those
characters.
33:46 - 33:50

So dollar sign is the end of a line.
33:50 - 33:52

But what if we're looking for something that
33:52 - 33:53

actually has a dollar sign in the string?
33:55 - 33:57

And that's what the backslash is for.
33:57 - 33:58

So if you put the backslash in front of
33:58 - 34:04

a otherwise meaningful character, you
don't, it becomes the actual character.
34:04 - 34:07

So this is saying match a dollar sign.
34:07 - 34:09

Those two characters say match a dollar
sign.
34:09 - 34:14

And then this says one character that's
0 through 9 or a, or a dot.
34:14 - 34:17

And then we put the plus modifier to say
34:17 - 34:20

at least one or more times and so that sort
of is
34:20 - 34:21

a greedy, of course.
34:21 - 34:25

So that will get us this and extract it,
including the dollar sign.
34:25 - 34:28

So the escape character is the backslash.
34:29 - 34:31

Okay. So there we are.
34:31 - 34:32

Now we're done.
34:32 - 34:35

So this is little bit cryptic.
34:35 - 34:38

It's, it's kind of a puzzle.
34:38 - 34:39

It's kind of fun.
34:39 - 34:43

And it's extremely powerful.
And you don't have to know it.
34:43 - 34:44

You don't have to learn it.
34:45 - 34:49

But if you do, you'll find that it's very
useful as we sort
34:49 - 34:53

of dig through data and are trying to
write things that are pretty quick.
34:53 - 34:59

And, and, and they, the thing I like about
regular expressions is that they
34:59 - 35:03

tend to be, if you write them well, they
tend to be less sensitive to bad data.
35:05 - 35:07

They tend to ignore data, they're, you
35:07 - 35:10

can put more detail, I exactly want this.
35:10 - 35:10

Whereas you're,
35:10 - 35:12

if you're writing find and extract, you're
35:12 - 35:14

making a lot of assumptions about the
data.
35:14 - 35:17

That it's clean and you're not going to,
you know, mis-hit on something.
35:17 - 35:22

So, okay, well, good luck, and you're
35:22 - 35:24

used to regular expressions, and we'll
see you later.

Title:: Python for Informatics - Chapter 11 - Regular Expressions
Description:: Regular Expressions http://www.pythonlearn.com/
All Lectures:
http://www.youtube.com/playlist?list=PLlRFEj9H3Oj4JXIwMwN1_ss1Tk8wZShEJ

more » « less
Video Language:: English
Team:: Captions Requested
Duration:: 35:24

Claude Almansi edited English subtitles for Python for Informatics - Chapter 11 - Regular Expressions

English subtitles

Revisions

Revision 1 Uploaded

Claude Almansi

Python for Informatics - Chapter 11 - Regular Expressions

Revisions

Our website uses cookies

Operating cookies (Required)