Python for Informatics - Chapter 11 - Regular Expressions
-
0:00 - 0:03Hello and welcome to Chapter 11, Regular
Expressions -
0:03 - 0:07from the book Python for Informatics:
Exploring Information. -
0:08 - 0:12As always, these slides are copyright
Creative Commons Attribution, as well as -
0:12 - 0:15the audio and the video that you're
watching or listening to right now. -
0:16 - 0:23And, so regular expressions are an
interesting thing. -
0:23 - 0:26You've seen from, in the chapters up till
now, I've, -
0:26 - 0:31I've had a singular focus on sort of
pulling information out of data. -
0:31 - 0:34Raw data, this mailbox file that perhaps
you're getting tired of already. -
0:34 - 0:36But it's a lot of fun, because I can have
-
0:36 - 0:38you go look for something and, and
pick it out. -
0:38 - 0:42And you're doing something that like would
be really painful to do sort of by hand. -
0:45 - 0:47And while it's not all of computing, I
mean, there's games -
0:47 - 0:51and there's, you know, things like
weather computations that do calculations, -
0:52 - 0:57pulling and extracting data out is a big
part of computing. -
0:57 - 1:01And so there's actually a library that's
built specifically to do this. -
1:01 - 1:06And, and if you start doing a few finds
and slicing, it gets kind of -
1:06 - 1:08long after a while and that's like split,
for example, -
1:08 - 1:11really saved us a lot of time.
-
1:11 - 1:14But sometimes the data that you are
looking for is a little -
1:14 - 1:18more sophisticated than broken into spaces
or colons or something like that. -
1:18 - 1:21And you just want to like tell something
to go find -
1:21 - 1:26I see what I want, and I see where it's
embedded in the string, go get it for me. -
1:26 - 1:29And regular expressions are themselves a
programming language. -
1:29 - 1:34They're like a really smart wild card for
searching. -
1:34 - 1:35So we've used wild cards in various
-
1:35 - 1:40things in search, but they're, they're a
really smart version of a wild card. -
1:42 - 1:47And so, regular expressions are quite
powerful and they're very cryptic. -
1:47 - 1:49And as a matter of fact, you don't even
need -
1:49 - 1:51to learn them if you don't feel like it,
right? -
1:52 - 1:53I've got this little guide.
-
1:53 - 1:56I need a guide for myself when I do
regular expressions. -
1:56 - 1:58It sometimes takes me a few minutes to
write -
1:58 - 2:00a regular expression to do exactly what I
want. -
2:00 - 2:05So in a way, writing a regular expression
is like program, writing a program. -
2:05 - 2:09It's highly specialized to searching and
extracting data from strings. -
2:09 - 2:12But it's like writing a program and it
takes a while to get -
2:12 - 2:15it right and you kind of like, oh, change
this, what about a slash there? -
2:15 - 2:18And, so, you, but they actually are kind
of fun. -
2:18 - 2:22And, and they are a great way to sort of
exchange little program snippets -
2:22 - 2:25to say, oh yeah, I'm looking for this, oh
here's a little reg expression you might -
2:25 - 2:28try and then, so they're, they're like
programs themselves. -
2:30 - 2:33It is this language of marker characters,
so when we -
2:33 - 2:37look for regular expressions, some
characters like A, B, C, have meaning -
2:37 - 2:41as A, B, C but some characters like caret or
dollar sign mean -
2:41 - 2:43at the beginning of the line, or at the
end of the line. -
2:43 - 2:47And so we encode in this string a, a
program, basically. -
2:47 - 2:51And so it's a rather old-school language.
It's from -
2:51 - 2:52long time.
-
2:52 - 2:55It predates Python, which is over 20 years
old, and so -
2:55 - 3:01it's, it also marks you as sort of a
little cool, right? -
3:01 - 3:04It's a, it's a distinct marking that makes
-
3:04 - 3:06it so that you know something other people
don't. -
3:06 - 3:10Right? So you can know how to program, but
if you know regular expressions -
3:10 - 3:13it'll be like woah, I tried to look at those
and they're kind of tough. -
3:13 - 3:16In a way, knowing regular expressions is
-
3:16 - 3:18kind of like a tattoo.
-
3:18 - 3:21So I, it's casual Friday and that's why
I'm wearing a T-shirt -
3:21 - 3:24today and so I figured I would come in
today in a T-shirt, -
3:24 - 3:26but seeing as it's the first time I'm wearing
a short-sleeved shirt, it's -
3:26 - 3:29also the first time I can show you my,
show my real tattoo here. -
3:29 - 3:33So, here's my real tattoo and in the
middle is Sakai, -
3:33 - 3:36the open source learning management system
always close to my heart. -
3:36 - 3:38And then you have the IMS logo, which
-
3:38 - 3:41is IMS Learning Tools Interoperability,
which a standard, -
3:41 - 3:46it means a lot to me.
Blackboard, OLAT, Learning Objects, Angel, -
3:46 - 3:52Moodle, Instructure, Jenzabar, and
Desire2Learn. -
3:52 - 3:54I call this the ring of compliance,
because these are all -
3:54 - 4:00of the first six or seven learning
management systems that complied -
4:00 - 4:01with the IMS Learning Tools
-
4:01 - 4:03Interoperability standards
specification, which is -
4:03 - 4:06something that I spent a lot of my life
making work. -
4:06 - 4:07So
-
4:07 - 4:10I figured I'd make a tattoo and just
kind of -
4:10 - 4:13part of my rough, tough image and,
and actually -
4:13 - 4:16regular expressions are indeed part of my
rough, tough image, -
4:16 - 4:19because I'm like, I'm down with
regular expressions. -
4:19 - 4:23And people are like impressed with my
regular expression knowledge. -
4:23 - 4:27But as impressive as I am, I still need a
cheat sheet, so I'll have a cheat -
4:27 - 4:29sheet that you can download hopefully on
the pythonlearn -
4:29 - 4:32website or whatever, and I just, it
-
4:32 - 4:33doesn't have to be much.
-
4:33 - 4:36It's really just a kind of a, a crutch,
and these are the characters that have -
4:36 - 4:38special meaning, like caret or
dollar sign -
4:38 - 4:41match the beginning or end of line,
respectively. -
4:41 - 4:44So they're not really matching a dollar
sign, they match, they, -
4:44 - 4:47they mean something in our little mini
string-like programming language. -
4:49 - 4:53So, like many things that we do in Python
going forward, once you want some -
4:53 - 4:56sophisticated capability, it comes with
Python, but -
4:56 - 4:58it comes in the form of a library.
-
4:58 - 5:01And so the regular expression library we
have to say import r-e -
5:01 - 5:04at the beginning of our programs to import
the regular expression library. -
5:04 - 5:06Then we call re.search to say I'm
-
5:06 - 5:09looking for search from the regular
expression library. -
5:09 - 5:12There's two basic functions or method,
two, two basic -
5:12 - 5:14capabilities inside this library that
we're going to look at. -
5:14 - 5:19One is search, that replaces find, it's
like a smart find, and then -
5:19 - 5:24findall is a combination of a smart find
and a automatic extraction. -
5:24 - 5:26So we'll look at both of those in turn.
-
5:26 - 5:29And I'll do it by comparing them to
existing -
5:29 - 5:31Python that you kind of already should
know at this point. -
5:34 - 5:37So here's some code that's, say, looking
for lines that -
5:37 - 5:40have the word fr-, have the string From
colon in them. -
5:40 - 5:44Right, so, we're going to open a file,
we're going to strip the white space. -
5:44 - 5:48If we find we, hunt within line for
From. -
5:48 - 5:51If it's greater than or equal to zero then
we'll print it. And so this -
5:51 - 5:55is just going to give us a number. If it's,
if it's not found, it's negative one. -
5:55 - 5:58So it's only going to print the lines that
that have From in them. -
5:58 - 6:00Here is the equivalent using
-
6:00 - 6:03regular expressions.
So these two things are equivalent. -
6:03 - 6:05So we have to import the library, like I
-
6:05 - 6:07mentioned before, and all the rest of it's
the same. -
6:07 - 6:11The if test is re.search. That says within
-
6:11 - 6:15the library re, call the search utility
and then -
6:15 - 6:18pass in the line, the string we're looking
for -
6:18 - 6:20and the line, the actual text we're
looking in. -
6:20 - 6:25So this is like look for From inside of
line and return me a -
6:25 - 6:29True or a False, whichever, depending on
whether you find it or not. -
6:29 - 6:33Now you might say, I, you just got done
telling me that it, it was more dense. -
6:33 - 6:35And the answer is, there's a few more
characters here. -
6:35 - 6:36But we'll see in a second how you
-
6:36 - 6:39can quickly add more power to the regular
expression. -
6:39 - 6:41Find, you have to start adding more
-
6:41 - 6:43Python lines to make it more sophisticated
where in -
6:43 - 6:46the regular expression you start changing,
-
6:46 - 6:50you change the search string to give more of
-
6:50 - 6:52the direction of what you're looking for,
and that's what -
6:52 - 6:55we'll be doing, pretty much, is changing
the search string. -
6:55 - 6:58So now if we wanted to switch to say,
wait, wait, wait, we don't -
6:58 - 7:03just want the From anywhere in the line,
we want it to start with From. -
7:03 - 7:06So we would change
line.startswith('From'), -
7:06 - 7:07and that's either going to be true or false
-
7:07 - 7:10depending on whether or not the
line starts with From. -
7:10 - 7:12Now, we do the same thing with
-
7:12 - 7:15regular expressions by changing the
search string. -
7:16 - 7:17So now we are in regular expressions.
-
7:17 - 7:20So this really just isn't a string, it's a
string plus -
7:20 - 7:22characters that are interpreted as
-
7:22 - 7:24commands by the regular expression
library. -
7:24 - 7:28So the caret, which is the first one on
our, -
7:28 - 7:32our little regular expression sheet, matches
the beginning of the line. -
7:32 - 7:33It's not actually a caret.
-
7:33 - 7:37So that says, the first character, this
two-character sequence, caret F, -
7:37 - 7:41means F but in column one, in the first
character of the line. -
7:41 - 7:43And so, again, this is going to give us a
-
7:43 - 7:46True or a False, if this regular
expression matches. -
7:46 - 7:50The, the beginning of the line, From: and
it's the same as -
7:50 - 7:54this, it's, does it start with From.
So again, these two are equivalent. -
7:54 - 8:00But you see the pattern where we're
going to do something to this string using -
8:00 - 8:06these characters that have meaning, okay?
So, the next thing that's -
8:06 - 8:12most commonly done other than caret and
dollar sign for the end of line, is -
8:12 - 8:16the wildcard characters and so, we've used
wildcards -
8:16 - 8:20possibly in like DOS, where we can use ?
-
8:20 - 8:25or * in like a dir command. dir .*.* if
you're familiar with that, -
8:25 - 8:30or even a Unix command like ls, you
know, star dot whatever. -
8:30 - 8:32This is not how regular expressions
-
8:32 - 8:34work. And the problem is is that dot, dot
-
8:34 - 8:38is that it matches a single character in
regular expressions. -
8:38 - 8:41Asterisk means any number of times.
-
8:41 - 8:47So if I look at this, if I look at
this and color-code this to make a -
8:47 - 8:52little more sense, the caret is actually
kind of part of the -
8:52 - 8:57regular expect, regular expression
programming language. Says I'm, I'm -
8:57 - 8:59I'm a virtual character matching the
beginning of line. -
8:59 - 9:01The X is a real character.
-
9:01 - 9:05The dot is part of the regular expression
programming language, any character. -
9:05 - 9:08Star is part of the regular expression
programming, it says -
9:08 - 9:12the immediate previous character many
times, zero or more times. -
9:12 - 9:15And then colon matches the colon.
-
9:15 - 9:20And so if you look at lines, these are the
kinds of lines that will give me a True. -
9:20 - 9:22Because they start with an X,
-
9:22 - 9:26followed by some number of characters,
followed by a colon. -
9:26 - 9:27So that's true.
-
9:27 - 9:31Start with a X, followed by some number of
characters, followed by a colon. -
9:31 - 9:32Okay?
-
9:32 - 9:35And so that's basically how this works.
-
9:35 - 9:39And so this little, this, in this
-
9:39 - 9:42five-character string there are, you know,
some of -
9:42 - 9:44these things are like instructions and
some of -
9:44 - 9:46them are the actual characters we're
looking for. -
9:46 - 9:48So the X and the colon
-
9:48 - 9:49are the characters we're looking
-
9:49 - 9:55for, and the caret, dot, and star are
programming. -
9:55 - 9:57Right? They are logic that we're adding
to the string. -
10:00 - 10:01Okay.
-
10:01 - 10:05So let's say, for example, you're...
Part of any of these things, -
10:05 - 10:07and part of the stuff we have done so far,
-
10:07 - 10:11has to assume that the data is some
level of being clean and -
10:11 - 10:14so the data that I have been giving you,
mbox.txt, is not inconsistent. -
10:15 - 10:18Right? It doesn't have like too much
weirdness in it. -
10:18 - 10:20I'm not trying to trick you and
mislead you, although -
10:20 - 10:23we've had situations where you sort of get
a traceback because -
10:23 - 10:25you think there's going to be five words
you, you grab a line, -
10:25 - 10:28you break it, and there's only two
words and then you get -
10:28 - 10:31a traceback because you're looking at the
fifth word, or something like that. -
10:33 - 10:35But if your data is less clean, or even
you just are -
10:35 - 10:40want to be real careful, you can
fine-tune your matching. -
10:40 - 10:43So, here's that same match.
-
10:43 - 10:45Give me a character X, followed by any
number of -
10:45 - 10:48characters, followed by a colon, and that's
what I'm looking for. -
10:48 - 10:50Give me lines that match that pattern.
-
10:50 - 10:52So this X starts at any number of
characters, -
10:52 - 10:55colon, great, this, any number of
characters good, great. -
10:55 - 10:57Oh wait, and there's an email X that says
-
10:57 - 11:01X Plane is two weeks behind sch, behind
schedule, colon, two weeks. -
11:01 - 11:06Well, the regular expression didn't know
that the dash made sense to you. -
11:06 - 11:07And you just assumed that everything that
started -
11:07 - 11:09with a capital X had a dash after it.
-
11:09 - 11:15So X is what it starts with, any number of
any character, and then -
11:15 - 11:17a colon. So this becomes True.
-
11:17 - 11:22This may not make you happy, right? It may
not be what you're looking for. -
11:22 - 11:26Because you haven't been specific enough
in your regular expression. -
11:26 - 11:31So, we can be more specific in our regular
expression. -
11:31 - 11:35So for example, this is a more specific
regular expression. -
11:35 - 11:40It still says start with an X as the first
character, then a dash, -
11:40 - 11:43that's a real character not a, then this
-
11:43 - 11:47next thing, instead of being a dot, this
backslash capital S. -
11:47 - 11:50It's on the sheet.
-
11:50 - 11:51Whoa. It's not on the sheet.
-
11:51 - 11:54I lost the sheet. Come back, sheet.
-
11:55 - 11:55I lost the sheet.
-
11:56 - 11:59I can't live without my sheet.
-
12:01 - 12:06Backslash capital S means a
non-whitespace character. -
12:06 - 12:09So that means spaces won't match.
-
12:09 - 12:14And then I changed the asterisk, zero or
more times thing, to a plus. -
12:14 - 12:16And that means one or more times.
-
12:16 - 12:20Here is a character, a non-whitespace.
These two things kind of work together. -
12:20 - 12:25A non-whitespace character at least one
time, as many as we like. -
12:25 - 12:26And then, a colon.
-
12:27 - 12:31So, if we look here, it starts with X dash,
-
12:31 - 12:35any number of non-whitespace
characters, and ends in colon. -
12:35 - 12:37Starts with X dash, any number
-
12:37 - 12:40of non-whitespace characters, ends
in a colon. -
12:40 - 12:42True. True.
-
12:42 - 12:46This one starts with an X, but doesn't
start with an X dash. -
12:46 - 12:49Oh, as a matter of fact, these characters
are blanks, so this becomes a False. -
12:49 - 12:53It does have an X and it does have a colon
and match the previous one, -
12:53 - 12:56but this one here is more specific.
-
13:00 - 13:03Okay? So it's more specific and so it
matches what you want. -
13:03 - 13:04Now it depends on what you are looking for.
-
13:04 - 13:05Maybe you do want this line,
-
13:05 - 13:09and so you're looking for X. I don't
know. But if you want, you can be -
13:09 - 13:13increasingly sophisticated in what
-
13:13 - 13:15you're looking for in a regular
expression. -
13:15 - 13:20So now, let's talk about extracting data.
-
13:20 - 13:24So everything we've done so far is,
is it there or is it not. -
13:24 - 13:25But it's really common once
-
13:25 - 13:27you find something you that want to
break it into pieces. -
13:27 - 13:32So we can combine the searching and the
parsing into one statement. -
13:33 - 13:37And instead of using search, which returns
for us a true/false, we are going to use -
13:37 - 13:42findall.
So in this example, I'm going to to show -
13:42 - 13:51you a new syntax. The square bracket in
regular expression language means -
13:51 - 13:53a way to list a set of characters.
-
13:53 - 13:58So this says, this is a single character
that says, -
13:58 - 14:00I want to match anything in the range
0 through 9. -
14:02 - 14:04Plus means one or more of those.
-
14:04 - 14:09So that says, so this is, this whole thing
says one or more digits. -
14:09 - 14:12That's a regular expression that says one
or more digits. -
14:12 - 14:13You can put other things inside here.
-
14:15 - 14:16You can put like, you know,
-
14:17 - 14:22you could make a thing that says a b c d.
And that would say, I'm -
14:22 - 14:26going to match a single character that's
a or b or c or d. Or you could say like, -
14:27 - 14:32you know, 1 3 5 7, bracket.
-
14:32 - 14:33That's a single character
-
14:33 - 14:35that's either a 1 or a 3 or a 5 or a 7.
-
14:35 - 14:37So the bracket is a list of matching
-
14:37 - 14:41characters and the dash inside the
bracket means range. -
14:41 - 14:45We'll see in a second that you can stick a
not inside the bracket. It's on this. -
14:45 - 14:47So, so again, remember in this little
-
14:47 - 14:50mini-language, we are programming, right?
-
14:50 - 14:55We are giving instructions to the regular
expression engine, as it were. Okay? -
14:58 - 15:03So, if we do this, and here is an
expression that -
15:03 - 15:09says I would like to find, you know, things
that are one or more digits. -
15:09 - 15:10And so,
-
15:14 - 15:17so it's one or more digits and, and so
it's going to look -
15:17 - 15:19through here and it's going to find it as
many times as it can. -
15:21 - 15:24So there is one or more digits, there is
one or more digits, -
15:24 - 15:27and there is one or more digits.
-
15:27 - 15:30And so what findall gives us back is a
list of strings. -
15:30 - 15:32So it found it.
-
15:32 - 15:33Where do I match?
Where do I match? -
15:33 - 15:38It's looking the whole time and then,
it says, oh, I've got it. -
15:38 - 15:392, 19, 42.
-
15:39 - 15:43So it actually extracts the strings that
match -
15:43 - 15:47and gives you a Python list of strings.
-
15:47 - 15:48Python list of strings.
-
15:48 - 15:53Kind of of like split, except it's like a
super smart split, right? -
15:53 - 15:57It's split, but I've directed it what to
look for, and if, -
16:01 - 16:05so here's an example of, you know, that's
the one I just did. -
16:05 - 16:10Find me one or more digits and extract
them, so 2, 19, 42. -
16:10 - 16:14Here I'm saying, using the same bracket
syntax, to look for a single -
16:14 - 16:20character A, capital A E I O or U, and one
or more -
16:20 - 16:25of those. And if you look, there are no
upper-case vowels in my string. -
16:25 - 16:27So it says I'm going to find all the
things that match -
16:27 - 16:36A E I O U. So things like AA would match
and, you know, OU would match. -
16:37 - 16:39And so that's what we, we would get if
they were in the string. -
16:41 - 16:44But because there are none, we get an
empty string. -
16:44 - 16:46So even if there are none, you get an
empty string. -
16:46 - 16:48So it always returns a string.
-
16:48 - 16:52It may be a zero-length string, and that's
what you have -
16:52 - 16:54to check. Okay?
-
17:00 - 17:02Okay, now
-
17:03 - 17:06matching has this notion of greedy,
-
17:07 - 17:10where when you put one of these pluses
-
17:10 - 17:16or asterisks it kind of has this outward
pushing feeling, right? -
17:16 - 17:17And so when you say,
-
17:17 - 17:19I'm looking for something that starts with
an -
17:19 - 17:22F at the beginning of the line, followed
-
17:22 - 17:24by one or more characters, followed by a
-
17:24 - 17:27colon, you can think of this as pushing
outward. -
17:27 - 17:32So if we look at a line here that has From
colon using the colon -
17:32 - 17:37character, it will try to expand, so it
certainly has -
17:37 - 17:43to match the F and it's looking for a
colon, any number of characters, -
17:43 - 17:47but it's trying to make the string that
matches as big as possible. -
17:47 - 17:50So it skips over this colon and goes to
that -
17:50 - 17:52colon and so the thing that we get is
here. -
17:52 - 17:56And so, it ignored this and said I will
make as large a string as I can. -
17:57 - 17:59So, that that's the plus that's doing it.
-
17:59 - 18:04Dot plus pushes, it's like, I've got a
-
18:04 - 18:07colon, but is there another colon out
there? -
18:07 - 18:09So you push it, okay?
-
18:09 - 18:11So that's greedy matching.
-
18:11 - 18:15It can get you in some trouble, like being
greedy -
18:15 - 18:18in general, and both asterisk and plus sort
of behave -
18:18 - 18:20in a greedy way because they're zero more
or one -
18:20 - 18:24or more characters, so they can sort of
push outward, okay? -
18:26 - 18:28Now you can turn this off.
-
18:28 - 18:32It's a programming language, we can tweak
it, okay? -
18:32 - 18:36And so we add a question mark.
-
18:36 - 18:41So this is a three-character sequence now.
So if you say dot plus question -
18:41 - 18:46mark, that says one or more of any
characters, push, -
18:46 - 18:52but instead of being greedy and pushing as
far as you can, this means stop -
18:52 - 18:57at the first. Stop at the first.
-
18:57 - 18:59Oops, stop at the first.
-
18:59 - 19:02I can never draw on this thing fast
enough. -
19:02 - 19:03Stop at the first.
-
19:03 - 19:04Okay?
-
19:04 - 19:06And that's it, just don't be greedy, don't
-
19:06 - 19:08try to make the string as large as
possible. -
19:08 - 19:11Go with the smaller one, the smaller
possible one. -
19:11 - 19:13We still need to find an F, and we still
need -
19:13 - 19:17to find a colon, but when you find the
first colon, stop. -
19:17 - 19:19And so what this does is this changes it
so that -
19:19 - 19:23what we match is from colon instead of
going all the way. -
19:23 - 19:27So the greedy match pushes as far as it
can. The non-greedy match -
19:27 - 19:33is satisfied with the first thing that
meets the criterion of the string. -
19:33 - 19:36So this is a little three-character
programming sequence, -
19:36 - 19:39any character one or more times and not
greedy. -
19:48 - 19:51If, for example, we were trying to solve the
problem -
19:51 - 19:53of pulling the email address out of a
string. -
19:55 - 19:55Right?
-
19:57 - 20:01We can make good use of this non-blank
character -
20:01 - 20:04and so the at sign is just a character and
-
20:04 - 20:08then we can say, I want at least one
non-blank -
20:08 - 20:12character before it and at least one
non-blank character after it. -
20:12 - 20:16So the way regular expressions does it
says, okay, I find my at sign and -
20:16 - 20:20I push in a greedy manner outwards, as
-
20:20 - 20:22long as there are non-blank characters,
push, push, push, push, -
20:22 - 20:27push, push, push, oops, stop.
Push, push, push, push, push, stop. -
20:27 - 20:27Okay?
-
20:27 - 20:30So it's some number of non-blank
characters, an -
20:30 - 20:33at sign, followed by some number of
non-blank characters. -
20:33 - 20:38So it's, that's using greedy matching. It,
it's doing that, okay? -
20:38 - 20:41And so this is where we get Stephen
Marquard, we can, and, -
20:41 - 20:46and we would know if there wasn't there by
the empty list, right? -
20:46 - 20:51And so we get stephen.marquard@uct.ac.za.
-
20:53 - 20:59Now, we can also fine-tune what we
extract, right? -
20:59 - 21:05In the previous slide, we extracted
whatever matched. -
21:05 - 21:06Right?
-
21:06 - 21:10Whatever this matched, it looked across
the whole string and found it, -
21:10 - 21:15found the thing, shoved it over, and gave
us what it matched. -
21:15 - 21:19But it's possible to make the match larger
than what's extracted, -
21:19 - 21:23to extract a subset of the match, and we'll
see that on this next slide. -
21:23 - 21:24Okay?
-
21:24 - 21:30So here's this same thing, which is an at
sign followed, and then -
21:30 - 21:34with non-blank characters as far as the
eye can see in either direction. -
21:34 - 21:37But I'm going to add to it caret From
space. -
21:37 - 21:44So, so this has to be start with, the
first character has to be a caret, this, -
21:44 - 21:46it's gotta have the word From,
-
21:46 - 21:51it's gotta have one space and then,
immediately, it's gotta find this, right? -
21:51 - 21:54It's gotta find a series of non-blanks,
followed by an at sign, -
21:54 - 21:58followed by another series of one or
more non-blanks. And then -
21:58 - 22:00what we do, so this, if we didn't put
these parentheses -
22:00 - 22:04in, it would match and we would get all of
this data. -
22:04 - 22:05It would go to here.
-
22:06 - 22:09But what we can do with the parentheses,
the parentheses are part -
22:09 - 22:12of the regular expression language,
saying, -
22:12 - 22:15okay, I want to match the whole thing.
-
22:15 - 22:17The parentheses aren't part of the care-,
a string up here. -
22:17 - 22:19I want to match the whole thing, but
-
22:19 - 22:21I only want to extract this part in
parentheses. -
22:22 - 22:25So this whole thing is a regular
expression that's matched -
22:25 - 22:29and then the parentheses part is what's
retrieved for you. -
22:29 - 22:32And so this makes it so that the only time
it's going to -
22:32 - 22:35look for at signs is, are on lines that
start with From space. -
22:35 - 22:39It is going to want the immediate next
character to be a non-blank. -
22:41 - 22:43Some number of non-blank characters
followed by an at sign, -
22:43 - 22:46some number of non-blank characters, it's
going to stop right there. -
22:46 - 22:48And it's only going to extract from here
to here, -
22:48 - 22:51and so we get out Stephen Marquard.
-
22:51 - 22:56But this is a pretty narrowly scoped thing
because -
22:56 - 22:58the first four characters have to be From
space. -
22:58 - 23:01And so that's a way to combine a stricter
match, -
23:01 - 23:04even though you don't actually want
all the data. -
23:04 - 23:06So you can add those things all over the
place. -
23:06 - 23:09Okay? Okay.
-
23:09 - 23:15Then, we, we, we can compare the different
ways of extracting data. -
23:15 - 23:20So if we look at how we extract the host
name. -
23:20 - 23:23Remember how we did this many chapters ago.
-
23:23 - 23:26So we did a data.find, which says oh,
-
23:26 - 23:30the first at sign is at 21.
So the first at sign is at 21. -
23:30 - 23:34Then we say we want to find the space
after that. -
23:34 - 23:39So that's the at position, that's 31.
And then we want to extract the data -
23:39 - 23:44that's one beyond the at up to but not
including the space. -
23:46 - 23:48And that is the variable that we are
going to print out, host. -
23:48 - 23:52And so we've extracted this bit of
information and out comes the host. -
23:52 - 23:53Quite nice. Okay?
-
23:54 - 23:57We also saw another technique, and by the
way, all these techniques are okay. -
23:59 - 24:00All these techniques are fine.
-
24:00 - 24:02Another technique we saw, once we sort of
played -
24:02 - 24:04with split and lists, was what we, what I
-
24:04 - 24:08called a double split version of this,
where the -
24:08 - 24:10first thing we do is we split that line.
-
24:12 - 24:16The first thing we do is split the line
and then we know, and blanks, -
24:19 - 24:24that the second thing, which is the
sub one, words sub one, -
24:24 - 24:29is the entire email address. Then this is
the double split. -
24:29 - 24:32We take the email address and we split it by
-
24:32 - 24:35an at sign and then we get a list of the
-
24:35 - 24:38pieces of the email address, the email
name and the -
24:38 - 24:44email host, and then we grab the, the
sub one of that, -
24:44 - 24:45and then we have the host.
-
24:45 - 24:50So that's a double, the double split way
of doing this, right? -
24:50 - 24:53Now in this, we still haven't done
the From yet, -
24:53 - 24:57but it is the double split way to do this.
-
24:57 - 25:04So, if we think about how we would do
this in a regular expression, okay? -
25:04 - 25:12We're going to say, look through the
string, findall, we're going to, -
25:12 - 25:15use the findall, and the regular
expression exploded up says -
25:16 - 25:21look through the string for an at.
Do, do, do, do, do, do, got an at. -
25:21 - 25:26Then, oh, start extracting. End extracting.
-
25:26 - 25:29And then this is another form of the
-
25:29 - 25:31this is one character, it's a
-
25:31 - 25:35single character, match any non-blank
character, and -
25:35 - 25:37zero or more of them. Okay?
-
25:37 - 25:42So find an at sign, start extracting,
-
25:42 - 25:48end extracting, match, this is one character.
-
25:48 - 25:54That is a set of possible matches, and
that's some character, this means not. -
25:57 - 25:59Okay? Not a blank, that's a blank
-
25:59 - 26:01right there, that's a blank character
right there. -
26:01 - 26:04Not a blank, as many times as you want.
-
26:04 - 26:05You might want to, we might want to turn
-
26:05 - 26:08that into a plus to guarantee at least one.
-
26:08 - 26:10So that might be better done as a plus
right there. -
26:14 - 26:16So this is, probably make more sense as a
plus, to say, I -
26:16 - 26:21want at least, after the at sign, I want
at least one non-blank character. -
26:26 - 26:31And the parentheses simply say, I don't
want the at sign. -
26:31 - 26:36So if the at sign, I really want those
non-blank characters after the at sign. -
26:36 - 26:39Okay? So that's what I want to extract.
-
26:39 - 26:42So it's like, go find the at sign.
-
26:42 - 26:44Okay, great, found the at sign. Start
-
26:44 - 26:48extracting, look for non-blank characters,
end extracting. -
26:48 - 26:50So pull that part out and put it right
there. -
26:53 - 26:56Now an even cooler version of this that
-
26:56 - 26:59you probably kind of imagined right away is,
-
27:01 - 27:07we say, you know what, I would like this
first character, the first -
27:07 - 27:13part of the line to be From, with a blank,
followed by any number of characters, -
27:17 - 27:21followed by an at sign, so the at sign is
real, then start -
27:21 - 27:26extracting, then any number of non-blank
characters, end extracting. -
27:27 - 27:32So this is a, this is like eight or nine
lines of Python -
27:32 - 27:36all rolled into one thing, okay?
-
27:39 - 27:44So, start at the beginning of the line.
Look for string From, with a space. -
27:44 - 27:50Then skip a bunch of characters looking
for an at sign, skip characters until -
27:50 - 27:53you encounter an at sign, then start
-
27:53 - 27:58extracting, match any non-blank, a single
non-blank character. -
27:58 - 28:01This is kind of like one non-blank
-
28:01 - 28:04character, one non-blank character, but
once you -
28:04 - 28:08suffix it with the asterisk that changes it to
be many non-blank characters. -
28:11 - 28:13And then stop extracting, okay?
-
28:14 - 28:19And so, you know, and so it's like find
From followed by a space, great. -
28:21 - 28:22That's the first part.
-
28:22 - 28:25Now throw away characters until you find
an at sign. -
28:26 - 28:28Then start extracting.
-
28:28 - 28:31Keep going with non-blank characters until
you hit -
28:31 - 28:34the first blank characters and pull that
part out. -
28:34 - 28:36Now the result is we get the exact same
-
28:36 - 28:42data. But with this added to it, we are
much more narrow in the kind of things -
28:42 - 28:47that we're looking for and if we get
noisy data that like, something like, -
28:47 - 28:53you know, meet at Joe's, right?
We don't want that. -
28:53 - 28:54That won't match, right?
-
28:54 - 28:56We want that to be like a False.
-
28:56 - 28:59And, and it allows us to sort of really
fine-tune our matching -
28:59 - 29:03and extracting. And this is just the
beginning, they are very, very powerful. -
29:03 - 29:09So, the last thing that I will show you is
sort of a program that is kind of like one -
29:09 - 29:12of the programs that we did in a previous
section, -
29:12 - 29:15except now we're going to use regular
expressions to do it. -
29:15 - 29:16So if you remember, we had this thing where
-
29:16 - 29:20we're doing spam confidence, where we're
looking for lines and -
29:21 - 29:23you know, and pulling this number out and then
-
29:23 - 29:26calculating the average, or the
maximum, or whatever. -
29:26 - 29:32And so here is a, we import the regular
expression library, we open the file, -
29:32 - 29:35we're going to do this with the, appending
to the, a list, we'll put the list. -
29:35 - 29:38We'll put the numbers in a list rather
than doing the calculation in a loop. -
29:39 - 29:40We strip the data.
-
29:40 - 29:42Now, here's the key thing, right?
-
29:42 - 29:45We're going to have a regular expression
that says, -
29:46 - 29:49look for the first character being X,
followed by -
29:49 - 29:51a dash, followed by all this,
all this -
29:51 - 29:55exactly has to match literally, followed
by a colon. -
29:55 - 30:01And then there's a space, and then we
begin extracting and we are looking for -
30:01 - 30:06the digit 0 through 9 or a dot and
we are looking for one or -
30:06 - 30:10more, and then we end extracting.
-
30:10 - 30:13So that's the, the parentheses are telling
us what to pull out. -
30:13 - 30:15So that just means that we're going to
pull out those numbers, all -
30:15 - 30:18the digits and the numbers, until we get
something other, I mean, -
30:18 - 30:21all the digits and the period, and we'll
get something other than -
30:21 - 30:24a digit and a period, and we, and then
we'll be done, okay? -
30:24 - 30:30And so if we, and so this is going to pull
those numbers out and give us back a list. -
30:30 - 30:31Now the thing about it is, we have
-
30:31 - 30:35to realize that sometimes this is not
going to match, because -
30:35 - 30:38we're sending every line, not just the
ones that start -
30:38 - 30:41with X, we're sending every line through
this and so -
30:41 - 30:44we need to know when we didn't get a
match. -
30:44 - 30:48And that, the way we know we didn't get a
match is if the list, the -
30:48 - 30:52number of items in the list that we got
back, is zero, then we're going to continue. -
30:52 - 30:57So this is kind of our if where we're
searching for the needle in the haystack. -
30:57 - 31:00But then once we find what we are looking
-
31:00 - 31:02for, the actual number that we are
interested in, -
31:05 - 31:08is already sitting here in stuff sub zero.
Okay? -
31:08 - 31:11And then we convert it to a float, we append it.
-
31:11 - 31:14And when the loop is all done, we print
out the maximum. -
31:14 - 31:15Okay?
-
31:15 - 31:17And so this is sort of encoding a number
of things -
31:17 - 31:22and ending up with a very, a very solid and
safe matching. -
31:22 - 31:26So we're really, it's hard for this to
find a line that's wrong and -
31:26 - 31:30you could even improve this a little bit
to make it even a little tighter -
31:30 - 31:35where we'd go find a number like 0.999.
You could say, oh, it's -
31:35 - 31:41all the numbers are zero dot, so
-
31:41 - 31:47you could make this a little, a little more
precise. -
31:47 - 31:49So it wouldn't, so it would even skip
things that -
31:49 - 31:53you can make it, so it looks exactly the
way you want it to look. -
31:53 - 31:55So, I emphasize that this
-
31:55 - 31:57is kind of a weird language and you need
some kind of thing. -
31:57 - 31:59We talked about all these.
-
31:59 - 32:02We have the beginning of the line, we have
the end -
32:02 - 32:04of the line, matching any character,
-
32:04 - 32:08matching space characters, matching
non-whitespace characters. -
32:08 - 32:13Star is a modifier that says zero or more
times. -
32:13 - 32:18Star question mark is a modifier that says
zero or more times non-greedy. -
32:18 - 32:21Plus is one or more times.
-
32:21 - 32:25Plus question mark is one or more times
non-greedy. -
32:25 - 32:27When you have bracket syntax, it's a set,
-
32:27 - 32:31it's a single character that's in the
listed set. -
32:31 - 32:33So that's lower-case vowels.
-
32:34 - 32:35You can also have the first, if the first
-
32:35 - 32:39character of this is a caret, that flips it.
-
32:39 - 32:43So that means everything except capital
X, capital Y, capital Z. -
32:43 - 32:45So it's everything that's not in the set,
-
32:45 - 32:48capital X, capital Y, capital Z, and then
-
32:48 - 32:51you can also put dashes in to represent
ranges. -
32:51 - 32:53Bracket a through z and 0 through 9,
and lower-case -
32:53 - 32:58letters and digits will match, but again,
this is a single character. -
32:58 - 33:01Now, you can put a plus or a star after
-
33:01 - 33:04these guys to make them happen more than
one time. -
33:04 - 33:06And you can even put them in twice.
-
33:06 - 33:12So if I wanted a two-digit number, I could
say 0 dash 9, 0 dash 9. -
33:13 - 33:15Oops. This is one character.
-
33:15 - 33:18This is one character and this is the
possible things. -
33:18 - 33:22So that's, you know, 0 0
would match. -
33:22 - 33:261 0 would match, 99 would match, etc.
-
33:26 - 33:27Okay?
-
33:29 - 33:32And then the parentheses are the things
that if you -
33:32 - 33:34are in the middle of a big long matching
string and -
33:34 - 33:37you don't want to extract the whole thing,
you can limit the -
33:37 - 33:40things you're extracting to, to the stuff
that's just in there. -
33:41 - 33:44With all these characters that have all
this meaning, -
33:44 - 33:46we have to have a way to match those
characters. -
33:46 - 33:50So dollar sign is the end of a line.
-
33:50 - 33:52But what if we're looking for something that
-
33:52 - 33:53actually has a dollar sign in the string?
-
33:55 - 33:57And that's what the backslash is for.
-
33:57 - 33:58So if you put the backslash in front of
-
33:58 - 34:04a otherwise meaningful character, you
don't, it becomes the actual character. -
34:04 - 34:07So this is saying match a dollar sign.
-
34:07 - 34:09Those two characters say match a dollar
sign. -
34:09 - 34:14And then this says one character that's
0 through 9 or a, or a dot. -
34:14 - 34:17And then we put the plus modifier to say
-
34:17 - 34:20at least one or more times and so that sort
of is -
34:20 - 34:21a greedy, of course.
-
34:21 - 34:25So that will get us this and extract it,
including the dollar sign. -
34:25 - 34:28So the escape character is the backslash.
-
34:29 - 34:31Okay. So there we are.
-
34:31 - 34:32Now we're done.
-
34:32 - 34:35So this is little bit cryptic.
-
34:35 - 34:38It's, it's kind of a puzzle.
-
34:38 - 34:39It's kind of fun.
-
34:39 - 34:43And it's extremely powerful.
And you don't have to know it. -
34:43 - 34:44You don't have to learn it.
-
34:45 - 34:49But if you do, you'll find that it's very
useful as we sort -
34:49 - 34:53of dig through data and are trying to
write things that are pretty quick. -
34:53 - 34:59And, and, and they, the thing I like about
regular expressions is that they -
34:59 - 35:03tend to be, if you write them well, they
tend to be less sensitive to bad data. -
35:05 - 35:07They tend to ignore data, they're, you
-
35:07 - 35:10can put more detail, I exactly want this.
-
35:10 - 35:10Whereas you're,
-
35:10 - 35:12if you're writing find and extract, you're
-
35:12 - 35:14making a lot of assumptions about the
data. -
35:14 - 35:17That it's clean and you're not going to,
you know, mis-hit on something. -
35:17 - 35:22So, okay, well, good luck, and you're
-
35:22 - 35:24used to regular expressions, and we'll
see you later.
- Title:
- Python for Informatics - Chapter 11 - Regular Expressions
- Description:
-
Regular Expressions http://www.pythonlearn.com/
All Lectures:
http://www.youtube.com/playlist?list=PLlRFEj9H3Oj4JXIwMwN1_ss1Tk8wZShEJ - Video Language:
- English
- Team:
- Captions Requested
- Duration:
- 35:24
Claude Almansi edited English subtitles for Python for Informatics - Chapter 11 - Regular Expressions |