Return to Video

Python for Informatics - Chapter 11 - Regular Expressions

  • 0:00 - 0:03
    Hello and welcome to Chapter 11, Regular
    Expressions
  • 0:03 - 0:07
    from the book Python for Informatics:
    Exploring Information.
  • 0:08 - 0:12
    As always, these slides are copyright
    Creative Commons Attribution, as well as
  • 0:12 - 0:15
    the audio and the video that you're
    watching or listening to right now.
  • 0:16 - 0:23
    And, so regular expressions are an
    interesting thing.
  • 0:23 - 0:26
    You've seen from, in the chapters up till
    now, I've,
  • 0:26 - 0:31
    I've had a singular focus on sort of
    pulling information out of data.
  • 0:31 - 0:34
    Raw data, this mailbox file that perhaps
    you're getting tired of already.
  • 0:34 - 0:36
    But it's a lot of fun, because I can have
  • 0:36 - 0:38
    you go look for something and, and
    pick it out.
  • 0:38 - 0:42
    And you're doing something that like would
    be really painful to do sort of by hand.
  • 0:45 - 0:47
    And while it's not all of computing, I
    mean, there's games
  • 0:47 - 0:51
    and there's, you know, things like
    weather computations that do calculations,
  • 0:52 - 0:57
    pulling and extracting data out is a big
    part of computing.
  • 0:57 - 1:01
    And so there's actually a library that's
    built specifically to do this.
  • 1:01 - 1:06
    And, and if you start doing a few finds
    and slicing, it gets kind of
  • 1:06 - 1:08
    long after a while and that's like split,
    for example,
  • 1:08 - 1:11
    really saved us a lot of time.
  • 1:11 - 1:14
    But sometimes the data that you are
    looking for is a little
  • 1:14 - 1:18
    more sophisticated than broken into spaces
    or colons or something like that.
  • 1:18 - 1:21
    And you just want to like tell something
    to go find
  • 1:21 - 1:26
    I see what I want, and I see where it's
    embedded in the string, go get it for me.
  • 1:26 - 1:29
    And regular expressions are themselves a
    programming language.
  • 1:29 - 1:34
    They're like a really smart wild card for
    searching.
  • 1:34 - 1:35
    So we've used wild cards in various
  • 1:35 - 1:40
    things in search, but they're, they're a
    really smart version of a wild card.
  • 1:42 - 1:47
    And so, regular expressions are quite
    powerful and they're very cryptic.
  • 1:47 - 1:49
    And as a matter of fact, you don't even
    need
  • 1:49 - 1:51
    to learn them if you don't feel like it,
    right?
  • 1:52 - 1:53
    I've got this little guide.
  • 1:53 - 1:56
    I need a guide for myself when I do
    regular expressions.
  • 1:56 - 1:58
    It sometimes takes me a few minutes to
    write
  • 1:58 - 2:00
    a regular expression to do exactly what I
    want.
  • 2:00 - 2:05
    So in a way, writing a regular expression
    is like program, writing a program.
  • 2:05 - 2:09
    It's highly specialized to searching and
    extracting data from strings.
  • 2:09 - 2:12
    But it's like writing a program and it
    takes a while to get
  • 2:12 - 2:15
    it right and you kind of like, oh, change
    this, what about a slash there?
  • 2:15 - 2:18
    And, so, you, but they actually are kind
    of fun.
  • 2:18 - 2:22
    And, and they are a great way to sort of
    exchange little program snippets
  • 2:22 - 2:25
    to say, oh yeah, I'm looking for this, oh
    here's a little reg expression you might
  • 2:25 - 2:28
    try and then, so they're, they're like
    programs themselves.
  • 2:30 - 2:33
    It is this language of marker characters,
    so when we
  • 2:33 - 2:37
    look for regular expressions, some
    characters like A, B, C, have meaning
  • 2:37 - 2:41
    as A, B, C but some characters like caret or
    dollar sign mean
  • 2:41 - 2:43
    at the beginning of the line, or at the
    end of the line.
  • 2:43 - 2:47
    And so we encode in this string a, a
    program, basically.
  • 2:47 - 2:51
    And so it's a rather old-school language.
    It's from
  • 2:51 - 2:52
    long time.
  • 2:52 - 2:55
    It predates Python, which is over 20 years
    old, and so
  • 2:55 - 3:01
    it's, it also marks you as sort of a
    little cool, right?
  • 3:01 - 3:04
    It's a, it's a distinct marking that makes
  • 3:04 - 3:06
    it so that you know something other people
    don't.
  • 3:06 - 3:10
    Right? So you can know how to program, but
    if you know regular expressions
  • 3:10 - 3:13
    it'll be like woah, I tried to look at those
    and they're kind of tough.
  • 3:13 - 3:16
    In a way, knowing regular expressions is
  • 3:16 - 3:18
    kind of like a tattoo.
  • 3:18 - 3:21
    So I, it's casual Friday and that's why
    I'm wearing a T-shirt
  • 3:21 - 3:24
    today and so I figured I would come in
    today in a T-shirt,
  • 3:24 - 3:26
    but seeing as it's the first time I'm wearing
    a short-sleeved shirt, it's
  • 3:26 - 3:29
    also the first time I can show you my,
    show my real tattoo here.
  • 3:29 - 3:33
    So, here's my real tattoo and in the
    middle is Sakai,
  • 3:33 - 3:36
    the open source learning management system
    always close to my heart.
  • 3:36 - 3:38
    And then you have the IMS logo, which
  • 3:38 - 3:41
    is IMS Learning Tools Interoperability,
    which a standard,
  • 3:41 - 3:46
    it means a lot to me.
    Blackboard, OLAT, Learning Objects, Angel,
  • 3:46 - 3:52
    Moodle, Instructure, Jenzabar, and
    Desire2Learn.
  • 3:52 - 3:54
    I call this the ring of compliance,
    because these are all
  • 3:54 - 4:00
    of the first six or seven learning
    management systems that complied
  • 4:00 - 4:01
    with the IMS Learning Tools
  • 4:01 - 4:03
    Interoperability standards
    specification, which is
  • 4:03 - 4:06
    something that I spent a lot of my life
    making work.
  • 4:06 - 4:07
    So
  • 4:07 - 4:10
    I figured I'd make a tattoo and just
    kind of
  • 4:10 - 4:13
    part of my rough, tough image and,
    and actually
  • 4:13 - 4:16
    regular expressions are indeed part of my
    rough, tough image,
  • 4:16 - 4:19
    because I'm like, I'm down with
    regular expressions.
  • 4:19 - 4:23
    And people are like impressed with my
    regular expression knowledge.
  • 4:23 - 4:27
    But as impressive as I am, I still need a
    cheat sheet, so I'll have a cheat
  • 4:27 - 4:29
    sheet that you can download hopefully on
    the pythonlearn
  • 4:29 - 4:32
    website or whatever, and I just, it
  • 4:32 - 4:33
    doesn't have to be much.
  • 4:33 - 4:36
    It's really just a kind of a, a crutch,
    and these are the characters that have
  • 4:36 - 4:38
    special meaning, like caret or
    dollar sign
  • 4:38 - 4:41
    match the beginning or end of line,
    respectively.
  • 4:41 - 4:44
    So they're not really matching a dollar
    sign, they match, they,
  • 4:44 - 4:47
    they mean something in our little mini
    string-like programming language.
  • 4:49 - 4:53
    So, like many things that we do in Python
    going forward, once you want some
  • 4:53 - 4:56
    sophisticated capability, it comes with
    Python, but
  • 4:56 - 4:58
    it comes in the form of a library.
  • 4:58 - 5:01
    And so the regular expression library we
    have to say import r-e
  • 5:01 - 5:04
    at the beginning of our programs to import
    the regular expression library.
  • 5:04 - 5:06
    Then we call re.search to say I'm
  • 5:06 - 5:09
    looking for search from the regular
    expression library.
  • 5:09 - 5:12
    There's two basic functions or method,
    two, two basic
  • 5:12 - 5:14
    capabilities inside this library that
    we're going to look at.
  • 5:14 - 5:19
    One is search, that replaces find, it's
    like a smart find, and then
  • 5:19 - 5:24
    findall is a combination of a smart find
    and a automatic extraction.
  • 5:24 - 5:26
    So we'll look at both of those in turn.
  • 5:26 - 5:29
    And I'll do it by comparing them to
    existing
  • 5:29 - 5:31
    Python that you kind of already should
    know at this point.
  • 5:34 - 5:37
    So here's some code that's, say, looking
    for lines that
  • 5:37 - 5:40
    have the word fr-, have the string From
    colon in them.
  • 5:40 - 5:44
    Right, so, we're going to open a file,
    we're going to strip the white space.
  • 5:44 - 5:48
    If we find we, hunt within line for
    From.
  • 5:48 - 5:51
    If it's greater than or equal to zero then
    we'll print it. And so this
  • 5:51 - 5:55
    is just going to give us a number. If it's,
    if it's not found, it's negative one.
  • 5:55 - 5:58
    So it's only going to print the lines that
    that have From in them.
  • 5:58 - 6:00
    Here is the equivalent using
  • 6:00 - 6:03
    regular expressions.
    So these two things are equivalent.
  • 6:03 - 6:05
    So we have to import the library, like I
  • 6:05 - 6:07
    mentioned before, and all the rest of it's
    the same.
  • 6:07 - 6:11
    The if test is re.search. That says within
  • 6:11 - 6:15
    the library re, call the search utility
    and then
  • 6:15 - 6:18
    pass in the line, the string we're looking
    for
  • 6:18 - 6:20
    and the line, the actual text we're
    looking in.
  • 6:20 - 6:25
    So this is like look for From inside of
    line and return me a
  • 6:25 - 6:29
    True or a False, whichever, depending on
    whether you find it or not.
  • 6:29 - 6:33
    Now you might say, I, you just got done
    telling me that it, it was more dense.
  • 6:33 - 6:35
    And the answer is, there's a few more
    characters here.
  • 6:35 - 6:36
    But we'll see in a second how you
  • 6:36 - 6:39
    can quickly add more power to the regular
    expression.
  • 6:39 - 6:41
    Find, you have to start adding more
  • 6:41 - 6:43
    Python lines to make it more sophisticated
    where in
  • 6:43 - 6:46
    the regular expression you start changing,
  • 6:46 - 6:50
    you change the search string to give more of
  • 6:50 - 6:52
    the direction of what you're looking for,
    and that's what
  • 6:52 - 6:55
    we'll be doing, pretty much, is changing
    the search string.
  • 6:55 - 6:58
    So now if we wanted to switch to say,
    wait, wait, wait, we don't
  • 6:58 - 7:03
    just want the From anywhere in the line,
    we want it to start with From.
  • 7:03 - 7:06
    So we would change
    line.startswith('From'),
  • 7:06 - 7:07
    and that's either going to be true or false
  • 7:07 - 7:10
    depending on whether or not the
    line starts with From.
  • 7:10 - 7:12
    Now, we do the same thing with
  • 7:12 - 7:15
    regular expressions by changing the
    search string.
  • 7:16 - 7:17
    So now we are in regular expressions.
  • 7:17 - 7:20
    So this really just isn't a string, it's a
    string plus
  • 7:20 - 7:22
    characters that are interpreted as
  • 7:22 - 7:24
    commands by the regular expression
    library.
  • 7:24 - 7:28
    So the caret, which is the first one on
    our,
  • 7:28 - 7:32
    our little regular expression sheet, matches
    the beginning of the line.
  • 7:32 - 7:33
    It's not actually a caret.
  • 7:33 - 7:37
    So that says, the first character, this
    two-character sequence, caret F,
  • 7:37 - 7:41
    means F but in column one, in the first
    character of the line.
  • 7:41 - 7:43
    And so, again, this is going to give us a
  • 7:43 - 7:46
    True or a False, if this regular
    expression matches.
  • 7:46 - 7:50
    The, the beginning of the line, From: and
    it's the same as
  • 7:50 - 7:54
    this, it's, does it start with From.
    So again, these two are equivalent.
  • 7:54 - 8:00
    But you see the pattern where we're
    going to do something to this string using
  • 8:00 - 8:06
    these characters that have meaning, okay?
    So, the next thing that's
  • 8:06 - 8:12
    most commonly done other than caret and
    dollar sign for the end of line, is
  • 8:12 - 8:16
    the wildcard characters and so, we've used
    wildcards
  • 8:16 - 8:20
    possibly in like DOS, where we can use ?
  • 8:20 - 8:25
    or * in like a dir command. dir .*.* if
    you're familiar with that,
  • 8:25 - 8:30
    or even a Unix command like ls, you
    know, star dot whatever.
  • 8:30 - 8:32
    This is not how regular expressions
  • 8:32 - 8:34
    work. And the problem is is that dot, dot
  • 8:34 - 8:38
    is that it matches a single character in
    regular expressions.
  • 8:38 - 8:41
    Asterisk means any number of times.
  • 8:41 - 8:47
    So if I look at this, if I look at
    this and color-code this to make a
  • 8:47 - 8:52
    little more sense, the caret is actually
    kind of part of the
  • 8:52 - 8:57
    regular expect, regular expression
    programming language. Says I'm, I'm
  • 8:57 - 8:59
    I'm a virtual character matching the
    beginning of line.
  • 8:59 - 9:01
    The X is a real character.
  • 9:01 - 9:05
    The dot is part of the regular expression
    programming language, any character.
  • 9:05 - 9:08
    Star is part of the regular expression
    programming, it says
  • 9:08 - 9:12
    the immediate previous character many
    times, zero or more times.
  • 9:12 - 9:15
    And then colon matches the colon.
  • 9:15 - 9:20
    And so if you look at lines, these are the
    kinds of lines that will give me a True.
  • 9:20 - 9:22
    Because they start with an X,
  • 9:22 - 9:26
    followed by some number of characters,
    followed by a colon.
  • 9:26 - 9:27
    So that's true.
  • 9:27 - 9:31
    Start with a X, followed by some number of
    characters, followed by a colon.
  • 9:31 - 9:32
    Okay?
  • 9:32 - 9:35
    And so that's basically how this works.
  • 9:35 - 9:39
    And so this little, this, in this
  • 9:39 - 9:42
    five-character string there are, you know,
    some of
  • 9:42 - 9:44
    these things are like instructions and
    some of
  • 9:44 - 9:46
    them are the actual characters we're
    looking for.
  • 9:46 - 9:48
    So the X and the colon
  • 9:48 - 9:49
    are the characters we're looking
  • 9:49 - 9:55
    for, and the caret, dot, and star are
    programming.
  • 9:55 - 9:57
    Right? They are logic that we're adding
    to the string.
  • 10:00 - 10:01
    Okay.
  • 10:01 - 10:05
    So let's say, for example, you're...
    Part of any of these things,
  • 10:05 - 10:07
    and part of the stuff we have done so far,
  • 10:07 - 10:11
    has to assume that the data is some
    level of being clean and
  • 10:11 - 10:14
    so the data that I have been giving you,
    mbox.txt, is not inconsistent.
  • 10:15 - 10:18
    Right? It doesn't have like too much
    weirdness in it.
  • 10:18 - 10:20
    I'm not trying to trick you and
    mislead you, although
  • 10:20 - 10:23
    we've had situations where you sort of get
    a traceback because
  • 10:23 - 10:25
    you think there's going to be five words
    you, you grab a line,
  • 10:25 - 10:28
    you break it, and there's only two
    words and then you get
  • 10:28 - 10:31
    a traceback because you're looking at the
    fifth word, or something like that.
  • 10:33 - 10:35
    But if your data is less clean, or even
    you just are
  • 10:35 - 10:40
    want to be real careful, you can
    fine-tune your matching.
  • 10:40 - 10:43
    So, here's that same match.
  • 10:43 - 10:45
    Give me a character X, followed by any
    number of
  • 10:45 - 10:48
    characters, followed by a colon, and that's
    what I'm looking for.
  • 10:48 - 10:50
    Give me lines that match that pattern.
  • 10:50 - 10:52
    So this X starts at any number of
    characters,
  • 10:52 - 10:55
    colon, great, this, any number of
    characters good, great.
  • 10:55 - 10:57
    Oh wait, and there's an email X that says
  • 10:57 - 11:01
    X Plane is two weeks behind sch, behind
    schedule, colon, two weeks.
  • 11:01 - 11:06
    Well, the regular expression didn't know
    that the dash made sense to you.
  • 11:06 - 11:07
    And you just assumed that everything that
    started
  • 11:07 - 11:09
    with a capital X had a dash after it.
  • 11:09 - 11:15
    So X is what it starts with, any number of
    any character, and then
  • 11:15 - 11:17
    a colon. So this becomes True.
  • 11:17 - 11:22
    This may not make you happy, right? It may
    not be what you're looking for.
  • 11:22 - 11:26
    Because you haven't been specific enough
    in your regular expression.
  • 11:26 - 11:31
    So, we can be more specific in our regular
    expression.
  • 11:31 - 11:35
    So for example, this is a more specific
    regular expression.
  • 11:35 - 11:40
    It still says start with an X as the first
    character, then a dash,
  • 11:40 - 11:43
    that's a real character not a, then this
  • 11:43 - 11:47
    next thing, instead of being a dot, this
    backslash capital S.
  • 11:47 - 11:50
    It's on the sheet.
  • 11:50 - 11:51
    Whoa. It's not on the sheet.
  • 11:51 - 11:54
    I lost the sheet. Come back, sheet.
  • 11:55 - 11:55
    I lost the sheet.
  • 11:56 - 11:59
    I can't live without my sheet.
  • 12:01 - 12:06
    Backslash capital S means a
    non-whitespace character.
  • 12:06 - 12:09
    So that means spaces won't match.
  • 12:09 - 12:14
    And then I changed the asterisk, zero or
    more times thing, to a plus.
  • 12:14 - 12:16
    And that means one or more times.
  • 12:16 - 12:20
    Here is a character, a non-whitespace.
    These two things kind of work together.
  • 12:20 - 12:25
    A non-whitespace character at least one
    time, as many as we like.
  • 12:25 - 12:26
    And then, a colon.
  • 12:27 - 12:31
    So, if we look here, it starts with X dash,
  • 12:31 - 12:35
    any number of non-whitespace
    characters, and ends in colon.
  • 12:35 - 12:37
    Starts with X dash, any number
  • 12:37 - 12:40
    of non-whitespace characters, ends
    in a colon.
  • 12:40 - 12:42
    True. True.
  • 12:42 - 12:46
    This one starts with an X, but doesn't
    start with an X dash.
  • 12:46 - 12:49
    Oh, as a matter of fact, these characters
    are blanks, so this becomes a False.
  • 12:49 - 12:53
    It does have an X and it does have a colon
    and match the previous one,
  • 12:53 - 12:56
    but this one here is more specific.
  • 13:00 - 13:03
    Okay? So it's more specific and so it
    matches what you want.
  • 13:03 - 13:04
    Now it depends on what you are looking for.
  • 13:04 - 13:05
    Maybe you do want this line,
  • 13:05 - 13:09
    and so you're looking for X. I don't
    know. But if you want, you can be
  • 13:09 - 13:13
    increasingly sophisticated in what
  • 13:13 - 13:15
    you're looking for in a regular
    expression.
  • 13:15 - 13:20
    So now, let's talk about extracting data.
  • 13:20 - 13:24
    So everything we've done so far is,
    is it there or is it not.
  • 13:24 - 13:25
    But it's really common once
  • 13:25 - 13:27
    you find something you that want to
    break it into pieces.
  • 13:27 - 13:32
    So we can combine the searching and the
    parsing into one statement.
  • 13:33 - 13:37
    And instead of using search, which returns
    for us a true/false, we are going to use
  • 13:37 - 13:42
    findall.
    So in this example, I'm going to to show
  • 13:42 - 13:51
    you a new syntax. The square bracket in
    regular expression language means
  • 13:51 - 13:53
    a way to list a set of characters.
  • 13:53 - 13:58
    So this says, this is a single character
    that says,
  • 13:58 - 14:00
    I want to match anything in the range
    0 through 9.
  • 14:02 - 14:04
    Plus means one or more of those.
  • 14:04 - 14:09
    So that says, so this is, this whole thing
    says one or more digits.
  • 14:09 - 14:12
    That's a regular expression that says one
    or more digits.
  • 14:12 - 14:13
    You can put other things inside here.
  • 14:15 - 14:16
    You can put like, you know,
  • 14:17 - 14:22
    you could make a thing that says a b c d.
    And that would say, I'm
  • 14:22 - 14:26
    going to match a single character that's
    a or b or c or d. Or you could say like,
  • 14:27 - 14:32
    you know, 1 3 5 7, bracket.
  • 14:32 - 14:33
    That's a single character
  • 14:33 - 14:35
    that's either a 1 or a 3 or a 5 or a 7.
  • 14:35 - 14:37
    So the bracket is a list of matching
  • 14:37 - 14:41
    characters and the dash inside the
    bracket means range.
  • 14:41 - 14:45
    We'll see in a second that you can stick a
    not inside the bracket. It's on this.
  • 14:45 - 14:47
    So, so again, remember in this little
  • 14:47 - 14:50
    mini-language, we are programming, right?
  • 14:50 - 14:55
    We are giving instructions to the regular
    expression engine, as it were. Okay?
  • 14:58 - 15:03
    So, if we do this, and here is an
    expression that
  • 15:03 - 15:09
    says I would like to find, you know, things
    that are one or more digits.
  • 15:09 - 15:10
    And so,
  • 15:14 - 15:17
    so it's one or more digits and, and so
    it's going to look
  • 15:17 - 15:19
    through here and it's going to find it as
    many times as it can.
  • 15:21 - 15:24
    So there is one or more digits, there is
    one or more digits,
  • 15:24 - 15:27
    and there is one or more digits.
  • 15:27 - 15:30
    And so what findall gives us back is a
    list of strings.
  • 15:30 - 15:32
    So it found it.
  • 15:32 - 15:33
    Where do I match?
    Where do I match?
  • 15:33 - 15:38
    It's looking the whole time and then,
    it says, oh, I've got it.
  • 15:38 - 15:39
    2, 19, 42.
  • 15:39 - 15:43
    So it actually extracts the strings that
    match
  • 15:43 - 15:47
    and gives you a Python list of strings.
  • 15:47 - 15:48
    Python list of strings.
  • 15:48 - 15:53
    Kind of of like split, except it's like a
    super smart split, right?
  • 15:53 - 15:57
    It's split, but I've directed it what to
    look for, and if,
  • 16:01 - 16:05
    so here's an example of, you know, that's
    the one I just did.
  • 16:05 - 16:10
    Find me one or more digits and extract
    them, so 2, 19, 42.
  • 16:10 - 16:14
    Here I'm saying, using the same bracket
    syntax, to look for a single
  • 16:14 - 16:20
    character A, capital A E I O or U, and one
    or more
  • 16:20 - 16:25
    of those. And if you look, there are no
    upper-case vowels in my string.
  • 16:25 - 16:27
    So it says I'm going to find all the
    things that match
  • 16:27 - 16:36
    A E I O U. So things like AA would match
    and, you know, OU would match.
  • 16:37 - 16:39
    And so that's what we, we would get if
    they were in the string.
  • 16:41 - 16:44
    But because there are none, we get an
    empty string.
  • 16:44 - 16:46
    So even if there are none, you get an
    empty string.
  • 16:46 - 16:48
    So it always returns a string.
  • 16:48 - 16:52
    It may be a zero-length string, and that's
    what you have
  • 16:52 - 16:54
    to check. Okay?
  • 17:00 - 17:02
    Okay, now
  • 17:03 - 17:06
    matching has this notion of greedy,
  • 17:07 - 17:10
    where when you put one of these pluses
  • 17:10 - 17:16
    or asterisks it kind of has this outward
    pushing feeling, right?
  • 17:16 - 17:17
    And so when you say,
  • 17:17 - 17:19
    I'm looking for something that starts with
    an
  • 17:19 - 17:22
    F at the beginning of the line, followed
  • 17:22 - 17:24
    by one or more characters, followed by a
  • 17:24 - 17:27
    colon, you can think of this as pushing
    outward.
  • 17:27 - 17:32
    So if we look at a line here that has From
    colon using the colon
  • 17:32 - 17:37
    character, it will try to expand, so it
    certainly has
  • 17:37 - 17:43
    to match the F and it's looking for a
    colon, any number of characters,
  • 17:43 - 17:47
    but it's trying to make the string that
    matches as big as possible.
  • 17:47 - 17:50
    So it skips over this colon and goes to
    that
  • 17:50 - 17:52
    colon and so the thing that we get is
    here.
  • 17:52 - 17:56
    And so, it ignored this and said I will
    make as large a string as I can.
  • 17:57 - 17:59
    So, that that's the plus that's doing it.
  • 17:59 - 18:04
    Dot plus pushes, it's like, I've got a
  • 18:04 - 18:07
    colon, but is there another colon out
    there?
  • 18:07 - 18:09
    So you push it, okay?
  • 18:09 - 18:11
    So that's greedy matching.
  • 18:11 - 18:15
    It can get you in some trouble, like being
    greedy
  • 18:15 - 18:18
    in general, and both asterisk and plus sort
    of behave
  • 18:18 - 18:20
    in a greedy way because they're zero more
    or one
  • 18:20 - 18:24
    or more characters, so they can sort of
    push outward, okay?
  • 18:26 - 18:28
    Now you can turn this off.
  • 18:28 - 18:32
    It's a programming language, we can tweak
    it, okay?
  • 18:32 - 18:36
    And so we add a question mark.
  • 18:36 - 18:41
    So this is a three-character sequence now.
    So if you say dot plus question
  • 18:41 - 18:46
    mark, that says one or more of any
    characters, push,
  • 18:46 - 18:52
    but instead of being greedy and pushing as
    far as you can, this means stop
  • 18:52 - 18:57
    at the first. Stop at the first.
  • 18:57 - 18:59
    Oops, stop at the first.
  • 18:59 - 19:02
    I can never draw on this thing fast
    enough.
  • 19:02 - 19:03
    Stop at the first.
  • 19:03 - 19:04
    Okay?
  • 19:04 - 19:06
    And that's it, just don't be greedy, don't
  • 19:06 - 19:08
    try to make the string as large as
    possible.
  • 19:08 - 19:11
    Go with the smaller one, the smaller
    possible one.
  • 19:11 - 19:13
    We still need to find an F, and we still
    need
  • 19:13 - 19:17
    to find a colon, but when you find the
    first colon, stop.
  • 19:17 - 19:19
    And so what this does is this changes it
    so that
  • 19:19 - 19:23
    what we match is from colon instead of
    going all the way.
  • 19:23 - 19:27
    So the greedy match pushes as far as it
    can. The non-greedy match
  • 19:27 - 19:33
    is satisfied with the first thing that
    meets the criterion of the string.
  • 19:33 - 19:36
    So this is a little three-character
    programming sequence,
  • 19:36 - 19:39
    any character one or more times and not
    greedy.
  • 19:48 - 19:51
    If, for example, we were trying to solve the
    problem
  • 19:51 - 19:53
    of pulling the email address out of a
    string.
  • 19:55 - 19:55
    Right?
  • 19:57 - 20:01
    We can make good use of this non-blank
    character
  • 20:01 - 20:04
    and so the at sign is just a character and
  • 20:04 - 20:08
    then we can say, I want at least one
    non-blank
  • 20:08 - 20:12
    character before it and at least one
    non-blank character after it.
  • 20:12 - 20:16
    So the way regular expressions does it
    says, okay, I find my at sign and
  • 20:16 - 20:20
    I push in a greedy manner outwards, as
  • 20:20 - 20:22
    long as there are non-blank characters,
    push, push, push, push,
  • 20:22 - 20:27
    push, push, push, oops, stop.
    Push, push, push, push, push, stop.
  • 20:27 - 20:27
    Okay?
  • 20:27 - 20:30
    So it's some number of non-blank
    characters, an
  • 20:30 - 20:33
    at sign, followed by some number of
    non-blank characters.
  • 20:33 - 20:38
    So it's, that's using greedy matching. It,
    it's doing that, okay?
  • 20:38 - 20:41
    And so this is where we get Stephen
    Marquard, we can, and,
  • 20:41 - 20:46
    and we would know if there wasn't there by
    the empty list, right?
  • 20:46 - 20:51
    And so we get stephen.marquard@uct.ac.za.
  • 20:53 - 20:59
    Now, we can also fine-tune what we
    extract, right?
  • 20:59 - 21:05
    In the previous slide, we extracted
    whatever matched.
  • 21:05 - 21:06
    Right?
  • 21:06 - 21:10
    Whatever this matched, it looked across
    the whole string and found it,
  • 21:10 - 21:15
    found the thing, shoved it over, and gave
    us what it matched.
  • 21:15 - 21:19
    But it's possible to make the match larger
    than what's extracted,
  • 21:19 - 21:23
    to extract a subset of the match, and we'll
    see that on this next slide.
  • 21:23 - 21:24
    Okay?
  • 21:24 - 21:30
    So here's this same thing, which is an at
    sign followed, and then
  • 21:30 - 21:34
    with non-blank characters as far as the
    eye can see in either direction.
  • 21:34 - 21:37
    But I'm going to add to it caret From
    space.
  • 21:37 - 21:44
    So, so this has to be start with, the
    first character has to be a caret, this,
  • 21:44 - 21:46
    it's gotta have the word From,
  • 21:46 - 21:51
    it's gotta have one space and then,
    immediately, it's gotta find this, right?
  • 21:51 - 21:54
    It's gotta find a series of non-blanks,
    followed by an at sign,
  • 21:54 - 21:58
    followed by another series of one or
    more non-blanks. And then
  • 21:58 - 22:00
    what we do, so this, if we didn't put
    these parentheses
  • 22:00 - 22:04
    in, it would match and we would get all of
    this data.
  • 22:04 - 22:05
    It would go to here.
  • 22:06 - 22:09
    But what we can do with the parentheses,
    the parentheses are part
  • 22:09 - 22:12
    of the regular expression language,
    saying,
  • 22:12 - 22:15
    okay, I want to match the whole thing.
  • 22:15 - 22:17
    The parentheses aren't part of the care-,
    a string up here.
  • 22:17 - 22:19
    I want to match the whole thing, but
  • 22:19 - 22:21
    I only want to extract this part in
    parentheses.
  • 22:22 - 22:25
    So this whole thing is a regular
    expression that's matched
  • 22:25 - 22:29
    and then the parentheses part is what's
    retrieved for you.
  • 22:29 - 22:32
    And so this makes it so that the only time
    it's going to
  • 22:32 - 22:35
    look for at signs is, are on lines that
    start with From space.
  • 22:35 - 22:39
    It is going to want the immediate next
    character to be a non-blank.
  • 22:41 - 22:43
    Some number of non-blank characters
    followed by an at sign,
  • 22:43 - 22:46
    some number of non-blank characters, it's
    going to stop right there.
  • 22:46 - 22:48
    And it's only going to extract from here
    to here,
  • 22:48 - 22:51
    and so we get out Stephen Marquard.
  • 22:51 - 22:56
    But this is a pretty narrowly scoped thing
    because
  • 22:56 - 22:58
    the first four characters have to be From
    space.
  • 22:58 - 23:01
    And so that's a way to combine a stricter
    match,
  • 23:01 - 23:04
    even though you don't actually want
    all the data.
  • 23:04 - 23:06
    So you can add those things all over the
    place.
  • 23:06 - 23:09
    Okay? Okay.
  • 23:09 - 23:15
    Then, we, we, we can compare the different
    ways of extracting data.
  • 23:15 - 23:20
    So if we look at how we extract the host
    name.
  • 23:20 - 23:23
    Remember how we did this many chapters ago.
  • 23:23 - 23:26
    So we did a data.find, which says oh,
  • 23:26 - 23:30
    the first at sign is at 21.
    So the first at sign is at 21.
  • 23:30 - 23:34
    Then we say we want to find the space
    after that.
  • 23:34 - 23:39
    So that's the at position, that's 31.
    And then we want to extract the data
  • 23:39 - 23:44
    that's one beyond the at up to but not
    including the space.
  • 23:46 - 23:48
    And that is the variable that we are
    going to print out, host.
  • 23:48 - 23:52
    And so we've extracted this bit of
    information and out comes the host.
  • 23:52 - 23:53
    Quite nice. Okay?
  • 23:54 - 23:57
    We also saw another technique, and by the
    way, all these techniques are okay.
  • 23:59 - 24:00
    All these techniques are fine.
  • 24:00 - 24:02
    Another technique we saw, once we sort of
    played
  • 24:02 - 24:04
    with split and lists, was what we, what I
  • 24:04 - 24:08
    called a double split version of this,
    where the
  • 24:08 - 24:10
    first thing we do is we split that line.
  • 24:12 - 24:16
    The first thing we do is split the line
    and then we know, and blanks,
  • 24:19 - 24:24
    that the second thing, which is the
    sub one, words sub one,
  • 24:24 - 24:29
    is the entire email address. Then this is
    the double split.
  • 24:29 - 24:32
    We take the email address and we split it by
  • 24:32 - 24:35
    an at sign and then we get a list of the
  • 24:35 - 24:38
    pieces of the email address, the email
    name and the
  • 24:38 - 24:44
    email host, and then we grab the, the
    sub one of that,
  • 24:44 - 24:45
    and then we have the host.
  • 24:45 - 24:50
    So that's a double, the double split way
    of doing this, right?
  • 24:50 - 24:53
    Now in this, we still haven't done
    the From yet,
  • 24:53 - 24:57
    but it is the double split way to do this.
  • 24:57 - 25:04
    So, if we think about how we would do
    this in a regular expression, okay?
  • 25:04 - 25:12
    We're going to say, look through the
    string, findall, we're going to,
  • 25:12 - 25:15
    use the findall, and the regular
    expression exploded up says
  • 25:16 - 25:21
    look through the string for an at.
    Do, do, do, do, do, do, got an at.
  • 25:21 - 25:26
    Then, oh, start extracting. End extracting.
  • 25:26 - 25:29
    And then this is another form of the
  • 25:29 - 25:31
    this is one character, it's a
  • 25:31 - 25:35
    single character, match any non-blank
    character, and
  • 25:35 - 25:37
    zero or more of them. Okay?
  • 25:37 - 25:42
    So find an at sign, start extracting,
  • 25:42 - 25:48
    end extracting, match, this is one character.
  • 25:48 - 25:54
    That is a set of possible matches, and
    that's some character, this means not.
  • 25:57 - 25:59
    Okay? Not a blank, that's a blank
  • 25:59 - 26:01
    right there, that's a blank character
    right there.
  • 26:01 - 26:04
    Not a blank, as many times as you want.
  • 26:04 - 26:05
    You might want to, we might want to turn
  • 26:05 - 26:08
    that into a plus to guarantee at least one.
  • 26:08 - 26:10
    So that might be better done as a plus
    right there.
  • 26:14 - 26:16
    So this is, probably make more sense as a
    plus, to say, I
  • 26:16 - 26:21
    want at least, after the at sign, I want
    at least one non-blank character.
  • 26:26 - 26:31
    And the parentheses simply say, I don't
    want the at sign.
  • 26:31 - 26:36
    So if the at sign, I really want those
    non-blank characters after the at sign.
  • 26:36 - 26:39
    Okay? So that's what I want to extract.
  • 26:39 - 26:42
    So it's like, go find the at sign.
  • 26:42 - 26:44
    Okay, great, found the at sign. Start
  • 26:44 - 26:48
    extracting, look for non-blank characters,
    end extracting.
  • 26:48 - 26:50
    So pull that part out and put it right
    there.
  • 26:53 - 26:56
    Now an even cooler version of this that
  • 26:56 - 26:59
    you probably kind of imagined right away is,
  • 27:01 - 27:07
    we say, you know what, I would like this
    first character, the first
  • 27:07 - 27:13
    part of the line to be From, with a blank,
    followed by any number of characters,
  • 27:17 - 27:21
    followed by an at sign, so the at sign is
    real, then start
  • 27:21 - 27:26
    extracting, then any number of non-blank
    characters, end extracting.
  • 27:27 - 27:32
    So this is a, this is like eight or nine
    lines of Python
  • 27:32 - 27:36
    all rolled into one thing, okay?
  • 27:39 - 27:44
    So, start at the beginning of the line.
    Look for string From, with a space.
  • 27:44 - 27:50
    Then skip a bunch of characters looking
    for an at sign, skip characters until
  • 27:50 - 27:53
    you encounter an at sign, then start
  • 27:53 - 27:58
    extracting, match any non-blank, a single
    non-blank character.
  • 27:58 - 28:01
    This is kind of like one non-blank
  • 28:01 - 28:04
    character, one non-blank character, but
    once you
  • 28:04 - 28:08
    suffix it with the asterisk that changes it to
    be many non-blank characters.
  • 28:11 - 28:13
    And then stop extracting, okay?
  • 28:14 - 28:19
    And so, you know, and so it's like find
    From followed by a space, great.
  • 28:21 - 28:22
    That's the first part.
  • 28:22 - 28:25
    Now throw away characters until you find
    an at sign.
  • 28:26 - 28:28
    Then start extracting.
  • 28:28 - 28:31
    Keep going with non-blank characters until
    you hit
  • 28:31 - 28:34
    the first blank characters and pull that
    part out.
  • 28:34 - 28:36
    Now the result is we get the exact same
  • 28:36 - 28:42
    data. But with this added to it, we are
    much more narrow in the kind of things
  • 28:42 - 28:47
    that we're looking for and if we get
    noisy data that like, something like,
  • 28:47 - 28:53
    you know, meet at Joe's, right?
    We don't want that.
  • 28:53 - 28:54
    That won't match, right?
  • 28:54 - 28:56
    We want that to be like a False.
  • 28:56 - 28:59
    And, and it allows us to sort of really
    fine-tune our matching
  • 28:59 - 29:03
    and extracting. And this is just the
    beginning, they are very, very powerful.
  • 29:03 - 29:09
    So, the last thing that I will show you is
    sort of a program that is kind of like one
  • 29:09 - 29:12
    of the programs that we did in a previous
    section,
  • 29:12 - 29:15
    except now we're going to use regular
    expressions to do it.
  • 29:15 - 29:16
    So if you remember, we had this thing where
  • 29:16 - 29:20
    we're doing spam confidence, where we're
    looking for lines and
  • 29:21 - 29:23
    you know, and pulling this number out and then
  • 29:23 - 29:26
    calculating the average, or the
    maximum, or whatever.
  • 29:26 - 29:32
    And so here is a, we import the regular
    expression library, we open the file,
  • 29:32 - 29:35
    we're going to do this with the, appending
    to the, a list, we'll put the list.
  • 29:35 - 29:38
    We'll put the numbers in a list rather
    than doing the calculation in a loop.
  • 29:39 - 29:40
    We strip the data.
  • 29:40 - 29:42
    Now, here's the key thing, right?
  • 29:42 - 29:45
    We're going to have a regular expression
    that says,
  • 29:46 - 29:49
    look for the first character being X,
    followed by
  • 29:49 - 29:51
    a dash, followed by all this,
    all this
  • 29:51 - 29:55
    exactly has to match literally, followed
    by a colon.
  • 29:55 - 30:01
    And then there's a space, and then we
    begin extracting and we are looking for
  • 30:01 - 30:06
    the digit 0 through 9 or a dot and
    we are looking for one or
  • 30:06 - 30:10
    more, and then we end extracting.
  • 30:10 - 30:13
    So that's the, the parentheses are telling
    us what to pull out.
  • 30:13 - 30:15
    So that just means that we're going to
    pull out those numbers, all
  • 30:15 - 30:18
    the digits and the numbers, until we get
    something other, I mean,
  • 30:18 - 30:21
    all the digits and the period, and we'll
    get something other than
  • 30:21 - 30:24
    a digit and a period, and we, and then
    we'll be done, okay?
  • 30:24 - 30:30
    And so if we, and so this is going to pull
    those numbers out and give us back a list.
  • 30:30 - 30:31
    Now the thing about it is, we have
  • 30:31 - 30:35
    to realize that sometimes this is not
    going to match, because
  • 30:35 - 30:38
    we're sending every line, not just the
    ones that start
  • 30:38 - 30:41
    with X, we're sending every line through
    this and so
  • 30:41 - 30:44
    we need to know when we didn't get a
    match.
  • 30:44 - 30:48
    And that, the way we know we didn't get a
    match is if the list, the
  • 30:48 - 30:52
    number of items in the list that we got
    back, is zero, then we're going to continue.
  • 30:52 - 30:57
    So this is kind of our if where we're
    searching for the needle in the haystack.
  • 30:57 - 31:00
    But then once we find what we are looking
  • 31:00 - 31:02
    for, the actual number that we are
    interested in,
  • 31:05 - 31:08
    is already sitting here in stuff sub zero.
    Okay?
  • 31:08 - 31:11
    And then we convert it to a float, we append it.
  • 31:11 - 31:14
    And when the loop is all done, we print
    out the maximum.
  • 31:14 - 31:15
    Okay?
  • 31:15 - 31:17
    And so this is sort of encoding a number
    of things
  • 31:17 - 31:22
    and ending up with a very, a very solid and
    safe matching.
  • 31:22 - 31:26
    So we're really, it's hard for this to
    find a line that's wrong and
  • 31:26 - 31:30
    you could even improve this a little bit
    to make it even a little tighter
  • 31:30 - 31:35
    where we'd go find a number like 0.999.
    You could say, oh, it's
  • 31:35 - 31:41
    all the numbers are zero dot, so
  • 31:41 - 31:47
    you could make this a little, a little more
    precise.
  • 31:47 - 31:49
    So it wouldn't, so it would even skip
    things that
  • 31:49 - 31:53
    you can make it, so it looks exactly the
    way you want it to look.
  • 31:53 - 31:55
    So, I emphasize that this
  • 31:55 - 31:57
    is kind of a weird language and you need
    some kind of thing.
  • 31:57 - 31:59
    We talked about all these.
  • 31:59 - 32:02
    We have the beginning of the line, we have
    the end
  • 32:02 - 32:04
    of the line, matching any character,
  • 32:04 - 32:08
    matching space characters, matching
    non-whitespace characters.
  • 32:08 - 32:13
    Star is a modifier that says zero or more
    times.
  • 32:13 - 32:18
    Star question mark is a modifier that says
    zero or more times non-greedy.
  • 32:18 - 32:21
    Plus is one or more times.
  • 32:21 - 32:25
    Plus question mark is one or more times
    non-greedy.
  • 32:25 - 32:27
    When you have bracket syntax, it's a set,
  • 32:27 - 32:31
    it's a single character that's in the
    listed set.
  • 32:31 - 32:33
    So that's lower-case vowels.
  • 32:34 - 32:35
    You can also have the first, if the first
  • 32:35 - 32:39
    character of this is a caret, that flips it.
  • 32:39 - 32:43
    So that means everything except capital
    X, capital Y, capital Z.
  • 32:43 - 32:45
    So it's everything that's not in the set,
  • 32:45 - 32:48
    capital X, capital Y, capital Z, and then
  • 32:48 - 32:51
    you can also put dashes in to represent
    ranges.
  • 32:51 - 32:53
    Bracket a through z and 0 through 9,
    and lower-case
  • 32:53 - 32:58
    letters and digits will match, but again,
    this is a single character.
  • 32:58 - 33:01
    Now, you can put a plus or a star after
  • 33:01 - 33:04
    these guys to make them happen more than
    one time.
  • 33:04 - 33:06
    And you can even put them in twice.
  • 33:06 - 33:12
    So if I wanted a two-digit number, I could
    say 0 dash 9, 0 dash 9.
  • 33:13 - 33:15
    Oops. This is one character.
  • 33:15 - 33:18
    This is one character and this is the
    possible things.
  • 33:18 - 33:22
    So that's, you know, 0 0
    would match.
  • 33:22 - 33:26
    1 0 would match, 99 would match, etc.
  • 33:26 - 33:27
    Okay?
  • 33:29 - 33:32
    And then the parentheses are the things
    that if you
  • 33:32 - 33:34
    are in the middle of a big long matching
    string and
  • 33:34 - 33:37
    you don't want to extract the whole thing,
    you can limit the
  • 33:37 - 33:40
    things you're extracting to, to the stuff
    that's just in there.
  • 33:41 - 33:44
    With all these characters that have all
    this meaning,
  • 33:44 - 33:46
    we have to have a way to match those
    characters.
  • 33:46 - 33:50
    So dollar sign is the end of a line.
  • 33:50 - 33:52
    But what if we're looking for something that
  • 33:52 - 33:53
    actually has a dollar sign in the string?
  • 33:55 - 33:57
    And that's what the backslash is for.
  • 33:57 - 33:58
    So if you put the backslash in front of
  • 33:58 - 34:04
    a otherwise meaningful character, you
    don't, it becomes the actual character.
  • 34:04 - 34:07
    So this is saying match a dollar sign.
  • 34:07 - 34:09
    Those two characters say match a dollar
    sign.
  • 34:09 - 34:14
    And then this says one character that's
    0 through 9 or a, or a dot.
  • 34:14 - 34:17
    And then we put the plus modifier to say
  • 34:17 - 34:20
    at least one or more times and so that sort
    of is
  • 34:20 - 34:21
    a greedy, of course.
  • 34:21 - 34:25
    So that will get us this and extract it,
    including the dollar sign.
  • 34:25 - 34:28
    So the escape character is the backslash.
  • 34:29 - 34:31
    Okay. So there we are.
  • 34:31 - 34:32
    Now we're done.
  • 34:32 - 34:35
    So this is little bit cryptic.
  • 34:35 - 34:38
    It's, it's kind of a puzzle.
  • 34:38 - 34:39
    It's kind of fun.
  • 34:39 - 34:43
    And it's extremely powerful.
    And you don't have to know it.
  • 34:43 - 34:44
    You don't have to learn it.
  • 34:45 - 34:49
    But if you do, you'll find that it's very
    useful as we sort
  • 34:49 - 34:53
    of dig through data and are trying to
    write things that are pretty quick.
  • 34:53 - 34:59
    And, and, and they, the thing I like about
    regular expressions is that they
  • 34:59 - 35:03
    tend to be, if you write them well, they
    tend to be less sensitive to bad data.
  • 35:05 - 35:07
    They tend to ignore data, they're, you
  • 35:07 - 35:10
    can put more detail, I exactly want this.
  • 35:10 - 35:10
    Whereas you're,
  • 35:10 - 35:12
    if you're writing find and extract, you're
  • 35:12 - 35:14
    making a lot of assumptions about the
    data.
  • 35:14 - 35:17
    That it's clean and you're not going to,
    you know, mis-hit on something.
  • 35:17 - 35:22
    So, okay, well, good luck, and you're
  • 35:22 - 35:24
    used to regular expressions, and we'll
    see you later.
Title:
Python for Informatics - Chapter 11 - Regular Expressions
Description:

Regular Expressions http://www.pythonlearn.com/
All Lectures:
http://www.youtube.com/playlist?list=PLlRFEj9H3Oj4JXIwMwN1_ss1Tk8wZShEJ

more » « less
Video Language:
English
Team:
Captions Requested
Duration:
35:24

English subtitles

Revisions