0:00:00.360,0:00:03.288 Hello and welcome to Chapter 11, Regular[br]Expressions 0:00:03.288,0:00:06.660 from the book Python for Informatics:[br]Exploring Information. 0:00:07.730,0:00:12.290 As always, these slides are copyright[br]Creative Commons Attribution, as well as 0:00:12.290,0:00:15.280 the audio and the video that you're[br]watching or listening to right now. 0:00:16.329,0:00:22.870 And, so regular expressions are an[br]interesting thing. 0:00:22.870,0:00:25.530 You've seen from, in the chapters up till[br]now, I've, 0:00:25.530,0:00:30.520 I've had a singular focus on sort of[br]pulling information out of data. 0:00:30.520,0:00:34.290 Raw data, this mailbox file that perhaps[br]you're getting tired of already. 0:00:34.290,0:00:35.860 But it's a lot of fun, because I can have 0:00:35.860,0:00:38.030 you go look for something and, and[br]pick it out. 0:00:38.030,0:00:42.240 And you're doing something that like would[br]be really painful to do sort of by hand. 0:00:45.090,0:00:47.060 And while it's not all of computing, I[br]mean, there's games 0:00:47.060,0:00:50.670 and there's, you know, things like[br]weather computations that do calculations, 0:00:52.460,0:00:56.730 pulling and extracting data out is a big[br]part of computing. 0:00:56.730,0:01:01.350 And so there's actually a library that's[br]built specifically to do this. 0:01:01.350,0:01:06.230 And, and if you start doing a few finds[br]and slicing, it gets kind of 0:01:06.230,0:01:08.110 long after a while and that's like split,[br]for example, 0:01:08.110,0:01:10.590 really saved us a lot of time. 0:01:10.590,0:01:13.760 But sometimes the data that you are[br]looking for is a little 0:01:13.760,0:01:18.130 more sophisticated than broken into spaces[br]or colons or something like that. 0:01:18.130,0:01:21.050 And you just want to like tell something[br]to go find 0:01:21.050,0:01:25.960 I see what I want, and I see where it's[br]embedded in the string, go get it for me. 0:01:25.960,0:01:29.160 And regular expressions are themselves a[br]programming language. 0:01:29.160,0:01:33.680 They're like a really smart wild card for[br]searching. 0:01:33.680,0:01:35.230 So we've used wild cards in various 0:01:35.230,0:01:40.180 things in search, but they're, they're a[br]really smart version of a wild card. 0:01:42.010,0:01:47.040 And so, regular expressions are quite[br]powerful and they're very cryptic. 0:01:47.040,0:01:49.080 And as a matter of fact, you don't even[br]need 0:01:49.080,0:01:50.740 to learn them if you don't feel like it,[br]right? 0:01:51.870,0:01:53.420 I've got this little guide. 0:01:53.420,0:01:56.040 I need a guide for myself when I do[br]regular expressions. 0:01:56.040,0:01:58.380 It sometimes takes me a few minutes to[br]write 0:01:58.380,0:02:00.270 a regular expression to do exactly what I[br]want. 0:02:00.270,0:02:05.380 So in a way, writing a regular expression[br]is like program, writing a program. 0:02:05.380,0:02:08.930 It's highly specialized to searching and[br]extracting data from strings. 0:02:08.930,0:02:11.500 But it's like writing a program and it[br]takes a while to get 0:02:11.500,0:02:15.408 it right and you kind of like, oh, change[br]this, what about a slash there? 0:02:15.408,0:02:18.130 And, so, you, but they actually are kind[br]of fun. 0:02:18.130,0:02:22.160 And, and they are a great way to sort of[br]exchange little program snippets 0:02:22.160,0:02:25.330 to say, oh yeah, I'm looking for this, oh[br]here's a little reg expression you might 0:02:25.330,0:02:28.380 try and then, so they're, they're like[br]programs themselves. 0:02:29.660,0:02:32.540 It is this language of marker characters,[br]so when we 0:02:32.540,0:02:37.210 look for regular expressions, some[br]characters like A, B, C, have meaning 0:02:37.210,0:02:40.750 as A, B, C but some characters like caret or[br]dollar sign mean 0:02:40.750,0:02:42.880 at the beginning of the line, or at the[br]end of the line. 0:02:42.880,0:02:47.420 And so we encode in this string a, a[br]program, basically. 0:02:47.420,0:02:50.940 And so it's a rather old-school language.[br]It's from 0:02:50.940,0:02:51.610 long time. 0:02:51.610,0:02:55.460 It predates Python, which is over 20 years[br]old, and so 0:02:55.460,0:03:00.630 it's, it also marks you as sort of a[br]little cool, right? 0:03:00.630,0:03:03.570 It's a, it's a distinct marking that makes 0:03:03.570,0:03:06.320 it so that you know something other people[br]don't. 0:03:06.320,0:03:09.560 Right? So you can know how to program, but[br]if you know regular expressions 0:03:09.560,0:03:13.380 it'll be like woah, I tried to look at those[br]and they're kind of tough. 0:03:13.380,0:03:16.030 In a way, knowing regular expressions is 0:03:16.030,0:03:17.980 kind of like a tattoo. 0:03:17.980,0:03:20.790 So I, it's casual Friday and that's why[br]I'm wearing a T-shirt 0:03:20.790,0:03:24.030 today and so I figured I would come in[br]today in a T-shirt, 0:03:24.030,0:03:26.250 but seeing as it's the first time I'm wearing[br]a short-sleeved shirt, it's 0:03:26.250,0:03:29.450 also the first time I can show you my,[br]show my real tattoo here. 0:03:29.450,0:03:32.590 So, here's my real tattoo and in the[br]middle is Sakai, 0:03:32.590,0:03:36.160 the open source learning management system[br]always close to my heart. 0:03:36.160,0:03:37.780 And then you have the IMS logo, which 0:03:37.780,0:03:41.205 is IMS Learning Tools Interoperability,[br]which a standard, 0:03:41.205,0:03:46.438 it means a lot to me.[br]Blackboard, OLAT, Learning Objects, Angel, 0:03:46.438,0:03:51.790 Moodle, Instructure, Jenzabar, and[br]Desire2Learn. 0:03:51.790,0:03:54.420 I call this the ring of compliance,[br]because these are all 0:03:54.420,0:03:59.800 of the first six or seven learning[br]management systems that complied 0:03:59.800,0:04:00.910 with the IMS Learning Tools 0:04:00.910,0:04:03.210 Interoperability standards[br]specification, which is 0:04:03.210,0:04:06.250 something that I spent a lot of my life[br]making work. 0:04:06.250,0:04:06.940 So 0:04:06.940,0:04:09.750 I figured I'd make a tattoo and just[br]kind of 0:04:09.750,0:04:12.810 part of my rough, tough image and,[br]and actually 0:04:12.810,0:04:15.945 regular expressions are indeed part of my[br]rough, tough image, 0:04:15.945,0:04:18.870 because I'm like, I'm down with[br]regular expressions. 0:04:18.870,0:04:22.800 And people are like impressed with my[br]regular expression knowledge. 0:04:22.800,0:04:26.710 But as impressive as I am, I still need a[br]cheat sheet, so I'll have a cheat 0:04:26.710,0:04:29.230 sheet that you can download hopefully on[br]the pythonlearn 0:04:29.230,0:04:31.950 website or whatever, and I just, it 0:04:31.950,0:04:32.750 doesn't have to be much. 0:04:32.750,0:04:36.370 It's really just a kind of a, a crutch,[br]and these are the characters that have 0:04:36.370,0:04:38.440 special meaning, like caret or[br]dollar sign 0:04:38.440,0:04:41.030 match the beginning or end of line,[br]respectively. 0:04:41.030,0:04:44.310 So they're not really matching a dollar[br]sign, they match, they, 0:04:44.310,0:04:47.492 they mean something in our little mini[br]string-like programming language. 0:04:48.800,0:04:52.910 So, like many things that we do in Python[br]going forward, once you want some 0:04:52.910,0:04:55.500 sophisticated capability, it comes with[br]Python, but 0:04:55.500,0:04:57.610 it comes in the form of a library. 0:04:57.610,0:05:00.870 And so the regular expression library we[br]have to say import r-e 0:05:00.870,0:05:04.110 at the beginning of our programs to import[br]the regular expression library. 0:05:04.110,0:05:06.380 Then we call re.search to say I'm 0:05:06.380,0:05:09.240 looking for search from the regular[br]expression library. 0:05:09.240,0:05:11.592 There's two basic functions or method,[br]two, two basic 0:05:11.592,0:05:14.232 capabilities inside this library that[br]we're going to look at. 0:05:14.232,0:05:18.940 One is search, that replaces find, it's[br]like a smart find, and then 0:05:18.940,0:05:24.130 findall is a combination of a smart find[br]and a automatic extraction. 0:05:24.130,0:05:25.670 So we'll look at both of those in turn. 0:05:25.670,0:05:28.760 And I'll do it by comparing them to[br]existing 0:05:28.760,0:05:31.230 Python that you kind of already should[br]know at this point. 0:05:34.320,0:05:37.080 So here's some code that's, say, looking[br]for lines that 0:05:37.080,0:05:40.100 have the word fr-, have the string From[br]colon in them. 0:05:40.100,0:05:43.540 Right, so, we're going to open a file,[br]we're going to strip the white space. 0:05:43.540,0:05:47.620 If we find we, hunt within line for[br]From. 0:05:47.620,0:05:51.410 If it's greater than or equal to zero then[br]we'll print it. And so this 0:05:51.410,0:05:55.010 is just going to give us a number. If it's,[br]if it's not found, it's negative one. 0:05:55.010,0:05:58.040 So it's only going to print the lines that[br]that have From in them. 0:05:58.040,0:05:59.520 Here is the equivalent using 0:05:59.520,0:06:03.180 regular expressions.[br]So these two things are equivalent. 0:06:03.180,0:06:04.820 So we have to import the library, like I 0:06:04.820,0:06:07.430 mentioned before, and all the rest of it's[br]the same. 0:06:07.430,0:06:10.930 The if test is re.search. That says within 0:06:10.930,0:06:15.260 the library re, call the search utility[br]and then 0:06:15.260,0:06:17.950 pass in the line, the string we're looking[br]for 0:06:17.950,0:06:20.480 and the line, the actual text we're[br]looking in. 0:06:20.480,0:06:24.920 So this is like look for From inside of[br]line and return me a 0:06:24.920,0:06:28.930 True or a False, whichever, depending on[br]whether you find it or not. 0:06:28.930,0:06:32.800 Now you might say, I, you just got done[br]telling me that it, it was more dense. 0:06:32.800,0:06:34.730 And the answer is, there's a few more[br]characters here. 0:06:34.730,0:06:36.070 But we'll see in a second how you 0:06:36.070,0:06:39.080 can quickly add more power to the regular[br]expression. 0:06:39.080,0:06:40.730 Find, you have to start adding more 0:06:40.730,0:06:42.910 Python lines to make it more sophisticated[br]where in 0:06:42.910,0:06:45.950 the regular expression you start changing, 0:06:45.950,0:06:49.950 you change the search string to give more of 0:06:49.950,0:06:51.940 the direction of what you're looking for,[br]and that's what 0:06:51.940,0:06:54.550 we'll be doing, pretty much, is changing[br]the search string. 0:06:54.550,0:06:58.420 So now if we wanted to switch to say,[br]wait, wait, wait, we don't 0:06:58.420,0:07:02.900 just want the From anywhere in the line,[br]we want it to start with From. 0:07:02.900,0:07:05.730 So we would change[br]line.startswith('From'), 0:07:05.730,0:07:06.530 and that's either going to be true or false 0:07:06.530,0:07:10.490 depending on whether or not the[br]line starts with From. 0:07:10.490,0:07:11.920 Now, we do the same thing with 0:07:11.920,0:07:14.720 regular expressions by changing the[br]search string. 0:07:15.950,0:07:17.290 So now we are in regular expressions. 0:07:17.290,0:07:19.980 So this really just isn't a string, it's a[br]string plus 0:07:19.980,0:07:21.660 characters that are interpreted as 0:07:21.660,0:07:24.348 commands by the regular expression[br]library. 0:07:24.348,0:07:27.970 So the caret, which is the first one on[br]our, 0:07:27.970,0:07:31.830 our little regular expression sheet, matches[br]the beginning of the line. 0:07:31.830,0:07:32.865 It's not actually a caret. 0:07:32.865,0:07:37.353 So that says, the first character, this[br]two-character sequence, caret F, 0:07:37.353,0:07:40.909 means F but in column one, in the first[br]character of the line. 0:07:40.909,0:07:43.110 And so, again, this is going to give us a 0:07:43.110,0:07:46.434 True or a False, if this regular[br]expression matches. 0:07:46.434,0:07:49.902 The, the beginning of the line, From: and[br]it's the same as 0:07:49.902,0:07:54.442 this, it's, does it start with From.[br]So again, these two are equivalent. 0:07:54.442,0:08:00.238 But you see the pattern where we're[br]going to do something to this string using 0:08:00.238,0:08:05.912 these characters that have meaning, okay?[br]So, the next thing that's 0:08:05.912,0:08:11.918 most commonly done other than caret and[br]dollar sign for the end of line, is 0:08:11.918,0:08:16.195 the wildcard characters and so, we've used[br]wildcards 0:08:16.195,0:08:19.512 possibly in like DOS, where we can use ? 0:08:19.512,0:08:25.132 or * in like a dir command. dir .*.* if[br]you're familiar with that, 0:08:25.132,0:08:29.508 or even a Unix command like ls, you[br]know, star dot whatever. 0:08:29.508,0:08:31.518 This is not how regular expressions 0:08:31.518,0:08:33.729 work. And the problem is is that dot, dot 0:08:33.729,0:08:38.020 is that it matches a single character in[br]regular expressions. 0:08:38.020,0:08:41.450 Asterisk means any number of times. 0:08:41.450,0:08:46.620 So if I look at this, if I look at[br]this and color-code this to make a 0:08:46.620,0:08:52.050 little more sense, the caret is actually[br]kind of part of the 0:08:52.050,0:08:56.555 regular expect, regular expression[br]programming language. Says I'm, I'm 0:08:56.555,0:08:58.910 I'm a virtual character matching the[br]beginning of line. 0:08:58.910,0:09:00.620 The X is a real character. 0:09:00.620,0:09:04.590 The dot is part of the regular expression[br]programming language, any character. 0:09:04.590,0:09:07.590 Star is part of the regular expression[br]programming, it says 0:09:07.590,0:09:12.220 the immediate previous character many[br]times, zero or more times. 0:09:12.220,0:09:14.850 And then colon matches the colon. 0:09:14.850,0:09:19.910 And so if you look at lines, these are the[br]kinds of lines that will give me a True. 0:09:19.910,0:09:22.380 Because they start with an X, 0:09:22.380,0:09:25.750 followed by some number of characters,[br]followed by a colon. 0:09:25.750,0:09:26.900 So that's true. 0:09:26.900,0:09:30.990 Start with a X, followed by some number of[br]characters, followed by a colon. 0:09:30.990,0:09:32.270 Okay? 0:09:32.270,0:09:35.180 And so that's basically how this works. 0:09:35.180,0:09:38.840 And so this little, this, in this 0:09:38.840,0:09:42.150 five-character string there are, you know,[br]some of 0:09:42.150,0:09:44.320 these things are like instructions and[br]some of 0:09:44.320,0:09:46.440 them are the actual characters we're[br]looking for. 0:09:46.440,0:09:47.670 So the X and the colon 0:09:47.670,0:09:49.060 are the characters we're looking 0:09:49.060,0:09:55.000 for, and the caret, dot, and star are[br]programming. 0:09:55.000,0:09:57.450 Right? They are logic that we're adding[br]to the string. 0:09:59.990,0:10:00.620 Okay. 0:10:00.620,0:10:04.840 So let's say, for example, you're... [br]Part of any of these things, 0:10:04.840,0:10:07.340 and part of the stuff we have done so far, 0:10:07.340,0:10:10.530 has to assume that the data is some[br]level of being clean and 0:10:10.530,0:10:14.440 so the data that I have been giving you,[br]mbox.txt, is not inconsistent. 0:10:15.480,0:10:17.571 Right? It doesn't have like too much[br]weirdness in it. 0:10:17.571,0:10:20.121 I'm not trying to trick you and[br]mislead you, although 0:10:20.121,0:10:22.824 we've had situations where you sort of get[br]a traceback because 0:10:22.824,0:10:25.017 you think there's going to be five words[br]you, you grab a line, 0:10:25.017,0:10:27.567 you break it, and there's only two[br]words and then you get 0:10:27.567,0:10:31.250 a traceback because you're looking at the[br]fifth word, or something like that. 0:10:32.580,0:10:35.380 But if your data is less clean, or even[br]you just are 0:10:35.380,0:10:39.890 want to be real careful, you can[br]fine-tune your matching. 0:10:39.890,0:10:42.520 So, here's that same match. 0:10:42.520,0:10:45.120 Give me a character X, followed by any[br]number of 0:10:45.120,0:10:48.090 characters, followed by a colon, and that's[br]what I'm looking for. 0:10:48.090,0:10:50.100 Give me lines that match that pattern. 0:10:50.100,0:10:52.215 So this X starts at any number of[br]characters, 0:10:52.215,0:10:55.290 colon, great, this, any number of[br]characters good, great. 0:10:55.290,0:10:57.422 Oh wait, and there's an email X that says 0:10:57.422,0:11:01.020 X Plane is two weeks behind sch, behind[br]schedule, colon, two weeks. 0:11:01.020,0:11:05.610 Well, the regular expression didn't know[br]that the dash made sense to you. 0:11:05.610,0:11:07.300 And you just assumed that everything that[br]started 0:11:07.300,0:11:09.490 with a capital X had a dash after it. 0:11:09.490,0:11:15.130 So X is what it starts with, any number of[br]any character, and then 0:11:15.130,0:11:17.430 a colon. So this becomes True. 0:11:17.430,0:11:21.940 This may not make you happy, right? It may[br]not be what you're looking for. 0:11:21.940,0:11:26.290 Because you haven't been specific enough[br]in your regular expression. 0:11:26.290,0:11:30.550 So, we can be more specific in our regular[br]expression. 0:11:30.550,0:11:35.310 So for example, this is a more specific[br]regular expression. 0:11:35.310,0:11:40.390 It still says start with an X as the first[br]character, then a dash, 0:11:40.390,0:11:43.220 that's a real character not a, then this 0:11:43.220,0:11:47.455 next thing, instead of being a dot, this[br]backslash capital S. 0:11:47.455,0:11:49.510 It's on the sheet. 0:11:49.510,0:11:51.410 Whoa. It's not on the sheet. 0:11:51.410,0:11:53.900 I lost the sheet. Come back, sheet. 0:11:54.900,0:11:55.400 I lost the sheet. 0:11:56.070,0:11:58.730 I can't live without my sheet. 0:12:00.820,0:12:06.180 Backslash capital S means a[br]non-whitespace character. 0:12:06.180,0:12:09.040 So that means spaces won't match. 0:12:09.040,0:12:14.430 And then I changed the asterisk, zero or[br]more times thing, to a plus. 0:12:14.430,0:12:16.340 And that means one or more times. 0:12:16.340,0:12:20.440 Here is a character, a non-whitespace.[br]These two things kind of work together. 0:12:20.440,0:12:25.170 A non-whitespace character at least one[br]time, as many as we like. 0:12:25.170,0:12:26.230 And then, a colon. 0:12:27.390,0:12:30.680 So, if we look here, it starts with X dash, 0:12:30.680,0:12:35.430 any number of non-whitespace[br]characters, and ends in colon. 0:12:35.430,0:12:37.150 Starts with X dash, any number 0:12:37.150,0:12:39.850 of non-whitespace characters, ends[br]in a colon. 0:12:39.850,0:12:41.520 True. True. 0:12:41.520,0:12:45.610 This one starts with an X, but doesn't[br]start with an X dash. 0:12:45.610,0:12:49.340 Oh, as a matter of fact, these characters[br]are blanks, so this becomes a False. 0:12:49.340,0:12:52.710 It does have an X and it does have a colon[br]and match the previous one, 0:12:52.710,0:12:55.500 but this one here is more specific. 0:12:59.720,0:13:02.680 Okay? So it's more specific and so it[br]matches what you want. 0:13:02.680,0:13:04.000 Now it depends on what you are looking for. 0:13:04.000,0:13:05.090 Maybe you do want this line, 0:13:05.090,0:13:08.740 and so you're looking for X. I don't[br]know. But if you want, you can be 0:13:08.740,0:13:12.770 increasingly sophisticated in what 0:13:12.770,0:13:15.000 you're looking for in a regular[br]expression. 0:13:15.000,0:13:19.948 So now, let's talk about extracting data. 0:13:19.948,0:13:23.550 So everything we've done so far is,[br]is it there or is it not. 0:13:23.550,0:13:24.740 But it's really common once 0:13:24.740,0:13:27.130 you find something you that want to[br]break it into pieces. 0:13:27.130,0:13:31.560 So we can combine the searching and the[br]parsing into one statement. 0:13:32.590,0:13:36.710 And instead of using search, which returns[br]for us a true/false, we are going to use 0:13:36.710,0:13:41.870 findall.[br]So in this example, I'm going to to show 0:13:41.870,0:13:51.010 you a new syntax. The square bracket in[br]regular expression language means 0:13:51.010,0:13:52.848 a way to list a set of characters. 0:13:52.848,0:13:57.620 So this says, this is a single character[br]that says, 0:13:57.620,0:14:00.490 I want to match anything in the range[br]0 through 9. 0:14:01.920,0:14:04.110 Plus means one or more of those. 0:14:04.110,0:14:08.560 So that says, so this is, this whole thing[br]says one or more digits. 0:14:08.560,0:14:11.590 That's a regular expression that says one[br]or more digits. 0:14:11.590,0:14:13.310 You can put other things inside here. 0:14:14.820,0:14:16.040 You can put like, you know, 0:14:17.280,0:14:21.670 you could make a thing that says a b c d.[br]And that would say, I'm 0:14:21.670,0:14:26.090 going to match a single character that's[br]a or b or c or d. Or you could say like, 0:14:26.950,0:14:32.300 you know, 1 3 5 7, bracket. 0:14:32.300,0:14:33.180 That's a single character 0:14:33.180,0:14:35.030 that's either a 1 or a 3 or a 5 or a 7. 0:14:35.030,0:14:37.080 So the bracket is a list of matching 0:14:37.080,0:14:41.350 characters and the dash inside the[br]bracket means range. 0:14:41.350,0:14:44.605 We'll see in a second that you can stick a[br]not inside the bracket. It's on this. 0:14:44.605,0:14:47.330 So, so again, remember in this little 0:14:47.330,0:14:49.920 mini-language, we are programming, right? 0:14:49.920,0:14:54.660 We are giving instructions to the regular[br]expression engine, as it were. Okay? 0:14:58.070,0:15:03.370 So, if we do this, and here is an[br]expression that 0:15:03.370,0:15:09.330 says I would like to find, you know, things[br]that are one or more digits. 0:15:09.330,0:15:09.890 And so, 0:15:13.700,0:15:16.640 so it's one or more digits and, and so[br]it's going to look 0:15:16.640,0:15:19.450 through here and it's going to find it as[br]many times as it can. 0:15:20.550,0:15:24.470 So there is one or more digits, there is[br]one or more digits, 0:15:24.470,0:15:26.720 and there is one or more digits. 0:15:26.720,0:15:30.400 And so what findall gives us back is a[br]list of strings. 0:15:30.400,0:15:31.800 So it found it. 0:15:31.800,0:15:33.180 Where do I match?[br]Where do I match? 0:15:33.180,0:15:37.830 It's looking the whole time and then,[br]it says, oh, I've got it. 0:15:37.830,0:15:39.410 2, 19, 42. 0:15:39.410,0:15:43.400 So it actually extracts the strings that[br]match 0:15:43.400,0:15:46.590 and gives you a Python list of strings. 0:15:46.590,0:15:48.035 Python list of strings. 0:15:48.035,0:15:53.360 Kind of of like split, except it's like a[br]super smart split, right? 0:15:53.360,0:15:56.940 It's split, but I've directed it what to[br]look for, and if, 0:16:01.320,0:16:04.530 so here's an example of, you know, that's[br]the one I just did. 0:16:04.530,0:16:10.320 Find me one or more digits and extract[br]them, so 2, 19, 42. 0:16:10.320,0:16:14.330 Here I'm saying, using the same bracket[br]syntax, to look for a single 0:16:14.330,0:16:19.900 character A, capital A E I O or U, and one[br]or more 0:16:19.900,0:16:24.520 of those. And if you look, there are no[br]upper-case vowels in my string. 0:16:24.520,0:16:26.850 So it says I'm going to find all the[br]things that match 0:16:26.850,0:16:35.880 A E I O U. So things like AA would match[br]and, you know, OU would match. 0:16:36.990,0:16:39.430 And so that's what we, we would get if[br]they were in the string. 0:16:40.520,0:16:43.830 But because there are none, we get an[br]empty string. 0:16:43.830,0:16:45.640 So even if there are none, you get an[br]empty string. 0:16:45.640,0:16:48.260 So it always returns a string. 0:16:48.260,0:16:51.910 It may be a zero-length string, and that's[br]what you have 0:16:51.910,0:16:54.466 to check. Okay? 0:17:00.466,0:17:02.426 Okay, now 0:17:03.426,0:17:05.730 matching has this notion of greedy, 0:17:06.730,0:17:10.119 where when you put one of these pluses 0:17:10.119,0:17:15.650 or asterisks it kind of has this outward[br]pushing feeling, right? 0:17:15.650,0:17:17.300 And so when you say, 0:17:17.300,0:17:19.300 I'm looking for something that starts with[br]an 0:17:19.300,0:17:21.500 F at the beginning of the line, followed 0:17:21.500,0:17:23.700 by one or more characters, followed by a 0:17:23.700,0:17:27.210 colon, you can think of this as pushing[br]outward. 0:17:27.210,0:17:32.100 So if we look at a line here that has From[br]colon using the colon 0:17:32.100,0:17:37.400 character, it will try to expand, so it[br]certainly has 0:17:37.400,0:17:42.590 to match the F and it's looking for a[br]colon, any number of characters, 0:17:42.590,0:17:46.950 but it's trying to make the string that[br]matches as big as possible. 0:17:46.950,0:17:49.730 So it skips over this colon and goes to[br]that 0:17:49.730,0:17:51.950 colon and so the thing that we get is[br]here. 0:17:51.950,0:17:56.110 And so, it ignored this and said I will[br]make as large a string as I can. 0:17:57.270,0:17:59.490 So, that that's the plus that's doing it. 0:17:59.490,0:18:04.100 Dot plus pushes, it's like, I've got a 0:18:04.100,0:18:06.660 colon, but is there another colon out[br]there? 0:18:06.660,0:18:09.010 So you push it, okay? 0:18:09.010,0:18:10.970 So that's greedy matching. 0:18:10.970,0:18:14.860 It can get you in some trouble, like being[br]greedy 0:18:14.860,0:18:18.210 in general, and both asterisk and plus sort[br]of behave 0:18:18.210,0:18:20.420 in a greedy way because they're zero more[br]or one 0:18:20.420,0:18:24.240 or more characters, so they can sort of[br]push outward, okay? 0:18:26.330,0:18:28.110 Now you can turn this off. 0:18:28.110,0:18:31.800 It's a programming language, we can tweak[br]it, okay? 0:18:31.800,0:18:35.790 And so we add a question mark. 0:18:35.790,0:18:40.830 So this is a three-character sequence now.[br]So if you say dot plus question 0:18:40.830,0:18:46.070 mark, that says one or more of any[br]characters, push, 0:18:46.070,0:18:51.680 but instead of being greedy and pushing as[br]far as you can, this means stop 0:18:51.680,0:18:57.167 at the first. Stop at the first. 0:18:57.167,0:18:59.450 Oops, stop at the first. 0:18:59.450,0:19:01.800 I can never draw on this thing fast[br]enough. 0:19:01.800,0:19:03.260 Stop at the first. 0:19:03.260,0:19:04.020 Okay? 0:19:04.020,0:19:05.910 And that's it, just don't be greedy, don't 0:19:05.910,0:19:08.260 try to make the string as large as[br]possible. 0:19:08.260,0:19:11.170 Go with the smaller one, the smaller[br]possible one. 0:19:11.170,0:19:13.150 We still need to find an F, and we still[br]need 0:19:13.150,0:19:16.620 to find a colon, but when you find the[br]first colon, stop. 0:19:16.620,0:19:18.850 And so what this does is this changes it[br]so that 0:19:18.850,0:19:22.690 what we match is from colon instead of[br]going all the way. 0:19:22.690,0:19:26.920 So the greedy match pushes as far as it[br]can. The non-greedy match 0:19:26.920,0:19:32.700 is satisfied with the first thing that[br]meets the criterion of the string. 0:19:32.700,0:19:35.780 So this is a little three-character[br]programming sequence, 0:19:35.780,0:19:38.780 any character one or more times and not[br]greedy. 0:19:48.460,0:19:50.570 If, for example, we were trying to solve the[br]problem 0:19:50.570,0:19:53.360 of pulling the email address out of a[br]string. 0:19:54.510,0:19:55.010 Right? 0:19:57.260,0:20:00.880 We can make good use of this non-blank[br]character 0:20:00.880,0:20:04.350 and so the at sign is just a character and 0:20:04.350,0:20:07.680 then we can say, I want at least one[br]non-blank 0:20:07.680,0:20:11.500 character before it and at least one[br]non-blank character after it. 0:20:11.500,0:20:15.980 So the way regular expressions does it[br]says, okay, I find my at sign and 0:20:15.980,0:20:19.800 I push in a greedy manner outwards, as 0:20:19.800,0:20:22.170 long as there are non-blank characters,[br]push, push, push, push, 0:20:22.170,0:20:26.590 push, push, push, oops, stop.[br]Push, push, push, push, push, stop. 0:20:26.590,0:20:27.270 Okay? 0:20:27.270,0:20:30.460 So it's some number of non-blank[br]characters, an 0:20:30.460,0:20:33.040 at sign, followed by some number of[br]non-blank characters. 0:20:33.040,0:20:38.080 So it's, that's using greedy matching. It,[br]it's doing that, okay? 0:20:38.080,0:20:41.380 And so this is where we get Stephen[br]Marquard, we can, and, 0:20:41.380,0:20:45.870 and we would know if there wasn't there by[br]the empty list, right? 0:20:45.870,0:20:51.040 And so we get stephen.marquard@uct.ac.za. 0:20:53.040,0:20:59.350 Now, we can also fine-tune what we[br]extract, right? 0:20:59.350,0:21:05.470 In the previous slide, we extracted[br]whatever matched. 0:21:05.470,0:21:06.070 Right? 0:21:06.070,0:21:10.310 Whatever this matched, it looked across[br]the whole string and found it, 0:21:10.310,0:21:14.630 found the thing, shoved it over, and gave[br]us what it matched. 0:21:14.630,0:21:18.580 But it's possible to make the match larger[br]than what's extracted, 0:21:18.580,0:21:22.860 to extract a subset of the match, and we'll[br]see that on this next slide. 0:21:22.860,0:21:23.790 Okay? 0:21:23.790,0:21:29.950 So here's this same thing, which is an at[br]sign followed, and then 0:21:29.950,0:21:33.890 with non-blank characters as far as the[br]eye can see in either direction. 0:21:33.890,0:21:37.448 But I'm going to add to it caret From[br]space. 0:21:37.448,0:21:44.468 So, so this has to be start with, the[br]first character has to be a caret, this, 0:21:44.468,0:21:45.810 it's gotta have the word From, 0:21:45.810,0:21:50.560 it's gotta have one space and then,[br]immediately, it's gotta find this, right? 0:21:50.560,0:21:53.500 It's gotta find a series of non-blanks,[br]followed by an at sign, 0:21:53.500,0:21:57.620 followed by another series of one or[br]more non-blanks. And then 0:21:57.620,0:22:00.490 what we do, so this, if we didn't put[br]these parentheses 0:22:00.490,0:22:03.900 in, it would match and we would get all of[br]this data. 0:22:03.900,0:22:04.780 It would go to here. 0:22:05.900,0:22:09.220 But what we can do with the parentheses,[br]the parentheses are part 0:22:09.220,0:22:12.330 of the regular expression language,[br]saying, 0:22:12.330,0:22:14.620 okay, I want to match the whole thing. 0:22:14.620,0:22:17.190 The parentheses aren't part of the care-,[br]a string up here. 0:22:17.190,0:22:18.550 I want to match the whole thing, but 0:22:18.550,0:22:20.620 I only want to extract this part in[br]parentheses. 0:22:21.800,0:22:24.960 So this whole thing is a regular[br]expression that's matched 0:22:24.960,0:22:28.680 and then the parentheses part is what's[br]retrieved for you. 0:22:28.680,0:22:31.620 And so this makes it so that the only time[br]it's going to 0:22:31.620,0:22:35.140 look for at signs is, are on lines that[br]start with From space. 0:22:35.140,0:22:39.220 It is going to want the immediate next[br]character to be a non-blank. 0:22:40.588,0:22:42.920 Some number of non-blank characters[br]followed by an at sign, 0:22:42.920,0:22:45.580 some number of non-blank characters, it's[br]going to stop right there. 0:22:45.580,0:22:48.110 And it's only going to extract from here[br]to here, 0:22:48.110,0:22:50.560 and so we get out Stephen Marquard. 0:22:50.560,0:22:55.860 But this is a pretty narrowly scoped thing[br]because 0:22:55.860,0:22:57.690 the first four characters have to be From[br]space. 0:22:57.690,0:23:00.642 And so that's a way to combine a stricter[br]match, 0:23:00.642,0:23:03.970 even though you don't actually want[br]all the data. 0:23:03.970,0:23:05.858 So you can add those things all over the[br]place. 0:23:05.858,0:23:09.330 Okay? Okay. 0:23:09.330,0:23:15.450 Then, we, we, we can compare the different[br]ways of extracting data. 0:23:15.450,0:23:19.730 So if we look at how we extract the host[br]name. 0:23:19.730,0:23:23.200 Remember how we did this many chapters ago. 0:23:23.200,0:23:26.085 So we did a data.find, which says oh, 0:23:26.085,0:23:29.850 the first at sign is at 21.[br]So the first at sign is at 21. 0:23:29.850,0:23:34.330 Then we say we want to find the space[br]after that. 0:23:34.330,0:23:38.970 So that's the at position, that's 31.[br]And then we want to extract the data 0:23:38.970,0:23:44.460 that's one beyond the at up to but not[br]including the space. 0:23:45.710,0:23:47.540 And that is the variable that we are[br]going to print out, host. 0:23:47.540,0:23:51.610 And so we've extracted this bit of[br]information and out comes the host. 0:23:51.610,0:23:52.880 Quite nice. Okay? 0:23:53.880,0:23:57.310 We also saw another technique, and by the[br]way, all these techniques are okay. 0:23:58.680,0:24:00.320 All these techniques are fine. 0:24:00.320,0:24:02.300 Another technique we saw, once we sort of[br]played 0:24:02.300,0:24:04.300 with split and lists, was what we, what I 0:24:04.300,0:24:07.910 called a double split version of this,[br]where the 0:24:07.910,0:24:09.740 first thing we do is we split that line. 0:24:11.890,0:24:15.740 The first thing we do is split the line[br]and then we know, and blanks, 0:24:19.040,0:24:23.750 that the second thing, which is the[br]sub one, words sub one, 0:24:23.750,0:24:28.720 is the entire email address. Then this is[br]the double split. 0:24:28.720,0:24:32.260 We take the email address and we split it by 0:24:32.260,0:24:34.950 an at sign and then we get a list of the 0:24:34.950,0:24:38.180 pieces of the email address, the email[br]name and the 0:24:38.180,0:24:44.000 email host, and then we grab the, the[br]sub one of that, 0:24:44.000,0:24:45.420 and then we have the host. 0:24:45.420,0:24:49.532 So that's a double, the double split way[br]of doing this, right? 0:24:49.532,0:24:53.292 Now in this, we still haven't done[br]the From yet, 0:24:53.292,0:24:57.151 but it is the double split way to do this. 0:24:57.151,0:25:03.501 So, if we think about how we would do[br]this in a regular expression, okay? 0:25:03.501,0:25:12.321 We're going to say, look through the[br]string, findall, we're going to, 0:25:12.321,0:25:15.365 use the findall, and the regular[br]expression exploded up says 0:25:16.365,0:25:20.830 look through the string for an at.[br]Do, do, do, do, do, do, got an at. 0:25:20.940,0:25:25.970 Then, oh, start extracting. End extracting. 0:25:25.970,0:25:28.520 And then this is another form of the 0:25:28.520,0:25:31.150 this is one character, it's a 0:25:31.150,0:25:35.300 single character, match any non-blank[br]character, and 0:25:35.300,0:25:37.340 zero or more of them. Okay? 0:25:37.340,0:25:42.224 So find an at sign, start extracting, 0:25:42.224,0:25:47.980 end extracting, match, this is one character. 0:25:47.980,0:25:53.740 That is a set of possible matches, and[br]that's some character, this means not. 0:25:56.880,0:25:58.990 Okay? Not a blank, that's a blank 0:25:58.990,0:26:01.100 right there, that's a blank character[br]right there. 0:26:01.100,0:26:03.900 Not a blank, as many times as you want. 0:26:03.900,0:26:05.050 You might want to, we might want to turn 0:26:05.050,0:26:07.520 that into a plus to guarantee at least one. 0:26:07.520,0:26:09.780 So that might be better done as a plus[br]right there. 0:26:13.680,0:26:15.880 So this is, probably make more sense as a[br]plus, to say, I 0:26:15.880,0:26:21.030 want at least, after the at sign, I want[br]at least one non-blank character. 0:26:26.210,0:26:30.800 And the parentheses simply say, I don't[br]want the at sign. 0:26:30.800,0:26:35.620 So if the at sign, I really want those[br]non-blank characters after the at sign. 0:26:35.620,0:26:38.550 Okay? So that's what I want to extract. 0:26:38.550,0:26:41.870 So it's like, go find the at sign. 0:26:41.870,0:26:43.640 Okay, great, found the at sign. Start 0:26:43.640,0:26:48.000 extracting, look for non-blank characters,[br]end extracting. 0:26:48.000,0:26:50.440 So pull that part out and put it right[br]there. 0:26:53.010,0:26:56.290 Now an even cooler version of this that 0:26:56.290,0:26:59.070 you probably kind of imagined right away is, 0:27:01.360,0:27:07.470 we say, you know what, I would like this[br]first character, the first 0:27:07.470,0:27:13.350 part of the line to be From, with a blank,[br]followed by any number of characters, 0:27:17.160,0:27:20.930 followed by an at sign, so the at sign is[br]real, then start 0:27:20.930,0:27:25.870 extracting, then any number of non-blank[br]characters, end extracting. 0:27:27.350,0:27:32.420 So this is a, this is like eight or nine[br]lines of Python 0:27:32.420,0:27:35.750 all rolled into one thing, okay? 0:27:38.800,0:27:44.200 So, start at the beginning of the line.[br]Look for string From, with a space. 0:27:44.200,0:27:50.030 Then skip a bunch of characters looking[br]for an at sign, skip characters until 0:27:50.030,0:27:53.370 you encounter an at sign, then start 0:27:53.370,0:27:58.430 extracting, match any non-blank, a single[br]non-blank character. 0:27:58.430,0:28:00.642 This is kind of like one non-blank 0:28:00.642,0:28:03.860 character, one non-blank character, but[br]once you 0:28:03.860,0:28:08.500 suffix it with the asterisk that changes it to[br]be many non-blank characters. 0:28:10.600,0:28:13.020 And then stop extracting, okay? 0:28:14.050,0:28:19.430 And so, you know, and so it's like find[br]From followed by a space, great. 0:28:20.590,0:28:22.250 That's the first part. 0:28:22.250,0:28:25.130 Now throw away characters until you find[br]an at sign. 0:28:26.130,0:28:28.110 Then start extracting. 0:28:28.110,0:28:31.480 Keep going with non-blank characters until[br]you hit 0:28:31.480,0:28:34.180 the first blank characters and pull that[br]part out. 0:28:34.180,0:28:35.790 Now the result is we get the exact same 0:28:35.790,0:28:42.070 data. But with this added to it, we are[br]much more narrow in the kind of things 0:28:42.070,0:28:46.690 that we're looking for and if we get[br]noisy data that like, something like, 0:28:46.690,0:28:52.820 you know, meet at Joe's, right?[br]We don't want that. 0:28:52.820,0:28:53.840 That won't match, right? 0:28:53.840,0:28:55.950 We want that to be like a False. 0:28:55.950,0:28:59.400 And, and it allows us to sort of really[br]fine-tune our matching 0:28:59.400,0:29:02.950 and extracting. And this is just the[br]beginning, they are very, very powerful. 0:29:02.950,0:29:08.850 So, the last thing that I will show you is[br]sort of a program that is kind of like one 0:29:08.850,0:29:11.830 of the programs that we did in a previous[br]section, 0:29:11.830,0:29:14.560 except now we're going to use regular[br]expressions to do it. 0:29:14.560,0:29:16.260 So if you remember, we had this thing where 0:29:16.260,0:29:19.910 we're doing spam confidence, where we're[br]looking for lines and 0:29:21.450,0:29:23.310 you know, and pulling this number out and then 0:29:23.310,0:29:26.430 calculating the average, or the[br]maximum, or whatever. 0:29:26.430,0:29:31.640 And so here is a, we import the regular[br]expression library, we open the file, 0:29:31.640,0:29:35.290 we're going to do this with the, appending[br]to the, a list, we'll put the list. 0:29:35.290,0:29:37.720 We'll put the numbers in a list rather[br]than doing the calculation in a loop. 0:29:39.180,0:29:40.310 We strip the data. 0:29:40.310,0:29:42.160 Now, here's the key thing, right? 0:29:42.160,0:29:44.830 We're going to have a regular expression[br]that says, 0:29:46.200,0:29:49.020 look for the first character being X,[br]followed by 0:29:49.020,0:29:51.060 a dash, followed by all this,[br]all this 0:29:51.060,0:29:54.740 exactly has to match literally, followed[br]by a colon. 0:29:54.740,0:30:00.950 And then there's a space, and then we[br]begin extracting and we are looking for 0:30:00.950,0:30:06.430 the digit 0 through 9 or a dot and[br]we are looking for one or 0:30:06.430,0:30:09.780 more, and then we end extracting. 0:30:09.780,0:30:12.720 So that's the, the parentheses are telling[br]us what to pull out. 0:30:12.720,0:30:15.400 So that just means that we're going to[br]pull out those numbers, all 0:30:15.400,0:30:18.070 the digits and the numbers, until we get[br]something other, I mean, 0:30:18.070,0:30:21.010 all the digits and the period, and we'll[br]get something other than 0:30:21.010,0:30:24.380 a digit and a period, and we, and then[br]we'll be done, okay? 0:30:24.380,0:30:30.030 And so if we, and so this is going to pull[br]those numbers out and give us back a list. 0:30:30.030,0:30:31.470 Now the thing about it is, we have 0:30:31.470,0:30:34.710 to realize that sometimes this is not[br]going to match, because 0:30:34.710,0:30:37.700 we're sending every line, not just the[br]ones that start 0:30:37.700,0:30:41.200 with X, we're sending every line through[br]this and so 0:30:41.200,0:30:44.260 we need to know when we didn't get a[br]match. 0:30:44.260,0:30:48.000 And that, the way we know we didn't get a[br]match is if the list, the 0:30:48.000,0:30:52.460 number of items in the list that we got[br]back, is zero, then we're going to continue. 0:30:52.460,0:30:56.990 So this is kind of our if where we're[br]searching for the needle in the haystack. 0:30:56.990,0:31:00.010 But then once we find what we are looking 0:31:00.010,0:31:02.450 for, the actual number that we are[br]interested in, 0:31:04.560,0:31:07.980 is already sitting here in stuff sub zero.[br]Okay? 0:31:07.980,0:31:10.570 And then we convert it to a float, we append it. 0:31:10.570,0:31:14.100 And when the loop is all done, we print[br]out the maximum. 0:31:14.100,0:31:14.810 Okay? 0:31:14.810,0:31:17.180 And so this is sort of encoding a number[br]of things 0:31:17.180,0:31:22.100 and ending up with a very, a very solid and[br]safe matching. 0:31:22.100,0:31:25.910 So we're really, it's hard for this to[br]find a line that's wrong and 0:31:25.910,0:31:29.590 you could even improve this a little bit[br]to make it even a little tighter 0:31:29.590,0:31:35.338 where we'd go find a number like 0.999.[br]You could say, oh, it's 0:31:35.338,0:31:41.042 all the numbers are zero dot, so 0:31:41.042,0:31:46.750 you could make this a little, a little more[br]precise. 0:31:46.750,0:31:49.453 So it wouldn't, so it would even skip[br]things that 0:31:49.453,0:31:52.580 you can make it, so it looks exactly the[br]way you want it to look. 0:31:52.580,0:31:54.690 So, I emphasize that this 0:31:54.690,0:31:57.380 is kind of a weird language and you need[br]some kind of thing. 0:31:57.380,0:31:58.917 We talked about all these. 0:31:58.917,0:32:01.500 We have the beginning of the line, we have[br]the end 0:32:01.500,0:32:03.831 of the line, matching any character, 0:32:03.831,0:32:07.617 matching space characters, matching[br]non-whitespace characters. 0:32:07.617,0:32:12.750 Star is a modifier that says zero or more[br]times. 0:32:12.750,0:32:18.326 Star question mark is a modifier that says[br]zero or more times non-greedy. 0:32:18.326,0:32:20.703 Plus is one or more times. 0:32:20.703,0:32:24.537 Plus question mark is one or more times[br]non-greedy. 0:32:24.537,0:32:27.275 When you have bracket syntax, it's a set, 0:32:27.275,0:32:30.610 it's a single character that's in the[br]listed set. 0:32:30.610,0:32:32.530 So that's lower-case vowels. 0:32:33.710,0:32:35.280 You can also have the first, if the first 0:32:35.280,0:32:38.680 character of this is a caret, that flips it. 0:32:38.680,0:32:42.850 So that means everything except capital[br]X, capital Y, capital Z. 0:32:42.850,0:32:45.403 So it's everything that's not in the set, 0:32:45.403,0:32:47.956 capital X, capital Y, capital Z, and then 0:32:47.956,0:32:51.080 you can also put dashes in to represent[br]ranges. 0:32:51.080,0:32:53.390 Bracket a through z and 0 through 9,[br]and lower-case 0:32:53.390,0:32:58.450 letters and digits will match, but again,[br]this is a single character. 0:32:58.450,0:33:00.750 Now, you can put a plus or a star after 0:33:00.750,0:33:04.440 these guys to make them happen more than[br]one time. 0:33:04.440,0:33:05.680 And you can even put them in twice. 0:33:05.680,0:33:12.240 So if I wanted a two-digit number, I could[br]say 0 dash 9, 0 dash 9. 0:33:12.810,0:33:14.869 Oops. This is one character. 0:33:14.869,0:33:18.350 This is one character and this is the[br]possible things. 0:33:18.350,0:33:22.340 So that's, you know, 0 0[br]would match. 0:33:22.340,0:33:26.276 1 0 would match, 99 would match, etc. 0:33:26.276,0:33:26.980 Okay? 0:33:29.020,0:33:31.980 And then the parentheses are the things[br]that if you 0:33:31.980,0:33:34.340 are in the middle of a big long matching[br]string and 0:33:34.340,0:33:37.250 you don't want to extract the whole thing,[br]you can limit the 0:33:37.250,0:33:40.470 things you're extracting to, to the stuff[br]that's just in there. 0:33:41.480,0:33:43.990 With all these characters that have all[br]this meaning, 0:33:43.990,0:33:46.310 we have to have a way to match those[br]characters. 0:33:46.310,0:33:50.100 So dollar sign is the end of a line. 0:33:50.100,0:33:51.840 But what if we're looking for something that 0:33:51.840,0:33:53.360 actually has a dollar sign in the string? 0:33:54.760,0:33:56.830 And that's what the backslash is for. 0:33:56.830,0:33:58.470 So if you put the backslash in front of 0:33:58.470,0:34:04.320 a otherwise meaningful character, you[br]don't, it becomes the actual character. 0:34:04.320,0:34:06.970 So this is saying match a dollar sign. 0:34:06.970,0:34:09.250 Those two characters say match a dollar[br]sign. 0:34:09.250,0:34:13.699 And then this says one character that's[br]0 through 9 or a, or a dot. 0:34:13.699,0:34:16.940 And then we put the plus modifier to say 0:34:16.940,0:34:19.920 at least one or more times and so that sort[br]of is 0:34:19.920,0:34:21.360 a greedy, of course. 0:34:21.360,0:34:25.179 So that will get us this and extract it,[br]including the dollar sign. 0:34:25.179,0:34:28.270 So the escape character is the backslash. 0:34:29.290,0:34:31.179 Okay. So there we are. 0:34:31.179,0:34:32.370 Now we're done. 0:34:32.370,0:34:34.550 So this is little bit cryptic. 0:34:34.550,0:34:38.040 It's, it's kind of a puzzle. 0:34:38.040,0:34:38.760 It's kind of fun. 0:34:38.760,0:34:42.850 And it's extremely powerful.[br]And you don't have to know it. 0:34:42.850,0:34:43.750 You don't have to learn it. 0:34:45.239,0:34:48.880 But if you do, you'll find that it's very[br]useful as we sort 0:34:48.880,0:34:53.239 of dig through data and are trying to[br]write things that are pretty quick. 0:34:53.239,0:34:58.520 And, and, and they, the thing I like about[br]regular expressions is that they 0:34:58.520,0:35:03.480 tend to be, if you write them well, they[br]tend to be less sensitive to bad data. 0:35:04.670,0:35:06.610 They tend to ignore data, they're, you 0:35:06.610,0:35:09.795 can put more detail, I exactly want this. 0:35:09.795,0:35:10.170 Whereas you're, 0:35:10.170,0:35:12.240 if you're writing find and extract, you're 0:35:12.240,0:35:14.290 making a lot of assumptions about the[br]data. 0:35:14.290,0:35:17.440 That it's clean and you're not going to,[br]you know, mis-hit on something. 0:35:17.440,0:35:21.510 So, okay, well, good luck, and you're 0:35:21.510,0:35:23.540 used to regular expressions, and we'll[br]see you later.