WEBVTT 00:00:00.200 --> 00:00:01.660 Welcome to Chapter Seven. 00:00:01.660 --> 00:00:03.650 Python for Informatics: Exploring Information. 00:00:03.650 --> 00:00:04.530 I'm Charles Severence. 00:00:04.530 --> 00:00:09.720 I'm the author of the book and your host. And, as always, this is brought to you by. 00:00:09.720 --> 00:00:10.410 No, I'm sorry. 00:00:10.410 --> 00:00:14.650 It's all creative copyright, Creative Commons Attribution. 00:00:14.650 --> 00:00:18.680 The audio, the video, the slides, and even the book. 00:00:18.680 --> 00:00:21.080 So, here we go. 00:00:21.080 --> 00:00:25.418 Oh, and and so, frankly, where we've been working 00:00:25.418 --> 00:00:34.280 all along is, we have been writing code and talking to the CPU. 00:00:34.280 --> 00:00:37.477 Hang on, let me, let me go get my CPU and stuff. 00:00:37.477 --> 00:00:42.151 Hang on, be right back. 00:00:44.151 --> 00:00:49.673 [SOUND] Okay. 00:00:49.673 --> 00:00:53.730 Here we go. Here we go. 00:00:53.730 --> 00:00:58.830 Here's all the stuff. Remember the stuff from the first lecture? 00:01:00.980 --> 00:01:01.710 There we go with that. 00:01:02.860 --> 00:01:05.840 Remember the motherboard from the first lecture? 00:01:05.840 --> 00:01:08.100 This is kind of the picture of what's on the screen. 00:01:08.800 --> 00:01:12.158 The motherboard, the CPU plugs in here, memory plugs in here. 00:01:12.158 --> 00:01:18.120 And remember how the CPU is sort of the brains, as 00:01:18.120 --> 00:01:22.790 much brains as there is, for the operation. The CPU is asking what next. 00:01:22.790 --> 00:01:26.240 The instructions come in through these little pins. 00:01:26.240 --> 00:01:30.131 There's data inside, and it stores sort of semi-permanent 00:01:30.131 --> 00:01:33.480 data, variables, are all stored pretty much here in RAM. 00:01:34.820 --> 00:01:37.970 And we write our programs, and so your Python programs, they're sitting here 00:01:37.970 --> 00:01:44.140 in this RAM, and they're being fed to this CPU through those chips. 00:01:44.140 --> 00:01:45.280 Through those pins, right? 00:01:45.280 --> 00:01:47.850 The pins, I mean it doesn't really connect like that. 00:01:47.850 --> 00:01:51.850 And so, so frankly, up to now, everything that we've been doing 00:01:51.850 --> 00:01:54.520 is just the Python programming language. 00:01:54.520 --> 00:01:58.210 And so the only place we've really been operating is here. 00:01:59.650 --> 00:02:02.800 We have been putting Python into the main memory. 00:02:02.800 --> 00:02:05.870 And the main memory. And we have 00:02:05.870 --> 00:02:09.596 been effectively feeding instructions to the CPU, 00:02:09.596 --> 00:02:14.080 the central processing unit, as it needed them, and then the program would stop. 00:02:14.080 --> 00:02:15.580 And everything we've done so far 00:02:15.580 --> 00:02:17.260 everything 00:02:17.260 --> 00:02:22.290 is just sort of fiddling around here. We have never escaped it. 00:02:22.290 --> 00:02:25.500 So now we are finally going to escape 00:02:25.500 --> 00:02:28.160 from the central processing unit and the memory. 00:02:29.180 --> 00:02:31.720 We'll still write programs and have variables in here. 00:02:32.920 --> 00:02:38.660 But now we're going to use the disk, the secondary storage, the 00:02:38.660 --> 00:02:44.300 permanent media, right? So if I go grab my Raspberry Pi, 00:02:44.300 --> 00:02:46.000 alright, that goes right there. 00:02:46.000 --> 00:02:51.290 Here's my Raspberry Pi, so here we've got the Raspberry Pi, which is the small version, 00:02:51.290 --> 00:02:55.990 which of course has a CPU, memory, and 00:02:55.990 --> 00:02:58.770 graphics processor, all in this little chip right here. 00:02:58.770 --> 00:03:02.850 But the secondary memory for the, is this little 00:03:02.850 --> 00:03:05.710 SD card that is the secondary memory for Raspberry Pi. 00:03:05.710 --> 00:03:07.500 So the structure of the Raspberry Pi is 00:03:07.500 --> 00:03:09.300 exactly the same as the structure of any other 00:03:09.300 --> 00:03:13.370 personal computer, it's just smaller and less expensive. 00:03:13.370 --> 00:03:14.630 And so in the Raspberry Pi, if you're 00:03:14.630 --> 00:03:17.780 programming the Raspberry Pi, you're sort of finally escaping. 00:03:17.780 --> 00:03:19.550 All your programs were in here. 00:03:19.550 --> 00:03:24.350 Your CPU is in here and that's pretty much how, how far you've got to run. 00:03:24.350 --> 00:03:28.730 But now, of course when you save your files, you save them to here. 00:03:28.730 --> 00:03:34.618 But now we are going to start looking at data on the disk drive and so it's time 00:03:34.618 --> 00:03:38.880 to escape to the secondary memory. Okay? 00:03:38.880 --> 00:03:41.000 Time to escape to the secondary memory. 00:03:41.000 --> 00:03:43.930 And Raspberry Pi, you can go right there. Okay? 00:03:43.930 --> 00:03:45.790 So it's time to find some data to mess with. 00:03:45.790 --> 00:03:48.690 So a lot of what we've been doing so far is just 00:03:48.690 --> 00:03:52.740 kind of the pre-work to get to the point where we can do this. 00:03:52.740 --> 00:03:54.600 And in here we're going to have data files. 00:03:54.600 --> 00:03:55.760 Now, we've been making data files. 00:03:55.760 --> 00:04:00.010 You've been writing, every Python program that you write on your computer gets saved 00:04:00.010 --> 00:04:03.180 as a file. Then Python reads the file and runs it. 00:04:04.380 --> 00:04:07.200 But now we're actually going to start messing with some data. 00:04:09.060 --> 00:04:11.530 And so, files are where we're going to be working. 00:04:11.530 --> 00:04:16.750 And so, one of things about secondary memory is it's much larger. 00:04:18.779 --> 00:04:21.480 And this is, main memory of the computer is pretty large, it's just 00:04:21.480 --> 00:04:26.090 not large enough to hold everything that the computer is capable of holding. 00:04:26.090 --> 00:04:28.010 So the files that we're going to work with. 00:04:28.010 --> 00:04:32.230 Now we're not talking about image files or Quicktime movies or things like that. 00:04:32.230 --> 00:04:34.390 We're going to work with text files because the 00:04:34.390 --> 00:04:37.540 theme of this course is digging through text. 00:04:37.540 --> 00:04:39.090 Sometimes we'll pull it off the Internet. 00:04:39.090 --> 00:04:42.170 Sometimes we'll read files, but it's digging through and 00:04:42.170 --> 00:04:44.030 using all the things that we've learned so far, 00:04:44.030 --> 00:04:46.040 looping and strings, and all those things, 00:04:46.040 --> 00:04:49.400 to make sense of a sequence of information. 00:04:50.520 --> 00:04:51.540 Okay? 00:04:51.540 --> 00:04:57.670 Now, to access file information, we have to do this thing called opening the file. 00:04:57.670 --> 00:05:02.400 We can't just say, yo, the information is just omnipresent because there are 00:05:02.400 --> 00:05:06.170 so much data that you can't have Python sort of know all the data. 00:05:06.170 --> 00:05:09.220 You literally have hundreds of thousands of files on 00:05:09.220 --> 00:05:12.420 your computer's hard drive. And you, 00:05:12.420 --> 00:05:13.500 which one are you going to read? 00:05:13.500 --> 00:05:15.770 So there's a step that you have to do, 00:05:15.770 --> 00:05:19.030 that you call this built-in function called open. 00:05:19.030 --> 00:05:21.880 And say, oh, this is the file that I want to work with, 00:05:21.880 --> 00:05:23.850 of the hundreds of thousands, and then once you do, 00:05:23.850 --> 00:05:27.510 you've kind of got this little connector into it. 00:05:27.510 --> 00:05:31.520 And the open is a built-in function inside Python. 00:05:31.520 --> 00:05:34.300 Hang on a sec, let's say good bye to that. The open 00:05:34.300 --> 00:05:39.680 function is a built-in function in Python, and you, it takes two parameters. 00:05:39.680 --> 00:05:45.810 The first parameter is the name of the file, like mbox.txt, 00:05:45.810 --> 00:05:48.600 and then the second is how you're going to read it. 00:05:48.600 --> 00:05:49.270 Are you going to read it? 00:05:49.270 --> 00:05:50.280 are you going to write it? et cetera. 00:05:50.280 --> 00:05:53.190 Now most of the time we'll be reading our files. 00:05:53.190 --> 00:05:55.730 So we call the open function and pass it in the name of 00:05:55.730 --> 00:05:59.310 the file we want to open, and then how we want to read it. 00:05:59.310 --> 00:06:02.300 Now you can leave this second parameter off and it 00:06:02.300 --> 00:06:04.630 assumes that you're going to want to read the file. 00:06:04.630 --> 00:06:05.130 Now. 00:06:08.920 --> 00:06:11.930 When the open is successful, it doesn't actually read all 00:06:11.930 --> 00:06:16.980 of the data because the memory is small, small compared to 00:06:16.980 --> 00:06:19.140 the hard drive, and so you have to sort of 00:06:19.140 --> 00:06:22.180 step through the data, you'll tell it when to read it. 00:06:22.180 --> 00:06:26.700 So the act of opening it is not actually reading all data. 00:06:26.700 --> 00:06:30.510 It is creating kind of like a connection between the 00:06:30.510 --> 00:06:33.220 memory and the data that's on the hard drive, right? 00:06:33.220 --> 00:06:34.470 It's connecting 00:06:34.470 --> 00:06:38.450 between, oh listen to this. Oh that's going to fall down. 00:06:38.450 --> 00:06:42.014 Is it going to stand up that way? 00:06:42.014 --> 00:06:45.330 Oh, I should come up with a way to make that stand. 00:06:46.369 --> 00:06:48.080 So it's a connection. 00:06:48.080 --> 00:06:50.290 So the, your program's kind of running in here. 00:06:50.290 --> 00:06:53.910 And the, and the file handle is just sort of it's 00:06:53.910 --> 00:06:57.760 like a phone call between your memory and your disk drive. 00:06:57.760 --> 00:07:00.150 It's not the actual data. The actual data is still 00:07:00.150 --> 00:07:06.010 sitting on the disk drive, okay? So, a graphical way to take a look at this 00:07:06.010 --> 00:07:11.680 is, the file handle, the thing that comes back from the open request. 00:07:11.680 --> 00:07:14.990 The open goes and finds the file out on the disk drive and 00:07:14.990 --> 00:07:20.250 yada, yada, yada, and then the handle is something that lives in the memory. 00:07:20.250 --> 00:07:22.100 that is sort of like the thing that 00:07:22.100 --> 00:07:25.830 maintains its connection to where all the data is 00:07:25.830 --> 00:07:28.910 on the disk or on the SD RAM that's in it. 00:07:28.910 --> 00:07:30.820 So the handle is not all the data, but it is 00:07:30.820 --> 00:07:34.280 a mechanism that you can use to get at the data. 00:07:34.280 --> 00:07:37.990 So if you print it out, it doesn't have all the data from the file, 00:07:37.990 --> 00:07:44.100 it says, I am a file handle that's opened this file and we're in read mode. 00:07:44.100 --> 00:07:46.120 So, that doesn't actually have the data, 00:07:46.120 --> 00:07:48.440 even though this is the data that's in the file. 00:07:48.440 --> 00:07:51.050 And then we have operations that we do to the handle like open it, 00:07:51.050 --> 00:07:53.470 close it, read it, write it. So we do things. 00:07:53.470 --> 00:07:56.370 So, so the handle and then through the handle it actually changes 00:07:56.370 --> 00:07:58.860 what's on the disk or reads what's on the disk. 00:07:58.860 --> 00:08:01.560 So the handle is kind of a thing that's not there. 00:08:02.890 --> 00:08:06.460 If you attempt to open a file and the name of the file. 00:08:06.460 --> 00:08:08.660 Now the way we're going to do these is these need to be 00:08:08.660 --> 00:08:14.490 in the same folder on your computer as in, as your Python code. 00:08:14.490 --> 00:08:16.110 Now, there are trickier ways to do it, but 00:08:16.110 --> 00:08:17.180 we're going to keep it simple. 00:08:17.180 --> 00:08:18.670 This is the name of a file in the 00:08:18.670 --> 00:08:21.590 same folder as the Python code that you're running. 00:08:21.590 --> 00:08:28.100 [SOUND] And if it's not, then we get, of course, a traceback and we're 00:08:28.100 --> 00:08:32.289 used to using, reading tracebacks by now, no such file or directory stuff.txt. 00:08:32.289 --> 00:08:34.710 Oh, of course, I forgot to save it or I typed it wrong. 00:08:37.820 --> 00:08:38.840 So. 00:08:38.840 --> 00:08:42.659 The next thing we have to learn is the notion of the newline character. 00:08:42.659 --> 00:08:44.390 You haven't seen this so far, 00:08:44.390 --> 00:08:47.960 but there's a special character in files 00:08:47.960 --> 00:08:52.030 that is used to indicate the end of a line. 00:08:52.030 --> 00:08:53.780 Because these text files that we've been writing, 00:08:53.780 --> 00:08:57.720 including Python programs that you have, are organized into lines. 00:08:57.720 --> 00:08:59.690 Each line has variable length and there is 00:08:59.690 --> 00:09:02.870 a special non-printing character that you just don't see. 00:09:02.870 --> 00:09:05.840 Now you see it because you see a line, 00:09:05.840 --> 00:09:10.710 multiple lines, but you don't see the character itself. 00:09:10.710 --> 00:09:13.130 So it turns out that this character is very 00:09:13.130 --> 00:09:15.690 important because the data is just a stream of 00:09:15.690 --> 00:09:18.850 characters on disk and then it's punctuated by newlines 00:09:18.850 --> 00:09:22.200 that tell it when it's time to end the line. 00:09:22.200 --> 00:09:29.368 So if we are building a string, the constant for newline is backslash n. 00:09:29.368 --> 00:09:32.973 And so, when we make a string that we want to 00:09:32.973 --> 00:09:38.380 have a newline in it, we'll say Hello backslash n World. 00:09:38.380 --> 00:09:41.370 And then if you print it out one way, you actually see the backslash n. 00:09:41.370 --> 00:09:44.230 But then if you use the print to print it out, you see sort of 00:09:44.230 --> 00:09:49.580 like the, it moves back down, you know, to the left margin and down. 00:09:49.580 --> 00:09:55.900 So, so, sometimes you see the slash n and sometimes it's shown as movement. 00:09:55.900 --> 00:09:57.140 Right? You, it moves it. 00:09:58.670 --> 00:10:00.000 The other thing that's important is even 00:10:00.000 --> 00:10:02.130 though we represent this as two characters, 00:10:02.130 --> 00:10:06.300 the backslash n is represented as two characters in a string, it's actually one character. 00:10:06.300 --> 00:10:10.250 So if we print it out, we see X newline Y 00:10:10.250 --> 00:10:13.280 and if we ask how many characters are in stuff, 00:10:13.280 --> 00:10:17.436 which is this string, it says 3. That's important. 00:10:17.436 --> 00:10:18.118 Okay? 00:10:18.118 --> 00:10:22.070 There is one, two, three. The newline is a single character. 00:10:22.070 --> 00:10:26.890 This is a just a syntax that we use to sort of encode a newline in a string. 00:10:27.890 --> 00:10:28.390 Okay? 00:10:29.450 --> 00:10:33.710 So, even though these are just a 00:10:33.710 --> 00:10:36.590 long sequence of characters punctuated by newlines, 00:10:36.590 --> 00:10:40.930 visually, text editors and operating systems show them, show 00:10:40.930 --> 00:10:43.950 these files to us as a sequence of lines. 00:10:43.950 --> 00:10:46.280 And it doesn't take very long to just start thinking about them 00:10:46.280 --> 00:10:47.650 as a sequence of lines. 00:10:47.650 --> 00:10:50.570 As a matter of fact, maybe you never, wish I'd never told you about newlines. 00:10:51.830 --> 00:10:53.080 But when we start reading files, we're 00:10:53.080 --> 00:10:54.990 going to have to deal with these newlines. 00:10:54.990 --> 00:10:59.260 So the way that we sort of have to mentally visualize of what these text 00:10:59.260 --> 00:11:03.980 files look like is they have a newline that punctuates the end of the line. 00:11:03.980 --> 00:11:09.200 Now in reality, if we look at this, this R really comes right after it. 00:11:09.200 --> 00:11:09.440 Right? 00:11:09.440 --> 00:11:13.410 This is all a bunch of characters and the newlines are punctuation, okay? 00:11:13.410 --> 00:11:16.720 To say this is first line, second line, third line, and fourth line. 00:11:16.720 --> 00:11:18.730 So, you gotta think that each of these things 00:11:18.730 --> 00:11:21.710 is here, sitting at the end of the line. 00:11:21.710 --> 00:11:24.950 And so the number of characters in this line include that newline. 00:11:24.950 --> 00:11:26.930 Now the newline is one character. 00:11:26.930 --> 00:11:31.920 Okay? So, how do we read these files? 00:11:31.920 --> 00:11:36.130 Well, we've already talked about doing an open xfile. 00:11:36.130 --> 00:11:39.100 And I'm just, this xfile, again that's just a mneumonic 00:11:39.100 --> 00:11:41.840 name that I made up. This is a handle. 00:11:41.840 --> 00:11:43.690 Remember, it's not all the data. 00:11:43.690 --> 00:11:46.070 But the handle is the way that we can read the data. 00:11:46.070 --> 00:11:48.690 We can use it as a access point. 00:11:48.690 --> 00:11:52.000 The coolest way to read a file, if it's a text file in multiple 00:11:52.000 --> 00:11:58.270 lines, is to use a determinant loop, a for loop. for cheese in xfile. 00:11:58.270 --> 00:12:03.090 So this, remember we would put a list of numbers or a string here. 00:12:03.090 --> 00:12:04.150 Now we've put a file 00:12:04.150 --> 00:12:05.200 handle here. 00:12:05.200 --> 00:12:09.315 Python knows automatically that each time we are going to run this 00:12:09.315 --> 00:12:11.980 loop, it's going to go to the next line of the file. 00:12:11.980 --> 00:12:16.092 Automatically, for, a cheese is just a stupid name that I came up with it. 00:12:16.092 --> 00:12:20.232 I would be better to call line rather than cheese, but for cheese in and then it goes 00:12:20.232 --> 00:12:22.812 dot, dot, dot, dot, dot, dot, dot, each file 00:12:22.812 --> 00:12:25.760 and then it stops when it reads the whole file. 00:12:25.760 --> 00:12:29.270 So this line will print out every line 00:12:29.270 --> 00:12:33.840 in the file, that's how you do it. These three lines open a file, 00:12:35.510 --> 00:12:42.400 read every line in the file, okay? So a file handle itself is a special kind 00:12:42.400 --> 00:12:47.170 of a sequence, much like a list of numbers or a string is a sequence of characters. 00:12:47.170 --> 00:12:48.860 So one of the things we can do to combine one of 00:12:48.860 --> 00:12:51.930 our counting idioms is count the number of lines in a file. 00:12:53.345 --> 00:12:54.340 Okay? And so how we 00:12:54.340 --> 00:12:56.970 would do that is we would open the file, set a 00:12:56.970 --> 00:13:00.580 counter to zero, this time I'll use a mnemonic variable called count. 00:13:00.580 --> 00:13:02.950 For line in fhand, that says run this 00:13:02.950 --> 00:13:05.740 indented text once for each line in the file. 00:13:05.740 --> 00:13:08.410 For each line in the file, add count equals count plus 1. 00:13:08.410 --> 00:13:10.760 When the for loop is done, print the count. 00:13:13.010 --> 00:13:14.480 Pretty straightforward. 00:13:14.480 --> 00:13:18.240 Very few other languages are capable of writing that program in 00:13:18.240 --> 00:13:22.160 as quick and as dense and succinct a way as Python is. 00:13:22.160 --> 00:13:25.080 Python does a really, really nice job of this. 00:13:25.080 --> 00:13:28.140 Okay? So that's how you count the lines. 00:13:28.140 --> 00:13:31.250 Open it, write a for loop, and then add one. 00:13:31.250 --> 00:13:35.980 Now we, we can't just say, so what you can't do, and this gives you a sense. 00:13:35.980 --> 00:13:37.125 You can't say len, 00:13:37.125 --> 00:13:40.300 fhand. 00:13:40.300 --> 00:13:42.620 And that's because this isn't really the data. 00:13:42.620 --> 00:13:45.390 That's sort of, you have to like pull the, pull it 00:13:45.390 --> 00:13:48.080 and read it to get the data out of it. 00:13:48.080 --> 00:13:49.990 Although we'll see another way of reading it later. 00:13:51.020 --> 00:13:53.270 Okay? So that's counting the lines in a file. 00:13:55.100 --> 00:13:57.460 It turns out you can also read the entire file. 00:13:58.980 --> 00:14:02.100 Now if you read the entire file, it's not broken into lines. 00:14:02.100 --> 00:14:04.000 You're getting all the characters punctuated 00:14:04.000 --> 00:14:06.320 by newlines and you get everything. 00:14:06.320 --> 00:14:09.820 Now you don't want to read this if it's too big, so it's 00:14:09.820 --> 00:14:12.610 going to all try to read it into the memory of the computer. 00:14:12.610 --> 00:14:16.080 And if the memory is not big enough, you're going to slow down to a crawl. 00:14:16.080 --> 00:14:18.905 But if it's a real tiny file, this works just fine. 00:14:18.905 --> 00:14:22.120 And so, so we have sort of real, we open 00:14:22.120 --> 00:14:26.990 a file and we say fhand.read, this is basically saying, hey, 00:14:26.990 --> 00:14:30.840 dear fhand, read it all and return it to me as a string. 00:14:31.950 --> 00:14:34.350 So that's a string with all the lines of the file concatenated 00:14:34.350 --> 00:14:38.790 together with newlines, which is actually exactly what's in the file. 00:14:38.790 --> 00:14:39.800 It's the raw data. 00:14:39.800 --> 00:14:42.400 That for loop sort of looks for the newline 00:14:42.400 --> 00:14:44.410 and does all of the stuff automatically for us. 00:14:44.410 --> 00:14:45.160 It's quite nice. 00:14:46.410 --> 00:14:49.670 So then we can, like, because inp is a string at this point, 00:14:49.670 --> 00:14:50.590 we can just print the length of it. 00:14:50.590 --> 00:14:53.110 And we can say, oh, there's 94,626 00:14:53.110 --> 00:14:56.780 characters that came from that file. 00:14:56.780 --> 00:15:01.910 It reads the whole thing, whole file, reads the whole file. 00:15:01.910 --> 00:15:04.450 We can also do things like, you know, slice it now. 00:15:04.450 --> 00:15:10.330 And so this is the first 20 characters, up from zero up to, but not including, 20. 00:15:10.330 --> 00:15:12.790 So this, this is our file. Okay? 00:15:12.790 --> 00:15:15.640 So that's reading through the whole file. 00:15:15.640 --> 00:15:18.390 So, let me go back a little bit, this is the file that we're 00:15:18.390 --> 00:15:19.230 going to play with. 00:15:20.370 --> 00:15:24.920 This file here that we're going to play with in this class is a mailbox file. 00:15:24.920 --> 00:15:27.120 And this is actual real data. And these are real people. 00:15:27.120 --> 00:15:28.890 And these are real dates, having to do with 00:15:28.890 --> 00:15:31.680 an open source project that I worked on called Sakai. 00:15:31.680 --> 00:15:35.990 I actually have a tattoo of Sakai here on my shoulder. 00:15:35.990 --> 00:15:38.220 Maybe in an upcoming lecture, I'll have a 00:15:38.220 --> 00:15:40.430 short-sleeved shirt, and show you my tattoo. 00:15:40.430 --> 00:15:44.500 But for now, I can't because I've got, got clothes on. 00:15:44.500 --> 00:15:52.480 So, but this is real data. It's the mbox.txt, mbox.txt file. 00:15:52.480 --> 00:15:56.270 So, so that's the file that we're going to use for most of the next few assignments. 00:15:56.270 --> 00:15:57.960 It'll be the same file. You'll get tired of it. 00:15:57.960 --> 00:16:00.330 And you'll get to know all these people, Stephen, 00:16:00.330 --> 00:16:02.110 Chen Wen, and all the people in the file. 00:16:05.360 --> 00:16:06.020 Okay, so. 00:16:07.440 --> 00:16:10.470 We can search for lines that have a prefix. 00:16:10.470 --> 00:16:14.380 This is kind of the find pattern from the looping lecture. 00:16:14.380 --> 00:16:17.860 So we're going to go through a list of, of lines in a file, 00:16:17.860 --> 00:16:20.910 and we're going to only print out the ones that match a certain thing. 00:16:20.910 --> 00:16:22.810 So again, we open the file up. 00:16:22.810 --> 00:16:25.410 We're going to write a for loop that's going to say, for each line in the 00:16:25.410 --> 00:16:30.420 file, if the line and then we can call a, a utility function 00:16:30.420 --> 00:16:32.660 inside of string, because line is a string. 00:16:32.660 --> 00:16:35.230 If line startswith From, print it out. 00:16:35.230 --> 00:16:37.860 So this means it's going to loop through all of the lines in the 00:16:37.860 --> 00:16:43.180 file and it's going to print the ones that start with the string 'From:' 00:16:44.530 --> 00:16:45.700 Okay? 00:16:45.700 --> 00:16:49.780 Again, four lines, complete Python program to read this 00:16:49.780 --> 00:16:52.760 file and print the lines that have a prefix of from. 00:16:54.950 --> 00:16:59.020 So, if you run this program, and I suggest that you do, 00:17:01.050 --> 00:17:02.710 this is what the output's going to look like. 00:17:03.840 --> 00:17:07.160 And it's like, wait a second, I'm seeing the lines, 00:17:09.680 --> 00:17:13.990 seeing the lines that have the froms, but then I get these blank lines. 00:17:16.530 --> 00:17:18.950 And why is that? Why are these blank lines there? 00:17:18.950 --> 00:17:24.390 If I look at the program, I mean, I'm not printing blank lines. 00:17:24.390 --> 00:17:26.369 I'm only printing lines that start with from. 00:17:26.369 --> 00:17:27.520 I'm not doing that, so why? 00:17:30.520 --> 00:17:31.020 What do you think? 00:17:31.790 --> 00:17:32.530 I'll give you a second. 00:17:34.740 --> 00:17:38.080 I've certainly done enough foreshadowing in this lecture. 00:17:38.080 --> 00:17:41.100 Well it turns out these newlines are the problem. 00:17:41.100 --> 00:17:43.850 So it turns out that the print, we've been doing this 00:17:43.850 --> 00:17:46.580 all along, you just, we didn't make a fuss about it. 00:17:46.580 --> 00:17:49.930 The print adds a newline at the end of everything that it prints. 00:17:49.930 --> 00:17:53.270 So the yellow newlines are coming from the print statement. 00:17:53.270 --> 00:17:57.500 But when we read the file, each line ends in a newline. 00:17:57.500 --> 00:18:00.490 So these green newlines are actually from the file. 00:18:03.170 --> 00:18:05.960 They're the ones from the file. 00:18:05.960 --> 00:18:08.060 So what's happening is we're seeing two 00:18:08.060 --> 00:18:10.530 newlines, and so that turns into a blank line. 00:18:11.870 --> 00:18:14.140 So, how do we deal with that? 00:18:14.140 --> 00:18:19.140 Well, we've got a string function that conveniently solves that problem, okay? 00:18:19.140 --> 00:18:21.040 And that is we're going to call rstrip. 00:18:21.040 --> 00:18:25.200 If you recall, we had strip, lstrip, and rstrip to strip 00:18:25.200 --> 00:18:28.380 white space on one side, on the other side, or on both sides. 00:18:28.380 --> 00:18:29.510 So in this one, 00:18:29.510 --> 00:18:30.570 we're going to use rstrip. 00:18:30.570 --> 00:18:33.130 We're going to say, we're going to read the line, that 00:18:33.130 --> 00:18:35.545 this line is going to have a newline in it. 00:18:35.545 --> 00:18:40.200 rstrip says pull white space, and the newlines are also counted as white space. 00:18:40.200 --> 00:18:42.870 Blanks or newlines are white space. 00:18:42.870 --> 00:18:46.610 And then we're going to replace this with no newline in it. 00:18:46.610 --> 00:18:50.108 Then we're going to ask if it starts with a from and then we're going to print it 00:18:50.108 --> 00:18:51.804 out, and then we go and we're going to 00:18:51.804 --> 00:18:55.130 see exactly what we're looking for in this file. 00:18:55.130 --> 00:18:56.040 And there's no newlines. 00:18:56.040 --> 00:19:01.360 So the newline that's coming out here is the one from the print, not the 00:19:01.360 --> 00:19:03.930 one from the file, because the one from 00:19:03.930 --> 00:19:06.610 the file got wiped out by that particular line. 00:19:07.950 --> 00:19:08.450 Okay? 00:19:09.720 --> 00:19:13.360 So another general pattern of these file-based loops 00:19:13.360 --> 00:19:17.510 that we have done this, is a skipping pattern. 00:19:17.510 --> 00:19:20.490 Now, you can do, the, the non-skipping pattern 00:19:20.490 --> 00:19:22.960 is where you're saying, I'm going to look for lines 00:19:22.960 --> 00:19:25.640 that start with from and do something to them. 00:19:25.640 --> 00:19:30.420 Sometimes you'll want to do something to all, to, to the to, to, you want to say, 00:19:30.420 --> 00:19:32.790 here's a bunch of lines I'm going to skip, and then I'm going to do something. 00:19:32.790 --> 00:19:36.520 So the skipping pattern uses continue. 00:19:36.520 --> 00:19:38.840 And so the first few lines here are the same. 00:19:38.840 --> 00:19:41.760 We open a file, we read each line in the file, 00:19:41.760 --> 00:19:43.780 but we're going to strip off the white space. 00:19:43.780 --> 00:19:45.640 You're going to get tired of typing these three lines, 00:19:45.640 --> 00:19:47.280 because you're going to do it a lot. 00:19:47.280 --> 00:19:51.890 Open the file, start reading the file, strip the whitespace for each line. 00:19:51.890 --> 00:19:57.740 And you can make it so that you can look for some fact. 00:19:57.740 --> 00:20:01.260 In this case, I'm going to say, if not line startswith From, this 00:20:01.260 --> 00:20:05.220 means this is true for all the lines that don't start with from, 00:20:05.220 --> 00:20:08.600 continue. And if you remember, continue goes up. 00:20:08.600 --> 00:20:10.960 So the continue says I'm done, it finishes 00:20:10.960 --> 00:20:14.230 the iteration, and it doesn't do anything down here. 00:20:14.230 --> 00:20:15.130 Okay? 00:20:15.130 --> 00:20:18.210 And so it, this is a, and then, we can do something. 00:20:18.210 --> 00:20:21.110 So, I've kind of flipped this, where I said, these are the 00:20:21.110 --> 00:20:24.830 things I'm interesting, interested in, that's lines that start with from. 00:20:24.830 --> 00:20:26.270 So, I'm going to skip the lines that don't. 00:20:26.270 --> 00:20:27.880 So I'm going to use continue. 00:20:27.880 --> 00:20:32.420 Either way you can do it, depending on the complexity or how much. 00:20:32.420 --> 00:20:34.100 Often when you're, this is a good pattern when 00:20:34.100 --> 00:20:36.400 you have lots of lines of code down here 00:20:36.400 --> 00:20:37.850 that you're going to do a lot of cool stuff with. 00:20:39.290 --> 00:20:42.780 You can also use things like in to select lines. 00:20:42.780 --> 00:20:43.320 Right? 00:20:43.320 --> 00:20:51.199 So I'm going to, I'm going to look for lines that have @uct.ac.za in them. 00:20:51.199 --> 00:20:53.073 So again, I'm going to open it up. 00:20:53.073 --> 00:20:55.830 I'm going to open these, go through each line in the file. 00:20:55.830 --> 00:21:00.510 I'm going to strip the white space out, and [COUGH] 00:21:00.510 --> 00:21:03.070 if not u-c-t, 00:21:03.070 --> 00:21:07.930 if this string is not in line, then I'm going to continue. 00:21:07.930 --> 00:21:12.270 So it's a way for me to skip all of the lines that don't have this string in it. 00:21:14.000 --> 00:21:19.260 So these lines do, that one has it too, and then we're going to print it out. 00:21:19.260 --> 00:21:23.750 It will print out the ones that make it past here, okay? 00:21:23.750 --> 00:21:28.440 So, but in is another way to do searching, right, starts with, 00:21:28.440 --> 00:21:28.940 et cetera. 00:21:30.640 --> 00:21:37.550 So one more thing that you might want to try is, so we can count, right? 00:21:37.550 --> 00:21:40.270 Now, and this is a pattern for prompting for a file name. 00:21:41.920 --> 00:21:45.850 And so, so here you, you'll get tired of sort of 00:21:45.850 --> 00:21:48.770 changing your code every time you want to open a different file. 00:21:48.770 --> 00:21:50.849 because you probably want to run the program 00:21:50.849 --> 00:21:53.621 with mbox once and mbox-short because, just so you 00:21:53.621 --> 00:21:57.850 can test it with different things of data. So here's just another pattern. 00:21:57.850 --> 00:22:01.910 We add this line to say raw_input, enter the file name. 00:22:01.910 --> 00:22:04.700 And there you go, we'll type in the file name. 00:22:04.700 --> 00:22:08.240 And then the thing that we open is whatever we entered as the file name. 00:22:08.240 --> 00:22:11.280 And then the rest of it is pretty much yada yada. 00:22:11.280 --> 00:22:14.060 So here I'm, it's reading the whole file. 00:22:14.060 --> 00:22:17.230 If the line starts with subject, count equals count plus one. 00:22:17.230 --> 00:22:19.340 And then there were 1797 subject 00:22:19.340 --> 00:22:21.850 lines in mbox.txt. 00:22:21.850 --> 00:22:26.500 There were 27 subject lines in mbox-short.txt, okay? 00:22:26.500 --> 00:22:29.020 So that's prompting for the file names. 00:22:29.020 --> 00:22:31.310 Now, open. 00:22:31.310 --> 00:22:35.450 The open statement fails if the file name doesn't exist. 00:22:35.450 --> 00:22:37.290 So, you might want to add a try and 00:22:37.290 --> 00:22:39.840 accept around that if you want to, if you're just writing 00:22:39.840 --> 00:22:42.530 code for yourself and you assume that everything's okay, 00:22:42.530 --> 00:22:44.610 then you don't have to write try accept but if 00:22:44.610 --> 00:22:50.610 you want to catch it [SOUND] and catch a bad file name, 00:22:50.610 --> 00:22:55.860 then you take the open which, and turn it into these four lines. 00:22:55.860 --> 00:22:58.480 So this is the code that we think might blow up, 00:22:59.500 --> 00:23:01.330 and it's going to blow up, we know it's going to blow up. 00:23:01.330 --> 00:23:03.510 If they enter a bad file name like 00:23:03.510 --> 00:23:06.510 na na boo boo, right, this is is going to blow up. 00:23:06.510 --> 00:23:08.940 So what do we do? We use try and accept. 00:23:08.940 --> 00:23:09.780 We put try 00:23:09.780 --> 00:23:10.390 around that. 00:23:10.390 --> 00:23:14.210 We're going to take out some insurance on that particular line. 00:23:14.210 --> 00:23:16.540 And then, if it fails, we're going to print 00:23:16.540 --> 00:23:20.500 this message and then say exit, to get out. 00:23:20.500 --> 00:23:22.920 So if you get a good file, 00:23:25.500 --> 00:23:27.930 if you get a good file, it works, skips the 00:23:27.930 --> 00:23:31.610 except, then runs the thing, prints out the count. 00:23:31.610 --> 00:23:35.930 That's what's happening here. If, on the other hand, you get a bad file, 00:23:36.990 --> 00:23:41.940 it comes here, open blows up, runs the except, prints this out, and then quits. 00:23:43.210 --> 00:23:45.680 So that's how this one works with a bad file. 00:23:46.820 --> 00:23:48.538 And now, no traceback, right? 00:23:53.538 --> 00:23:55.386 So we are 00:23:56.690 --> 00:24:00.270 It's kind of a short lecture. We're done with Chapter Seven. 00:24:01.480 --> 00:24:03.940 We open a file. 00:24:03.940 --> 00:24:05.670 We read the file. 00:24:05.670 --> 00:24:09.380 We take out white space at the end with rstrip. 00:24:09.380 --> 00:24:11.670 We had used string functions. 00:24:11.670 --> 00:24:14.650 So, this is kind of putting it all together. 00:24:14.650 --> 00:24:17.280 And it's kind of short little programs now. 00:24:17.280 --> 00:24:22.100 So, it's not. And you know, starting now, 00:24:22.100 --> 00:24:25.255 we are going to start putting these things together and start actually doing work. 00:24:25.255 --> 00:24:28.100 Because now, we have, from the first few chapters, 00:24:28.100 --> 00:24:32.390 we have basic capabilities of Python. Now we have some data to work with. 00:24:32.390 --> 00:24:33.180 Now going forward 00:24:33.180 --> 00:24:36.570 we are going to do increasingly sophisticated things with that data. 00:24:36.570 --> 00:24:38.240 So I can't wait to see in the next lecture.