0:00:00.200,0:00:01.660 Welcome to Chapter Seven. 0:00:01.660,0:00:03.650 Python for Informatics: Exploring[br]Information. 0:00:03.650,0:00:04.530 I'm Charles Severence. 0:00:04.530,0:00:09.720 I'm the author of the book and your host.[br]And, as always, this is brought to you by. 0:00:09.720,0:00:10.410 No, I'm sorry. 0:00:10.410,0:00:14.650 It's all creative copyright, Creative[br]Commons Attribution. 0:00:14.650,0:00:18.680 The audio, the video, the slides, and even[br]the book. 0:00:18.680,0:00:21.080 So, here we go. 0:00:21.080,0:00:25.418 Oh, and and so, frankly, where[br]we've been working 0:00:25.418,0:00:34.280 all along is, we have been writing code[br]and talking to the CPU. 0:00:34.280,0:00:37.477 Hang on, let me, let me go get[br]my CPU and stuff. 0:00:37.477,0:00:42.151 Hang on, be right back. 0:00:44.151,0:00:49.673 [SOUND][br]Okay. 0:00:49.673,0:00:53.730 Here we go. Here we go. 0:00:53.730,0:00:58.830 Here's all the stuff. Remember the stuff[br]from the first lecture? 0:01:00.980,0:01:01.710 There we go with that. 0:01:02.860,0:01:05.840 Remember the motherboard from the first[br]lecture? 0:01:05.840,0:01:08.100 This is kind of the picture of what's on[br]the screen. 0:01:08.800,0:01:12.158 The motherboard, the CPU plugs in here,[br]memory plugs in here. 0:01:12.158,0:01:18.120 And remember how the CPU is sort of the[br]brains, as 0:01:18.120,0:01:22.790 much brains as there is, for the operation.[br]The CPU is asking what next. 0:01:22.790,0:01:26.240 The instructions come in through these[br]little pins. 0:01:26.240,0:01:30.131 There's data inside, and it stores sort of[br]semi-permanent 0:01:30.131,0:01:33.480 data, variables, are all stored pretty[br]much here in RAM. 0:01:34.820,0:01:37.970 And we write our programs, and so your[br]Python programs, they're sitting here 0:01:37.970,0:01:44.140 in this RAM, and they're being fed to this[br]CPU through those chips. 0:01:44.140,0:01:45.280 Through those pins, right? 0:01:45.280,0:01:47.850 The pins, I mean it doesn't really connect[br]like that. 0:01:47.850,0:01:51.850 And so, so frankly, up to now, everything[br]that we've been doing 0:01:51.850,0:01:54.520 is just the Python programming language. 0:01:54.520,0:01:58.210 And so the only place we've really been[br]operating is here. 0:01:59.650,0:02:02.800 We have been putting Python into the main[br]memory. 0:02:02.800,0:02:05.870 And the main memory. And we have 0:02:05.870,0:02:09.596 been effectively feeding instructions to[br]the CPU, 0:02:09.596,0:02:14.080 the central processing unit, as it needed[br]them, and then the program would stop. 0:02:14.080,0:02:15.580 And everything we've done so far 0:02:15.580,0:02:17.260 everything 0:02:17.260,0:02:22.290 is just sort of fiddling around here.[br]We have never escaped it. 0:02:22.290,0:02:25.500 So now we are finally going to escape 0:02:25.500,0:02:28.160 from the central processing unit and the[br]memory. 0:02:29.180,0:02:31.720 We'll still write programs and have[br]variables in here. 0:02:32.920,0:02:38.660 But now we're going to use the disk,[br]the secondary storage, the 0:02:38.660,0:02:44.300 permanent media, right?[br]So if I go grab my Raspberry Pi, 0:02:44.300,0:02:46.000 alright, that goes right there. 0:02:46.000,0:02:51.290 Here's my Raspberry Pi, so here we've got[br]the Raspberry Pi, which is the small version, 0:02:51.290,0:02:55.990 which of course has a CPU, memory, and 0:02:55.990,0:02:58.770 graphics processor, all in this little chip[br]right here. 0:02:58.770,0:03:02.850 But the secondary memory for the,[br]is this little 0:03:02.850,0:03:05.710 SD card that is the secondary memory for[br]Raspberry Pi. 0:03:05.710,0:03:07.500 So the structure of the Raspberry Pi is 0:03:07.500,0:03:09.300 exactly the same as the structure[br]of any other 0:03:09.300,0:03:13.370 personal computer, it's just smaller and[br]less expensive. 0:03:13.370,0:03:14.630 And so in the Raspberry Pi, if you're 0:03:14.630,0:03:17.780 programming the Raspberry Pi, you're sort[br]of finally escaping. 0:03:17.780,0:03:19.550 All your programs were in here. 0:03:19.550,0:03:24.350 Your CPU is in here and that's pretty much[br]how, how far you've got to run. 0:03:24.350,0:03:28.730 But now, of course when you save your files,[br]you save them to here. 0:03:28.730,0:03:34.618 But now we are going to start looking at[br]data on the disk drive and so it's time 0:03:34.618,0:03:38.880 to escape to the secondary memory.[br]Okay? 0:03:38.880,0:03:41.000 Time to escape to the secondary memory. 0:03:41.000,0:03:43.930 And Raspberry Pi, you can go right there.[br]Okay? 0:03:43.930,0:03:45.790 So it's time to find some data to mess[br]with. 0:03:45.790,0:03:48.690 So a lot of what we've been doing so far[br]is just 0:03:48.690,0:03:52.740 kind of the pre-work to get to the point[br]where we can do this. 0:03:52.740,0:03:54.600 And in here we're going to have data files. 0:03:54.600,0:03:55.760 Now, we've been making data files. 0:03:55.760,0:04:00.010 You've been writing, every Python program[br]that you write on your computer gets saved 0:04:00.010,0:04:03.180 as a file. Then Python reads the file and runs it. 0:04:04.380,0:04:07.200 But now we're actually going to start[br]messing with some data. 0:04:09.060,0:04:11.530 And so, files are where we're going to be[br]working. 0:04:11.530,0:04:16.750 And so, one of things about secondary memory[br]is it's much larger. 0:04:18.779,0:04:21.480 And this is, main memory of the computer[br]is pretty large, it's just 0:04:21.480,0:04:26.090 not large enough to hold everything that[br]the computer is capable of holding. 0:04:26.090,0:04:28.010 So the files that we're going to work with. 0:04:28.010,0:04:32.230 Now we're not talking about image files or[br]Quicktime movies or things like that. 0:04:32.230,0:04:34.390 We're going to work with text files[br]because the 0:04:34.390,0:04:37.540 theme of this course is digging through[br]text. 0:04:37.540,0:04:39.090 Sometimes we'll pull it off the Internet. 0:04:39.090,0:04:42.170 Sometimes we'll read files, but it's[br]digging through and 0:04:42.170,0:04:44.030 using all the things that we've learned so[br]far, 0:04:44.030,0:04:46.040 looping and strings, and all those things, 0:04:46.040,0:04:49.400 to make sense of a sequence of[br]information. 0:04:50.520,0:04:51.540 Okay? 0:04:51.540,0:04:57.670 Now, to access file information, we have[br]to do this thing called opening the file. 0:04:57.670,0:05:02.400 We can't just say, yo, the information is[br]just omnipresent because there are 0:05:02.400,0:05:06.170 so much data that you can't have Python[br]sort of know all the data. 0:05:06.170,0:05:09.220 You literally have hundreds of thousands[br]of files on 0:05:09.220,0:05:12.420 your computer's hard drive.[br]And you, 0:05:12.420,0:05:13.500 which one are you going to read? 0:05:13.500,0:05:15.770 So there's a step that you have to do, 0:05:15.770,0:05:19.030 that you call this built-in function[br]called open. 0:05:19.030,0:05:21.880 And say, oh, this is the file that I[br]want to work with, 0:05:21.880,0:05:23.850 of the hundreds of thousands, and then[br]once you do, 0:05:23.850,0:05:27.510 you've kind of got this little[br]connector into it. 0:05:27.510,0:05:31.520 And the open is a built-in function inside[br]Python. 0:05:31.520,0:05:34.300 Hang on a sec, let's say good bye to that.[br]The open 0:05:34.300,0:05:39.680 function is a built-in function in Python,[br]and you, it takes two parameters. 0:05:39.680,0:05:45.810 The first parameter is the name of the[br]file, like mbox.txt, 0:05:45.810,0:05:48.600 and then the second is how you're going to[br]read it. 0:05:48.600,0:05:49.270 Are you going to read it? 0:05:49.270,0:05:50.280 are you going to write it? et cetera. 0:05:50.280,0:05:53.190 Now most of the time we'll be reading our[br]files. 0:05:53.190,0:05:55.730 So we call the open function and pass it[br]in the name of 0:05:55.730,0:05:59.310 the file we want to open, and then how we[br]want to read it. 0:05:59.310,0:06:02.300 Now you can leave this second parameter[br]off and it 0:06:02.300,0:06:04.630 assumes that you're going to want to read[br]the file. 0:06:04.630,0:06:05.130 Now. 0:06:08.920,0:06:11.930 When the open is successful, it doesn't[br]actually read all 0:06:11.930,0:06:16.980 of the data because the memory is small,[br]small compared to 0:06:16.980,0:06:19.140 the hard drive, and so you have to sort of 0:06:19.140,0:06:22.180 step through the data, you'll tell it when[br]to read it. 0:06:22.180,0:06:26.700 So the act of opening it is not[br]actually reading all data. 0:06:26.700,0:06:30.510 It is creating kind of like a connection[br]between the 0:06:30.510,0:06:33.220 memory and the data that's on the hard[br]drive, right? 0:06:33.220,0:06:34.470 It's connecting 0:06:34.470,0:06:38.450 between, oh listen to this.[br]Oh that's going to fall down. 0:06:38.450,0:06:42.014 Is it going to stand up that way? 0:06:42.014,0:06:45.330 Oh, I should come up with a way to[br]make that stand. 0:06:46.369,0:06:48.080 So it's a connection. 0:06:48.080,0:06:50.290 So the, your program's kind of running in[br]here. 0:06:50.290,0:06:53.910 And the, and the file handle is just sort[br]of it's 0:06:53.910,0:06:57.760 like a phone call between your memory and[br]your disk drive. 0:06:57.760,0:07:00.150 It's not the actual data.[br]The actual data is still 0:07:00.150,0:07:06.010 sitting on the disk drive, okay?[br]So, a graphical way to take a look at this 0:07:06.010,0:07:11.680 is, the file handle, the thing that comes[br]back from the open request. 0:07:11.680,0:07:14.990 The open goes and finds the file out on[br]the disk drive and 0:07:14.990,0:07:20.250 yada, yada, yada, and then the handle is[br]something that lives in the memory. 0:07:20.250,0:07:22.100 that is sort of like the thing that 0:07:22.100,0:07:25.830 maintains its connection to where all the[br]data is 0:07:25.830,0:07:28.910 on the disk or on the SD RAM that's in it. 0:07:28.910,0:07:30.820 So the handle is not all the data, but it is 0:07:30.820,0:07:34.280 a mechanism that you can use to get at the[br]data. 0:07:34.280,0:07:37.990 So if you print it out, it doesn't have[br]all the data from the file, 0:07:37.990,0:07:44.100 it says, I am a file handle that's opened[br]this file and we're in read mode. 0:07:44.100,0:07:46.120 So, that doesn't actually have the data, 0:07:46.120,0:07:48.440 even though this is the data that's [br]in the file. 0:07:48.440,0:07:51.050 And then we have operations that we do to[br]the handle like open it, 0:07:51.050,0:07:53.470 close it, read it, write it.[br]So we do things. 0:07:53.470,0:07:56.370 So, so the handle and then through the[br]handle it actually changes 0:07:56.370,0:07:58.860 what's on the disk or reads[br]what's on the disk. 0:07:58.860,0:08:01.560 So the handle is kind of a thing that's[br]not there. 0:08:02.890,0:08:06.460 If you attempt to open a file and the name[br]of the file. 0:08:06.460,0:08:08.660 Now the way we're going to do these is[br]these need to be 0:08:08.660,0:08:14.490 in the same folder on your computer as in,[br]as your Python code. 0:08:14.490,0:08:16.110 Now, there are trickier ways to do it, but 0:08:16.110,0:08:17.180 we're going to keep it simple. 0:08:17.180,0:08:18.670 This is the name of a file in the 0:08:18.670,0:08:21.590 same folder as the Python code that you're[br]running. 0:08:21.590,0:08:28.100 [SOUND] And if it's not, then we get, of[br]course, a traceback and we're 0:08:28.100,0:08:32.289 used to using, reading tracebacks by[br]now, no such file or directory stuff.txt. 0:08:32.289,0:08:34.710 Oh, of course, I forgot to save it or I[br]typed it wrong. 0:08:37.820,0:08:38.840 So. 0:08:38.840,0:08:42.659 The next thing we have to learn is the[br]notion of the newline character. 0:08:42.659,0:08:44.390 You haven't seen this so far, 0:08:44.390,0:08:47.960 but there's a special character in files 0:08:47.960,0:08:52.030 that is used to indicate the end of a line. 0:08:52.030,0:08:53.780 Because these text files that we've been[br]writing, 0:08:53.780,0:08:57.720 including Python programs that you have,[br]are organized into lines. 0:08:57.720,0:08:59.690 Each line has variable length and there is 0:08:59.690,0:09:02.870 a special non-printing character that you[br]just don't see. 0:09:02.870,0:09:05.840 Now you see it because you see a line, 0:09:05.840,0:09:10.710 multiple lines, but you don't see the[br]character itself. 0:09:10.710,0:09:13.130 So it turns out that this character is[br]very 0:09:13.130,0:09:15.690 important because the data is just a[br]stream of 0:09:15.690,0:09:18.850 characters on disk and then it's[br]punctuated by newlines 0:09:18.850,0:09:22.200 that tell it when it's time to end the[br]line. 0:09:22.200,0:09:29.368 So if we are building a string, the[br]constant for newline is backslash n. 0:09:29.368,0:09:32.973 And so, when we make a string that we[br]want to 0:09:32.973,0:09:38.380 have a newline in it, we'll say Hello[br]backslash n World. 0:09:38.380,0:09:41.370 And then if you print it out one way, you[br]actually see the backslash n. 0:09:41.370,0:09:44.230 But then if you use the print to print it[br]out, you see sort of 0:09:44.230,0:09:49.580 like the, it moves back down, you know,[br]to the left margin and down. 0:09:49.580,0:09:55.900 So, so, sometimes you see the slash n[br]and sometimes it's shown as movement. 0:09:55.900,0:09:57.140 Right? You, it moves it. 0:09:58.670,0:10:00.000 The other thing that's important is even 0:10:00.000,0:10:02.130 though we represent this as two[br]characters, 0:10:02.130,0:10:06.300 the backslash n is represented as two characters[br]in a string, it's actually one character. 0:10:06.300,0:10:10.250 So if we print it out, we see[br]X newline Y 0:10:10.250,0:10:13.280 and if we ask how many characters are[br]in stuff, 0:10:13.280,0:10:17.436 which is this string, it says 3.[br]That's important. 0:10:17.436,0:10:18.118 Okay? 0:10:18.118,0:10:22.070 There is one, two, three.[br]The newline is a single character. 0:10:22.070,0:10:26.890 This is a just a syntax that we use to[br]sort of encode a newline in a string. 0:10:27.890,0:10:28.390 Okay? 0:10:29.450,0:10:33.710 So, even though these are just a 0:10:33.710,0:10:36.590 long sequence of characters punctuated by[br]newlines, 0:10:36.590,0:10:40.930 visually, text editors and operating[br]systems show them, show 0:10:40.930,0:10:43.950 these files to us as a sequence of lines. 0:10:43.950,0:10:46.280 And it doesn't take very long to just[br]start thinking about them 0:10:46.280,0:10:47.650 as a sequence of lines. 0:10:47.650,0:10:50.570 As a matter of fact, maybe you never, wish[br]I'd never told you about newlines. 0:10:51.830,0:10:53.080 But when we start reading files, we're 0:10:53.080,0:10:54.990 going to have to deal with these newlines. 0:10:54.990,0:10:59.260 So the way that we sort of have to[br]mentally visualize of what these text 0:10:59.260,0:11:03.980 files look like is they have a newline[br]that punctuates the end of the line. 0:11:03.980,0:11:09.200 Now in reality, if we look at this, this[br]R really comes right after it. 0:11:09.200,0:11:09.440 Right? 0:11:09.440,0:11:13.410 This is all a bunch of characters and the[br]newlines are punctuation, okay? 0:11:13.410,0:11:16.720 To say this is first line, second line,[br]third line, and fourth line. 0:11:16.720,0:11:18.730 So, you gotta think that each of these[br]things 0:11:18.730,0:11:21.710 is here, sitting at the end of the line. 0:11:21.710,0:11:24.950 And so the number of characters in this[br]line include that newline. 0:11:24.950,0:11:26.930 Now the newline is one character. 0:11:26.930,0:11:31.920 Okay? So, how do we read these files? 0:11:31.920,0:11:36.130 Well, we've already talked about doing an[br]open xfile. 0:11:36.130,0:11:39.100 And I'm just, this xfile, again that's[br]just a mneumonic 0:11:39.100,0:11:41.840 name that I made up. This is a handle. 0:11:41.840,0:11:43.690 Remember, it's not all the data. 0:11:43.690,0:11:46.070 But the handle is the way that we can read[br]the data. 0:11:46.070,0:11:48.690 We can use it as a access point. 0:11:48.690,0:11:52.000 The coolest way to read a file, if it's a[br]text file in multiple 0:11:52.000,0:11:58.270 lines, is to use a determinant loop, a[br]for loop. for cheese in xfile. 0:11:58.270,0:12:03.090 So this, remember we would put a list of[br]numbers or a string here. 0:12:03.090,0:12:04.150 Now we've put a file 0:12:04.150,0:12:05.200 handle here. 0:12:05.200,0:12:09.315 Python knows automatically that each time[br]we are going to run this 0:12:09.315,0:12:11.980 loop, it's going to go to the next line of[br]the file. 0:12:11.980,0:12:16.092 Automatically, for, a cheese is just a[br]stupid name that I came up with it. 0:12:16.092,0:12:20.232 I would be better to call line rather than[br]cheese, but for cheese in and then it goes 0:12:20.232,0:12:22.812 dot, dot, dot, dot, dot, dot, dot,[br]each file 0:12:22.812,0:12:25.760 and then it stops when it reads[br]the whole file. 0:12:25.760,0:12:29.270 So this line will print out every line 0:12:29.270,0:12:33.840 in the file, that's how you do it.[br]These three lines open a file, 0:12:35.510,0:12:42.400 read every line in the file, okay?[br]So a file handle itself is a special kind 0:12:42.400,0:12:47.170 of a sequence, much like a list of numbers[br]or a string is a sequence of characters. 0:12:47.170,0:12:48.860 So one of the things we can do to combine[br]one of 0:12:48.860,0:12:51.930 our counting idioms is count the number of[br]lines in a file. 0:12:53.345,0:12:54.340 Okay? And so how we 0:12:54.340,0:12:56.970 would do that is we would open[br]the file, set a 0:12:56.970,0:13:00.580 counter to zero, this time I'll use a[br]mnemonic variable called count. 0:13:00.580,0:13:02.950 For line in fhand, that says run this 0:13:02.950,0:13:05.740 indented text once for each line in the[br]file. 0:13:05.740,0:13:08.410 For each line in the file, add count equals[br]count plus 1. 0:13:08.410,0:13:10.760 When the for loop is done, print the[br]count. 0:13:13.010,0:13:14.480 Pretty straightforward. 0:13:14.480,0:13:18.240 Very few other languages are capable of[br]writing that program in 0:13:18.240,0:13:22.160 as quick and as dense and succinct a way as[br]Python is. 0:13:22.160,0:13:25.080 Python does a really, really nice[br]job of this. 0:13:25.080,0:13:28.140 Okay? So that's how you count the lines. 0:13:28.140,0:13:31.250 Open it, write a for loop, and then add[br]one. 0:13:31.250,0:13:35.980 Now we, we can't just say, so what you[br]can't do, and this gives you a sense. 0:13:35.980,0:13:37.125 You can't say len, 0:13:37.125,0:13:40.300 fhand. 0:13:40.300,0:13:42.620 And that's because this isn't really the[br]data. 0:13:42.620,0:13:45.390 That's sort of, you have to like pull the,[br]pull it 0:13:45.390,0:13:48.080 and read it to get the data out of it. 0:13:48.080,0:13:49.990 Although we'll see another way of reading[br]it later. 0:13:51.020,0:13:53.270 Okay? So that's counting the lines in a[br]file. 0:13:55.100,0:13:57.460 It turns out you can also read the entire[br]file. 0:13:58.980,0:14:02.100 Now if you read the entire file, it's not[br]broken into lines. 0:14:02.100,0:14:04.000 You're getting all the characters[br]punctuated 0:14:04.000,0:14:06.320 by newlines and you get everything. 0:14:06.320,0:14:09.820 Now you don't want to read this if it's[br]too big, so it's 0:14:09.820,0:14:12.610 going to all try to read it into the[br]memory of the computer. 0:14:12.610,0:14:16.080 And if the memory is not big enough,[br]you're going to slow down to a crawl. 0:14:16.080,0:14:18.905 But if it's a real tiny file, this works[br]just fine. 0:14:18.905,0:14:22.120 And so, so we have sort of real, we open 0:14:22.120,0:14:26.990 a file and we say fhand.read, this is[br]basically saying, hey, 0:14:26.990,0:14:30.840 dear fhand, read it all and return it to[br]me as a string. 0:14:31.950,0:14:34.350 So that's a string with all the lines of[br]the file concatenated 0:14:34.350,0:14:38.790 together with newlines, which is actually[br]exactly what's in the file. 0:14:38.790,0:14:39.800 It's the raw data. 0:14:39.800,0:14:42.400 That for loop sort of looks for the newline 0:14:42.400,0:14:44.410 and does all of the stuff[br]automatically for us. 0:14:44.410,0:14:45.160 It's quite nice. 0:14:46.410,0:14:49.670 So then we can, like, because inp is a[br]string at this point, 0:14:49.670,0:14:50.590 we can just print the length of it. 0:14:50.590,0:14:53.110 And we can say, oh, there's 94,626 0:14:53.110,0:14:56.780 characters that came from that file. 0:14:56.780,0:15:01.910 It reads the whole thing, whole file,[br]reads the whole file. 0:15:01.910,0:15:04.450 We can also do things like, you know, slice[br]it now. 0:15:04.450,0:15:10.330 And so this is the first 20 characters,[br]up from zero up to, but not including, 20. 0:15:10.330,0:15:12.790 So this, this is our file. Okay? 0:15:12.790,0:15:15.640 So that's reading through the whole file. 0:15:15.640,0:15:18.390 So, let me go back a little bit, this is[br]the file that we're 0:15:18.390,0:15:19.230 going to play with. 0:15:20.370,0:15:24.920 This file here that we're going to play[br]with in this class is a mailbox file. 0:15:24.920,0:15:27.120 And this is actual real data.[br]And these are real people. 0:15:27.120,0:15:28.890 And these are real dates, having to do[br]with 0:15:28.890,0:15:31.680 an open source project that I worked on[br]called Sakai. 0:15:31.680,0:15:35.990 I actually have a tattoo of Sakai here on[br]my shoulder. 0:15:35.990,0:15:38.220 Maybe in an upcoming lecture, I'll have a 0:15:38.220,0:15:40.430 short-sleeved shirt, and show you my[br]tattoo. 0:15:40.430,0:15:44.500 But for now, I can't because I've got, got[br]clothes on. 0:15:44.500,0:15:52.480 So, but this is real data.[br]It's the mbox.txt, mbox.txt file. 0:15:52.480,0:15:56.270 So, so that's the file that we're going to[br]use for most of the next few assignments. 0:15:56.270,0:15:57.960 It'll be the same file. You'll get tired of it. 0:15:57.960,0:16:00.330 And you'll get to know all these people,[br]Stephen, 0:16:00.330,0:16:02.110 Chen Wen, and all the people in the file. 0:16:05.360,0:16:06.020 Okay, so. 0:16:07.440,0:16:10.470 We can search for lines that have a[br]prefix. 0:16:10.470,0:16:14.380 This is kind of the find pattern from the[br]looping lecture. 0:16:14.380,0:16:17.860 So we're going to go through a list of, of[br]lines in a file, 0:16:17.860,0:16:20.910 and we're going to only print out the ones[br]that match a certain thing. 0:16:20.910,0:16:22.810 So again, we open the file up. 0:16:22.810,0:16:25.410 We're going to write a for loop that's[br]going to say, for each line in the 0:16:25.410,0:16:30.420 file, if the line and then we can call a,[br]a utility function 0:16:30.420,0:16:32.660 inside of string, because line is a string. 0:16:32.660,0:16:35.230 If line startswith From, print it out. 0:16:35.230,0:16:37.860 So this means it's going to loop through[br]all of the lines in the 0:16:37.860,0:16:43.180 file and it's going to print the ones that[br]start with the string 'From:' 0:16:44.530,0:16:45.700 Okay? 0:16:45.700,0:16:49.780 Again, four lines, complete Python program[br]to read this 0:16:49.780,0:16:52.760 file and print the lines that have a[br]prefix of from. 0:16:54.950,0:16:59.020 So, if you run this program, and I suggest[br]that you do, 0:17:01.050,0:17:02.710 this is what the output's going to look like. 0:17:03.840,0:17:07.160 And it's like, wait a second, I'm seeing[br]the lines, 0:17:09.680,0:17:13.990 seeing the lines that have the froms, but[br]then I get these blank lines. 0:17:16.530,0:17:18.950 And why is that?[br]Why are these blank lines there? 0:17:18.950,0:17:24.390 If I look at the program, I mean, I'm not[br]printing blank lines. 0:17:24.390,0:17:26.369 I'm only printing lines that[br]start with from. 0:17:26.369,0:17:27.520 I'm not doing that, so why? 0:17:30.520,0:17:31.020 What do you think? 0:17:31.790,0:17:32.530 I'll give you a second. 0:17:34.740,0:17:38.080 I've certainly done enough foreshadowing[br]in this lecture. 0:17:38.080,0:17:41.100 Well it turns out these newlines are the[br]problem. 0:17:41.100,0:17:43.850 So it turns out that the print, we've been[br]doing this 0:17:43.850,0:17:46.580 all along, you just, we didn't make a fuss[br]about it. 0:17:46.580,0:17:49.930 The print adds a newline at the end of[br]everything that it prints. 0:17:49.930,0:17:53.270 So the yellow newlines are coming from[br]the print statement. 0:17:53.270,0:17:57.500 But when we read the file, each line ends[br]in a newline. 0:17:57.500,0:18:00.490 So these green newlines are actually from[br]the file. 0:18:03.170,0:18:05.960 They're the ones from the file. 0:18:05.960,0:18:08.060 So what's happening is we're seeing two 0:18:08.060,0:18:10.530 newlines, and so that turns into a[br]blank line. 0:18:11.870,0:18:14.140 So, how do we deal with that? 0:18:14.140,0:18:19.140 Well, we've got a string function that[br]conveniently solves that problem, okay? 0:18:19.140,0:18:21.040 And that is we're going to call rstrip. 0:18:21.040,0:18:25.200 If you recall, we had strip, lstrip, and[br]rstrip to strip 0:18:25.200,0:18:28.380 white space on one side, on the other[br]side, or on both sides. 0:18:28.380,0:18:29.510 So in this one, 0:18:29.510,0:18:30.570 we're going to use rstrip. 0:18:30.570,0:18:33.130 We're going to say, we're going to read[br]the line, that 0:18:33.130,0:18:35.545 this line is going to have a newline in it. 0:18:35.545,0:18:40.200 rstrip says pull white space, and the[br]newlines are also counted as white space. 0:18:40.200,0:18:42.870 Blanks or newlines are white space. 0:18:42.870,0:18:46.610 And then we're going to replace this with[br]no newline in it. 0:18:46.610,0:18:50.108 Then we're going to ask if it starts with[br]a from and then we're going to print it 0:18:50.108,0:18:51.804 out, and then we go and we're going to 0:18:51.804,0:18:55.130 see exactly what we're looking for[br]in this file. 0:18:55.130,0:18:56.040 And there's no newlines. 0:18:56.040,0:19:01.360 So the newline that's coming out here[br]is the one from the print, not the 0:19:01.360,0:19:03.930 one from the file, because the one from 0:19:03.930,0:19:06.610 the file got wiped out by that particular[br]line. 0:19:07.950,0:19:08.450 Okay? 0:19:09.720,0:19:13.360 So another general pattern of these[br]file-based loops 0:19:13.360,0:19:17.510 that we have done this, is a skipping[br]pattern. 0:19:17.510,0:19:20.490 Now, you can do, the, the non-skipping[br]pattern 0:19:20.490,0:19:22.960 is where you're saying, I'm going to look[br]for lines 0:19:22.960,0:19:25.640 that start with from and do something to[br]them. 0:19:25.640,0:19:30.420 Sometimes you'll want to do something to[br]all, to, to the to, to, you want to say, 0:19:30.420,0:19:32.790 here's a bunch of lines I'm going to[br]skip, and then I'm going to do something. 0:19:32.790,0:19:36.520 So the skipping pattern uses continue. 0:19:36.520,0:19:38.840 And so the first few lines here are the[br]same. 0:19:38.840,0:19:41.760 We open a file, we read each line[br]in the file, 0:19:41.760,0:19:43.780 but we're going to strip off the white[br]space. 0:19:43.780,0:19:45.640 You're going to get tired of typing these[br]three lines, 0:19:45.640,0:19:47.280 because you're going to do it a lot. 0:19:47.280,0:19:51.890 Open the file, start reading the file,[br]strip the whitespace for each line. 0:19:51.890,0:19:57.740 And you can make it so that you can look[br]for some fact. 0:19:57.740,0:20:01.260 In this case, I'm going to say, if not[br]line startswith From, this 0:20:01.260,0:20:05.220 means this is true for all the lines that[br]don't start with from, 0:20:05.220,0:20:08.600 continue. And if you remember, continue[br]goes up. 0:20:08.600,0:20:10.960 So the continue says I'm done, it[br]finishes 0:20:10.960,0:20:14.230 the iteration, and it doesn't do anything[br]down here. 0:20:14.230,0:20:15.130 Okay? 0:20:15.130,0:20:18.210 And so it, this is a, and then, we can do[br]something. 0:20:18.210,0:20:21.110 So, I've kind of flipped this, where I[br]said, these are the 0:20:21.110,0:20:24.830 things I'm interesting, interested in,[br]that's lines that start with from. 0:20:24.830,0:20:26.270 So, I'm going to skip the lines that[br]don't. 0:20:26.270,0:20:27.880 So I'm going to use continue. 0:20:27.880,0:20:32.420 Either way you can do it, depending on the[br]complexity or how much. 0:20:32.420,0:20:34.100 Often when you're, this is a good pattern[br]when 0:20:34.100,0:20:36.400 you have lots of lines of code down here 0:20:36.400,0:20:37.850 that you're going to do a lot of cool[br]stuff with. 0:20:39.290,0:20:42.780 You can also use things like in to select[br]lines. 0:20:42.780,0:20:43.320 Right? 0:20:43.320,0:20:51.199 So I'm going to, I'm going to look for[br]lines that have @uct.ac.za in them. 0:20:51.199,0:20:53.073 So again, I'm going to open it up. 0:20:53.073,0:20:55.830 I'm going to open these, go through each[br]line in the file. 0:20:55.830,0:21:00.510 I'm going to strip the white space out,[br]and [COUGH] 0:21:00.510,0:21:03.070 if not u-c-t, 0:21:03.070,0:21:07.930 if this string is not in line, then I'm[br]going to continue. 0:21:07.930,0:21:12.270 So it's a way for me to skip all of the[br]lines that don't have this string in it. 0:21:14.000,0:21:19.260 So these lines do, that one has it too,[br]and then we're going to print it out. 0:21:19.260,0:21:23.750 It will print out the ones that make it past[br]here, okay? 0:21:23.750,0:21:28.440 So, but in is another way to do searching,[br]right, starts with, 0:21:28.440,0:21:28.940 et cetera. 0:21:30.640,0:21:37.550 So one more thing that you might want to[br]try is, so we can count, right? 0:21:37.550,0:21:40.270 Now, and this is a pattern for prompting[br]for a file name. 0:21:41.920,0:21:45.850 And so, so here you, you'll get tired of[br]sort of 0:21:45.850,0:21:48.770 changing your code every time you want to[br]open a different file. 0:21:48.770,0:21:50.849 because you probably want to run the[br]program 0:21:50.849,0:21:53.621 with mbox once and mbox-short because,[br]just so you 0:21:53.621,0:21:57.850 can test it with different things of data.[br]So here's just another pattern. 0:21:57.850,0:22:01.910 We add this line to say raw_input, enter[br]the file name. 0:22:01.910,0:22:04.700 And there you go, we'll type in the file[br]name. 0:22:04.700,0:22:08.240 And then the thing that we open is[br]whatever we entered as the file name. 0:22:08.240,0:22:11.280 And then the rest of it is pretty much[br]yada yada. 0:22:11.280,0:22:14.060 So here I'm, it's reading the whole file. 0:22:14.060,0:22:17.230 If the line starts with subject, count[br]equals count plus one. 0:22:17.230,0:22:19.340 And then there were 1797 subject 0:22:19.340,0:22:21.850 lines in mbox.txt. 0:22:21.850,0:22:26.500 There were 27 subject lines in[br]mbox-short.txt, okay? 0:22:26.500,0:22:29.020 So that's prompting for the file names. 0:22:29.020,0:22:31.310 Now, open. 0:22:31.310,0:22:35.450 The open statement fails if the file name[br]doesn't exist. 0:22:35.450,0:22:37.290 So, you might want to add a try and 0:22:37.290,0:22:39.840 accept around that if you want to, if[br]you're just writing 0:22:39.840,0:22:42.530 code for yourself and you assume that[br]everything's okay, 0:22:42.530,0:22:44.610 then you don't have to write try accept[br]but if 0:22:44.610,0:22:50.610 you want to catch it [SOUND][br]and catch a bad file name, 0:22:50.610,0:22:55.860 then you take the open which, and turn it[br]into these four lines. 0:22:55.860,0:22:58.480 So this is the code that we think might[br]blow up, 0:22:59.500,0:23:01.330 and it's going to blow up, we know it's[br]going to blow up. 0:23:01.330,0:23:03.510 If they enter a bad file name like 0:23:03.510,0:23:06.510 na na boo boo, right, this is is going to[br]blow up. 0:23:06.510,0:23:08.940 So what do we do?[br]We use try and accept. 0:23:08.940,0:23:09.780 We put try 0:23:09.780,0:23:10.390 around that. 0:23:10.390,0:23:14.210 We're going to take out some insurance on[br]that particular line. 0:23:14.210,0:23:16.540 And then, if it fails, we're going to[br]print 0:23:16.540,0:23:20.500 this message and then say exit, to get[br]out. 0:23:20.500,0:23:22.920 So if you get a good file, 0:23:25.500,0:23:27.930 if you get a good file, it works, skips the 0:23:27.930,0:23:31.610 except, then runs the thing, prints out[br]the count. 0:23:31.610,0:23:35.930 That's what's happening here. If, on the[br]other hand, you get a bad file, 0:23:36.990,0:23:41.940 it comes here, open blows up, runs the[br]except, prints this out, and then quits. 0:23:43.210,0:23:45.680 So that's how this one works with a bad[br]file. 0:23:46.820,0:23:48.538 And now, no traceback, right? 0:23:53.538,0:23:55.386 So we are 0:23:56.690,0:24:00.270 It's kind of a short lecture.[br]We're done with Chapter Seven. 0:24:01.480,0:24:03.940 We open a file. 0:24:03.940,0:24:05.670 We read the file. 0:24:05.670,0:24:09.380 We take out white space at the end with[br]rstrip. 0:24:09.380,0:24:11.670 We had used string functions. 0:24:11.670,0:24:14.650 So, this is kind of putting it all[br]together. 0:24:14.650,0:24:17.280 And it's kind of short little programs[br]now. 0:24:17.280,0:24:22.100 So, it's not.[br]And you know, starting now, 0:24:22.100,0:24:25.255 we are going to start putting these things[br]together and start actually doing work. 0:24:25.255,0:24:28.100 Because now, we have, from the first few[br]chapters, 0:24:28.100,0:24:32.390 we have basic capabilities of Python.[br]Now we have some data to work with. 0:24:32.390,0:24:33.180 Now going forward 0:24:33.180,0:24:36.570 we are going to do increasingly[br]sophisticated things with that data. 0:24:36.570,0:24:38.240 So I can't wait to see in the next[br]lecture.