1 00:00:00,200 --> 00:00:01,660 Welcome to Chapter Seven. 2 00:00:01,660 --> 00:00:03,650 Python for Informatics: Exploring Information. 3 00:00:03,650 --> 00:00:04,530 I'm Charles Severence. 4 00:00:04,530 --> 00:00:09,720 I'm the author of the book and your host. And, as always, this is brought to you by. 5 00:00:09,720 --> 00:00:10,410 No, I'm sorry. 6 00:00:10,410 --> 00:00:14,650 It's all creative copyright, Creative Commons Attribution. 7 00:00:14,650 --> 00:00:18,680 The audio, the video, the slides, and even the book. 8 00:00:18,680 --> 00:00:21,080 So, here we go. 9 00:00:21,080 --> 00:00:25,418 Oh, and and so, frankly, where we've been working 10 00:00:25,418 --> 00:00:34,280 all along is, we have been writing code and talking to the CPU. 11 00:00:34,280 --> 00:00:37,477 Hang on, let me, let me go get my CPU and stuff. 12 00:00:37,477 --> 00:00:42,151 Hang on, be right back. 13 00:00:44,151 --> 00:00:49,673 [SOUND] Okay. 14 00:00:49,673 --> 00:00:53,730 Here we go. Here we go. 15 00:00:53,730 --> 00:00:58,830 Here's all the stuff. Remember the stuff from the first lecture? 16 00:01:00,980 --> 00:01:01,710 There we go with that. 17 00:01:02,860 --> 00:01:05,840 Remember the motherboard from the first lecture? 18 00:01:05,840 --> 00:01:08,100 This is kind of the picture of what's on the screen. 19 00:01:08,800 --> 00:01:12,158 The motherboard, the CPU plugs in here, memory plugs in here. 20 00:01:12,158 --> 00:01:18,120 And remember how the CPU is sort of the brains, as 21 00:01:18,120 --> 00:01:22,790 much brains as there is, for the operation. The CPU is asking what next. 22 00:01:22,790 --> 00:01:26,240 The instructions come in through these little pins. 23 00:01:26,240 --> 00:01:30,131 There's data inside, and it stores sort of semi-permanent 24 00:01:30,131 --> 00:01:33,480 data, variables, are all stored pretty much here in RAM. 25 00:01:34,820 --> 00:01:37,970 And we write our programs, and so your Python programs, they're sitting here 26 00:01:37,970 --> 00:01:44,140 in this RAM, and they're being fed to this CPU through those chips. 27 00:01:44,140 --> 00:01:45,280 Through those pins, right? 28 00:01:45,280 --> 00:01:47,850 The pins, I mean it doesn't really connect like that. 29 00:01:47,850 --> 00:01:51,850 And so, so frankly, up to now, everything that we've been doing 30 00:01:51,850 --> 00:01:54,520 is just the Python programming language. 31 00:01:54,520 --> 00:01:58,210 And so the only place we've really been operating is here. 32 00:01:59,650 --> 00:02:02,800 We have been putting Python into the main memory. 33 00:02:02,800 --> 00:02:05,870 And the main memory. And we have 34 00:02:05,870 --> 00:02:09,596 been effectively feeding instructions to the CPU, 35 00:02:09,596 --> 00:02:14,080 the central processing unit, as it needed them, and then the program would stop. 36 00:02:14,080 --> 00:02:15,580 And everything we've done so far 37 00:02:15,580 --> 00:02:17,260 everything 38 00:02:17,260 --> 00:02:22,290 is just sort of fiddling around here. We have never escaped it. 39 00:02:22,290 --> 00:02:25,500 So now we are finally going to escape 40 00:02:25,500 --> 00:02:28,160 from the central processing unit and the memory. 41 00:02:29,180 --> 00:02:31,720 We'll still write programs and have variables in here. 42 00:02:32,920 --> 00:02:38,660 But now we're going to use the disk, the secondary storage, the 43 00:02:38,660 --> 00:02:44,300 permanent media, right? So if I go grab my Raspberry Pi, 44 00:02:44,300 --> 00:02:46,000 alright, that goes right there. 45 00:02:46,000 --> 00:02:51,290 Here's my Raspberry Pi, so here we've got the Raspberry Pi, which is the small version, 46 00:02:51,290 --> 00:02:55,990 which of course has a CPU, memory, and 47 00:02:55,990 --> 00:02:58,770 graphics processor, all in this little chip right here. 48 00:02:58,770 --> 00:03:02,850 But the secondary memory for the, is this little 49 00:03:02,850 --> 00:03:05,710 SD card that is the secondary memory for Raspberry Pi. 50 00:03:05,710 --> 00:03:07,500 So the structure of the Raspberry Pi is 51 00:03:07,500 --> 00:03:09,300 exactly the same as the structure of any other 52 00:03:09,300 --> 00:03:13,370 personal computer, it's just smaller and less expensive. 53 00:03:13,370 --> 00:03:14,630 And so in the Raspberry Pi, if you're 54 00:03:14,630 --> 00:03:17,780 programming the Raspberry Pi, you're sort of finally escaping. 55 00:03:17,780 --> 00:03:19,550 All your programs were in here. 56 00:03:19,550 --> 00:03:24,350 Your CPU is in here and that's pretty much how, how far you've got to run. 57 00:03:24,350 --> 00:03:28,730 But now, of course when you save your files, you save them to here. 58 00:03:28,730 --> 00:03:34,618 But now we are going to start looking at data on the disk drive and so it's time 59 00:03:34,618 --> 00:03:38,880 to escape to the secondary memory. Okay? 60 00:03:38,880 --> 00:03:41,000 Time to escape to the secondary memory. 61 00:03:41,000 --> 00:03:43,930 And Raspberry Pi, you can go right there. Okay? 62 00:03:43,930 --> 00:03:45,790 So it's time to find some data to mess with. 63 00:03:45,790 --> 00:03:48,690 So a lot of what we've been doing so far is just 64 00:03:48,690 --> 00:03:52,740 kind of the pre-work to get to the point where we can do this. 65 00:03:52,740 --> 00:03:54,600 And in here we're going to have data files. 66 00:03:54,600 --> 00:03:55,760 Now, we've been making data files. 67 00:03:55,760 --> 00:04:00,010 You've been writing, every Python program that you write on your computer gets saved 68 00:04:00,010 --> 00:04:03,180 as a file. Then Python reads the file and runs it. 69 00:04:04,380 --> 00:04:07,200 But now we're actually going to start messing with some data. 70 00:04:09,060 --> 00:04:11,530 And so, files are where we're going to be working. 71 00:04:11,530 --> 00:04:16,750 And so, one of things about secondary memory is it's much larger. 72 00:04:18,779 --> 00:04:21,480 And this is, main memory of the computer is pretty large, it's just 73 00:04:21,480 --> 00:04:26,090 not large enough to hold everything that the computer is capable of holding. 74 00:04:26,090 --> 00:04:28,010 So the files that we're going to work with. 75 00:04:28,010 --> 00:04:32,230 Now we're not talking about image files or Quicktime movies or things like that. 76 00:04:32,230 --> 00:04:34,390 We're going to work with text files because the 77 00:04:34,390 --> 00:04:37,540 theme of this course is digging through text. 78 00:04:37,540 --> 00:04:39,090 Sometimes we'll pull it off the Internet. 79 00:04:39,090 --> 00:04:42,170 Sometimes we'll read files, but it's digging through and 80 00:04:42,170 --> 00:04:44,030 using all the things that we've learned so far, 81 00:04:44,030 --> 00:04:46,040 looping and strings, and all those things, 82 00:04:46,040 --> 00:04:49,400 to make sense of a sequence of information. 83 00:04:50,520 --> 00:04:51,540 Okay? 84 00:04:51,540 --> 00:04:57,670 Now, to access file information, we have to do this thing called opening the file. 85 00:04:57,670 --> 00:05:02,400 We can't just say, yo, the information is just omnipresent because there are 86 00:05:02,400 --> 00:05:06,170 so much data that you can't have Python sort of know all the data. 87 00:05:06,170 --> 00:05:09,220 You literally have hundreds of thousands of files on 88 00:05:09,220 --> 00:05:12,420 your computer's hard drive. And you, 89 00:05:12,420 --> 00:05:13,500 which one are you going to read? 90 00:05:13,500 --> 00:05:15,770 So there's a step that you have to do, 91 00:05:15,770 --> 00:05:19,030 that you call this built-in function called open. 92 00:05:19,030 --> 00:05:21,880 And say, oh, this is the file that I want to work with, 93 00:05:21,880 --> 00:05:23,850 of the hundreds of thousands, and then once you do, 94 00:05:23,850 --> 00:05:27,510 you've kind of got this little connector into it. 95 00:05:27,510 --> 00:05:31,520 And the open is a built-in function inside Python. 96 00:05:31,520 --> 00:05:34,300 Hang on a sec, let's say good bye to that. The open 97 00:05:34,300 --> 00:05:39,680 function is a built-in function in Python, and you, it takes two parameters. 98 00:05:39,680 --> 00:05:45,810 The first parameter is the name of the file, like mbox.txt, 99 00:05:45,810 --> 00:05:48,600 and then the second is how you're going to read it. 100 00:05:48,600 --> 00:05:49,270 Are you going to read it? 101 00:05:49,270 --> 00:05:50,280 are you going to write it? et cetera. 102 00:05:50,280 --> 00:05:53,190 Now most of the time we'll be reading our files. 103 00:05:53,190 --> 00:05:55,730 So we call the open function and pass it in the name of 104 00:05:55,730 --> 00:05:59,310 the file we want to open, and then how we want to read it. 105 00:05:59,310 --> 00:06:02,300 Now you can leave this second parameter off and it 106 00:06:02,300 --> 00:06:04,630 assumes that you're going to want to read the file. 107 00:06:04,630 --> 00:06:05,130 Now. 108 00:06:08,920 --> 00:06:11,930 When the open is successful, it doesn't actually read all 109 00:06:11,930 --> 00:06:16,980 of the data because the memory is small, small compared to 110 00:06:16,980 --> 00:06:19,140 the hard drive, and so you have to sort of 111 00:06:19,140 --> 00:06:22,180 step through the data, you'll tell it when to read it. 112 00:06:22,180 --> 00:06:26,700 So the act of opening it is not actually reading all data. 113 00:06:26,700 --> 00:06:30,510 It is creating kind of like a connection between the 114 00:06:30,510 --> 00:06:33,220 memory and the data that's on the hard drive, right? 115 00:06:33,220 --> 00:06:34,470 It's connecting 116 00:06:34,470 --> 00:06:38,450 between, oh listen to this. Oh that's going to fall down. 117 00:06:38,450 --> 00:06:42,014 Is it going to stand up that way? 118 00:06:42,014 --> 00:06:45,330 Oh, I should come up with a way to make that stand. 119 00:06:46,369 --> 00:06:48,080 So it's a connection. 120 00:06:48,080 --> 00:06:50,290 So the, your program's kind of running in here. 121 00:06:50,290 --> 00:06:53,910 And the, and the file handle is just sort of it's 122 00:06:53,910 --> 00:06:57,760 like a phone call between your memory and your disk drive. 123 00:06:57,760 --> 00:07:00,150 It's not the actual data. The actual data is still 124 00:07:00,150 --> 00:07:06,010 sitting on the disk drive, okay? So, a graphical way to take a look at this 125 00:07:06,010 --> 00:07:11,680 is, the file handle, the thing that comes back from the open request. 126 00:07:11,680 --> 00:07:14,990 The open goes and finds the file out on the disk drive and 127 00:07:14,990 --> 00:07:20,250 yada, yada, yada, and then the handle is something that lives in the memory. 128 00:07:20,250 --> 00:07:22,100 that is sort of like the thing that 129 00:07:22,100 --> 00:07:25,830 maintains its connection to where all the data is 130 00:07:25,830 --> 00:07:28,910 on the disk or on the SD RAM that's in it. 131 00:07:28,910 --> 00:07:30,820 So the handle is not all the data, but it is 132 00:07:30,820 --> 00:07:34,280 a mechanism that you can use to get at the data. 133 00:07:34,280 --> 00:07:37,990 So if you print it out, it doesn't have all the data from the file, 134 00:07:37,990 --> 00:07:44,100 it says, I am a file handle that's opened this file and we're in read mode. 135 00:07:44,100 --> 00:07:46,120 So, that doesn't actually have the data, 136 00:07:46,120 --> 00:07:48,440 even though this is the data that's in the file. 137 00:07:48,440 --> 00:07:51,050 And then we have operations that we do to the handle like open it, 138 00:07:51,050 --> 00:07:53,470 close it, read it, write it. So we do things. 139 00:07:53,470 --> 00:07:56,370 So, so the handle and then through the handle it actually changes 140 00:07:56,370 --> 00:07:58,860 what's on the disk or reads what's on the disk. 141 00:07:58,860 --> 00:08:01,560 So the handle is kind of a thing that's not there. 142 00:08:02,890 --> 00:08:06,460 If you attempt to open a file and the name of the file. 143 00:08:06,460 --> 00:08:08,660 Now the way we're going to do these is these need to be 144 00:08:08,660 --> 00:08:14,490 in the same folder on your computer as in, as your Python code. 145 00:08:14,490 --> 00:08:16,110 Now, there are trickier ways to do it, but 146 00:08:16,110 --> 00:08:17,180 we're going to keep it simple. 147 00:08:17,180 --> 00:08:18,670 This is the name of a file in the 148 00:08:18,670 --> 00:08:21,590 same folder as the Python code that you're running. 149 00:08:21,590 --> 00:08:28,100 [SOUND] And if it's not, then we get, of course, a traceback and we're 150 00:08:28,100 --> 00:08:32,289 used to using, reading tracebacks by now, no such file or directory stuff.txt. 151 00:08:32,289 --> 00:08:34,710 Oh, of course, I forgot to save it or I typed it wrong. 152 00:08:37,820 --> 00:08:38,840 So. 153 00:08:38,840 --> 00:08:42,659 The next thing we have to learn is the notion of the newline character. 154 00:08:42,659 --> 00:08:44,390 You haven't seen this so far, 155 00:08:44,390 --> 00:08:47,960 but there's a special character in files 156 00:08:47,960 --> 00:08:52,030 that is used to indicate the end of a line. 157 00:08:52,030 --> 00:08:53,780 Because these text files that we've been writing, 158 00:08:53,780 --> 00:08:57,720 including Python programs that you have, are organized into lines. 159 00:08:57,720 --> 00:08:59,690 Each line has variable length and there is 160 00:08:59,690 --> 00:09:02,870 a special non-printing character that you just don't see. 161 00:09:02,870 --> 00:09:05,840 Now you see it because you see a line, 162 00:09:05,840 --> 00:09:10,710 multiple lines, but you don't see the character itself. 163 00:09:10,710 --> 00:09:13,130 So it turns out that this character is very 164 00:09:13,130 --> 00:09:15,690 important because the data is just a stream of 165 00:09:15,690 --> 00:09:18,850 characters on disk and then it's punctuated by newlines 166 00:09:18,850 --> 00:09:22,200 that tell it when it's time to end the line. 167 00:09:22,200 --> 00:09:29,368 So if we are building a string, the constant for newline is backslash n. 168 00:09:29,368 --> 00:09:32,973 And so, when we make a string that we want to 169 00:09:32,973 --> 00:09:38,380 have a newline in it, we'll say Hello backslash n World. 170 00:09:38,380 --> 00:09:41,370 And then if you print it out one way, you actually see the backslash n. 171 00:09:41,370 --> 00:09:44,230 But then if you use the print to print it out, you see sort of 172 00:09:44,230 --> 00:09:49,580 like the, it moves back down, you know, to the left margin and down. 173 00:09:49,580 --> 00:09:55,900 So, so, sometimes you see the slash n and sometimes it's shown as movement. 174 00:09:55,900 --> 00:09:57,140 Right? You, it moves it. 175 00:09:58,670 --> 00:10:00,000 The other thing that's important is even 176 00:10:00,000 --> 00:10:02,130 though we represent this as two characters, 177 00:10:02,130 --> 00:10:06,300 the backslash n is represented as two characters in a string, it's actually one character. 178 00:10:06,300 --> 00:10:10,250 So if we print it out, we see X newline Y 179 00:10:10,250 --> 00:10:13,280 and if we ask how many characters are in stuff, 180 00:10:13,280 --> 00:10:17,436 which is this string, it says 3. That's important. 181 00:10:17,436 --> 00:10:18,118 Okay? 182 00:10:18,118 --> 00:10:22,070 There is one, two, three. The newline is a single character. 183 00:10:22,070 --> 00:10:26,890 This is a just a syntax that we use to sort of encode a newline in a string. 184 00:10:27,890 --> 00:10:28,390 Okay? 185 00:10:29,450 --> 00:10:33,710 So, even though these are just a 186 00:10:33,710 --> 00:10:36,590 long sequence of characters punctuated by newlines, 187 00:10:36,590 --> 00:10:40,930 visually, text editors and operating systems show them, show 188 00:10:40,930 --> 00:10:43,950 these files to us as a sequence of lines. 189 00:10:43,950 --> 00:10:46,280 And it doesn't take very long to just start thinking about them 190 00:10:46,280 --> 00:10:47,650 as a sequence of lines. 191 00:10:47,650 --> 00:10:50,570 As a matter of fact, maybe you never, wish I'd never told you about newlines. 192 00:10:51,830 --> 00:10:53,080 But when we start reading files, we're 193 00:10:53,080 --> 00:10:54,990 going to have to deal with these newlines. 194 00:10:54,990 --> 00:10:59,260 So the way that we sort of have to mentally visualize of what these text 195 00:10:59,260 --> 00:11:03,980 files look like is they have a newline that punctuates the end of the line. 196 00:11:03,980 --> 00:11:09,200 Now in reality, if we look at this, this R really comes right after it. 197 00:11:09,200 --> 00:11:09,440 Right? 198 00:11:09,440 --> 00:11:13,410 This is all a bunch of characters and the newlines are punctuation, okay? 199 00:11:13,410 --> 00:11:16,720 To say this is first line, second line, third line, and fourth line. 200 00:11:16,720 --> 00:11:18,730 So, you gotta think that each of these things 201 00:11:18,730 --> 00:11:21,710 is here, sitting at the end of the line. 202 00:11:21,710 --> 00:11:24,950 And so the number of characters in this line include that newline. 203 00:11:24,950 --> 00:11:26,930 Now the newline is one character. 204 00:11:26,930 --> 00:11:31,920 Okay? So, how do we read these files? 205 00:11:31,920 --> 00:11:36,130 Well, we've already talked about doing an open xfile. 206 00:11:36,130 --> 00:11:39,100 And I'm just, this xfile, again that's just a mneumonic 207 00:11:39,100 --> 00:11:41,840 name that I made up. This is a handle. 208 00:11:41,840 --> 00:11:43,690 Remember, it's not all the data. 209 00:11:43,690 --> 00:11:46,070 But the handle is the way that we can read the data. 210 00:11:46,070 --> 00:11:48,690 We can use it as a access point. 211 00:11:48,690 --> 00:11:52,000 The coolest way to read a file, if it's a text file in multiple 212 00:11:52,000 --> 00:11:58,270 lines, is to use a determinant loop, a for loop. for cheese in xfile. 213 00:11:58,270 --> 00:12:03,090 So this, remember we would put a list of numbers or a string here. 214 00:12:03,090 --> 00:12:04,150 Now we've put a file 215 00:12:04,150 --> 00:12:05,200 handle here. 216 00:12:05,200 --> 00:12:09,315 Python knows automatically that each time we are going to run this 217 00:12:09,315 --> 00:12:11,980 loop, it's going to go to the next line of the file. 218 00:12:11,980 --> 00:12:16,092 Automatically, for, a cheese is just a stupid name that I came up with it. 219 00:12:16,092 --> 00:12:20,232 I would be better to call line rather than cheese, but for cheese in and then it goes 220 00:12:20,232 --> 00:12:22,812 dot, dot, dot, dot, dot, dot, dot, each file 221 00:12:22,812 --> 00:12:25,760 and then it stops when it reads the whole file. 222 00:12:25,760 --> 00:12:29,270 So this line will print out every line 223 00:12:29,270 --> 00:12:33,840 in the file, that's how you do it. These three lines open a file, 224 00:12:35,510 --> 00:12:42,400 read every line in the file, okay? So a file handle itself is a special kind 225 00:12:42,400 --> 00:12:47,170 of a sequence, much like a list of numbers or a string is a sequence of characters. 226 00:12:47,170 --> 00:12:48,860 So one of the things we can do to combine one of 227 00:12:48,860 --> 00:12:51,930 our counting idioms is count the number of lines in a file. 228 00:12:53,345 --> 00:12:54,340 Okay? And so how we 229 00:12:54,340 --> 00:12:56,970 would do that is we would open the file, set a 230 00:12:56,970 --> 00:13:00,580 counter to zero, this time I'll use a mnemonic variable called count. 231 00:13:00,580 --> 00:13:02,950 For line in fhand, that says run this 232 00:13:02,950 --> 00:13:05,740 indented text once for each line in the file. 233 00:13:05,740 --> 00:13:08,410 For each line in the file, add count equals count plus 1. 234 00:13:08,410 --> 00:13:10,760 When the for loop is done, print the count. 235 00:13:13,010 --> 00:13:14,480 Pretty straightforward. 236 00:13:14,480 --> 00:13:18,240 Very few other languages are capable of writing that program in 237 00:13:18,240 --> 00:13:22,160 as quick and as dense and succinct a way as Python is. 238 00:13:22,160 --> 00:13:25,080 Python does a really, really nice job of this. 239 00:13:25,080 --> 00:13:28,140 Okay? So that's how you count the lines. 240 00:13:28,140 --> 00:13:31,250 Open it, write a for loop, and then add one. 241 00:13:31,250 --> 00:13:35,980 Now we, we can't just say, so what you can't do, and this gives you a sense. 242 00:13:35,980 --> 00:13:37,125 You can't say len, 243 00:13:37,125 --> 00:13:40,300 fhand. 244 00:13:40,300 --> 00:13:42,620 And that's because this isn't really the data. 245 00:13:42,620 --> 00:13:45,390 That's sort of, you have to like pull the, pull it 246 00:13:45,390 --> 00:13:48,080 and read it to get the data out of it. 247 00:13:48,080 --> 00:13:49,990 Although we'll see another way of reading it later. 248 00:13:51,020 --> 00:13:53,270 Okay? So that's counting the lines in a file. 249 00:13:55,100 --> 00:13:57,460 It turns out you can also read the entire file. 250 00:13:58,980 --> 00:14:02,100 Now if you read the entire file, it's not broken into lines. 251 00:14:02,100 --> 00:14:04,000 You're getting all the characters punctuated 252 00:14:04,000 --> 00:14:06,320 by newlines and you get everything. 253 00:14:06,320 --> 00:14:09,820 Now you don't want to read this if it's too big, so it's 254 00:14:09,820 --> 00:14:12,610 going to all try to read it into the memory of the computer. 255 00:14:12,610 --> 00:14:16,080 And if the memory is not big enough, you're going to slow down to a crawl. 256 00:14:16,080 --> 00:14:18,905 But if it's a real tiny file, this works just fine. 257 00:14:18,905 --> 00:14:22,120 And so, so we have sort of real, we open 258 00:14:22,120 --> 00:14:26,990 a file and we say fhand.read, this is basically saying, hey, 259 00:14:26,990 --> 00:14:30,840 dear fhand, read it all and return it to me as a string. 260 00:14:31,950 --> 00:14:34,350 So that's a string with all the lines of the file concatenated 261 00:14:34,350 --> 00:14:38,790 together with newlines, which is actually exactly what's in the file. 262 00:14:38,790 --> 00:14:39,800 It's the raw data. 263 00:14:39,800 --> 00:14:42,400 That for loop sort of looks for the newline 264 00:14:42,400 --> 00:14:44,410 and does all of the stuff automatically for us. 265 00:14:44,410 --> 00:14:45,160 It's quite nice. 266 00:14:46,410 --> 00:14:49,670 So then we can, like, because inp is a string at this point, 267 00:14:49,670 --> 00:14:50,590 we can just print the length of it. 268 00:14:50,590 --> 00:14:53,110 And we can say, oh, there's 94,626 269 00:14:53,110 --> 00:14:56,780 characters that came from that file. 270 00:14:56,780 --> 00:15:01,910 It reads the whole thing, whole file, reads the whole file. 271 00:15:01,910 --> 00:15:04,450 We can also do things like, you know, slice it now. 272 00:15:04,450 --> 00:15:10,330 And so this is the first 20 characters, up from zero up to, but not including, 20. 273 00:15:10,330 --> 00:15:12,790 So this, this is our file. Okay? 274 00:15:12,790 --> 00:15:15,640 So that's reading through the whole file. 275 00:15:15,640 --> 00:15:18,390 So, let me go back a little bit, this is the file that we're 276 00:15:18,390 --> 00:15:19,230 going to play with. 277 00:15:20,370 --> 00:15:24,920 This file here that we're going to play with in this class is a mailbox file. 278 00:15:24,920 --> 00:15:27,120 And this is actual real data. And these are real people. 279 00:15:27,120 --> 00:15:28,890 And these are real dates, having to do with 280 00:15:28,890 --> 00:15:31,680 an open source project that I worked on called Sakai. 281 00:15:31,680 --> 00:15:35,990 I actually have a tattoo of Sakai here on my shoulder. 282 00:15:35,990 --> 00:15:38,220 Maybe in an upcoming lecture, I'll have a 283 00:15:38,220 --> 00:15:40,430 short-sleeved shirt, and show you my tattoo. 284 00:15:40,430 --> 00:15:44,500 But for now, I can't because I've got, got clothes on. 285 00:15:44,500 --> 00:15:52,480 So, but this is real data. It's the mbox.txt, mbox.txt file. 286 00:15:52,480 --> 00:15:56,270 So, so that's the file that we're going to use for most of the next few assignments. 287 00:15:56,270 --> 00:15:57,960 It'll be the same file. You'll get tired of it. 288 00:15:57,960 --> 00:16:00,330 And you'll get to know all these people, Stephen, 289 00:16:00,330 --> 00:16:02,110 Chen Wen, and all the people in the file. 290 00:16:05,360 --> 00:16:06,020 Okay, so. 291 00:16:07,440 --> 00:16:10,470 We can search for lines that have a prefix. 292 00:16:10,470 --> 00:16:14,380 This is kind of the find pattern from the looping lecture. 293 00:16:14,380 --> 00:16:17,860 So we're going to go through a list of, of lines in a file, 294 00:16:17,860 --> 00:16:20,910 and we're going to only print out the ones that match a certain thing. 295 00:16:20,910 --> 00:16:22,810 So again, we open the file up. 296 00:16:22,810 --> 00:16:25,410 We're going to write a for loop that's going to say, for each line in the 297 00:16:25,410 --> 00:16:30,420 file, if the line and then we can call a, a utility function 298 00:16:30,420 --> 00:16:32,660 inside of string, because line is a string. 299 00:16:32,660 --> 00:16:35,230 If line startswith From, print it out. 300 00:16:35,230 --> 00:16:37,860 So this means it's going to loop through all of the lines in the 301 00:16:37,860 --> 00:16:43,180 file and it's going to print the ones that start with the string 'From:' 302 00:16:44,530 --> 00:16:45,700 Okay? 303 00:16:45,700 --> 00:16:49,780 Again, four lines, complete Python program to read this 304 00:16:49,780 --> 00:16:52,760 file and print the lines that have a prefix of from. 305 00:16:54,950 --> 00:16:59,020 So, if you run this program, and I suggest that you do, 306 00:17:01,050 --> 00:17:02,710 this is what the output's going to look like. 307 00:17:03,840 --> 00:17:07,160 And it's like, wait a second, I'm seeing the lines, 308 00:17:09,680 --> 00:17:13,990 seeing the lines that have the froms, but then I get these blank lines. 309 00:17:16,530 --> 00:17:18,950 And why is that? Why are these blank lines there? 310 00:17:18,950 --> 00:17:24,390 If I look at the program, I mean, I'm not printing blank lines. 311 00:17:24,390 --> 00:17:26,369 I'm only printing lines that start with from. 312 00:17:26,369 --> 00:17:27,520 I'm not doing that, so why? 313 00:17:30,520 --> 00:17:31,020 What do you think? 314 00:17:31,790 --> 00:17:32,530 I'll give you a second. 315 00:17:34,740 --> 00:17:38,080 I've certainly done enough foreshadowing in this lecture. 316 00:17:38,080 --> 00:17:41,100 Well it turns out these newlines are the problem. 317 00:17:41,100 --> 00:17:43,850 So it turns out that the print, we've been doing this 318 00:17:43,850 --> 00:17:46,580 all along, you just, we didn't make a fuss about it. 319 00:17:46,580 --> 00:17:49,930 The print adds a newline at the end of everything that it prints. 320 00:17:49,930 --> 00:17:53,270 So the yellow newlines are coming from the print statement. 321 00:17:53,270 --> 00:17:57,500 But when we read the file, each line ends in a newline. 322 00:17:57,500 --> 00:18:00,490 So these green newlines are actually from the file. 323 00:18:03,170 --> 00:18:05,960 They're the ones from the file. 324 00:18:05,960 --> 00:18:08,060 So what's happening is we're seeing two 325 00:18:08,060 --> 00:18:10,530 newlines, and so that turns into a blank line. 326 00:18:11,870 --> 00:18:14,140 So, how do we deal with that? 327 00:18:14,140 --> 00:18:19,140 Well, we've got a string function that conveniently solves that problem, okay? 328 00:18:19,140 --> 00:18:21,040 And that is we're going to call rstrip. 329 00:18:21,040 --> 00:18:25,200 If you recall, we had strip, lstrip, and rstrip to strip 330 00:18:25,200 --> 00:18:28,380 white space on one side, on the other side, or on both sides. 331 00:18:28,380 --> 00:18:29,510 So in this one, 332 00:18:29,510 --> 00:18:30,570 we're going to use rstrip. 333 00:18:30,570 --> 00:18:33,130 We're going to say, we're going to read the line, that 334 00:18:33,130 --> 00:18:35,545 this line is going to have a newline in it. 335 00:18:35,545 --> 00:18:40,200 rstrip says pull white space, and the newlines are also counted as white space. 336 00:18:40,200 --> 00:18:42,870 Blanks or newlines are white space. 337 00:18:42,870 --> 00:18:46,610 And then we're going to replace this with no newline in it. 338 00:18:46,610 --> 00:18:50,108 Then we're going to ask if it starts with a from and then we're going to print it 339 00:18:50,108 --> 00:18:51,804 out, and then we go and we're going to 340 00:18:51,804 --> 00:18:55,130 see exactly what we're looking for in this file. 341 00:18:55,130 --> 00:18:56,040 And there's no newlines. 342 00:18:56,040 --> 00:19:01,360 So the newline that's coming out here is the one from the print, not the 343 00:19:01,360 --> 00:19:03,930 one from the file, because the one from 344 00:19:03,930 --> 00:19:06,610 the file got wiped out by that particular line. 345 00:19:07,950 --> 00:19:08,450 Okay? 346 00:19:09,720 --> 00:19:13,360 So another general pattern of these file-based loops 347 00:19:13,360 --> 00:19:17,510 that we have done this, is a skipping pattern. 348 00:19:17,510 --> 00:19:20,490 Now, you can do, the, the non-skipping pattern 349 00:19:20,490 --> 00:19:22,960 is where you're saying, I'm going to look for lines 350 00:19:22,960 --> 00:19:25,640 that start with from and do something to them. 351 00:19:25,640 --> 00:19:30,420 Sometimes you'll want to do something to all, to, to the to, to, you want to say, 352 00:19:30,420 --> 00:19:32,790 here's a bunch of lines I'm going to skip, and then I'm going to do something. 353 00:19:32,790 --> 00:19:36,520 So the skipping pattern uses continue. 354 00:19:36,520 --> 00:19:38,840 And so the first few lines here are the same. 355 00:19:38,840 --> 00:19:41,760 We open a file, we read each line in the file, 356 00:19:41,760 --> 00:19:43,780 but we're going to strip off the white space. 357 00:19:43,780 --> 00:19:45,640 You're going to get tired of typing these three lines, 358 00:19:45,640 --> 00:19:47,280 because you're going to do it a lot. 359 00:19:47,280 --> 00:19:51,890 Open the file, start reading the file, strip the whitespace for each line. 360 00:19:51,890 --> 00:19:57,740 And you can make it so that you can look for some fact. 361 00:19:57,740 --> 00:20:01,260 In this case, I'm going to say, if not line startswith From, this 362 00:20:01,260 --> 00:20:05,220 means this is true for all the lines that don't start with from, 363 00:20:05,220 --> 00:20:08,600 continue. And if you remember, continue goes up. 364 00:20:08,600 --> 00:20:10,960 So the continue says I'm done, it finishes 365 00:20:10,960 --> 00:20:14,230 the iteration, and it doesn't do anything down here. 366 00:20:14,230 --> 00:20:15,130 Okay? 367 00:20:15,130 --> 00:20:18,210 And so it, this is a, and then, we can do something. 368 00:20:18,210 --> 00:20:21,110 So, I've kind of flipped this, where I said, these are the 369 00:20:21,110 --> 00:20:24,830 things I'm interesting, interested in, that's lines that start with from. 370 00:20:24,830 --> 00:20:26,270 So, I'm going to skip the lines that don't. 371 00:20:26,270 --> 00:20:27,880 So I'm going to use continue. 372 00:20:27,880 --> 00:20:32,420 Either way you can do it, depending on the complexity or how much. 373 00:20:32,420 --> 00:20:34,100 Often when you're, this is a good pattern when 374 00:20:34,100 --> 00:20:36,400 you have lots of lines of code down here 375 00:20:36,400 --> 00:20:37,850 that you're going to do a lot of cool stuff with. 376 00:20:39,290 --> 00:20:42,780 You can also use things like in to select lines. 377 00:20:42,780 --> 00:20:43,320 Right? 378 00:20:43,320 --> 00:20:51,199 So I'm going to, I'm going to look for lines that have @uct.ac.za in them. 379 00:20:51,199 --> 00:20:53,073 So again, I'm going to open it up. 380 00:20:53,073 --> 00:20:55,830 I'm going to open these, go through each line in the file. 381 00:20:55,830 --> 00:21:00,510 I'm going to strip the white space out, and [COUGH] 382 00:21:00,510 --> 00:21:03,070 if not u-c-t, 383 00:21:03,070 --> 00:21:07,930 if this string is not in line, then I'm going to continue. 384 00:21:07,930 --> 00:21:12,270 So it's a way for me to skip all of the lines that don't have this string in it. 385 00:21:14,000 --> 00:21:19,260 So these lines do, that one has it too, and then we're going to print it out. 386 00:21:19,260 --> 00:21:23,750 It will print out the ones that make it past here, okay? 387 00:21:23,750 --> 00:21:28,440 So, but in is another way to do searching, right, starts with, 388 00:21:28,440 --> 00:21:28,940 et cetera. 389 00:21:30,640 --> 00:21:37,550 So one more thing that you might want to try is, so we can count, right? 390 00:21:37,550 --> 00:21:40,270 Now, and this is a pattern for prompting for a file name. 391 00:21:41,920 --> 00:21:45,850 And so, so here you, you'll get tired of sort of 392 00:21:45,850 --> 00:21:48,770 changing your code every time you want to open a different file. 393 00:21:48,770 --> 00:21:50,849 because you probably want to run the program 394 00:21:50,849 --> 00:21:53,621 with mbox once and mbox-short because, just so you 395 00:21:53,621 --> 00:21:57,850 can test it with different things of data. So here's just another pattern. 396 00:21:57,850 --> 00:22:01,910 We add this line to say raw_input, enter the file name. 397 00:22:01,910 --> 00:22:04,700 And there you go, we'll type in the file name. 398 00:22:04,700 --> 00:22:08,240 And then the thing that we open is whatever we entered as the file name. 399 00:22:08,240 --> 00:22:11,280 And then the rest of it is pretty much yada yada. 400 00:22:11,280 --> 00:22:14,060 So here I'm, it's reading the whole file. 401 00:22:14,060 --> 00:22:17,230 If the line starts with subject, count equals count plus one. 402 00:22:17,230 --> 00:22:19,340 And then there were 1797 subject 403 00:22:19,340 --> 00:22:21,850 lines in mbox.txt. 404 00:22:21,850 --> 00:22:26,500 There were 27 subject lines in mbox-short.txt, okay? 405 00:22:26,500 --> 00:22:29,020 So that's prompting for the file names. 406 00:22:29,020 --> 00:22:31,310 Now, open. 407 00:22:31,310 --> 00:22:35,450 The open statement fails if the file name doesn't exist. 408 00:22:35,450 --> 00:22:37,290 So, you might want to add a try and 409 00:22:37,290 --> 00:22:39,840 accept around that if you want to, if you're just writing 410 00:22:39,840 --> 00:22:42,530 code for yourself and you assume that everything's okay, 411 00:22:42,530 --> 00:22:44,610 then you don't have to write try accept but if 412 00:22:44,610 --> 00:22:50,610 you want to catch it [SOUND] and catch a bad file name, 413 00:22:50,610 --> 00:22:55,860 then you take the open which, and turn it into these four lines. 414 00:22:55,860 --> 00:22:58,480 So this is the code that we think might blow up, 415 00:22:59,500 --> 00:23:01,330 and it's going to blow up, we know it's going to blow up. 416 00:23:01,330 --> 00:23:03,510 If they enter a bad file name like 417 00:23:03,510 --> 00:23:06,510 na na boo boo, right, this is is going to blow up. 418 00:23:06,510 --> 00:23:08,940 So what do we do? We use try and accept. 419 00:23:08,940 --> 00:23:09,780 We put try 420 00:23:09,780 --> 00:23:10,390 around that. 421 00:23:10,390 --> 00:23:14,210 We're going to take out some insurance on that particular line. 422 00:23:14,210 --> 00:23:16,540 And then, if it fails, we're going to print 423 00:23:16,540 --> 00:23:20,500 this message and then say exit, to get out. 424 00:23:20,500 --> 00:23:22,920 So if you get a good file, 425 00:23:25,500 --> 00:23:27,930 if you get a good file, it works, skips the 426 00:23:27,930 --> 00:23:31,610 except, then runs the thing, prints out the count. 427 00:23:31,610 --> 00:23:35,930 That's what's happening here. If, on the other hand, you get a bad file, 428 00:23:36,990 --> 00:23:41,940 it comes here, open blows up, runs the except, prints this out, and then quits. 429 00:23:43,210 --> 00:23:45,680 So that's how this one works with a bad file. 430 00:23:46,820 --> 00:23:48,538 And now, no traceback, right? 431 00:23:53,538 --> 00:23:55,386 So we are 432 00:23:56,690 --> 00:24:00,270 It's kind of a short lecture. We're done with Chapter Seven. 433 00:24:01,480 --> 00:24:03,940 We open a file. 434 00:24:03,940 --> 00:24:05,670 We read the file. 435 00:24:05,670 --> 00:24:09,380 We take out white space at the end with rstrip. 436 00:24:09,380 --> 00:24:11,670 We had used string functions. 437 00:24:11,670 --> 00:24:14,650 So, this is kind of putting it all together. 438 00:24:14,650 --> 00:24:17,280 And it's kind of short little programs now. 439 00:24:17,280 --> 00:24:22,100 So, it's not. And you know, starting now, 440 00:24:22,100 --> 00:24:25,255 we are going to start putting these things together and start actually doing work. 441 00:24:25,255 --> 00:24:28,100 Because now, we have, from the first few chapters, 442 00:24:28,100 --> 00:24:32,390 we have basic capabilities of Python. Now we have some data to work with. 443 00:24:32,390 --> 00:24:33,180 Now going forward 444 00:24:33,180 --> 00:24:36,570 we are going to do increasingly sophisticated things with that data. 445 00:24:36,570 --> 00:24:38,240 So I can't wait to see in the next lecture.