Welcome to Chapter Seven. Python for Informatics: Exploring Information. I'm Charles Severence. I'm the author of the book and your host. And, as always, this is brought to you by. No, I'm sorry. It's all creative copyright, Creative Commons Attribution. The audio, the video, the slides, and even the book. So, here we go. Oh, and and so, frankly, where we've been working all along is, we have been writing code and talking to the CPU. Hang on, let me, let me go get my CPU and stuff. Hang on, be right back. [SOUND] Okay. Here we go. Here we go. Here's all the stuff. Remember the stuff from the first lecture? There we go with that. Remember the motherboard from the first lecture? This is kind of the picture of what's on the screen. The motherboard, the CPU plugs in here, memory plugs in here. And remember how the CPU is sort of the brains, as much brains as there is, for the operation. The CPU is asking what next. The instructions come in through these little pins. There's data inside, and it stores sort of semi-permanent data, variables, are all stored pretty much here in RAM. And we write our programs, and so your Python programs, they're sitting here in this RAM, and they're being fed to this CPU through those chips. Through those pins, right? The pins, I mean it doesn't really connect like that. And so, so frankly, up to now, everything that we've been doing is just the Python programming language. And so the only place we've really been operating is here. We have been putting Python into the main memory. And the main memory. And we have been effectively feeding instructions to the CPU, the central processing unit, as it needed them, and then the program would stop. And everything we've done so far everything is just sort of fiddling around here. We have never escaped it. So now we are finally going to escape from the central processing unit and the memory. We'll still write programs and have variables in here. But now we're going to use the disk, the secondary storage, the permanent media, right? So if I go grab my Raspberry Pi, alright, that goes right there. Here's my Raspberry Pi, so here we've got the Raspberry Pi, which is the small version, which of course has a CPU, memory, and graphics processor, all in this little chip right here. But the secondary memory for the, is this little SD card that is the secondary memory for Raspberry Pi. So the structure of the Raspberry Pi is exactly the same as the structure of any other personal computer, it's just smaller and less expensive. And so in the Raspberry Pi, if you're programming the Raspberry Pi, you're sort of finally escaping. All your programs were in here. Your CPU is in here and that's pretty much how, how far you've got to run. But now, of course when you save your files, you save them to here. But now we are going to start looking at data on the disk drive and so it's time to escape to the secondary memory. Okay? Time to escape to the secondary memory. And Raspberry Pi, you can go right there. Okay? So it's time to find some data to mess with. So a lot of what we've been doing so far is just kind of the pre-work to get to the point where we can do this. And in here we're going to have data files. Now, we've been making data files. You've been writing, every Python program that you write on your computer gets saved as a file. Then Python reads the file and runs it. But now we're actually going to start messing with some data. And so, files are where we're going to be working. And so, one of things about secondary memory is it's much larger. And this is, main memory of the computer is pretty large, it's just not large enough to hold everything that the computer is capable of holding. So the files that we're going to work with. Now we're not talking about image files or Quicktime movies or things like that. We're going to work with text files because the theme of this course is digging through text. Sometimes we'll pull it off the Internet. Sometimes we'll read files, but it's digging through and using all the things that we've learned so far, looping and strings, and all those things, to make sense of a sequence of information. Okay? Now, to access file information, we have to do this thing called opening the file. We can't just say, yo, the information is just omnipresent because there are so much data that you can't have Python sort of know all the data. You literally have hundreds of thousands of files on your computer's hard drive. And you, which one are you going to read? So there's a step that you have to do, that you call this built-in function called open. And say, oh, this is the file that I want to work with, of the hundreds of thousands, and then once you do, you've kind of got this little connector into it. And the open is a built-in function inside Python. Hang on a sec, let's say good bye to that. The open function is a built-in function in Python, and you, it takes two parameters. The first parameter is the name of the file, like mbox.txt, and then the second is how you're going to read it. Are you going to read it? are you going to write it? et cetera. Now most of the time we'll be reading our files. So we call the open function and pass it in the name of the file we want to open, and then how we want to read it. Now you can leave this second parameter off and it assumes that you're going to want to read the file. Now. When the open is successful, it doesn't actually read all of the data because the memory is small, small compared to the hard drive, and so you have to sort of step through the data, you'll tell it when to read it. So the act of opening it is not actually reading all data. It is creating kind of like a connection between the memory and the data that's on the hard drive, right? It's connecting between, oh listen to this. Oh that's going to fall down. Is it going to stand up that way? Oh, I should come up with a way to make that stand. So it's a connection. So the, your program's kind of running in here. And the, and the file handle is just sort of it's like a phone call between your memory and your disk drive. It's not the actual data. The actual data is still sitting on the disk drive, okay? So, a graphical way to take a look at this is, the file handle, the thing that comes back from the open request. The open goes and finds the file out on the disk drive and yada, yada, yada, and then the handle is something that lives in the memory. that is sort of like the thing that maintains its connection to where all the data is on the disk or on the SD RAM that's in it. So the handle is not all the data, but it is a mechanism that you can use to get at the data. So if you print it out, it doesn't have all the data from the file, it says, I am a file handle that's opened this file and we're in read mode. So, that doesn't actually have the data, even though this is the data that's in the file. And then we have operations that we do to the handle like open it, close it, read it, write it. So we do things. So, so the handle and then through the handle it actually changes what's on the disk or reads what's on the disk. So the handle is kind of a thing that's not there. If you attempt to open a file and the name of the file. Now the way we're going to do these is these need to be in the same folder on your computer as in, as your Python code. Now, there are trickier ways to do it, but we're going to keep it simple. This is the name of a file in the same folder as the Python code that you're running. [SOUND] And if it's not, then we get, of course, a traceback and we're used to using, reading tracebacks by now, no such file or directory stuff.txt. Oh, of course, I forgot to save it or I typed it wrong. So. The next thing we have to learn is the notion of the newline character. You haven't seen this so far, but there's a special character in files that is used to indicate the end of a line. Because these text files that we've been writing, including Python programs that you have, are organized into lines. Each line has variable length and there is a special non-printing character that you just don't see. Now you see it because you see a line, multiple lines, but you don't see the character itself. So it turns out that this character is very important because the data is just a stream of characters on disk and then it's punctuated by newlines that tell it when it's time to end the line. So if we are building a string, the constant for newline is backslash n. And so, when we make a string that we want to have a newline in it, we'll say Hello backslash n World. And then if you print it out one way, you actually see the backslash n. But then if you use the print to print it out, you see sort of like the, it moves back down, you know, to the left margin and down. So, so, sometimes you see the slash n and sometimes it's shown as movement. Right? You, it moves it. The other thing that's important is even though we represent this as two characters, the backslash n is represented as two characters in a string, it's actually one character. So if we print it out, we see X newline Y and if we ask how many characters are in stuff, which is this string, it says 3. That's important. Okay? There is one, two, three. The newline is a single character. This is a just a syntax that we use to sort of encode a newline in a string. Okay? So, even though these are just a long sequence of characters punctuated by newlines, visually, text editors and operating systems show them, show these files to us as a sequence of lines. And it doesn't take very long to just start thinking about them as a sequence of lines. As a matter of fact, maybe you never, wish I'd never told you about newlines. But when we start reading files, we're going to have to deal with these newlines. So the way that we sort of have to mentally visualize of what these text files look like is they have a newline that punctuates the end of the line. Now in reality, if we look at this, this R really comes right after it. Right? This is all a bunch of characters and the newlines are punctuation, okay? To say this is first line, second line, third line, and fourth line. So, you gotta think that each of these things is here, sitting at the end of the line. And so the number of characters in this line include that newline. Now the newline is one character. Okay? So, how do we read these files? Well, we've already talked about doing an open xfile. And I'm just, this xfile, again that's just a mneumonic name that I made up. This is a handle. Remember, it's not all the data. But the handle is the way that we can read the data. We can use it as a access point. The coolest way to read a file, if it's a text file in multiple lines, is to use a determinant loop, a for loop. for cheese in xfile. So this, remember we would put a list of numbers or a string here. Now we've put a file handle here. Python knows automatically that each time we are going to run this loop, it's going to go to the next line of the file. Automatically, for, a cheese is just a stupid name that I came up with it. I would be better to call line rather than cheese, but for cheese in and then it goes dot, dot, dot, dot, dot, dot, dot, each file and then it stops when it reads the whole file. So this line will print out every line in the file, that's how you do it. These three lines open a file, read every line in the file, okay? So a file handle itself is a special kind of a sequence, much like a list of numbers or a string is a sequence of characters. So one of the things we can do to combine one of our counting idioms is count the number of lines in a file. Okay? And so how we would do that is we would open the file, set a counter to zero, this time I'll use a mnemonic variable called count. For line in fhand, that says run this indented text once for each line in the file. For each line in the file, add count equals count plus 1. When the for loop is done, print the count. Pretty straightforward. Very few other languages are capable of writing that program in as quick and as dense and succinct a way as Python is. Python does a really, really nice job of this. Okay? So that's how you count the lines. Open it, write a for loop, and then add one. Now we, we can't just say, so what you can't do, and this gives you a sense. You can't say len, fhand. And that's because this isn't really the data. That's sort of, you have to like pull the, pull it and read it to get the data out of it. Although we'll see another way of reading it later. Okay? So that's counting the lines in a file. It turns out you can also read the entire file. Now if you read the entire file, it's not broken into lines. You're getting all the characters punctuated by newlines and you get everything. Now you don't want to read this if it's too big, so it's going to all try to read it into the memory of the computer. And if the memory is not big enough, you're going to slow down to a crawl. But if it's a real tiny file, this works just fine. And so, so we have sort of real, we open a file and we say fhand.read, this is basically saying, hey, dear fhand, read it all and return it to me as a string. So that's a string with all the lines of the file concatenated together with newlines, which is actually exactly what's in the file. It's the raw data. That for loop sort of looks for the newline and does all of the stuff automatically for us. It's quite nice. So then we can, like, because inp is a string at this point, we can just print the length of it. And we can say, oh, there's 94,626 characters that came from that file. It reads the whole thing, whole file, reads the whole file. We can also do things like, you know, slice it now. And so this is the first 20 characters, up from zero up to, but not including, 20. So this, this is our file. Okay? So that's reading through the whole file. So, let me go back a little bit, this is the file that we're going to play with. This file here that we're going to play with in this class is a mailbox file. And this is actual real data. And these are real people. And these are real dates, having to do with an open source project that I worked on called Sakai. I actually have a tattoo of Sakai here on my shoulder. Maybe in an upcoming lecture, I'll have a short-sleeved shirt, and show you my tattoo. But for now, I can't because I've got, got clothes on. So, but this is real data. It's the mbox.txt, mbox.txt file. So, so that's the file that we're going to use for most of the next few assignments. It'll be the same file. You'll get tired of it. And you'll get to know all these people, Stephen, Chen Wen, and all the people in the file. Okay, so. We can search for lines that have a prefix. This is kind of the find pattern from the looping lecture. So we're going to go through a list of, of lines in a file, and we're going to only print out the ones that match a certain thing. So again, we open the file up. We're going to write a for loop that's going to say, for each line in the file, if the line and then we can call a, a utility function inside of string, because line is a string. If line startswith From, print it out. So this means it's going to loop through all of the lines in the file and it's going to print the ones that start with the string 'From:' Okay? Again, four lines, complete Python program to read this file and print the lines that have a prefix of from. So, if you run this program, and I suggest that you do, this is what the output's going to look like. And it's like, wait a second, I'm seeing the lines, seeing the lines that have the froms, but then I get these blank lines. And why is that? Why are these blank lines there? If I look at the program, I mean, I'm not printing blank lines. I'm only printing lines that start with from. I'm not doing that, so why? What do you think? I'll give you a second. I've certainly done enough foreshadowing in this lecture. Well it turns out these newlines are the problem. So it turns out that the print, we've been doing this all along, you just, we didn't make a fuss about it. The print adds a newline at the end of everything that it prints. So the yellow newlines are coming from the print statement. But when we read the file, each line ends in a newline. So these green newlines are actually from the file. They're the ones from the file. So what's happening is we're seeing two newlines, and so that turns into a blank line. So, how do we deal with that? Well, we've got a string function that conveniently solves that problem, okay? And that is we're going to call rstrip. If you recall, we had strip, lstrip, and rstrip to strip white space on one side, on the other side, or on both sides. So in this one, we're going to use rstrip. We're going to say, we're going to read the line, that this line is going to have a newline in it. rstrip says pull white space, and the newlines are also counted as white space. Blanks or newlines are white space. And then we're going to replace this with no newline in it. Then we're going to ask if it starts with a from and then we're going to print it out, and then we go and we're going to see exactly what we're looking for in this file. And there's no newlines. So the newline that's coming out here is the one from the print, not the one from the file, because the one from the file got wiped out by that particular line. Okay? So another general pattern of these file-based loops that we have done this, is a skipping pattern. Now, you can do, the, the non-skipping pattern is where you're saying, I'm going to look for lines that start with from and do something to them. Sometimes you'll want to do something to all, to, to the to, to, you want to say, here's a bunch of lines I'm going to skip, and then I'm going to do something. So the skipping pattern uses continue. And so the first few lines here are the same. We open a file, we read each line in the file, but we're going to strip off the white space. You're going to get tired of typing these three lines, because you're going to do it a lot. Open the file, start reading the file, strip the whitespace for each line. And you can make it so that you can look for some fact. In this case, I'm going to say, if not line startswith From, this means this is true for all the lines that don't start with from, continue. And if you remember, continue goes up. So the continue says I'm done, it finishes the iteration, and it doesn't do anything down here. Okay? And so it, this is a, and then, we can do something. So, I've kind of flipped this, where I said, these are the things I'm interesting, interested in, that's lines that start with from. So, I'm going to skip the lines that don't. So I'm going to use continue. Either way you can do it, depending on the complexity or how much. Often when you're, this is a good pattern when you have lots of lines of code down here that you're going to do a lot of cool stuff with. You can also use things like in to select lines. Right? So I'm going to, I'm going to look for lines that have @uct.ac.za in them. So again, I'm going to open it up. I'm going to open these, go through each line in the file. I'm going to strip the white space out, and [COUGH] if not u-c-t, if this string is not in line, then I'm going to continue. So it's a way for me to skip all of the lines that don't have this string in it. So these lines do, that one has it too, and then we're going to print it out. It will print out the ones that make it past here, okay? So, but in is another way to do searching, right, starts with, et cetera. So one more thing that you might want to try is, so we can count, right? Now, and this is a pattern for prompting for a file name. And so, so here you, you'll get tired of sort of changing your code every time you want to open a different file. because you probably want to run the program with mbox once and mbox-short because, just so you can test it with different things of data. So here's just another pattern. We add this line to say raw_input, enter the file name. And there you go, we'll type in the file name. And then the thing that we open is whatever we entered as the file name. And then the rest of it is pretty much yada yada. So here I'm, it's reading the whole file. If the line starts with subject, count equals count plus one. And then there were 1797 subject lines in mbox.txt. There were 27 subject lines in mbox-short.txt, okay? So that's prompting for the file names. Now, open. The open statement fails if the file name doesn't exist. So, you might want to add a try and accept around that if you want to, if you're just writing code for yourself and you assume that everything's okay, then you don't have to write try accept but if you want to catch it [SOUND] and catch a bad file name, then you take the open which, and turn it into these four lines. So this is the code that we think might blow up, and it's going to blow up, we know it's going to blow up. If they enter a bad file name like na na boo boo, right, this is is going to blow up. So what do we do? We use try and accept. We put try around that. We're going to take out some insurance on that particular line. And then, if it fails, we're going to print this message and then say exit, to get out. So if you get a good file, if you get a good file, it works, skips the except, then runs the thing, prints out the count. That's what's happening here. If, on the other hand, you get a bad file, it comes here, open blows up, runs the except, prints this out, and then quits. So that's how this one works with a bad file. And now, no traceback, right? So we are It's kind of a short lecture. We're done with Chapter Seven. We open a file. We read the file. We take out white space at the end with rstrip. We had used string functions. So, this is kind of putting it all together. And it's kind of short little programs now. So, it's not. And you know, starting now, we are going to start putting these things together and start actually doing work. Because now, we have, from the first few chapters, we have basic capabilities of Python. Now we have some data to work with. Now going forward we are going to do increasingly sophisticated things with that data. So I can't wait to see in the next lecture.