Return to Video

Python for Informatics - Chapter 7 Files

  • 0:00 - 0:02
    Welcome to Chapter Seven.
  • 0:02 - 0:04
    Python for Informatics: Exploring
    Information.
  • 0:04 - 0:05
    I'm Charles Severence.
  • 0:05 - 0:10
    I'm the author of the book and your host.
    And, as always, this is brought to you by.
  • 0:10 - 0:10
    No, I'm sorry.
  • 0:10 - 0:15
    It's all creative copyright, Creative
    Commons Attribution.
  • 0:15 - 0:19
    The audio, the video, the slides, and even
    the book.
  • 0:19 - 0:21
    So, here we go.
  • 0:21 - 0:25
    Oh, and and so, frankly, where
    we've been working
  • 0:25 - 0:34
    all along is, we have been writing code
    and talking to the CPU.
  • 0:34 - 0:37
    Hang on, let me, let me go get
    my CPU and stuff.
  • 0:37 - 0:42
    Hang on, be right back.
  • 0:44 - 0:50
    [SOUND]
    Okay.
  • 0:50 - 0:54
    Here we go. Here we go.
  • 0:54 - 0:59
    Here's all the stuff. Remember the stuff
    from the first lecture?
  • 1:01 - 1:02
    There we go with that.
  • 1:03 - 1:06
    Remember the motherboard from the first
    lecture?
  • 1:06 - 1:08
    This is kind of the picture of what's on
    the screen.
  • 1:09 - 1:12
    The motherboard, the CPU plugs in here,
    memory plugs in here.
  • 1:12 - 1:18
    And remember how the CPU is sort of the
    brains, as
  • 1:18 - 1:23
    much brains as there is, for the operation.
    The CPU is asking what next.
  • 1:23 - 1:26
    The instructions come in through these
    little pins.
  • 1:26 - 1:30
    There's data inside, and it stores sort of
    semi-permanent
  • 1:30 - 1:33
    data, variables, are all stored pretty
    much here in RAM.
  • 1:35 - 1:38
    And we write our programs, and so your
    Python programs, they're sitting here
  • 1:38 - 1:44
    in this RAM, and they're being fed to this
    CPU through those chips.
  • 1:44 - 1:45
    Through those pins, right?
  • 1:45 - 1:48
    The pins, I mean it doesn't really connect
    like that.
  • 1:48 - 1:52
    And so, so frankly, up to now, everything
    that we've been doing
  • 1:52 - 1:55
    is just the Python programming language.
  • 1:55 - 1:58
    And so the only place we've really been
    operating is here.
  • 2:00 - 2:03
    We have been putting Python into the main
    memory.
  • 2:03 - 2:06
    And the main memory. And we have
  • 2:06 - 2:10
    been effectively feeding instructions to
    the CPU,
  • 2:10 - 2:14
    the central processing unit, as it needed
    them, and then the program would stop.
  • 2:14 - 2:16
    And everything we've done so far
  • 2:16 - 2:17
    everything
  • 2:17 - 2:22
    is just sort of fiddling around here.
    We have never escaped it.
  • 2:22 - 2:26
    So now we are finally going to escape
  • 2:26 - 2:28
    from the central processing unit and the
    memory.
  • 2:29 - 2:32
    We'll still write programs and have
    variables in here.
  • 2:33 - 2:39
    But now we're going to use the disk,
    the secondary storage, the
  • 2:39 - 2:44
    permanent media, right?
    So if I go grab my Raspberry Pi,
  • 2:44 - 2:46
    alright, that goes right there.
  • 2:46 - 2:51
    Here's my Raspberry Pi, so here we've got
    the Raspberry Pi, which is the small version,
  • 2:51 - 2:56
    which of course has a CPU, memory, and
  • 2:56 - 2:59
    graphics processor, all in this little chip
    right here.
  • 2:59 - 3:03
    But the secondary memory for the,
    is this little
  • 3:03 - 3:06
    SD card that is the secondary memory for
    Raspberry Pi.
  • 3:06 - 3:08
    So the structure of the Raspberry Pi is
  • 3:08 - 3:09
    exactly the same as the structure
    of any other
  • 3:09 - 3:13
    personal computer, it's just smaller and
    less expensive.
  • 3:13 - 3:15
    And so in the Raspberry Pi, if you're
  • 3:15 - 3:18
    programming the Raspberry Pi, you're sort
    of finally escaping.
  • 3:18 - 3:20
    All your programs were in here.
  • 3:20 - 3:24
    Your CPU is in here and that's pretty much
    how, how far you've got to run.
  • 3:24 - 3:29
    But now, of course when you save your files,
    you save them to here.
  • 3:29 - 3:35
    But now we are going to start looking at
    data on the disk drive and so it's time
  • 3:35 - 3:39
    to escape to the secondary memory.
    Okay?
  • 3:39 - 3:41
    Time to escape to the secondary memory.
  • 3:41 - 3:44
    And Raspberry Pi, you can go right there.
    Okay?
  • 3:44 - 3:46
    So it's time to find some data to mess
    with.
  • 3:46 - 3:49
    So a lot of what we've been doing so far
    is just
  • 3:49 - 3:53
    kind of the pre-work to get to the point
    where we can do this.
  • 3:53 - 3:55
    And in here we're going to have data files.
  • 3:55 - 3:56
    Now, we've been making data files.
  • 3:56 - 4:00
    You've been writing, every Python program
    that you write on your computer gets saved
  • 4:00 - 4:03
    as a file. Then Python reads the file and runs it.
  • 4:04 - 4:07
    But now we're actually going to start
    messing with some data.
  • 4:09 - 4:12
    And so, files are where we're going to be
    working.
  • 4:12 - 4:17
    And so, one of things about secondary memory
    is it's much larger.
  • 4:19 - 4:21
    And this is, main memory of the computer
    is pretty large, it's just
  • 4:21 - 4:26
    not large enough to hold everything that
    the computer is capable of holding.
  • 4:26 - 4:28
    So the files that we're going to work with.
  • 4:28 - 4:32
    Now we're not talking about image files or
    Quicktime movies or things like that.
  • 4:32 - 4:34
    We're going to work with text files
    because the
  • 4:34 - 4:38
    theme of this course is digging through
    text.
  • 4:38 - 4:39
    Sometimes we'll pull it off the Internet.
  • 4:39 - 4:42
    Sometimes we'll read files, but it's
    digging through and
  • 4:42 - 4:44
    using all the things that we've learned so
    far,
  • 4:44 - 4:46
    looping and strings, and all those things,
  • 4:46 - 4:49
    to make sense of a sequence of
    information.
  • 4:51 - 4:52
    Okay?
  • 4:52 - 4:58
    Now, to access file information, we have
    to do this thing called opening the file.
  • 4:58 - 5:02
    We can't just say, yo, the information is
    just omnipresent because there are
  • 5:02 - 5:06
    so much data that you can't have Python
    sort of know all the data.
  • 5:06 - 5:09
    You literally have hundreds of thousands
    of files on
  • 5:09 - 5:12
    your computer's hard drive.
    And you,
  • 5:12 - 5:14
    which one are you going to read?
  • 5:14 - 5:16
    So there's a step that you have to do,
  • 5:16 - 5:19
    that you call this built-in function
    called open.
  • 5:19 - 5:22
    And say, oh, this is the file that I
    want to work with,
  • 5:22 - 5:24
    of the hundreds of thousands, and then
    once you do,
  • 5:24 - 5:28
    you've kind of got this little
    connector into it.
  • 5:28 - 5:32
    And the open is a built-in function inside
    Python.
  • 5:32 - 5:34
    Hang on a sec, let's say good bye to that.
    The open
  • 5:34 - 5:40
    function is a built-in function in Python,
    and you, it takes two parameters.
  • 5:40 - 5:46
    The first parameter is the name of the
    file, like mbox.txt,
  • 5:46 - 5:49
    and then the second is how you're going to
    read it.
  • 5:49 - 5:49
    Are you going to read it?
  • 5:49 - 5:50
    are you going to write it? et cetera.
  • 5:50 - 5:53
    Now most of the time we'll be reading our
    files.
  • 5:53 - 5:56
    So we call the open function and pass it
    in the name of
  • 5:56 - 5:59
    the file we want to open, and then how we
    want to read it.
  • 5:59 - 6:02
    Now you can leave this second parameter
    off and it
  • 6:02 - 6:05
    assumes that you're going to want to read
    the file.
  • 6:05 - 6:05
    Now.
  • 6:09 - 6:12
    When the open is successful, it doesn't
    actually read all
  • 6:12 - 6:17
    of the data because the memory is small,
    small compared to
  • 6:17 - 6:19
    the hard drive, and so you have to sort of
  • 6:19 - 6:22
    step through the data, you'll tell it when
    to read it.
  • 6:22 - 6:27
    So the act of opening it is not
    actually reading all data.
  • 6:27 - 6:31
    It is creating kind of like a connection
    between the
  • 6:31 - 6:33
    memory and the data that's on the hard
    drive, right?
  • 6:33 - 6:34
    It's connecting
  • 6:34 - 6:38
    between, oh listen to this.
    Oh that's going to fall down.
  • 6:38 - 6:42
    Is it going to stand up that way?
  • 6:42 - 6:45
    Oh, I should come up with a way to
    make that stand.
  • 6:46 - 6:48
    So it's a connection.
  • 6:48 - 6:50
    So the, your program's kind of running in
    here.
  • 6:50 - 6:54
    And the, and the file handle is just sort
    of it's
  • 6:54 - 6:58
    like a phone call between your memory and
    your disk drive.
  • 6:58 - 7:00
    It's not the actual data.
    The actual data is still
  • 7:00 - 7:06
    sitting on the disk drive, okay?
    So, a graphical way to take a look at this
  • 7:06 - 7:12
    is, the file handle, the thing that comes
    back from the open request.
  • 7:12 - 7:15
    The open goes and finds the file out on
    the disk drive and
  • 7:15 - 7:20
    yada, yada, yada, and then the handle is
    something that lives in the memory.
  • 7:20 - 7:22
    that is sort of like the thing that
  • 7:22 - 7:26
    maintains its connection to where all the
    data is
  • 7:26 - 7:29
    on the disk or on the SD RAM that's in it.
  • 7:29 - 7:31
    So the handle is not all the data, but it is
  • 7:31 - 7:34
    a mechanism that you can use to get at the
    data.
  • 7:34 - 7:38
    So if you print it out, it doesn't have
    all the data from the file,
  • 7:38 - 7:44
    it says, I am a file handle that's opened
    this file and we're in read mode.
  • 7:44 - 7:46
    So, that doesn't actually have the data,
  • 7:46 - 7:48
    even though this is the data that's
    in the file.
  • 7:48 - 7:51
    And then we have operations that we do to
    the handle like open it,
  • 7:51 - 7:53
    close it, read it, write it.
    So we do things.
  • 7:53 - 7:56
    So, so the handle and then through the
    handle it actually changes
  • 7:56 - 7:59
    what's on the disk or reads
    what's on the disk.
  • 7:59 - 8:02
    So the handle is kind of a thing that's
    not there.
  • 8:03 - 8:06
    If you attempt to open a file and the name
    of the file.
  • 8:06 - 8:09
    Now the way we're going to do these is
    these need to be
  • 8:09 - 8:14
    in the same folder on your computer as in,
    as your Python code.
  • 8:14 - 8:16
    Now, there are trickier ways to do it, but
  • 8:16 - 8:17
    we're going to keep it simple.
  • 8:17 - 8:19
    This is the name of a file in the
  • 8:19 - 8:22
    same folder as the Python code that you're
    running.
  • 8:22 - 8:28
    [SOUND] And if it's not, then we get, of
    course, a traceback and we're
  • 8:28 - 8:32
    used to using, reading tracebacks by
    now, no such file or directory stuff.txt.
  • 8:32 - 8:35
    Oh, of course, I forgot to save it or I
    typed it wrong.
  • 8:38 - 8:39
    So.
  • 8:39 - 8:43
    The next thing we have to learn is the
    notion of the newline character.
  • 8:43 - 8:44
    You haven't seen this so far,
  • 8:44 - 8:48
    but there's a special character in files
  • 8:48 - 8:52
    that is used to indicate the end of a line.
  • 8:52 - 8:54
    Because these text files that we've been
    writing,
  • 8:54 - 8:58
    including Python programs that you have,
    are organized into lines.
  • 8:58 - 9:00
    Each line has variable length and there is
  • 9:00 - 9:03
    a special non-printing character that you
    just don't see.
  • 9:03 - 9:06
    Now you see it because you see a line,
  • 9:06 - 9:11
    multiple lines, but you don't see the
    character itself.
  • 9:11 - 9:13
    So it turns out that this character is
    very
  • 9:13 - 9:16
    important because the data is just a
    stream of
  • 9:16 - 9:19
    characters on disk and then it's
    punctuated by newlines
  • 9:19 - 9:22
    that tell it when it's time to end the
    line.
  • 9:22 - 9:29
    So if we are building a string, the
    constant for newline is backslash n.
  • 9:29 - 9:33
    And so, when we make a string that we
    want to
  • 9:33 - 9:38
    have a newline in it, we'll say Hello
    backslash n World.
  • 9:38 - 9:41
    And then if you print it out one way, you
    actually see the backslash n.
  • 9:41 - 9:44
    But then if you use the print to print it
    out, you see sort of
  • 9:44 - 9:50
    like the, it moves back down, you know,
    to the left margin and down.
  • 9:50 - 9:56
    So, so, sometimes you see the slash n
    and sometimes it's shown as movement.
  • 9:56 - 9:57
    Right? You, it moves it.
  • 9:59 - 10:00
    The other thing that's important is even
  • 10:00 - 10:02
    though we represent this as two
    characters,
  • 10:02 - 10:06
    the backslash n is represented as two characters
    in a string, it's actually one character.
  • 10:06 - 10:10
    So if we print it out, we see
    X newline Y
  • 10:10 - 10:13
    and if we ask how many characters are
    in stuff,
  • 10:13 - 10:17
    which is this string, it says 3.
    That's important.
  • 10:17 - 10:18
    Okay?
  • 10:18 - 10:22
    There is one, two, three.
    The newline is a single character.
  • 10:22 - 10:27
    This is a just a syntax that we use to
    sort of encode a newline in a string.
  • 10:28 - 10:28
    Okay?
  • 10:29 - 10:34
    So, even though these are just a
  • 10:34 - 10:37
    long sequence of characters punctuated by
    newlines,
  • 10:37 - 10:41
    visually, text editors and operating
    systems show them, show
  • 10:41 - 10:44
    these files to us as a sequence of lines.
  • 10:44 - 10:46
    And it doesn't take very long to just
    start thinking about them
  • 10:46 - 10:48
    as a sequence of lines.
  • 10:48 - 10:51
    As a matter of fact, maybe you never, wish
    I'd never told you about newlines.
  • 10:52 - 10:53
    But when we start reading files, we're
  • 10:53 - 10:55
    going to have to deal with these newlines.
  • 10:55 - 10:59
    So the way that we sort of have to
    mentally visualize of what these text
  • 10:59 - 11:04
    files look like is they have a newline
    that punctuates the end of the line.
  • 11:04 - 11:09
    Now in reality, if we look at this, this
    R really comes right after it.
  • 11:09 - 11:09
    Right?
  • 11:09 - 11:13
    This is all a bunch of characters and the
    newlines are punctuation, okay?
  • 11:13 - 11:17
    To say this is first line, second line,
    third line, and fourth line.
  • 11:17 - 11:19
    So, you gotta think that each of these
    things
  • 11:19 - 11:22
    is here, sitting at the end of the line.
  • 11:22 - 11:25
    And so the number of characters in this
    line include that newline.
  • 11:25 - 11:27
    Now the newline is one character.
  • 11:27 - 11:32
    Okay? So, how do we read these files?
  • 11:32 - 11:36
    Well, we've already talked about doing an
    open xfile.
  • 11:36 - 11:39
    And I'm just, this xfile, again that's
    just a mneumonic
  • 11:39 - 11:42
    name that I made up. This is a handle.
  • 11:42 - 11:44
    Remember, it's not all the data.
  • 11:44 - 11:46
    But the handle is the way that we can read
    the data.
  • 11:46 - 11:49
    We can use it as a access point.
  • 11:49 - 11:52
    The coolest way to read a file, if it's a
    text file in multiple
  • 11:52 - 11:58
    lines, is to use a determinant loop, a
    for loop. for cheese in xfile.
  • 11:58 - 12:03
    So this, remember we would put a list of
    numbers or a string here.
  • 12:03 - 12:04
    Now we've put a file
  • 12:04 - 12:05
    handle here.
  • 12:05 - 12:09
    Python knows automatically that each time
    we are going to run this
  • 12:09 - 12:12
    loop, it's going to go to the next line of
    the file.
  • 12:12 - 12:16
    Automatically, for, a cheese is just a
    stupid name that I came up with it.
  • 12:16 - 12:20
    I would be better to call line rather than
    cheese, but for cheese in and then it goes
  • 12:20 - 12:23
    dot, dot, dot, dot, dot, dot, dot,
    each file
  • 12:23 - 12:26
    and then it stops when it reads
    the whole file.
  • 12:26 - 12:29
    So this line will print out every line
  • 12:29 - 12:34
    in the file, that's how you do it.
    These three lines open a file,
  • 12:36 - 12:42
    read every line in the file, okay?
    So a file handle itself is a special kind
  • 12:42 - 12:47
    of a sequence, much like a list of numbers
    or a string is a sequence of characters.
  • 12:47 - 12:49
    So one of the things we can do to combine
    one of
  • 12:49 - 12:52
    our counting idioms is count the number of
    lines in a file.
  • 12:53 - 12:54
    Okay? And so how we
  • 12:54 - 12:57
    would do that is we would open
    the file, set a
  • 12:57 - 13:01
    counter to zero, this time I'll use a
    mnemonic variable called count.
  • 13:01 - 13:03
    For line in fhand, that says run this
  • 13:03 - 13:06
    indented text once for each line in the
    file.
  • 13:06 - 13:08
    For each line in the file, add count equals
    count plus 1.
  • 13:08 - 13:11
    When the for loop is done, print the
    count.
  • 13:13 - 13:14
    Pretty straightforward.
  • 13:14 - 13:18
    Very few other languages are capable of
    writing that program in
  • 13:18 - 13:22
    as quick and as dense and succinct a way as
    Python is.
  • 13:22 - 13:25
    Python does a really, really nice
    job of this.
  • 13:25 - 13:28
    Okay? So that's how you count the lines.
  • 13:28 - 13:31
    Open it, write a for loop, and then add
    one.
  • 13:31 - 13:36
    Now we, we can't just say, so what you
    can't do, and this gives you a sense.
  • 13:36 - 13:37
    You can't say len,
  • 13:37 - 13:40
    fhand.
  • 13:40 - 13:43
    And that's because this isn't really the
    data.
  • 13:43 - 13:45
    That's sort of, you have to like pull the,
    pull it
  • 13:45 - 13:48
    and read it to get the data out of it.
  • 13:48 - 13:50
    Although we'll see another way of reading
    it later.
  • 13:51 - 13:53
    Okay? So that's counting the lines in a
    file.
  • 13:55 - 13:57
    It turns out you can also read the entire
    file.
  • 13:59 - 14:02
    Now if you read the entire file, it's not
    broken into lines.
  • 14:02 - 14:04
    You're getting all the characters
    punctuated
  • 14:04 - 14:06
    by newlines and you get everything.
  • 14:06 - 14:10
    Now you don't want to read this if it's
    too big, so it's
  • 14:10 - 14:13
    going to all try to read it into the
    memory of the computer.
  • 14:13 - 14:16
    And if the memory is not big enough,
    you're going to slow down to a crawl.
  • 14:16 - 14:19
    But if it's a real tiny file, this works
    just fine.
  • 14:19 - 14:22
    And so, so we have sort of real, we open
  • 14:22 - 14:27
    a file and we say fhand.read, this is
    basically saying, hey,
  • 14:27 - 14:31
    dear fhand, read it all and return it to
    me as a string.
  • 14:32 - 14:34
    So that's a string with all the lines of
    the file concatenated
  • 14:34 - 14:39
    together with newlines, which is actually
    exactly what's in the file.
  • 14:39 - 14:40
    It's the raw data.
  • 14:40 - 14:42
    That for loop sort of looks for the newline
  • 14:42 - 14:44
    and does all of the stuff
    automatically for us.
  • 14:44 - 14:45
    It's quite nice.
  • 14:46 - 14:50
    So then we can, like, because inp is a
    string at this point,
  • 14:50 - 14:51
    we can just print the length of it.
  • 14:51 - 14:53
    And we can say, oh, there's 94,626
  • 14:53 - 14:57
    characters that came from that file.
  • 14:57 - 15:02
    It reads the whole thing, whole file,
    reads the whole file.
  • 15:02 - 15:04
    We can also do things like, you know, slice
    it now.
  • 15:04 - 15:10
    And so this is the first 20 characters,
    up from zero up to, but not including, 20.
  • 15:10 - 15:13
    So this, this is our file. Okay?
  • 15:13 - 15:16
    So that's reading through the whole file.
  • 15:16 - 15:18
    So, let me go back a little bit, this is
    the file that we're
  • 15:18 - 15:19
    going to play with.
  • 15:20 - 15:25
    This file here that we're going to play
    with in this class is a mailbox file.
  • 15:25 - 15:27
    And this is actual real data.
    And these are real people.
  • 15:27 - 15:29
    And these are real dates, having to do
    with
  • 15:29 - 15:32
    an open source project that I worked on
    called Sakai.
  • 15:32 - 15:36
    I actually have a tattoo of Sakai here on
    my shoulder.
  • 15:36 - 15:38
    Maybe in an upcoming lecture, I'll have a
  • 15:38 - 15:40
    short-sleeved shirt, and show you my
    tattoo.
  • 15:40 - 15:44
    But for now, I can't because I've got, got
    clothes on.
  • 15:44 - 15:52
    So, but this is real data.
    It's the mbox.txt, mbox.txt file.
  • 15:52 - 15:56
    So, so that's the file that we're going to
    use for most of the next few assignments.
  • 15:56 - 15:58
    It'll be the same file. You'll get tired of it.
  • 15:58 - 16:00
    And you'll get to know all these people,
    Stephen,
  • 16:00 - 16:02
    Chen Wen, and all the people in the file.
  • 16:05 - 16:06
    Okay, so.
  • 16:07 - 16:10
    We can search for lines that have a
    prefix.
  • 16:10 - 16:14
    This is kind of the find pattern from the
    looping lecture.
  • 16:14 - 16:18
    So we're going to go through a list of, of
    lines in a file,
  • 16:18 - 16:21
    and we're going to only print out the ones
    that match a certain thing.
  • 16:21 - 16:23
    So again, we open the file up.
  • 16:23 - 16:25
    We're going to write a for loop that's
    going to say, for each line in the
  • 16:25 - 16:30
    file, if the line and then we can call a,
    a utility function
  • 16:30 - 16:33
    inside of string, because line is a string.
  • 16:33 - 16:35
    If line startswith From, print it out.
  • 16:35 - 16:38
    So this means it's going to loop through
    all of the lines in the
  • 16:38 - 16:43
    file and it's going to print the ones that
    start with the string 'From:'
  • 16:45 - 16:46
    Okay?
  • 16:46 - 16:50
    Again, four lines, complete Python program
    to read this
  • 16:50 - 16:53
    file and print the lines that have a
    prefix of from.
  • 16:55 - 16:59
    So, if you run this program, and I suggest
    that you do,
  • 17:01 - 17:03
    this is what the output's going to look like.
  • 17:04 - 17:07
    And it's like, wait a second, I'm seeing
    the lines,
  • 17:10 - 17:14
    seeing the lines that have the froms, but
    then I get these blank lines.
  • 17:17 - 17:19
    And why is that?
    Why are these blank lines there?
  • 17:19 - 17:24
    If I look at the program, I mean, I'm not
    printing blank lines.
  • 17:24 - 17:26
    I'm only printing lines that
    start with from.
  • 17:26 - 17:28
    I'm not doing that, so why?
  • 17:31 - 17:31
    What do you think?
  • 17:32 - 17:33
    I'll give you a second.
  • 17:35 - 17:38
    I've certainly done enough foreshadowing
    in this lecture.
  • 17:38 - 17:41
    Well it turns out these newlines are the
    problem.
  • 17:41 - 17:44
    So it turns out that the print, we've been
    doing this
  • 17:44 - 17:47
    all along, you just, we didn't make a fuss
    about it.
  • 17:47 - 17:50
    The print adds a newline at the end of
    everything that it prints.
  • 17:50 - 17:53
    So the yellow newlines are coming from
    the print statement.
  • 17:53 - 17:58
    But when we read the file, each line ends
    in a newline.
  • 17:58 - 18:00
    So these green newlines are actually from
    the file.
  • 18:03 - 18:06
    They're the ones from the file.
  • 18:06 - 18:08
    So what's happening is we're seeing two
  • 18:08 - 18:11
    newlines, and so that turns into a
    blank line.
  • 18:12 - 18:14
    So, how do we deal with that?
  • 18:14 - 18:19
    Well, we've got a string function that
    conveniently solves that problem, okay?
  • 18:19 - 18:21
    And that is we're going to call rstrip.
  • 18:21 - 18:25
    If you recall, we had strip, lstrip, and
    rstrip to strip
  • 18:25 - 18:28
    white space on one side, on the other
    side, or on both sides.
  • 18:28 - 18:30
    So in this one,
  • 18:30 - 18:31
    we're going to use rstrip.
  • 18:31 - 18:33
    We're going to say, we're going to read
    the line, that
  • 18:33 - 18:36
    this line is going to have a newline in it.
  • 18:36 - 18:40
    rstrip says pull white space, and the
    newlines are also counted as white space.
  • 18:40 - 18:43
    Blanks or newlines are white space.
  • 18:43 - 18:47
    And then we're going to replace this with
    no newline in it.
  • 18:47 - 18:50
    Then we're going to ask if it starts with
    a from and then we're going to print it
  • 18:50 - 18:52
    out, and then we go and we're going to
  • 18:52 - 18:55
    see exactly what we're looking for
    in this file.
  • 18:55 - 18:56
    And there's no newlines.
  • 18:56 - 19:01
    So the newline that's coming out here
    is the one from the print, not the
  • 19:01 - 19:04
    one from the file, because the one from
  • 19:04 - 19:07
    the file got wiped out by that particular
    line.
  • 19:08 - 19:08
    Okay?
  • 19:10 - 19:13
    So another general pattern of these
    file-based loops
  • 19:13 - 19:18
    that we have done this, is a skipping
    pattern.
  • 19:18 - 19:20
    Now, you can do, the, the non-skipping
    pattern
  • 19:20 - 19:23
    is where you're saying, I'm going to look
    for lines
  • 19:23 - 19:26
    that start with from and do something to
    them.
  • 19:26 - 19:30
    Sometimes you'll want to do something to
    all, to, to the to, to, you want to say,
  • 19:30 - 19:33
    here's a bunch of lines I'm going to
    skip, and then I'm going to do something.
  • 19:33 - 19:37
    So the skipping pattern uses continue.
  • 19:37 - 19:39
    And so the first few lines here are the
    same.
  • 19:39 - 19:42
    We open a file, we read each line
    in the file,
  • 19:42 - 19:44
    but we're going to strip off the white
    space.
  • 19:44 - 19:46
    You're going to get tired of typing these
    three lines,
  • 19:46 - 19:47
    because you're going to do it a lot.
  • 19:47 - 19:52
    Open the file, start reading the file,
    strip the whitespace for each line.
  • 19:52 - 19:58
    And you can make it so that you can look
    for some fact.
  • 19:58 - 20:01
    In this case, I'm going to say, if not
    line startswith From, this
  • 20:01 - 20:05
    means this is true for all the lines that
    don't start with from,
  • 20:05 - 20:09
    continue. And if you remember, continue
    goes up.
  • 20:09 - 20:11
    So the continue says I'm done, it
    finishes
  • 20:11 - 20:14
    the iteration, and it doesn't do anything
    down here.
  • 20:14 - 20:15
    Okay?
  • 20:15 - 20:18
    And so it, this is a, and then, we can do
    something.
  • 20:18 - 20:21
    So, I've kind of flipped this, where I
    said, these are the
  • 20:21 - 20:25
    things I'm interesting, interested in,
    that's lines that start with from.
  • 20:25 - 20:26
    So, I'm going to skip the lines that
    don't.
  • 20:26 - 20:28
    So I'm going to use continue.
  • 20:28 - 20:32
    Either way you can do it, depending on the
    complexity or how much.
  • 20:32 - 20:34
    Often when you're, this is a good pattern
    when
  • 20:34 - 20:36
    you have lots of lines of code down here
  • 20:36 - 20:38
    that you're going to do a lot of cool
    stuff with.
  • 20:39 - 20:43
    You can also use things like in to select
    lines.
  • 20:43 - 20:43
    Right?
  • 20:43 - 20:51
    So I'm going to, I'm going to look for
    lines that have @uct.ac.za in them.
  • 20:51 - 20:53
    So again, I'm going to open it up.
  • 20:53 - 20:56
    I'm going to open these, go through each
    line in the file.
  • 20:56 - 21:01
    I'm going to strip the white space out,
    and [COUGH]
  • 21:01 - 21:03
    if not u-c-t,
  • 21:03 - 21:08
    if this string is not in line, then I'm
    going to continue.
  • 21:08 - 21:12
    So it's a way for me to skip all of the
    lines that don't have this string in it.
  • 21:14 - 21:19
    So these lines do, that one has it too,
    and then we're going to print it out.
  • 21:19 - 21:24
    It will print out the ones that make it past
    here, okay?
  • 21:24 - 21:28
    So, but in is another way to do searching,
    right, starts with,
  • 21:28 - 21:29
    et cetera.
  • 21:31 - 21:38
    So one more thing that you might want to
    try is, so we can count, right?
  • 21:38 - 21:40
    Now, and this is a pattern for prompting
    for a file name.
  • 21:42 - 21:46
    And so, so here you, you'll get tired of
    sort of
  • 21:46 - 21:49
    changing your code every time you want to
    open a different file.
  • 21:49 - 21:51
    because you probably want to run the
    program
  • 21:51 - 21:54
    with mbox once and mbox-short because,
    just so you
  • 21:54 - 21:58
    can test it with different things of data.
    So here's just another pattern.
  • 21:58 - 22:02
    We add this line to say raw_input, enter
    the file name.
  • 22:02 - 22:05
    And there you go, we'll type in the file
    name.
  • 22:05 - 22:08
    And then the thing that we open is
    whatever we entered as the file name.
  • 22:08 - 22:11
    And then the rest of it is pretty much
    yada yada.
  • 22:11 - 22:14
    So here I'm, it's reading the whole file.
  • 22:14 - 22:17
    If the line starts with subject, count
    equals count plus one.
  • 22:17 - 22:19
    And then there were 1797 subject
  • 22:19 - 22:22
    lines in mbox.txt.
  • 22:22 - 22:26
    There were 27 subject lines in
    mbox-short.txt, okay?
  • 22:26 - 22:29
    So that's prompting for the file names.
  • 22:29 - 22:31
    Now, open.
  • 22:31 - 22:35
    The open statement fails if the file name
    doesn't exist.
  • 22:35 - 22:37
    So, you might want to add a try and
  • 22:37 - 22:40
    accept around that if you want to, if
    you're just writing
  • 22:40 - 22:43
    code for yourself and you assume that
    everything's okay,
  • 22:43 - 22:45
    then you don't have to write try accept
    but if
  • 22:45 - 22:51
    you want to catch it [SOUND]
    and catch a bad file name,
  • 22:51 - 22:56
    then you take the open which, and turn it
    into these four lines.
  • 22:56 - 22:58
    So this is the code that we think might
    blow up,
  • 23:00 - 23:01
    and it's going to blow up, we know it's
    going to blow up.
  • 23:01 - 23:04
    If they enter a bad file name like
  • 23:04 - 23:07
    na na boo boo, right, this is is going to
    blow up.
  • 23:07 - 23:09
    So what do we do?
    We use try and accept.
  • 23:09 - 23:10
    We put try
  • 23:10 - 23:10
    around that.
  • 23:10 - 23:14
    We're going to take out some insurance on
    that particular line.
  • 23:14 - 23:17
    And then, if it fails, we're going to
    print
  • 23:17 - 23:20
    this message and then say exit, to get
    out.
  • 23:20 - 23:23
    So if you get a good file,
  • 23:26 - 23:28
    if you get a good file, it works, skips the
  • 23:28 - 23:32
    except, then runs the thing, prints out
    the count.
  • 23:32 - 23:36
    That's what's happening here. If, on the
    other hand, you get a bad file,
  • 23:37 - 23:42
    it comes here, open blows up, runs the
    except, prints this out, and then quits.
  • 23:43 - 23:46
    So that's how this one works with a bad
    file.
  • 23:47 - 23:49
    And now, no traceback, right?
  • 23:54 - 23:55
    So we are
  • 23:57 - 24:00
    It's kind of a short lecture.
    We're done with Chapter Seven.
  • 24:01 - 24:04
    We open a file.
  • 24:04 - 24:06
    We read the file.
  • 24:06 - 24:09
    We take out white space at the end with
    rstrip.
  • 24:09 - 24:12
    We had used string functions.
  • 24:12 - 24:15
    So, this is kind of putting it all
    together.
  • 24:15 - 24:17
    And it's kind of short little programs
    now.
  • 24:17 - 24:22
    So, it's not.
    And you know, starting now,
  • 24:22 - 24:25
    we are going to start putting these things
    together and start actually doing work.
  • 24:25 - 24:28
    Because now, we have, from the first few
    chapters,
  • 24:28 - 24:32
    we have basic capabilities of Python.
    Now we have some data to work with.
  • 24:32 - 24:33
    Now going forward
  • 24:33 - 24:37
    we are going to do increasingly
    sophisticated things with that data.
  • 24:37 - 24:38
    So I can't wait to see in the next
    lecture.
Title:
Python for Informatics - Chapter 7 Files
Description:

This is Chapter 7 for Python for Informatics. www.pythonearn.com
All Lectures: http://www.youtube.com/playlist?list=PLlRFEj9H3Oj4JXIwMwN1_s­s1Tk8wZShEJ

more » « less
Video Language:
English
Team:
Captions Requested
Duration:
24:39

English subtitles

Revisions