Welcome to Chapter Seven.
Python for Informatics: Exploring
Information.
I'm Charles Severence.
I'm the author of the book and your host.
And, as always, this is brought to you by.
No, I'm sorry.
It's all creative copyright, Creative
Commons Attribution.
The audio, the video, the slides, and even
the book.
So, here we go.
Oh, and and so, frankly, where
we've been working
all along is, we have been writing code
and talking to the CPU.
Hang on, let me, let me go get
my CPU and stuff.
Hang on, be right back.
[SOUND]
Okay.
Here we go. Here we go.
Here's all the stuff. Remember the stuff
from the first lecture?
There we go with that.
Remember the motherboard from the first
lecture?
This is kind of the picture of what's on
the screen.
The motherboard, the CPU plugs in here,
memory plugs in here.
And remember how the CPU is sort of the
brains, as
much brains as there is, for the operation.
The CPU is asking what next.
The instructions come in through these
little pins.
There's data inside, and it stores sort of
semi-permanent
data, variables, are all stored pretty
much here in RAM.
And we write our programs, and so your
Python programs, they're sitting here
in this RAM, and they're being fed to this
CPU through those chips.
Through those pins, right?
The pins, I mean it doesn't really connect
like that.
And so, so frankly, up to now, everything
that we've been doing
is just the Python programming language.
And so the only place we've really been
operating is here.
We have been putting Python into the main
memory.
And the main memory. And we have
been effectively feeding instructions to
the CPU,
the central processing unit, as it needed
them, and then the program would stop.
And everything we've done so far
everything
is just sort of fiddling around here.
We have never escaped it.
So now we are finally going to escape
from the central processing unit and the
memory.
We'll still write programs and have
variables in here.
But now we're going to use the disk,
the secondary storage, the
permanent media, right?
So if I go grab my Raspberry Pi,
alright, that goes right there.
Here's my Raspberry Pi, so here we've got
the Raspberry Pi, which is the small version,
which of course has a CPU, memory, and
graphics processor, all in this little chip
right here.
But the secondary memory for the,
is this little
SD card that is the secondary memory for
Raspberry Pi.
So the structure of the Raspberry Pi is
exactly the same as the structure
of any other
personal computer, it's just smaller and
less expensive.
And so in the Raspberry Pi, if you're
programming the Raspberry Pi, you're sort
of finally escaping.
All your programs were in here.
Your CPU is in here and that's pretty much
how, how far you've got to run.
But now, of course when you save your files,
you save them to here.
But now we are going to start looking at
data on the disk drive and so it's time
to escape to the secondary memory.
Okay?
Time to escape to the secondary memory.
And Raspberry Pi, you can go right there.
Okay?
So it's time to find some data to mess
with.
So a lot of what we've been doing so far
is just
kind of the pre-work to get to the point
where we can do this.
And in here we're going to have data files.
Now, we've been making data files.
You've been writing, every Python program
that you write on your computer gets saved
as a file. Then Python reads the file and runs it.
But now we're actually going to start
messing with some data.
And so, files are where we're going to be
working.
And so, one of things about secondary memory
is it's much larger.
And this is, main memory of the computer
is pretty large, it's just
not large enough to hold everything that
the computer is capable of holding.
So the files that we're going to work with.
Now we're not talking about image files or
Quicktime movies or things like that.
We're going to work with text files
because the
theme of this course is digging through
text.
Sometimes we'll pull it off the Internet.
Sometimes we'll read files, but it's
digging through and
using all the things that we've learned so
far,
looping and strings, and all those things,
to make sense of a sequence of
information.
Okay?
Now, to access file information, we have
to do this thing called opening the file.
We can't just say, yo, the information is
just omnipresent because there are
so much data that you can't have Python
sort of know all the data.
You literally have hundreds of thousands
of files on
your computer's hard drive.
And you,
which one are you going to read?
So there's a step that you have to do,
that you call this built-in function
called open.
And say, oh, this is the file that I
want to work with,
of the hundreds of thousands, and then
once you do,
you've kind of got this little
connector into it.
And the open is a built-in function inside
Python.
Hang on a sec, let's say good bye to that.
The open
function is a built-in function in Python,
and you, it takes two parameters.
The first parameter is the name of the
file, like mbox.txt,
and then the second is how you're going to
read it.
Are you going to read it?
are you going to write it? et cetera.
Now most of the time we'll be reading our
files.
So we call the open function and pass it
in the name of
the file we want to open, and then how we
want to read it.
Now you can leave this second parameter
off and it
assumes that you're going to want to read
the file.
Now.
When the open is successful, it doesn't
actually read all
of the data because the memory is small,
small compared to
the hard drive, and so you have to sort of
step through the data, you'll tell it when
to read it.
So the act of opening it is not
actually reading all data.
It is creating kind of like a connection
between the
memory and the data that's on the hard
drive, right?
It's connecting
between, oh listen to this.
Oh that's going to fall down.
Is it going to stand up that way?
Oh, I should come up with a way to
make that stand.
So it's a connection.
So the, your program's kind of running in
here.
And the, and the file handle is just sort
of it's
like a phone call between your memory and
your disk drive.
It's not the actual data.
The actual data is still
sitting on the disk drive, okay?
So, a graphical way to take a look at this
is, the file handle, the thing that comes
back from the open request.
The open goes and finds the file out on
the disk drive and
yada, yada, yada, and then the handle is
something that lives in the memory.
that is sort of like the thing that
maintains its connection to where all the
data is
on the disk or on the SD RAM that's in it.
So the handle is not all the data, but it is
a mechanism that you can use to get at the
data.
So if you print it out, it doesn't have
all the data from the file,
it says, I am a file handle that's opened
this file and we're in read mode.
So, that doesn't actually have the data,
even though this is the data that's
in the file.
And then we have operations that we do to
the handle like open it,
close it, read it, write it.
So we do things.
So, so the handle and then through the
handle it actually changes
what's on the disk or reads
what's on the disk.
So the handle is kind of a thing that's
not there.
If you attempt to open a file and the name
of the file.
Now the way we're going to do these is
these need to be
in the same folder on your computer as in,
as your Python code.
Now, there are trickier ways to do it, but
we're going to keep it simple.
This is the name of a file in the
same folder as the Python code that you're
running.
[SOUND] And if it's not, then we get, of
course, a traceback and we're
used to using, reading tracebacks by
now, no such file or directory stuff.txt.
Oh, of course, I forgot to save it or I
typed it wrong.
So.
The next thing we have to learn is the
notion of the newline character.
You haven't seen this so far,
but there's a special character in files
that is used to indicate the end of a line.
Because these text files that we've been
writing,
including Python programs that you have,
are organized into lines.
Each line has variable length and there is
a special non-printing character that you
just don't see.
Now you see it because you see a line,
multiple lines, but you don't see the
character itself.
So it turns out that this character is
very
important because the data is just a
stream of
characters on disk and then it's
punctuated by newlines
that tell it when it's time to end the
line.
So if we are building a string, the
constant for newline is backslash n.
And so, when we make a string that we
want to
have a newline in it, we'll say Hello
backslash n World.
And then if you print it out one way, you
actually see the backslash n.
But then if you use the print to print it
out, you see sort of
like the, it moves back down, you know,
to the left margin and down.
So, so, sometimes you see the slash n
and sometimes it's shown as movement.
Right? You, it moves it.
The other thing that's important is even
though we represent this as two
characters,
the backslash n is represented as two characters
in a string, it's actually one character.
So if we print it out, we see
X newline Y
and if we ask how many characters are
in stuff,
which is this string, it says 3.
That's important.
Okay?
There is one, two, three.
The newline is a single character.
This is a just a syntax that we use to
sort of encode a newline in a string.
Okay?
So, even though these are just a
long sequence of characters punctuated by
newlines,
visually, text editors and operating
systems show them, show
these files to us as a sequence of lines.
And it doesn't take very long to just
start thinking about them
as a sequence of lines.
As a matter of fact, maybe you never, wish
I'd never told you about newlines.
But when we start reading files, we're
going to have to deal with these newlines.
So the way that we sort of have to
mentally visualize of what these text
files look like is they have a newline
that punctuates the end of the line.
Now in reality, if we look at this, this
R really comes right after it.
Right?
This is all a bunch of characters and the
newlines are punctuation, okay?
To say this is first line, second line,
third line, and fourth line.
So, you gotta think that each of these
things
is here, sitting at the end of the line.
And so the number of characters in this
line include that newline.
Now the newline is one character.
Okay? So, how do we read these files?
Well, we've already talked about doing an
open xfile.
And I'm just, this xfile, again that's
just a mneumonic
name that I made up. This is a handle.
Remember, it's not all the data.
But the handle is the way that we can read
the data.
We can use it as a access point.
The coolest way to read a file, if it's a
text file in multiple
lines, is to use a determinant loop, a
for loop. for cheese in xfile.
So this, remember we would put a list of
numbers or a string here.
Now we've put a file
handle here.
Python knows automatically that each time
we are going to run this
loop, it's going to go to the next line of
the file.
Automatically, for, a cheese is just a
stupid name that I came up with it.
I would be better to call line rather than
cheese, but for cheese in and then it goes
dot, dot, dot, dot, dot, dot, dot,
each file
and then it stops when it reads
the whole file.
So this line will print out every line
in the file, that's how you do it.
These three lines open a file,
read every line in the file, okay?
So a file handle itself is a special kind
of a sequence, much like a list of numbers
or a string is a sequence of characters.
So one of the things we can do to combine
one of
our counting idioms is count the number of
lines in a file.
Okay? And so how we
would do that is we would open
the file, set a
counter to zero, this time I'll use a
mnemonic variable called count.
For line in fhand, that says run this
indented text once for each line in the
file.
For each line in the file, add count equals
count plus 1.
When the for loop is done, print the
count.
Pretty straightforward.
Very few other languages are capable of
writing that program in
as quick and as dense and succinct a way as
Python is.
Python does a really, really nice
job of this.
Okay? So that's how you count the lines.
Open it, write a for loop, and then add
one.
Now we, we can't just say, so what you
can't do, and this gives you a sense.
You can't say len,
fhand.
And that's because this isn't really the
data.
That's sort of, you have to like pull the,
pull it
and read it to get the data out of it.
Although we'll see another way of reading
it later.
Okay? So that's counting the lines in a
file.
It turns out you can also read the entire
file.
Now if you read the entire file, it's not
broken into lines.
You're getting all the characters
punctuated
by newlines and you get everything.
Now you don't want to read this if it's
too big, so it's
going to all try to read it into the
memory of the computer.
And if the memory is not big enough,
you're going to slow down to a crawl.
But if it's a real tiny file, this works
just fine.
And so, so we have sort of real, we open
a file and we say fhand.read, this is
basically saying, hey,
dear fhand, read it all and return it to
me as a string.
So that's a string with all the lines of
the file concatenated
together with newlines, which is actually
exactly what's in the file.
It's the raw data.
That for loop sort of looks for the newline
and does all of the stuff
automatically for us.
It's quite nice.
So then we can, like, because inp is a
string at this point,
we can just print the length of it.
And we can say, oh, there's 94,626
characters that came from that file.
It reads the whole thing, whole file,
reads the whole file.
We can also do things like, you know, slice
it now.
And so this is the first 20 characters,
up from zero up to, but not including, 20.
So this, this is our file. Okay?
So that's reading through the whole file.
So, let me go back a little bit, this is
the file that we're
going to play with.
This file here that we're going to play
with in this class is a mailbox file.
And this is actual real data.
And these are real people.
And these are real dates, having to do
with
an open source project that I worked on
called Sakai.
I actually have a tattoo of Sakai here on
my shoulder.
Maybe in an upcoming lecture, I'll have a
short-sleeved shirt, and show you my
tattoo.
But for now, I can't because I've got, got
clothes on.
So, but this is real data.
It's the mbox.txt, mbox.txt file.
So, so that's the file that we're going to
use for most of the next few assignments.
It'll be the same file. You'll get tired of it.
And you'll get to know all these people,
Stephen,
Chen Wen, and all the people in the file.
Okay, so.
We can search for lines that have a
prefix.
This is kind of the find pattern from the
looping lecture.
So we're going to go through a list of, of
lines in a file,
and we're going to only print out the ones
that match a certain thing.
So again, we open the file up.
We're going to write a for loop that's
going to say, for each line in the
file, if the line and then we can call a,
a utility function
inside of string, because line is a string.
If line startswith From, print it out.
So this means it's going to loop through
all of the lines in the
file and it's going to print the ones that
start with the string 'From:'
Okay?
Again, four lines, complete Python program
to read this
file and print the lines that have a
prefix of from.
So, if you run this program, and I suggest
that you do,
this is what the output's going to look like.
And it's like, wait a second, I'm seeing
the lines,
seeing the lines that have the froms, but
then I get these blank lines.
And why is that?
Why are these blank lines there?
If I look at the program, I mean, I'm not
printing blank lines.
I'm only printing lines that
start with from.
I'm not doing that, so why?
What do you think?
I'll give you a second.
I've certainly done enough foreshadowing
in this lecture.
Well it turns out these newlines are the
problem.
So it turns out that the print, we've been
doing this
all along, you just, we didn't make a fuss
about it.
The print adds a newline at the end of
everything that it prints.
So the yellow newlines are coming from
the print statement.
But when we read the file, each line ends
in a newline.
So these green newlines are actually from
the file.
They're the ones from the file.
So what's happening is we're seeing two
newlines, and so that turns into a
blank line.
So, how do we deal with that?
Well, we've got a string function that
conveniently solves that problem, okay?
And that is we're going to call rstrip.
If you recall, we had strip, lstrip, and
rstrip to strip
white space on one side, on the other
side, or on both sides.
So in this one,
we're going to use rstrip.
We're going to say, we're going to read
the line, that
this line is going to have a newline in it.
rstrip says pull white space, and the
newlines are also counted as white space.
Blanks or newlines are white space.
And then we're going to replace this with
no newline in it.
Then we're going to ask if it starts with
a from and then we're going to print it
out, and then we go and we're going to
see exactly what we're looking for
in this file.
And there's no newlines.
So the newline that's coming out here
is the one from the print, not the
one from the file, because the one from
the file got wiped out by that particular
line.
Okay?
So another general pattern of these
file-based loops
that we have done this, is a skipping
pattern.
Now, you can do, the, the non-skipping
pattern
is where you're saying, I'm going to look
for lines
that start with from and do something to
them.
Sometimes you'll want to do something to
all, to, to the to, to, you want to say,
here's a bunch of lines I'm going to
skip, and then I'm going to do something.
So the skipping pattern uses continue.
And so the first few lines here are the
same.
We open a file, we read each line
in the file,
but we're going to strip off the white
space.
You're going to get tired of typing these
three lines,
because you're going to do it a lot.
Open the file, start reading the file,
strip the whitespace for each line.
And you can make it so that you can look
for some fact.
In this case, I'm going to say, if not
line startswith From, this
means this is true for all the lines that
don't start with from,
continue. And if you remember, continue
goes up.
So the continue says I'm done, it
finishes
the iteration, and it doesn't do anything
down here.
Okay?
And so it, this is a, and then, we can do
something.
So, I've kind of flipped this, where I
said, these are the
things I'm interesting, interested in,
that's lines that start with from.
So, I'm going to skip the lines that
don't.
So I'm going to use continue.
Either way you can do it, depending on the
complexity or how much.
Often when you're, this is a good pattern
when
you have lots of lines of code down here
that you're going to do a lot of cool
stuff with.
You can also use things like in to select
lines.
Right?
So I'm going to, I'm going to look for
lines that have @uct.ac.za in them.
So again, I'm going to open it up.
I'm going to open these, go through each
line in the file.
I'm going to strip the white space out,
and [COUGH]
if not u-c-t,
if this string is not in line, then I'm
going to continue.
So it's a way for me to skip all of the
lines that don't have this string in it.
So these lines do, that one has it too,
and then we're going to print it out.
It will print out the ones that make it past
here, okay?
So, but in is another way to do searching,
right, starts with,
et cetera.
So one more thing that you might want to
try is, so we can count, right?
Now, and this is a pattern for prompting
for a file name.
And so, so here you, you'll get tired of
sort of
changing your code every time you want to
open a different file.
because you probably want to run the
program
with mbox once and mbox-short because,
just so you
can test it with different things of data.
So here's just another pattern.
We add this line to say raw_input, enter
the file name.
And there you go, we'll type in the file
name.
And then the thing that we open is
whatever we entered as the file name.
And then the rest of it is pretty much
yada yada.
So here I'm, it's reading the whole file.
If the line starts with subject, count
equals count plus one.
And then there were 1797 subject
lines in mbox.txt.
There were 27 subject lines in
mbox-short.txt, okay?
So that's prompting for the file names.
Now, open.
The open statement fails if the file name
doesn't exist.
So, you might want to add a try and
accept around that if you want to, if
you're just writing
code for yourself and you assume that
everything's okay,
then you don't have to write try accept
but if
you want to catch it [SOUND]
and catch a bad file name,
then you take the open which, and turn it
into these four lines.
So this is the code that we think might
blow up,
and it's going to blow up, we know it's
going to blow up.
If they enter a bad file name like
na na boo boo, right, this is is going to
blow up.
So what do we do?
We use try and accept.
We put try
around that.
We're going to take out some insurance on
that particular line.
And then, if it fails, we're going to
print
this message and then say exit, to get
out.
So if you get a good file,
if you get a good file, it works, skips the
except, then runs the thing, prints out
the count.
That's what's happening here. If, on the
other hand, you get a bad file,
it comes here, open blows up, runs the
except, prints this out, and then quits.
So that's how this one works with a bad
file.
And now, no traceback, right?
So we are
It's kind of a short lecture.
We're done with Chapter Seven.
We open a file.
We read the file.
We take out white space at the end with
rstrip.
We had used string functions.
So, this is kind of putting it all
together.
And it's kind of short little programs
now.
So, it's not.
And you know, starting now,
we are going to start putting these things
together and start actually doing work.
Because now, we have, from the first few
chapters,
we have basic capabilities of Python.
Now we have some data to work with.
Now going forward
we are going to do increasingly
sophisticated things with that data.
So I can't wait to see in the next
lecture.