Hello again, and welcome to Chapter Nine of Python, Dictionaries. As always, this lecture is copyright Creative Commons Attribution. That means the audio, the video, the slides, and even my scribbles. You can use them any way you like, as long as you attribute them. Okay, so this is the second chapter where we're talking about collections, and the collections are like a piece of luggage in that you can put multiple things in them. Variables that we've talked about sort of starting in Chapter Two and Chapter Three were simple variables, scalar. They're just kind of one thing, and as soon as you, like, put another thing in there, it overwrites the first thing. And so if you look at the code, you know, x = 2 and x = 4, the question is, you know, where did the 2 go? Right? The 2 was there, x was there, there was a 2 in there, and then we cross it out and put 4 in there. This is sort of the basic operation, the assignment statement, it's a replacement. But a dictionary allows us to put more than one thing. Not using this syntax, but it allows us to have a variable that's really an aggregate of many values. And the difference between a list and a dictionary is how the values are structured within that single variable. The list is a linear collection, indexed by integers 0, 1, 2, 3. If there's five of them, it's 0 through 4, very much like a Pringle's can here, where they're just stacked nicely on top of each other. Everything's kind of organized. We talked about it in the last, in the last lecture. This lecture we're talking about dictionaries. A dictionary's very powerful. It's, and its power comes from a different way of organizing itself internally. It's a bag of values, like a just sort of, just stuff's in it, it's not in any order. Big stuff, little stuff. Things have labels. You can also think of it like a purse with just things in it that's like, it's not like stacked. It's just, stuff moves around as you're going and that's, that's a very good model for dictionaries. And so dictionaries have to have a label because the stuff is not in order. There's no such thing as the third thing. There is the thing with the label perfume. There's the thing with the label candy. There's the thing with the label money. And so there's the value, the thing, the money. And then there's always also the label. We also call these key/value. The key is the label and the value is whatever. And so these pink things are all labels for various things you could put in your purse. So you could say to your purse, "hey purse, give me my tissues." "Hey purse, give me my money." And it, it's in there somewhere and the purse sort of gives you back the tissues or the money. And it's, Python's most powerful data collection is the dictionaries. And it's when you get used to wielding them you'll say, like, whoa, I can do so much with these things. And at the beginning you just sort of learning sort of how to use them without hurting yourself. But they're very powerful. It's like a database. It's, it allows you to store very arbitrary data organized in however you feel like organizing it, in a way that advances the cause of the program that you're writing. And we're still kind of at the very beginning, but as you learn more, these will become a very powerful tool for you. They, dictionaries have different names in different languages. PERL or PHP would call them associative arrays. Java would call them a PropertyMap or a HashMap. And C# might call them a property bag or an attribute bag. And so they're, they're just the same concept. It's keys and values is the concept that's across all these languages. Just are very powerful. And if you look at the Wikipedia entry that I have here you can see that it's just, it's a concept that we give different names in different languages. Same concept, different names. So like I said, the difference between a list and a dictionary, they both can store multiple values. The question is how we label them, how we store them, and how we retrieve them. So here's an example use of a dictionary. I'm going to make a thing called purse. And I'm going to store in purse, this is an assignment statement, purse sub money. So this isn't like sub zero. This is sub money. So I'm actually using a string as the place. And, so I'm going to say stick 12 in my purse and stick a Post-it note that says that's my money. Candy is 3. Tissues is 75. And if I look at that, it's not just the numbers 12, 3, and 75 as it would be in a list. It is the connection between money and 12, tissues is 75, candy is 3. And in the key/value, that's the key and that's the value. So candy is the key and 3 is the value. Now I can look things up by their name, print purse sub candy. Well it goes and finds it, asking hey purse, give me back candy, and it goes and finds the value, which is 3, and so out comes a 3. We can also put it on the right-hand side of an assignment statement, so purse sub candy says give me the old version of candy, and then add 2 to it, which gives me 5, and then store it back in that purse under the label candy. So we see candy changing to 5. And so, this is a place, and you could do this with a list except these would be numbers. You could say purse sub two is equal to purse sub two plus two, or whatever. But in dictionaries, there are labels. Now, they're not strings. Strings is a very common label in dictionaries, but it's not always strings, you can use other things. In this chapter we'll pretty much focus on strings. You can even use numbers and then you would get a little confused. But you can. So here's sort of a picture of how this works. So, if we take a look at this line purse sub money equals 12, it's like we were putting a key/value connection, money is the label for 12. And then we sort of move that in. And it's up to the purse to decide where things live. If we look at the next line, we're going to put the value in with a 3 in with the label candy, and we're going to put the value 75 in with the label of tissues. And when we say hey purse, print yourself out, it just goes and pulls these things back out and hands them to us. And what it's really, it's giving us both the label and the value and it's necessary to do that cause they're just like 12, 75, and 3. What exactly is that? And so this syntax with the curly braces is what happens when you print a dictionary out. The same thing happens when we're sort of printing purse sub candy, right? Purse sub candy, it's like dear purse, go and find the candy thing. Look at that one, look at that one. Oh, yep, yep, this is candy. But what we're looking for is the value, and so that's why 3 is coming out here. So go look up under candy, and tell me what's stored under candy. These can be actually more complex things, I'm just keeping it simple for this lecture. And then, when we say purse sub candy equals purse sub candy plus 2, well it pulls the 3 out, looking at the label candy, then adds 3 plus 2 and makes 5, and then it assigns it back in, and then that says, oh, go, go place this number 5 in the purse with the label of candy, which then replaces the 3 with a 5. Okay? And if we print it out, we see that the new variable, or the new candy entry, is now 5. Okay? So if we just sort of put these things side by side, we create them sort of both the same way and we make an empty list, and an empty dictionary, we call the append method because we're sort of just putting these things in order. You gotta put the first one in first. So it's not telling you where. You kind of know that this will be the first one, cause we're starting with an empty one, and this will be the second one. We put in the values 21 and 183, and then we print it out, and it's like okay, you gave me the values 21 and 183, I will maintain the order for you, there's no keys other than their position. The position is the key, as it were, so if I want to to change the first one to 23, well, I say list sub zero, which is this, and then change that to 23. So this is sort of used as a lookup to find something. It can be used on either the right-hand side or the left-hand side of an assignment statement. Comparing that to dictionaries, I want to put a 21 in there and I want to put it with the label age. I'm going to put 182, put that in with the label course. So we don't have to like, make an entry. The fact that the entry doesn't exist, it creates the age entry and sticks 21 into it, creates the course entry, sticks 182 into it. We print it out and it says, oh, course is 182 and age is 21. This emphasizes that order is not preserved in dictionaries. I won't go into like great detail as to why that is. It turns out that that's a compromise that makes them fast using a technique called hashing. It's how it actually works internally, go Wikipedia hashing and take a look. But, the thing that matters to us as programmers primarily is that lists maintain order and dictionaries do not maintain order. They, dictionaries give us power that we don't have in lists. I mean they're very complimentary. Now there's not this one that's better than the other. They've very complimentary. Different kinds of data is either better represented as a list or as a dictionary, depending on the problem you're trying to solve. And in a moment we'll, we'll be writing programs that are using both. So if we come down here and I say, okay, stick 23 into, assignment statement, into ddd sub age, well that will change this 21 to 23, so when we print it out. So you can, this part, where you look something up and change the value, you can do either way. It's just how you do it here is a little bit different, okay? So let's look through this code again. And so I like, I like to use the word key and value. Key is the way we look the thing up, and in lists keys are numbers starting at zero and with no gaps. In dictionaries keys are whatever we want them to be, in this case I'm using strings. And then the value is the number we're storing in it. So we create this kind of a list with that kind, those kinds of statements. This statement creates this kind of a thing. Now, if we, if we think of this assignment statement as moving data into a new, into a place, a new item of data into a place. It's looking at this thing right here. Right? It's like, that's where I want to move it. And so it hunts, and says, look the key up. And that's the one that I'm going to change. And then once it knows which it's going to change, then it's going to take the 23, and it's going to put the 23 into that location. And so that's how this changes from that to that. Similarly when we get down to here, we're going to stick 23 somewhere and this is, this expression, this lookup expression, the index expression ddd sub age, is where we're going to put it. So, we're looking here, where is that thing? Well, that thing is this entry in the dictionary. And so now when we're going to store the 23, we know where the 23 is going to go. It's going to overwrite the 21 and so the 21 is going to change to 23, okay? So they're kind of similar. There are things that work similar in them and then there are things that work differently in them. We can make literals, constants, with curly braces. And they look just like the print. That's one nice thing about Python. When you print something out it's showing you how you can make a literal, and basically you just open with a curly brace and say chuck colon 1, fred 42, jan 100. And we're making connections. key/value pair, key/value pair. We print it out and No order. They don't maintain order. Now they might come out in the same order, but that's just lucky. Right? All the ones I've shown you so far don't come out in the same order, which is good to demonstrate it. If it one time came out in the same order that wouldn't be broken. It's not like it doesn't want to come out in the same order. It's just, you don't, it's not internally stored, and you add an element and it may reorder them. You can do an empty dictionary with just a curly brace, curly brace. So, I'm going give you another example. And I'm going to show you a series of names. And I want you to figure out what the most common name is and how many times each name appears. Now these are real people. They actually work on the Sakai project. Steven, Zhen, and Chen, and me. So these are people that are actually in the data that we use in this course. Okay? And so I think I'll show you about fifteen names and you're to come up with a way, I'm going to show them to you one at a time, you need to come up with a way to keep track of these. Okay? So I'll just, with no further ado I will show you the names. [BLANK_AUDIO] Okay, so that's all the names. Did you get it? You might have to go back and do it again. How did you solve the problem? What kind of a data structure did you build to solve the problem? Or did you just say wow that's painful, I think I will learn Python instead, in solving that problem. Okay? So pause the, pause the video if you want and write down or go back, write down what you think the number of the most common name is and how many times. Okay. Now I'll show you. So here is the whole list. It's all of them. And now that we see all of them, we use our amazing human mind and we scan around, and look at purpleness and, and all that stuff. And then we go like, oh, this is a so much easier problem when I'm looking at the whole thing. And I think that the most common person is Zhen, and I think we see Zhen, I think we see Zhen five times. And I think csev is one, two, three and Chen Wen is one, two. And Steve Marquard is one, two, three. So the question is, what is an effective data structure if you going to see a million of these, what kind of data structure would you have to produce? Because you can't keep it in you head even, even this number of people, you can't even this amount of data, no way you can keep it in your head. You have to come up with some kind of a variable, as it were, just like largest so far was the variable. Some kind of variable that gets you to the point where you understand what's going on. And so this is the most common technique to solve this problem where you keep a running total of each of the names. And if you see a new name, you add them to the list. So csev and then you give him a one, and then you see Zhen and you give her a one, and then you see Chen and you give her a one. And then you see csev again and you give him a two. And you see a two, and a two, and a one right? [COUGH] And so then when you're all done you have the mapping, right, of these things and you go oh, okay, let me look through here and find the largest one. That's the largest one and so that must be the person who is the most. So you need a scratch area, a data structure or a piece of paper as it were, and so that's what, exactly what dictionaries are really good at. You could think of this as like a histogram. You know, it's, it's a bunch of counters, but counters that are indexed by a string. So we use a lot of this. And so this is a pattern of many counters with a dictionary, simultaneous counters. We're counting a bunch of, we're looking at a series of things, and we're going to simultaneously keep track of a large number of counters, rather than just one counter. How many names did you see total? Whatever, 12. But how many of each name did you see is a bunch of counters, so it's a bunch of simultaneous counters. So a dictionary is, is great for this, a dictionary is great for this. We, when we see somebody for the first time, we can add an entry to the dictionary, which is kind of like going oh, csev one, and then Chen Wen one. Now these don't exist yet. Right? So we've got csev one and Chen Wen one, so that creates an entry and sticks a one in it and the mapping between the key csev and the value one, the key Chen Wen and the value one and then we say, hey what's in there? Oh, we've got a csev is one and Chen Wen is one. And then we see Chen Wen a second time, so we'd add another number right there. So this old number is one, we add one to it and we get two and then we stick that back in and then we do the calculations. We do a dump and say oh there's two in Chen Wen and one in csev. Okay? So this is a great data structure for the simutaneous counters like what's the most common word, who had the most commits, da, da, da, da, da. Now, everything we do we have to figure out like, when you're going to get in trouble with Python. When Python's going to give you the old thumbs down and say oh, you went too far. So one thing Python does not like is if you reference a key before it exists. We'll, we'll talk in a second how to work around this. But if you simply create a dictionary and say, oh, print out what's under csev, it gives you a traceback. It's like, I'm going to inform you that that's not there. And it says key error, csev. Now, the thing that allows us to solve this is the in operator. We've used the in operator to see if a substring was in a string. Or if a number was in a list. So, so this in operator says, in operator says, hey, ask a question. Is the string csev a current key in the dictionary ccc? Is the string csev a current key in the dictionary ccc? And it says, False. So now we have something that doesn't give a traceback that can tell us whether or not the key is there. So if you remember the algorithm, the first time you see it, you set them to one, and every other time, you add one to them. So this is how we do that in Python. So here's how we implement that program that I just gave you in Python. So, here's our names. It's shorter so my slide works better. Here's a variable, our iteration variable, it's going to, you know, go through all five of these one at a time. And within the body of the loop we have an if statement. If the name is not currently in the counts dictionary, counts is the name of my dictionary. If the name is not currently in the counts dictionary, I say counts sub name equals one. else, that must mean it's already there which means it's okay to retrieve it, counts sub name plus 1. We're going to add a 1 to it and stick it back in, okay? And so when this finishes it's going to add entries and then add one to entries that already exist. And not traceback at all. And when we print it out we're going to see the counts. And literally this could have gone a million times and it would just be fine and it would just keep expanding. Okay? So this pattern of checking to see if a key is in a dictionary, setting it to some number, or adding one to it is a really, really common pattern. It's so common, as a matter of fact, that there is a a special thing built into dictionaries that does this for us, okay? And there is this method called get. And so, counts is the name of the dictionary, get is a built-in capability of dictionaries. And it takes two parameters. The first parameter is a key name, like a string, like csev or chen wen or marquard. And then the second parameter is a value to give back if this doesn't exist. It's a default value if the key does not exist. And there's no traceback. So this way you can encapsulate, in effect, an if-then-else. If the name parameter is in the counts, print the thing out, otherwise print zero. So this expression will either get the number if it exists or it will give me back a zero if it doesn't exist. So this is really valuable. Right? This is really valuable. That's a really bad smiley face. So this is really valuable because it, once, once we understand the idiom, it really takes four lines of code and turns it into one line of code. Because we're going to be doing this if-then-else all the time. Now, and so we can reconstruct that loop a lot easier and a lot more cleanly using this idiom, right? It's something that looks kind of complex but you'll get used to it really fast, okay? So we have, everything here is the same, we create an empty dictionary, we have five names to go through, we're going to write a for loop and it's going to go through each of those. And then we're going to say counts sub name equals counts dot get the value stored at name, and if you don't find it, give me back a zero. And then whatever comes back, either the old value or the zero, add 1 to that and then take that sum and stick it in counts name. Okay? So this is either going to create, or it's going to update. If there is no entry, it's going to create it and set it to one. If there is an entry it's going to add one to the current entry. Okay? So this is, this line is kind of an idiom. Read about it in the book, figure it out, get used to the notion of what this is doing. Understand what that is doing, okay? Because I'm going to start using it as if you understand it. So, the next problem is a problem of finding the most common word. So, finding the most common, the top five, is often a, a trigger that says, use dictionaries because if you're going to have to count things up, you're going to, you know, you don't know what the most common thing is at the beginning. First you have to count everything up, and dictionaries are a great way to count. So here's a little problem and I would like you to read this text and find me the most common word in the text. And tell me what the most common word is and how many times it occurs. Ready? I'm going to give you a thousandth of a second, just like I would give a computer. I would expect it'd be able to do this in a thousandth of a second. [SOUND] There you go. [BLANK_AUDIO] Okay, I gave you five seconds. Time's up. Did you get it? Or did you say to yourself, you know what, I hate that, it's no good, I think I'll write a Python program instead. And he'll probably show me a Python program if I wait long enough. So here's a slightly easier problem from the first lecture. Ready? It's the same problem. Find the most common word and how many times the word occurs. [BLANK AUDIO] [MUSIC] Did you get it? I believe the answer is, and I could look really dumb here, oops, the answer is the, and I think it's seven times. So, that's the right answer. Okay? Again, things humans are not so good at. So, here's a piece of code that's starting to combine some of the things we've been doing in the past few chapters all together. We are going to read a line of text, split it into words, count the occurrence, how many times each word occurs, and then print out a map. So, so here's what we're going to do, we're going to say okay, start a dictionary, an empty dictionary, read the line of input. Then split it, remember, the split takes a string and produces a list. So words is a list, line is a string, and then we'll print that out. Then we're going to write a for loop that's going to go through each of the words, and then create, use this idiom counts sub word equals counts.get word, 0 + 1. So this is going to do exactly what we talked about in the previous couple slides back, either create the entries or add to those entries, okay? And then we're going to print them out. So here's what that program does when it prints out. Now this is actually one long line I'm just cutting it so you can see it. Here's this line we enter, and the words the, there's seven of them. Then it takes this line and splits it into a list, and there is the beginning and end of the list. The list maintains the order, so the list simply breaks all these words into separate words in a list of strings. From one string to many strings. This is many strings. And so the, and the spaces are gone. And so now here's this list. And then what we're going to do is we're going to run through the list. And we're going to keep running totals of each of the words in the list. And then when we're done with the list, we're going to print out the contents of that dictionary. And we can inspect it and go like, let's look for the biggest one, na, na, na, na, na. It's kind of like looking for the largest, like, oh, seven. That's the largest and the largest word is the. Okay? So that's how the program runs, it reads a line, splits it into a list of words, and then accumulates a running total for each word, and then we hand inspect to see what the most common word is. Okay? Oh no, no, I don't want that song again. There we go. And so and so here we have the, in it's kind of a smaller fashion. We make a dictionary. This entering a line of text is here. It's all one line. We do the split and then we print the words out. And so that split creates a list of strings from a single string based on where the blanks are at, chop, chop, chop, chop. And then here at counting, we're going to loop through each of the words one at a time and use this idiom, counts sub word equals counts.get word, 0 + 1, which is going to create and/or update. And then we print the counts out and that comes out there. Okay? So, again, this is the new thing that we've done. Everything else we've kind of seen before. Now we can also loop through dictionaries with for loops. The for loop, we've been, put all kinds of things over here. We've put strings over here, we've put lists of numbers over here. We've put files over here. And basically what it really says is you know, if this is a collection of things, run this little indent code once for each item in the collection, and key then becomes our iteration variable. And key is very mnemonic here. It doesn't know that they are keys. And so, keys. The key here is that, there's a bit, the important concept here is that dictionaries are key/value pairs and so this is only one variable and so it actually decides that, they've decided that it goes through the keys, which is actually quite useful. So key is going to take on the successive values of the labels. Not the successive values of the values stored at the labels. But it's really easy for us to retrieve the contents at that label counts sub key. So we're going to use the key 'chuck', 'fred', 'jan', to look up the 1, 42, 100. And so it prints out the key, and then the value at it, the key, and the value at it, and the key, and the value. And so we're able to sort of go through the dictionary and look at all the key/value pairs, which is the common thing that you really want to do. Okay? Now there's some methods inside of dictionaries that allow us to convert dictionaries into lists of things. And so if you simply take a dictionary, so here's a little dictionary with three items in it, and we can say list sub and then give a dictionary name right there, and then that converts it into a list. But it's just a list of the keys. We can also say jjj dot keys, kind of do the same thing. Say give me a list consisting of the keys. And then jjj dot values gives you a list of the values, 1, 42, and 100. Of course they're not in the same order. Now interestingly, as long as you don't modify the dictionary, the order of these two things corresponds as long as in between here you're not changing it. So the first jan maps to 100, chuck maps to 1, and fred maps to 42. So the order, you can't predict the order they're going to come out but these two things will come out in the same order, whatever that order happens to be. Okay, and so there's one more thing. So we've got the keys, we've got the values, and we've got a thing called items. items also returns a list, it's a list. But it's a list of what Python calls tuples. That's what the next chapter is about. We'll talk more about tuples in the next chapter. A tuple is a key/value pair. So this list has three things in it. One, two, three. The first one jan maps to 100, the second is chuck maps to 1, the third one is fred maps to 42. So, just kind of bear with me for a second. We'll hit this a little harder in the next chapter. But the place that this, the idiom where this works very beautifully is on a for loop. Now, for those of you who have programmed in other languages, this will be kind of weird because other languages have iterations but they don't have two iteration variables. Python has two iteration variables. It can be used for many things but one of the things that it's used for that's really quite nice is we can have two iteration variables. This jj items returns pairs of things and then aaa and bbb are iteration variables that sort of move in synchronized, move, are synchronized as they move through. So aaa takes on the value of the key. bbb takes on the value of the, the value. And then the loop runs once. Then aaa is advanced to the next key. And bbb is advanced to the next value simultaneously, synchronized. Then they print that out, then it advances to the next one, and the next one, and they print that out. So they are moving in a synchronized way. Now again, the order jan, chuck, fred is not the same. But the correspondence between jan 100, chuck 1, and fred, that's going to, that's going to work. And so basically, as these things go, they work their way through whatever order they're stored in the dictionary. So this is quite nice. Two iteration variables going through key/value. Now if I was making these names mnemonic, and they made more sense, I would call this the key variable and that would be the value variable. But for now I'm just using kind of silly names to show you that key and value are not special. They're not Python reserved words in any way. They're just a good way to name these things, key/value pairs. Okay? Okay. Now we're going to circle all the way back to the beginning. All the way back to the first lecture. And I gave you this program, and I said don't worry about it. We'll learn about it later. Well, now later. At this point you should be able to understand every line of this program. This is the program that's going to count the most common word in a file. Okay? So let's walk through what it does and hopefully by now this will make a lot of sense. Okay? So we're going to start out, we're going to ask for a file name, we're going to open that file for read. Then, because we know it's not a very large file, we're going to read it all in one go. So handle dot read says read the whole file, newlines and all, blanks, newlines, whatever, and put it in the variable called text, it's just mnemonic. Remember I'm, in this one I'm using the mnemonic variable names. Then go through that whole string, which was the whole file, go through and split it all. Newlines don't hurt it. Newlines are treated like blanks. And it understands all that. It throws the newlines away and the blanks away and splits it into a beautiful list of just words with no blanks. And the list of the words in that file ends up in the variable words. words is a list, text is a string, words is a list. Then what I do is the pattern of accumulating counters in a dictionary. I make an empty dictionary. I have the word variable that goes through all the words and then I just say, counts sub word equals counts dot get(word,0) + 1, and that, like we just got done saying, it both creates and/or increments the value in the dictionary as needed. So now at the end of the, at the, at this point in the program, we have a full dictionary with the word:count. Okay? And there's many of them. You know, all the words, all the counts. They're not in any particular order. So now what we're going to do is we're going to write a largest loop, find the largest. Which is another thing that we've done. So not only do I need to now know what largest count I've seen so far, I need to know what word that is. So I'm going to set the largest count we've seen so far to None, set the largest word we've seen so far to None, and then I'm going to use this two-iteration variable pattern to say go through the key/value pairs word and count in counts.items. So it's just going to go through [SOUND] all of them. And then I'm going to ask if the largest number I've seen so far is None or the current count that I'm looking at is greater then the largest I've seen so far, keep them. Take the current word, stick it in biggest word so far, take the current count, stick it in the biggest count so far. So this is going run through all of the word.count pairs, word.count key/value pairs. And then when it comes out, it's going to print out the word that's the most common and how many times. So if we feed in that clown text, it will run all this stuff, and print out oh, the is the most common word, and it appeared seven times. Or if I print the stuff that was two slides back, words.txt, from the actual textbook, then it says the word to is the most common and it happened 16 times. So I could easily, you know, throw 10 million, 10 million words through this thing, and it would just be totally happy. Right? And so, this is not that complex of a problem, but it's using a whole bunch of idioms that we've been using. The splitting of words, the accumulation of multiple counters in a dictionary. And so, it sort of is the beginning of doing some kind of data analysis that's hard for humans to do, and error-prone for humans to do. And so this is, we're reviewing collections. We've introduced dictionaries. We've done the most common word pattern, talked about that. The lack of order, and I did that a bunch of times. And we've looked ahead at tuples, which is the next, the third kind of collection that we're going to talk about. And they're actually in some ways a little simpler than dictionaries. And simpler than lists. So, see you in the next lecture, Chapter 10, tuples.