1 00:00:00,000 --> 00:00:04,847 In this section, we're gonna look at a new form of data, called a table. And once we 2 00:00:04,847 --> 00:00:09,410 look at how tables work, then we're gonna play around with code that manipulates 3 00:00:09,410 --> 00:00:13,972 tables. So it's very similar to the way earlier we did images and then looked at 4 00:00:13,972 --> 00:00:18,934 the code that manipulates images. The code to work with tables will actually in some 5 00:00:18,934 --> 00:00:23,667 ways look, similar to the code that worked on images. So my goal is that the real 6 00:00:23,667 --> 00:00:28,486 patterns that make any sorta code work are gonna start coming through. So, tables are 7 00:00:28,486 --> 00:00:33,268 a really common way to organize data on the computer. So as a running example for 8 00:00:33,268 --> 00:00:37,815 this section, I'm gonna use the social security baby names database. So the 9 00:00:37,815 --> 00:00:42,302 social security administration does retirement benefits and stuff in the US. 10 00:00:42,479 --> 00:00:46,998 But they also happen to track, every year. What names are given to babies born in 11 00:00:46,998 --> 00:00:51,249 that year in the US? And so that's gonna be kinda fun data set that we're gonna 12 00:00:51,249 --> 00:00:55,608 use, So here I've, I've structured this as an example of a table. So, as I was 13 00:00:55,608 --> 00:00:59,806 saying, table's a way of storing data. It's basically, you can think of it as 14 00:00:59,806 --> 00:01:04,218 like a rectangle. So the way the table works is that it is first organized into 15 00:01:04,218 --> 00:01:08,685 fields. So the baby data is organized into four fields and the fields are name, rank, 16 00:01:08,685 --> 00:01:13,205 gender and year, Look at the other fields as basically as the columns that make this 17 00:01:13,205 --> 00:01:18,767 thing up, And then the data is stored in what we'll call rows. So here's the first 18 00:01:18,767 --> 00:01:24,479 row has the data for the name Jacob, so it says the name is Jacob, the rank is one 19 00:01:24,479 --> 00:01:30,402 for that name and what rank one for this data set is that Jacob is the most popular 20 00:01:30,402 --> 00:01:36,076 boy name for babies born in 2010. Then we have gender boys in years 2010. So the 21 00:01:36,076 --> 00:01:40,966 second row has another name. So each name has its own row. So in this case it says 22 00:01:40,966 --> 00:01:46,058 the name is Isabella, the rank is one. So what that means is Isabella was The most 23 00:01:46,058 --> 00:01:51,760 popular girl name for babies born in 2010. So, then we see, Ethan has rank two for 24 00:01:51,760 --> 00:01:57,394 boy names. Sophia has rank two for girls, and so on. So it, the, the table just has 25 00:01:57,394 --> 00:02:03,715 all the names. In this case there, they're shown, sorted by rank. So there's o ne row 26 00:02:03,715 --> 00:02:09,143 per name. In this case it has the 1,000 top boy names and the 1,000 top girl 27 00:02:09,143 --> 00:02:14,516 names. So, it's, there's 2,000 rows overall. So as I was saying, tables are 28 00:02:14,516 --> 00:02:19,277 really common for storing all sorts of data on the computer. You may have heard 29 00:02:19,277 --> 00:02:23,865 the term database. So, a database is a related concept to this, sort of simple, 30 00:02:23,865 --> 00:02:28,568 basic idea of a table. Generally the way this works is that the fields are, are, or 31 00:02:28,568 --> 00:02:33,040 you can think of them as the categories, the number of fields is not very big. 32 00:02:34,400 --> 00:02:39,052 Fields, and there might be eight or ten or something. So they represent kinda the 33 00:02:39,052 --> 00:02:43,368 fixed categories we wanna keep track of. And then the number of rows could be 34 00:02:43,368 --> 00:02:47,684 enormous. It might be millions or maybe even billions of rows. So I'll just, 35 00:02:47,852 --> 00:02:52,552 mention a couple examples. So you could think of your, your email inbox is maybe 36 00:02:52,552 --> 00:02:57,184 stored in a table on the computer. So the way that would work is, well, what would 37 00:02:57,184 --> 00:03:01,468 the fields be? The fields might be something like from, and to, and date, and 38 00:03:01,468 --> 00:03:06,274 subject, and, you know, a few other things that you store, per message. And then one 39 00:03:06,274 --> 00:03:10,848 row is just one message. So each message gets its own row, and then we have this, 40 00:03:11,021 --> 00:03:15,480 fixed number of fields. So then when you go to your inbox, well, there might be. 41 00:03:15,480 --> 00:03:19,430 10,000 rows in there for all your email and maybe when you go to your inbox it 42 00:03:19,430 --> 00:03:23,480 just selects the ten most recent ones and shows you, maybe not all the fields, but 43 00:03:23,480 --> 00:03:28,080 maybe the most important fields from that message. Another example is Craig's List. 44 00:03:28,080 --> 00:03:32,393 Or, you know, any sorta online auction site. Where maybe it's stored, it could be 45 00:03:32,393 --> 00:03:36,927 stored in a table where one row is gonna be one item for sale. And then the fields 46 00:03:36,927 --> 00:03:41,405 would again be sorta the categories that you want for one item. So the categories, 47 00:03:41,405 --> 00:03:45,994 the fields might be the price, the date it was listed. Maybe a short description, and 48 00:03:45,994 --> 00:03:50,694 a long description, and a few things like that. So those are just a couple examples 49 00:03:50,694 --> 00:03:55,172 of how many of the things you deal with day to day often, back on the computer, 50 00:03:55,172 --> 00:03:59,822 that's gonna be stored in some kinda table. Alright, so to make this real, I 51 00:03:59,822 --> 00:04:05,980 wanna look at co de to manipulate, tables. And I'm gonna use the baby name table as 52 00:04:05,980 --> 00:04:11,844 sort of our, our working example for a, a couple sections here. So, in this case, 53 00:04:11,844 --> 00:04:17,511 the baby data for 2010 is stored in, baby-2010.csv. I should just mention, CSV 54 00:04:17,511 --> 00:04:22,992 stands for Comma Separated Values. It's a standard for storing, essentially table 55 00:04:22,992 --> 00:04:28,130 data in a text file, and it's a really simple, fairly old standard. So it's a 56 00:04:28,130 --> 00:04:33,542 pretty, you know, easy way to interchange data from one program to another. So in 57 00:04:33,542 --> 00:04:39,043 terms of the code, I'll make my analogy to images. So for images, we had four pixel 58 00:04:39,043 --> 00:04:44,468 colon images, And that would loop through all the pixels in the image, and for each 59 00:04:44,468 --> 00:04:49,225 pixel. Everyone, whatever this code was inside the colon braces. So, for the table 60 00:04:49,225 --> 00:04:53,938 to be very similar we're going to have four row colon table, And what that's 61 00:04:53,938 --> 00:04:58,711 going to do is it's just going to loop through each row through the table. So, it 62 00:04:58,711 --> 00:05:03,484 just starts from the top and go through each one. And for each row it's going to 63 00:05:03,484 --> 00:05:07,840 run whatever code I put in the colon braces. So, here is our first example. 64 00:05:08,444 --> 00:05:13,172 That is the line, very similar to, loading an image. So that's the line that, grabs 65 00:05:13,172 --> 00:05:17,349 the table and stores it in a variable, which I will inevitably just call the 66 00:05:17,349 --> 00:05:22,021 table, And then here I have the four loop, sorta looking through all the rows. And in 67 00:05:22,021 --> 00:05:26,638 this case, the simplest thing I'm gonna do is I'm just gonna say, print row. So, I'm 68 00:05:26,638 --> 00:05:31,036 just gonna, essentially just, you know, look at a, print each row in the data. So 69 00:05:31,036 --> 00:05:36,778 this is the baby data, so if I run this. There is row one and row two and so on, So 70 00:05:36,778 --> 00:05:42,936 you can see that Jacob, Isabel, Ethan, those fairly popular names. It actually 71 00:05:42,936 --> 00:05:49,094 made my web page quite tall because of course there is two thousand of these 72 00:05:49,094 --> 00:05:55,652 things. So you know there's Courtney with a K The 637 popular girl names. So it runs 73 00:05:55,652 --> 00:06:01,810 all the way down here as I was saying. Oops, to a, to a thousand. So Acre and an 74 00:06:01,810 --> 00:06:06,671 Danea, So That is one thing, so what, I guess what this shows, sort of, a bulk 75 00:06:06,671 --> 00:06:11,605 output thing, but what it shows is, that line ran 2,000 times. Once for each row in 76 00:06:11,605 --> 00:06:16,538 the table. So, just as with the image, the four loop just went through and looked at 77 00:06:16,538 --> 00:06:21,776 each one. Alright so here I'm gonna comma this out and run again just to get rid of 78 00:06:21,776 --> 00:06:26,832 the output so I can have my webpage and I'll be a mile high here. So what are we 79 00:06:26,832 --> 00:06:31,643 gonna do with the table? Just looping through and printing each row, that's like 80 00:06:31,643 --> 00:06:36,595 [laugh], like for Craigslist or for your email. That's never what you want. What we 81 00:06:36,595 --> 00:06:41,840 want is to loop through all the rows and just pick out the six or two of the 2,000 82 00:06:41,840 --> 00:06:47,338 that we want. This is very common thing to do with table [inaudible]. It is sometimes 83 00:06:47,338 --> 00:06:52,457 called in database terminology a quarry. That I'm going to kind of sort of narrow 84 00:06:52,457 --> 00:06:57,196 down to just the rows I want. So, let's talk about the code to do that. So 85 00:06:57,386 --> 00:07:02,821 [inaudible] we're going to do this with an IF statement, Put an IF statement inside 86 00:07:02,821 --> 00:07:08,293 the loop and in the IF task we will write a task to select just some of the rows. So 87 00:07:08,293 --> 00:07:12,772 here's gonna be my first example. So here is the four loop. So that's looping 88 00:07:12,772 --> 00:07:17,546 through all the rows. And then inside the four loop, I've got this if statement. So 89 00:07:17,546 --> 00:07:22,142 what's gonna happen is, this highlighted code is gonna run again and again and 90 00:07:22,142 --> 00:07:27,624 again, once for each row in the thing. And so what I've done. So I've, written a test 91 00:07:27,624 --> 00:07:33,791 here, and my, the goal here is, in this case, is to just pick out the rows where 92 00:07:33,791 --> 00:07:39,849 the rank is six. And so, let me talk about how that works. So what's gonna happen is 93 00:07:39,849 --> 00:07:44,065 that highlighted test, that test is gonna be evaluated once for every row. So in a 94 00:07:44,065 --> 00:07:48,542 sense 2000 times. So, what I'm gonna do is structure the test so it's true for a row 95 00:07:48,542 --> 00:07:52,707 I care about. And then inside of here I'll put a print, so it'll print the ones I 96 00:07:52,707 --> 00:07:56,767 care about. In all the other rows this will be false, and so it won't print the, 97 00:07:56,767 --> 00:08:01,446 won't print those. All right, so how does this work? So just as for the pixel, we 98 00:08:01,446 --> 00:08:06,576 had get red and get green and get blue the row has get field. And so you could, 99 00:08:06,576 --> 00:08:11,645 remember we called it a row because all the way across it has a bunch of different 100 00:08:11,645 --> 00:08:16,408 fields. So you can say, well, which field do you want? The way this works is each 101 00:08:16,408 --> 00:08:21,563 field has a n ame. In this case, the names are name, rank, gender and year. So in 102 00:08:21,563 --> 00:08:25,900 this case, I say get field. And then, within the parentheses, I say in a string, 103 00:08:25,900 --> 00:08:30,351 which field do I want by name? So in this case, I'm, like, oh right. I wanna go to 104 00:08:30,351 --> 00:08:34,625 the row, and I wanna pick out the rank. So this highlighted part that goes to the 105 00:08:34,625 --> 00:08:38,489 row. And that picks out the rank. Just as before we would have a pixel dot get red 106 00:08:38,489 --> 00:08:42,040 and that would pick, that would pull the red just out of the pixel, so this is 107 00:08:42,040 --> 00:08:46,956 analogous but for a table. So now my call here for this example is I wanted to just 108 00:08:46,956 --> 00:08:51,716 show what the rows where the rank [inaudible] required new little bit of 109 00:08:51,716 --> 00:08:56,933 code. So having picked the rank out here, then I says equals, equals, which I think 110 00:08:56,933 --> 00:09:02,150 we already used before, but two equal signs next to each other that compares two 111 00:09:02,150 --> 00:09:07,107 things for equality, it tested they are the same. And so road get field rank 112 00:09:07,107 --> 00:09:11,919 equal, equal six. What that says is, get the rank out, and test if it's six. And if 113 00:09:11,919 --> 00:09:16,732 it's six, we'll say that that's, the test is true. And if it's not, we'll say it's 114 00:09:16,732 --> 00:09:21,699 false. So, let me just try running this. So if I run it, what's happened is, it 115 00:09:21,699 --> 00:09:26,514 went through all 2,000 rows. And for these two rows, that test was true, Because 116 00:09:26,514 --> 00:09:31,329 that's the case where the, the rank was six. And obviously, you know, I could say 117 00:09:31,329 --> 00:09:35,620 it, like, 127 here or whatever. And then we would get the two rows. It just 118 00:09:35,620 --> 00:09:40,271 happens; each rank number has one boy name and one girl name in the Stata set. So, 119 00:09:40,445 --> 00:09:46,108 that's why I keep getting two rows here. So let me try another example. Oh, also I 120 00:09:46,108 --> 00:09:53,780 should mention a, a warning about this. So I'll change this back to six, quick. So 121 00:09:53,780 --> 00:09:58,413 this use of the two equals for equality is a little odd in computer code. I think it 122 00:09:58,413 --> 00:10:02,604 would be very reasonable to think, oh, what, shouldn't there be just one equal 123 00:10:02,604 --> 00:10:06,795 sign? Right? If rank equals six? And unfortunately the single equal sign in 124 00:10:06,795 --> 00:10:10,986 JavaScript already has been used for variable assignment. It's kinda already 125 00:10:10,986 --> 00:10:15,799 dedicated to meaning that. And so they couldn't use it for quality, so that's why 126 00:10:15,799 --> 00:10:20,364 there's this different symbol for equa lity. Now, just for this class. So the, 127 00:10:20,364 --> 00:10:25,166 it's actually a pretty common error coding to sort of accidentally type a single 128 00:10:25,166 --> 00:10:30,388 equal sign, when someone meant two equal signs for comparison. In this case. I've 129 00:10:30,388 --> 00:10:35,353 outfitted the run button with some special checking code, where it notices if in an 130 00:10:35,353 --> 00:10:40,497 if test, it sees a single equal sign, And it gives this error message that basically 131 00:10:40,497 --> 00:10:45,642 says, hey, did, did you maybe mean to use, two equal signs? So, that is an easy error 132 00:10:45,642 --> 00:10:49,295 to make, but. Hit the run button and we'll catch it for you. That, that's something I 133 00:10:49,295 --> 00:10:54,320 just did for this class, Alright so now let me do a now let me do another example. 134 00:10:54,497 --> 00:10:59,646 So the test I did before I tested if rank was six but really any kind of test as we 135 00:10:59,646 --> 00:11:04,596 were doing before with images, will work here. So in this case what I'm going to do 136 00:11:04,596 --> 00:11:09,397 is I want to go through the data set and I want to find the data, let's just say, for 137 00:11:09,397 --> 00:11:14,313 Alice. So as I mentioned before forget field you can just patch in the name for 138 00:11:14,313 --> 00:11:19,000 any field. So, you would need to know what the field names are. For this data set 139 00:11:19,000 --> 00:11:23,687 they are name ranked under here and here. So, here I will go to the row and say, hey 140 00:11:23,687 --> 00:11:28,374 give me the name field. So I'll say, name there. And then I'll, I'll equals, equals, 141 00:11:28,374 --> 00:11:32,726 test if the name is, is the same as Alice. So, if I run that. In effect what this 142 00:11:32,726 --> 00:11:36,819 does is it just pulls out the Alice row. It goes through all the rows, does this 143 00:11:36,819 --> 00:11:41,015 test, and if the name is Alice, let's hear the English translation of this, then it 144 00:11:41,015 --> 00:11:46,444 prints the row out. Alright, so that's the basic pattern. So let me just work a few 145 00:11:46,444 --> 00:11:51,155 examples for this. So, the pattern is gonna be, [inaudible] just as I was doing. 146 00:11:51,155 --> 00:11:55,571 We have a four loop, there's an if statement side of it. And then really, all 147 00:11:55,571 --> 00:12:00,459 of the action is in the parentheses of the test. Where I say row.getfield something, 148 00:12:00,459 --> 00:12:05,111 and I have some test about it. So let's try these. So if I run it this way, we 149 00:12:05,111 --> 00:12:10,057 pull out, it says, if name is equal, equal to Alice, I get the Alice row. If I wanted 150 00:12:10,057 --> 00:12:14,827 to look for something else, pull out some other data, we could say Robert. So Alice 151 00:12:14,827 --> 00:12:25,602 is 172. Ro bert is 54. Let's try Abby. 284. So, what's happening is, this 152 00:12:25,602 --> 00:12:31,070 highlighted test is happening all 2000 times. And it's just a question of which 153 00:12:31,070 --> 00:12:37,091 rows are we, are we picking out there? I did Robert before. I'll show you something 154 00:12:37,091 --> 00:12:43,708 kind of funny. If you do Bob and you run. Nothing appears here. What's going on 155 00:12:43,708 --> 00:12:47,735 there is actually no one names their kid Bob. Apparently, so what's happening is 156 00:12:47,735 --> 00:12:51,966 that we are getting no... Zero printing is happening here. This thing was just never 157 00:12:51,966 --> 00:12:56,198 true. That's sort of the pattern on the form I guess for just as how people name 158 00:12:56,198 --> 00:13:00,378 babies is that they tend of the form... They put a long name, like Robert. So, and 159 00:13:00,378 --> 00:13:04,507 then Bob is like, they don't put on the form. Maybe that's just what they actually 160 00:13:04,507 --> 00:13:09,562 call the kid. Alright, so let me try a different test. Let's say I wanna test if 161 00:13:09,562 --> 00:13:14,631 the rank is one. So I would change get field, and I would type rank here. And 162 00:13:14,631 --> 00:13:20,539 then the equals, equals. I can say one, sure. So that gives me the two rows Jacob 163 00:13:20,775 --> 00:13:26,877 and Isabelle. We saw four, those are rank one. So. [inaudible], what was the other 164 00:13:26,877 --> 00:13:32,469 one we did 1,000. So say rank equals a thousand. And we get crew ending. So the 165 00:13:32,679 --> 00:13:38,341 test we did earlier with images like less than, less than equal to. All that stuff 166 00:13:38,341 --> 00:13:44,143 works too. So let's say I wanna look at, if the rank is less than ten. [inaudible] 167 00:13:44,143 --> 00:13:49,569 say less than ten and when I run that. You can see I get, rank one, rank two, rank 168 00:13:49,569 --> 00:13:54,417 three, rank... All these are rank numbers where the less than ten test is true. 169 00:13:54,606 --> 00:13:59,831 Although you'll notice the last I get is Aiden and Cloe, number nine. The rows 170 00:13:59,831 --> 00:14:05,119 where rank is ten, I don't get. And that's because this form of less than is a strict 171 00:14:05,119 --> 00:14:10,407 less than. So it's true for nine but it's not true for ten. If you want, there's 172 00:14:10,407 --> 00:14:15,254 another form of less than where you're like, where you wanna say less than or 173 00:14:15,254 --> 00:14:19,371 equal to. And, I don't think we did this for the images but it's just, what you do 174 00:14:19,371 --> 00:14:23,238 is you put in an equal sign right after it. That means less than or equal to. So 175 00:14:23,238 --> 00:14:29,730 if I run it now then it goes through ten. So, and that works for, greater than as 176 00:14:29,730 --> 00:14:35,052 well. Alright, so let's try a, let's try a greater than one. So I could say, I would 177 00:14:35,052 --> 00:14:40,598 like to see all the rows where the rank is greater than 990, let's say. And so what 178 00:14:40,598 --> 00:14:44,967 I, so I get 991, 92, da, da, da, da, up through 1000. Okay, let me just try one 179 00:14:44,967 --> 00:14:50,136 more. I, so [inaudible] examples with name and rank. And [inaudible] inevitably, I'm 180 00:14:50,136 --> 00:14:54,752 calling, road.getfield, and just changing what string is there to pull out a 181 00:14:54,752 --> 00:15:00,229 different field. I'll try pulling out the, the gender field. And this case, the way 182 00:15:00,229 --> 00:15:04,721 the data's coded, the gender field is it's, it's, it's just strings. So it's 183 00:15:04,721 --> 00:15:09,398 either the string boy or string girl. So if I were to say, if gender is equal, 184 00:15:09,398 --> 00:15:15,186 equal to girl. Hit one then I get [sound] I mean if you look where it say scroll 185 00:15:15,186 --> 00:15:20,346 here, what's happened is I have just gotten all 1,000 girl bros. And, and none 186 00:15:20,346 --> 00:15:25,860 of the 1000 [inaudible] woops. Alrighty. Sorry, let me get this back. So this is 187 00:15:25,860 --> 00:15:31,030 ju-, just a trick where I comment out print, so it prints nothing, and run it 188 00:15:31,030 --> 00:15:36,443 again. So then, that way, it just, it just blanks out the output here. So. Just to 189 00:15:36,628 --> 00:15:41,066 repeat what the pattern is. So, t, t, these first few lines were always the 190 00:15:41,066 --> 00:15:46,306 same. And I guess I was always [inaudible] the row. So the, that was always the same. 191 00:15:46,306 --> 00:15:51,300 What I change is the if test. And the gist of it, the pattern tended to be I would 192 00:15:51,300 --> 00:15:56,108 say row.getField, whatever field I care about. And then I would write equals, 193 00:15:56,108 --> 00:16:00,978 equals or less than or equal to or something. Let's say on the rank or equal, 194 00:16:00,978 --> 00:16:05,848 equal to the name to, in a sense, pull out the rows. And the rule was, I'm pulling 195 00:16:05,848 --> 00:16:12,676 out a row, if this test is true. And so, with that in mind, well this can be a good 196 00:16:12,676 --> 00:16:15,040 source of some exercises.