WEBVTT 00:00:00.048 --> 00:00:01.062 This is the first of two 00:00:01.062 --> 00:00:04.086 videos where we learn about relational algebra. 00:00:04.086 --> 00:00:07.034 Relational Algebra is a formal language. 00:00:07.034 --> 00:00:09.001 It's an algebra that forms 00:00:09.001 --> 00:00:12.006 the underpinnings of implemented languages like SQL. 00:00:12.006 --> 00:00:14.001 In this video we're going to 00:00:14.001 --> 00:00:15.011 learn the basics of the 00:00:15.011 --> 00:00:19.043 Relational Algebra Query Language and a few of the most popular operators. 00:00:19.043 --> 00:00:20.067 In the second video we'll learn 00:00:20.067 --> 00:00:22.039 some additional operators and some 00:00:22.039 --> 00:00:26.036 alternate notations for relational algebra. 00:00:26.036 --> 00:00:28.013 Now, let's just review first from 00:00:28.013 --> 00:00:29.086 our previous video on relational 00:00:29.086 --> 00:00:31.062 querying that queries over 00:00:31.062 --> 00:00:33.066 relational databases operate on 00:00:33.066 --> 00:00:36.097 relations and they also produce relations as a result. 00:00:36.097 --> 00:00:38.005 So if we write a query 00:00:38.005 --> 00:00:39.008 that operates say on the three 00:00:39.008 --> 00:00:41.071 relations depicted here, the 00:00:41.071 --> 00:00:44.073 result of that query is going to be a new relation. 00:00:44.073 --> 00:00:46.012 And, in fact, we can 00:00:46.012 --> 00:00:47.075 post queries on that 00:00:47.075 --> 00:00:49.036 new relation or combine that 00:00:49.036 --> 00:00:52.035 new relation with our previous relations. 00:00:52.035 --> 00:00:54.051 So let's start out with Relational Algebra. 00:00:54.051 --> 00:00:56.017 For the examples in 00:00:56.017 --> 00:00:57.029 this video we're going to be 00:00:57.029 --> 00:01:00.093 using a simple college admission relations database with three relations. 00:01:00.093 --> 00:01:02.013 The first relation, the college 00:01:02.013 --> 00:01:03.086 relation, contains information about the 00:01:03.086 --> 00:01:06.086 college name, state, and enrollment of the college. 00:01:06.086 --> 00:01:08.002 The second relation, the student 00:01:08.002 --> 00:01:10.005 relation, contains an ID 00:01:10.005 --> 00:01:11.071 for each student, the student's name, 00:01:11.071 --> 00:01:14.053 GPA and the size of the high school they attended. 00:01:14.053 --> 00:01:16.002 And, finally, the third relation contains 00:01:16.002 --> 00:01:18.079 information about students applying to colleges. 00:01:18.079 --> 00:01:20.005 Specifically, the student's ID, the 00:01:20.005 --> 00:01:21.065 college name where they're 00:01:21.065 --> 00:01:22.088 applying, the major they're 00:01:22.088 --> 00:01:25.089 applying for and the decision of that application. 00:01:25.089 --> 00:01:28.069 I've underlined the keys for these three relations. 00:01:28.069 --> 00:01:30.073 As a reminder, a key is 00:01:30.073 --> 00:01:31.079 an attribute or a set of 00:01:31.079 --> 00:01:33.031 attributes whose value is 00:01:33.031 --> 00:01:35.056 guaranteed to be unique. 00:01:35.056 --> 00:01:36.084 So, for example, we're going 00:01:36.084 --> 00:01:38.037 to assume the college names are unique, 00:01:38.037 --> 00:01:40.033 student IDs are unique and that 00:01:40.033 --> 00:01:41.091 students will only apply to 00:01:41.091 --> 00:01:46.022 each college for a particular major one time. 00:01:46.022 --> 00:01:47.028 So, we're going to have a 00:01:47.028 --> 00:01:48.007 picture of these three relations at 00:01:48.007 --> 00:01:52.038 the bottom of the slides throughout the video. 00:01:52.038 --> 00:01:54.017 The simplest query in relational 00:01:54.017 --> 00:01:55.051 algebra is a query 00:01:55.051 --> 00:01:57.062 that is simply the name of a relation. 00:01:57.062 --> 00:01:58.085 So, for example, we can 00:01:58.085 --> 00:02:01.021 write a query, "student" and that's 00:02:01.021 --> 00:02:04.022 a valid expression in relational algebra. 00:02:04.022 --> 00:02:05.036 If we run that query on 00:02:05.036 --> 00:02:07.002 our database we'll get as 00:02:07.002 --> 00:02:09.058 a result a copy of the student relation. 00:02:09.058 --> 00:02:11.003 Pretty straightforward . 00:02:11.003 --> 00:02:12.068 Now what happens next 00:02:12.068 --> 00:02:13.085 is that we're going to use 00:02:13.085 --> 00:02:15.074 operators of the relational algebra 00:02:15.074 --> 00:02:19.092 to filter relations, slice relations, and combine relations. 00:02:19.092 --> 00:02:23.041 So, let's through those operators. 00:02:23.041 --> 00:02:26.006 The first operator is the select operator. 00:02:26.006 --> 00:02:27.077 So, the select operator is used 00:02:27.077 --> 00:02:30.008 to pick certain rows out of a relation. 00:02:30.008 --> 00:02:31.067 The select operator is denoted by 00:02:31.067 --> 00:02:33.067 a Sigma with a subscript--that's 00:02:33.067 --> 00:02:35.012 the condition that's used to 00:02:35.012 --> 00:02:37.004 filter the rows that we extract from the relations. 00:02:37.004 --> 00:02:40.028 So, we're just going through three examples here. 00:02:40.028 --> 00:02:41.005 The first example says that we 00:02:41.005 --> 00:02:42.066 want to find the students 00:02:42.066 --> 00:02:44.019 whose GPA is greater than 3.7. 00:02:44.019 --> 00:02:45.096 So to write that 00:02:45.096 --> 00:02:47.059 expression in relational algebra, we write 00:02:47.059 --> 00:02:48.078 the sigma which is the 00:02:48.078 --> 00:02:50.059 selection operator as a 00:02:50.059 --> 00:02:51.077 subscript the condition that we're 00:02:51.077 --> 00:02:54.053 filtering for--GPA greater than 00:02:54.053 --> 00:02:58.054 3.7--and the relation over which we're finding that selection predicate. 00:02:58.054 --> 00:03:00.069 So, this expression will return 00:03:00.069 --> 00:03:02.001 a subset of the student 00:03:02.001 --> 00:03:04.001 table containing those rows 00:03:04.001 --> 00:03:07.003 where the GPA is greater 3.7. 00:03:07.003 --> 00:03:08.003 If we want to filter for two 00:03:08.003 --> 00:03:09.057 conditions, we just do 00:03:09.057 --> 00:03:12.043 an "and" of the conditions in the subscript of the sigma. 00:03:12.043 --> 00:03:13.098 So if we want, say, students 00:03:13.098 --> 00:03:15.029 whose GPA is greater than 3.7 00:03:15.029 --> 00:03:16.094 and whose high school size 00:03:16.094 --> 00:03:18.051 is less than a thousand, we'll 00:03:18.051 --> 00:03:23.007 write select GPA greater than 3.7. 00:03:23.007 --> 00:03:24.096 We used a logical 00:03:24.096 --> 00:03:27.007 and operator--a caret, high school 00:03:27.007 --> 00:03:28.003 size is less than a 00:03:28.003 --> 00:03:32.021 thousand, and again we'll apply that to the student relation. 00:03:32.021 --> 00:03:33.051 And once again, the result of 00:03:33.051 --> 00:03:34.087 that will be a subset of 00:03:34.087 --> 00:03:37.035 the student relation containing the rows that satisfy the condition. 00:03:37.035 --> 00:03:39.042 If we want to find 00:03:39.042 --> 00:03:41.029 the applications to Stanford for 00:03:41.029 --> 00:03:42.007 a CS major, then we'll be 00:03:42.007 --> 00:03:45.052 applying a selection condition to the apply relation. 00:03:45.052 --> 00:03:46.087 Again, we write the sigma 00:03:46.087 --> 00:03:48.066 and now the subscript is 00:03:48.066 --> 00:03:49.079 going to say that the college 00:03:49.079 --> 00:03:54.077 name is Stanford and the major is CS. 00:03:54.077 --> 00:03:57.006 Again, the and operator, and 00:03:57.006 --> 00:03:59.039 that will be applied 00:03:59.039 --> 00:04:01.033 to the apply relation and 00:04:01.033 --> 00:04:02.081 it will return as a 00:04:02.081 --> 00:04:06.063 result, a subset of the apply relation. 00:04:06.063 --> 00:04:07.081 So the general case of the 00:04:07.081 --> 00:04:10.058 select operator is that we have the sigma. 00:04:10.058 --> 00:04:12.028 We have a condition as a 00:04:12.028 --> 00:04:14.056 subscript and then we have a relation name. 00:04:14.056 --> 00:04:17.019 And we return as a result the subset of the relation. 00:04:17.019 --> 00:04:21.049 Our next operator is the Project Operator. 00:04:21.049 --> 00:04:23.021 So the select operator picks certain 00:04:23.021 --> 00:04:26.071 rows, and the project operator picks certain columns. 00:04:26.071 --> 00:04:28.011 So let's say we're 00:04:28.011 --> 00:04:29.088 interested in the applications, but all 00:04:29.088 --> 00:04:30.084 we wanted to know was the 00:04:30.084 --> 00:04:34.027 list of ID's and the decisions for those applications. 00:04:34.027 --> 00:04:35.072 The project operator is written 00:04:35.072 --> 00:04:37.084 using the Greek pi symbol, 00:04:37.084 --> 00:04:39.042 and now the subscript is 00:04:39.042 --> 00:04:42.015 a list of the column names that we would like to extract. 00:04:42.015 --> 00:04:43.091 So we write ID, sorry, 00:04:43.091 --> 00:04:46.075 student ID and decision, and 00:04:46.075 --> 00:04:49.077 we apply that to the apply relation again. 00:04:49.077 --> 00:04:51.006 And now what we'll get 00:04:51.006 --> 00:04:54.027 back is a relation that has just two rows. 00:04:54.027 --> 00:04:55.041 It's going to have all the 00:04:55.041 --> 00:04:56.067 tuples of apply, but it's 00:04:56.067 --> 00:04:57.098 only going to have the 00:04:57.098 --> 00:05:00.071 student ID and the decision columns. 00:05:00.071 --> 00:05:02.032 So the general case of 00:05:02.032 --> 00:05:04.005 a project operator is the 00:05:04.005 --> 00:05:05.091 projection, and then a 00:05:05.091 --> 00:05:07.061 list of attributes, can be 00:05:07.061 --> 00:05:13.002 any number, and then a relation name. 00:05:13.002 --> 00:05:14.069 Now, what if we're interested in 00:05:14.069 --> 00:05:16.092 picking both rows and columns at the same time. 00:05:16.092 --> 00:05:18.015 So we want only some of 00:05:18.015 --> 00:05:20.095 the rows, and we want only some of the columns. 00:05:20.095 --> 00:05:23.008 Now we're going to compose operators. 00:05:23.008 --> 00:05:26.017 Remember that relational queries produce relations . 00:05:26.017 --> 00:05:27.065 So we can write a 00:05:27.065 --> 00:05:29.031 query, say, with the 00:05:29.031 --> 00:05:31.012 select operator of the 00:05:31.012 --> 00:05:33.011 students whose GPA is greater than 3.7. 00:05:33.011 --> 00:05:35.099 And this is how we do that. 00:05:35.099 --> 00:05:37.044 And now, we can take 00:05:37.044 --> 00:05:39.014 that whole expression which produces a 00:05:39.014 --> 00:05:40.047 relation, and we can 00:05:40.047 --> 00:05:42.066 apply the project operator to that, and 00:05:42.066 --> 00:05:43.096 we can get out the student 00:05:43.096 --> 00:05:49.076 ID and the student name. 00:05:49.076 --> 00:05:50.017 Okay. 00:05:50.017 --> 00:05:51.088 So, what we actually see 00:05:51.088 --> 00:05:54.031 now is that the general 00:05:54.031 --> 00:05:55.075 case of the selection and 00:05:55.075 --> 00:05:58.008 projection operators weren't quite what I told you at first. 00:05:58.008 --> 00:05:59.058 I was deceiving you slightly. 00:05:59.058 --> 00:06:01.021 When we write the select 00:06:01.021 --> 00:06:03.075 operator, it's a select 00:06:03.075 --> 00:06:06.007 with the condition on any 00:06:06.007 --> 00:06:07.056 expression of the relational 00:06:07.056 --> 00:06:08.095 algebra, and if it's 00:06:08.095 --> 00:06:09.093 a big one we might want 00:06:09.093 --> 00:06:11.008 to put parenthesis on it, 00:06:11.008 --> 00:06:13.098 and similarly the project operator is 00:06:13.098 --> 00:06:16.021 a list of attributes from 00:06:16.021 --> 00:06:18.061 any expression of the relational algebra. 00:06:18.061 --> 00:06:20.061 And we can compose these as much as we want. 00:06:20.061 --> 00:06:22.024 We can have select over project, over 00:06:22.024 --> 00:06:24.037 select, select, project, and so on. 00:06:24.037 --> 00:06:27.072 Now let's talk about 00:06:27.072 --> 00:06:31.044 duplicate values in the results of relational algebra queries. 00:06:31.044 --> 00:06:33.007 Let's suppose we ask 00:06:33.007 --> 00:06:34.036 for a list of the majors that 00:06:34.036 --> 00:06:37.007 people have applied for and the decision for those majors. 00:06:37.007 --> 00:06:37.092 So we write that as the 00:06:37.092 --> 00:06:40.063 project of the major 00:06:40.063 --> 00:06:45.058 and the decision on the applied relation. 00:06:45.058 --> 00:06:46.007 You might think that when we get 00:06:46.007 --> 00:06:49.053 the results of this query, we're going to have a lot of duplicate values. 00:06:49.053 --> 00:06:51.082 So we'll have CS yes, CS 00:06:51.082 --> 00:06:54.057 yes, CS no, EE yes, EE no, and so on. 00:06:54.057 --> 00:06:55.072 You can imagine in a 00:06:55.072 --> 00:06:57.056 large realistic database of applications, 00:06:57.056 --> 00:06:59.007 there's going to be hundreds of 00:06:59.007 --> 00:07:02.038 people applying for majors and having a yes or a no decision. 00:07:02.038 --> 00:07:04.003 The semantics of relational 00:07:04.003 --> 00:07:08.033 algebra says that duplicates are always eliminated. 00:07:08.033 --> 00:07:09.046 So if you run a query 00:07:09.046 --> 00:07:10.068 that would logically have a 00:07:10.068 --> 00:07:11.098 lot of duplicate values, you just 00:07:11.098 --> 00:07:14.049 get one value for each result. 00:07:14.049 --> 00:07:16.063 That's actually a bit of a difference with the SQL language. 00:07:16.063 --> 00:07:18.034 So, SQL is based on 00:07:18.034 --> 00:07:20.048 what's known as multi-sets or 00:07:20.048 --> 00:07:21.093 bags and that means 00:07:21.093 --> 00:07:24.003 that we don't eliminate duplicates, whereas 00:07:24.003 --> 00:07:26.017 relational algebra is based 00:07:26.017 --> 00:07:29.019 on sets themselves and duplicates are eliminated. 00:07:29.019 --> 00:07:30.071 There is a multi-set or 00:07:30.071 --> 00:07:32.067 bad relational algebra defined as 00:07:32.067 --> 00:07:33.082 well but we'll be fine by 00:07:33.082 --> 00:07:38.000 just considering the set relational algebra in these videos. 00:07:38.000 --> 00:07:39.086 Our first operator that combines two 00:07:39.086 --> 00:07:41.075 relations is the cross-product operator, 00:07:41.075 --> 00:07:43.098 also known as the Cartesian product. 00:07:43.098 --> 00:07:45.022 What this operator does, is it 00:07:45.022 --> 00:07:46.054 takes two relations and kinda 00:07:46.054 --> 00:07:48.006 glues them together so that 00:07:48.006 --> 00:07:49.082 their schema of the result 00:07:49.082 --> 00:07:50.082 is the union of the 00:07:50.082 --> 00:07:52.065 schemas of the two relations and 00:07:52.065 --> 00:07:54.005 the contents of the result are every 00:07:54.005 --> 00:07:56.045 combination of tuples from those relations. 00:07:56.045 --> 00:07:57.095 This is in fact the normal 00:07:57.095 --> 00:07:59.001 set cross product that you 00:07:59.001 --> 00:08:01.025 might have learned way back in the elementary school. 00:08:01.025 --> 00:08:03.001 So let's talk about, say, 00:08:03.001 --> 00:08:06.086 doing the cross products of students and apply. 00:08:06.086 --> 00:08:08.004 So if we do this 00:08:08.004 --> 00:08:09.008 cross products, just to 00:08:09.008 --> 00:08:11.062 save drawing, I'm just gonna glue 00:08:11.062 --> 00:08:14.004 these two relations together here. 00:08:14.004 --> 00:08:15.000 So if we do the 00:08:15.000 --> 00:08:16.035 cross product we'll get at the 00:08:16.035 --> 00:08:18.038 result a big relation, here, which 00:08:18.038 --> 00:08:20.028 is going to have eight attributes. 00:08:20.028 --> 00:08:21.091 The eight attributes across the student 00:08:21.091 --> 00:08:24.012 and apply now the only 00:08:24.012 --> 00:08:25.046 small little trick is that 00:08:25.046 --> 00:08:27.049 when we glue two relations together 00:08:27.049 --> 00:08:28.007 sometimes they'll have the same 00:08:28.007 --> 00:08:31.012 attribute and we can see we have SID on both sides. 00:08:31.012 --> 00:08:33.049 So just as a notational convention, 00:08:33.049 --> 00:08:35.014 when cross-product is done 00:08:35.014 --> 00:08:36.061 and there's two attributes that are 00:08:36.061 --> 00:08:39.042 named, they're prefaced with the name of the relation they came from. 00:08:39.042 --> 00:08:40.071 So this one would be referred 00:08:40.071 --> 00:08:42.018 to in the cross-product as 00:08:42.018 --> 00:08:44.026 the student dot SID where this 00:08:44.026 --> 00:08:45.052 one over here would be referred 00:08:45.052 --> 00:08:48.045 to as the apply dot SID. 00:08:48.045 --> 00:08:49.092 So, again, we glue together in 00:08:49.092 --> 00:08:51.033 the Cartesian product the two 00:08:51.033 --> 00:08:53.011 relations with four attributes each, 00:08:53.011 --> 00:08:55.056 we get a result with eight attributes. 00:08:55.056 --> 00:08:57.069 Now let's talk about the contents of these. 00:08:57.069 --> 00:09:00.011 So let's suppose that the student 00:09:00.011 --> 00:09:02.037 relation had s-tuples in it 00:09:02.037 --> 00:09:04.017 and that's how many tuples, while 00:09:04.017 --> 00:09:05.008 the apply had 8 tuples in 00:09:05.008 --> 00:09:07.048 it, the result of the 00:09:07.048 --> 00:09:08.079 Cartesian products is gonna 00:09:08.079 --> 00:09:10.071 have S times A tuples, 00:09:10.071 --> 00:09:12.074 is going to have one tuple 00:09:12.074 --> 00:09:14.068 for every combination of tuples 00:09:14.068 --> 00:09:19.004 from the student relation and the apply relation. 00:09:19.004 --> 00:09:20.075 Now, the cross-product seems 00:09:20.075 --> 00:09:21.081 like it might not be that 00:09:21.081 --> 00:09:23.012 helpful, but what is 00:09:23.012 --> 00:09:24.052 interesting is when we use 00:09:24.052 --> 00:09:26.092 the cross-product together with other operators. 00:09:26.092 --> 00:09:28.097 And let's see a big example of that. 00:09:28.097 --> 00:09:30.061 Let's suppose that we want 00:09:30.061 --> 00:09:32.065 to get the names and GPAs of 00:09:32.065 --> 00:09:33.099 students with a high school 00:09:33.099 --> 00:09:35.035 size greater than a thousand who 00:09:35.035 --> 00:09:37.099 applied to CS and were rejected. 00:09:37.099 --> 00:09:40.001 Okay, so let's take a look. 00:09:40.001 --> 00:09:41.062 We're going to have to access 00:09:41.062 --> 00:09:43.015 the students and the apply 00:09:43.015 --> 00:09:45.038 records in order to run this query. 00:09:45.038 --> 00:09:46.006 So what we'll do is we'll 00:09:46.006 --> 00:09:49.098 take student cross apply as our starting point. 00:09:49.098 --> 00:09:51.065 So now we have 00:09:51.065 --> 00:09:53.051 a big relation that contains 00:09:53.051 --> 00:09:57.032 eight attributes and all of those tuples that we described previously. 00:09:57.032 --> 00:09:58.036 But now we're going to 00:09:58.036 --> 00:10:00.005 start making things more interesting, because 00:10:00.005 --> 00:10:01.042 what we're going to do is 00:10:01.042 --> 00:10:03.072 a big selection over this relation. 00:10:03.072 --> 00:10:04.095 And that selection is first 00:10:04.095 --> 00:10:06.005 of all going to make sure 00:10:06.005 --> 00:10:08.000 that it only combines student and 00:10:08.000 --> 00:10:11.003 apply tuples that are referring to the same student. 00:10:11.003 --> 00:10:12.003 So to do that, we write 00:10:12.003 --> 00:10:15.069 student dot SID equals 00:10:15.069 --> 00:10:17.009 apply dot SID. 00:10:17.009 --> 00:10:18.095 So now we've filtered the result 00:10:18.095 --> 00:10:20.009 of that cross-product to only 00:10:20.009 --> 00:10:22.006 include combinations of student 00:10:22.006 --> 00:10:24.054 and apply by couples that make sets. 00:10:24.054 --> 00:10:26.085 Now we have to do a little bit of additional filtering. 00:10:26.085 --> 00:10:28.004 We said that we want the 00:10:28.004 --> 00:10:29.008 high school size to be 00:10:29.008 --> 00:10:31.004 greater than a thousand, so 00:10:31.004 --> 00:10:33.083 we do an "and" operator in the high school. 00:10:33.083 --> 00:10:35.034 We want them to have applied 00:10:35.034 --> 00:10:37.008 to CS so that's and major equals CS. 00:10:37.008 --> 00:10:40.013 We're getting a nice big query here. 00:10:40.013 --> 00:10:41.026 And finally we want them to 00:10:41.026 --> 00:10:42.099 have been rejected, so "and 00:10:42.099 --> 00:10:47.032 decision" equals, we'll just be using R for reject. 00:10:47.032 --> 00:10:50.009 So now, we've got that gigantic query. 00:10:50.009 --> 00:10:51.093 But that gets us exactly what 00:10:51.093 --> 00:10:53.017 we want except for one more thing, 00:10:53.017 --> 00:10:56.015 which is, as I said, all we want is their names and GPAs. 00:10:56.015 --> 00:10:57.008 So finally we take a 00:10:57.008 --> 00:10:59.031 big parentheses around here and 00:10:59.031 --> 00:11:00.009 we apply to that the 00:11:00.009 --> 00:11:02.091 projection operator, getting the 00:11:02.091 --> 00:11:06.033 student name and the GPA. 00:11:06.033 --> 00:11:07.074 And that is the relational 00:11:07.074 --> 00:11:09.073 algebra expression that produces 00:11:09.073 --> 00:11:12.066 the query that we have written in English. 00:11:12.066 --> 00:11:13.049 Now we have seen how the 00:11:13.049 --> 00:11:14.085 cross product allows us to 00:11:14.085 --> 00:11:16.094 combine tuples and then 00:11:16.094 --> 00:11:18.007 apply selection conditions to 00:11:18.007 --> 00:11:21.023 get meaningful combinations of tuples. 00:11:21.023 --> 00:11:22.076 It turns out that relational algebra 00:11:22.076 --> 00:11:24.003 includes an operator called 00:11:24.003 --> 00:11:25.009 the natural join that is 00:11:25.009 --> 00:11:27.076 used pretty much for the exact purpose. 00:11:27.076 --> 00:11:29.036 What the natural join does is 00:11:29.036 --> 00:11:31.001 it performs a cross-product 00:11:31.001 --> 00:11:33.082 but then it enforces equality 00:11:33.082 --> 00:11:35.099 on all of the attributes with the same name. 00:11:35.099 --> 00:11:36.088 So if we set up our 00:11:36.088 --> 00:11:38.077 schema properly, for example, 00:11:38.077 --> 00:11:40.066 we have student ID and student 00:11:40.066 --> 00:11:42.024 ID here, meaning the same 00:11:42.024 --> 00:11:43.005 thing, and when the cross 00:11:43.005 --> 00:11:45.048 product is created, it's only 00:11:45.048 --> 00:11:48.049 going to combine tuples where the student ID is the same. 00:11:48.049 --> 00:11:49.089 And furthermore, if we add 00:11:49.089 --> 00:11:51.021 college in, we can 00:11:51.021 --> 00:11:54.015 see that we have the college name here and the college name here. 00:11:54.015 --> 00:11:56.025 If we combine college and apply 00:11:56.025 --> 00:11:59.096 tuples, we'll only combine tuples that are talking about the same college. 00:11:59.096 --> 00:12:01.032 Now in addition, one more 00:12:01.032 --> 00:12:02.043 thing that it does is it 00:12:02.043 --> 00:12:06.005 gets rid of these pesky attributes that have the same names. 00:12:06.005 --> 00:12:07.091 So since when we combine, 00:12:07.091 --> 00:12:09.034 for example, student and apply 00:12:09.034 --> 00:12:11.034 with the natural join, we're only 00:12:11.034 --> 00:12:13.059 combining tuples where the 00:12:13.059 --> 00:12:16.084 student SID is the same as the apply SID. 00:12:16.084 --> 00:12:17.089 Then we don't need to keep two 00:12:17.089 --> 00:12:19.029 copies of that 00:12:19.029 --> 00:12:23.021 column because the values are always going to be equal. 00:12:23.021 --> 00:12:25.058 So the natural join operator 00:12:25.058 --> 00:12:28.012 is written using a bow 00:12:28.012 --> 00:12:30.066 tie, that's just the convention. 00:12:30.066 --> 00:12:34.004 You will find that in your text editing programs if you look carefully. 00:12:34.004 --> 00:12:37.005 So let's do some examples now. 00:12:37.005 --> 00:12:38.066 Let's go back to our same 00:12:38.066 --> 00:12:39.096 query where we were finding 00:12:39.096 --> 00:12:42.004 the names and GPAs of students 00:12:42.004 --> 00:12:45.039 from large high schools who applied to CS and were rejected. 00:12:45.039 --> 00:12:46.078 So now, instead of using 00:12:46.078 --> 00:12:47.089 the cross-product we're gonna 00:12:47.089 --> 00:12:51.095 use the natural join, which, as I said, was written with a bow tie. 00:12:51.095 --> 00:12:53.038 What that allows us to 00:12:53.038 --> 00:12:54.078 do, once we do that natural 00:12:54.078 --> 00:12:56.001 join, is we don't have 00:12:56.001 --> 00:12:57.064 to write that condition, that enforced 00:12:57.064 --> 00:12:59.023 equality on those two 00:12:59.023 --> 00:13:01.093 attributes, because it's going to do it itself. 00:13:01.093 --> 00:13:03.015 And once we have done that then 00:13:03.015 --> 00:13:04.004 all we need to do is 00:13:04.004 --> 00:13:05.005 apply the rest of our conditions, 00:13:05.005 --> 00:13:06.061 which were that the high school 00:13:06.061 --> 00:13:08.076 is greater than a thousand and the 00:13:08.076 --> 00:13:11.008 major is CS and the decision 00:13:11.008 --> 00:13:14.011 is reject, again we'll 00:13:14.011 --> 00:13:16.017 call that R. And then, 00:13:16.017 --> 00:13:17.036 since we're only getting the names 00:13:17.036 --> 00:13:19.098 and GPAs, we write the 00:13:19.098 --> 00:13:24.000 student name and the GPA. 00:13:24.000 --> 00:13:24.027 Okay. 00:13:24.027 --> 00:13:26.063 And that's the result of the query using a natural join. 00:13:26.063 --> 00:13:27.077 So, as you can see that's a 00:13:27.077 --> 00:13:29.009 little bit simpler than the original 00:13:29.009 --> 00:13:30.065 with the cross-product and by setting 00:13:30.065 --> 00:13:33.099 up schemas correctly, natural join can be very useful. 00:13:33.099 --> 00:13:37.001 Now let's add one more complication to our query. 00:13:37.001 --> 00:13:38.064 Let's suppose that we're only interested 00:13:38.064 --> 00:13:41.077 in applications to colleges where the enrollment is greater than 20,000. 00:13:41.077 --> 00:13:43.048 So, so far in 00:13:43.048 --> 00:13:44.007 our expression we refer to the 00:13:44.007 --> 00:13:46.011 student relation and the apply 00:13:46.011 --> 00:13:48.066 relation, but we haven't used the college relation. 00:13:48.066 --> 00:13:50.034 But if we want to have a 00:13:50.034 --> 00:13:51.061 filter on enrollment, we're going to have 00:13:51.061 --> 00:13:55.016 to bring the college relation into the picture. 00:13:55.016 --> 00:13:57.089 This turns out to perhaps be easier than you think. 00:13:57.089 --> 00:13:59.055 Let's just erase a couple 00:13:59.055 --> 00:14:01.065 of our parentheses here, and what 00:14:01.065 --> 00:14:02.071 we're going to do is we're 00:14:02.071 --> 00:14:04.021 going to join in the 00:14:04.021 --> 00:14:07.047 college relation, with the two relations we have already. 00:14:07.047 --> 00:14:11.049 Now, technically, the natural 00:14:11.049 --> 00:14:13.093 join is the binary operator, people 00:14:13.093 --> 00:14:15.051 often use it without parentheses 00:14:15.051 --> 00:14:16.008 because it's associative, but if 00:14:16.008 --> 00:14:20.012 we get pedantic about it we could add that and then we're in good shape. 00:14:20.012 --> 00:14:22.043 Now we've joined all three relations together. 00:14:22.043 --> 00:14:24.031 And remember, automatically the natural 00:14:24.031 --> 00:14:27.035 join enforces equality on the shared attributes. 00:14:27.035 --> 00:14:29.035 Very specifically, the college 00:14:29.035 --> 00:14:30.049 name here is going to 00:14:30.049 --> 00:14:33.008 be set equal to the apply college name as well. 00:14:33.008 --> 00:14:36.046 Now once we've done that, we've got all the information we need. 00:14:36.046 --> 00:14:37.061 We just need to add one 00:14:37.061 --> 00:14:39.002 more filtering condition, which is 00:14:39.002 --> 00:14:42.029 that the college enrollment is greater than 20,000. 00:14:42.029 --> 00:14:47.059 And with that, we've solved our query. 00:14:47.059 --> 00:14:48.093 So to summarize the 00:14:48.093 --> 00:14:52.025 natural join, the natural join combines relations. 00:14:52.025 --> 00:14:54.078 It automatically sets values equal 00:14:54.078 --> 00:14:55.098 when attribute names are the 00:14:55.098 --> 00:14:58.098 same and then it removes the duplicate columns. 00:14:58.098 --> 00:15:01.011 The natural join actually does not 00:15:01.011 --> 00:15:04.007 add any expressive power to relational algebra. 00:15:04.007 --> 00:15:08.069 We can rewrite the natural join without it using the cross-product. 00:15:08.069 --> 00:15:10.091 So let me just show that rewrite here. 00:15:10.091 --> 00:15:12.025 If we have, and now I'm 00:15:12.025 --> 00:15:14.049 going to use the general case of two expressions. 00:15:14.049 --> 00:15:17.033 One expression, natural join 00:15:17.033 --> 00:15:19.007 with another expression, that is 00:15:19.007 --> 00:15:21.075 actually equivalent to doing 00:15:21.075 --> 00:15:23.009 a projection on the 00:15:23.009 --> 00:15:25.086 schema of the first 00:15:25.086 --> 00:15:27.031 expression - I'll just 00:15:27.031 --> 00:15:29.005 call it E1 now - union 00:15:29.005 --> 00:15:30.092 the schema of the second expression. 00:15:30.092 --> 00:15:32.028 That's a real union, so that 00:15:32.028 --> 00:15:33.053 means if we have two copies we 00:15:33.053 --> 00:15:36.009 just keep one of them. Over 00:15:36.009 --> 00:15:39.004 the selection of. Now we're 00:15:39.004 --> 00:15:40.072 going to set all the shared attributes 00:15:40.072 --> 00:15:42.004 of the first expression to 00:15:42.004 --> 00:15:44.013 be equal to the shared attributes of the second. 00:15:44.013 --> 00:15:45.056 So I'll just write E1, A1 00:15:45.056 --> 00:15:48.028 equals E2, A1 00:15:48.028 --> 00:15:54.072 and E1, A2 equals E2 dot A2. 00:15:54.072 --> 00:15:57.005 Now these are the cases where, 00:15:57.005 --> 00:16:00.075 again, the attributes have the same names, and so on. 00:16:00.075 --> 00:16:02.053 So we're setting all those equal, 00:16:02.053 --> 00:16:04.084 and that is applied over 00:16:04.084 --> 00:16:07.083 expression one cross-product expression two. 00:16:07.083 --> 00:16:10.007 So again, the natural 00:16:10.007 --> 00:16:12.013 join is not giving us 00:16:12.013 --> 00:16:18.098 additional expressive power, but it is very convenient notationally. 00:16:18.098 --> 00:16:20.026 The last operator that I'm 00:16:20.026 --> 00:16:23.002 going to cover in this video is the theta join operator. 00:16:23.002 --> 00:16:24.005 Like natural join, theta join is 00:16:24.005 --> 00:16:28.008 actually an abbreviation that doesn't add expressive power to the language. 00:16:28.008 --> 00:16:28.083 Let me just write it. 00:16:28.083 --> 00:16:30.005 The theta join operator takes 00:16:30.005 --> 00:16:32.067 two expressions and combines them 00:16:32.067 --> 00:16:34.087 with the bow tie looking 00:16:34.087 --> 00:16:37.001 operator, but with a subscript theta. 00:16:37.001 --> 00:16:39.005 That theta is a condition. 00:16:39.005 --> 00:16:40.072 It's a condition in the style 00:16:40.072 --> 00:16:43.096 of the condition in the selection operator. 00:16:43.096 --> 00:16:45.072 And what this actually says 00:16:45.072 --> 00:16:47.031 - it's pretty simple - 00:16:47.031 --> 00:16:49.053 is it's equivalent to applying 00:16:49.053 --> 00:16:51.036 the theta condition to the 00:16:51.036 --> 00:16:53.067 cross-product of the two expressions. 00:16:53.067 --> 00:16:55.000 So you might wonder why 00:16:55.000 --> 00:16:56.033 I even mention the theta 00:16:56.033 --> 00:16:57.094 join operator, and the reason I 00:16:57.094 --> 00:16:59.056 mention it is that most 00:16:59.056 --> 00:17:01.077 database management systems implement the 00:17:01.077 --> 00:17:03.007 theta join as their basic 00:17:03.007 --> 00:17:05.022 operation for combining relations. 00:17:05.022 --> 00:17:07.006 So the basic operation is 00:17:07.006 --> 00:17:08.087 take two relations, combine all tuples, 00:17:08.087 --> 00:17:10.058 but then only keep the combinations 00:17:10.058 --> 00:17:12.057 that pass the theta condition. 00:17:12.057 --> 00:17:13.063 Often when you talk to 00:17:13.063 --> 00:17:15.019 people who build database systems or 00:17:15.019 --> 00:17:16.064 use databases, when they 00:17:16.064 --> 00:17:21.019 use the word join, they really mean the theta join. 00:17:21.019 --> 00:17:25.037 So, in conclusion, relational algebra is a formal language. 00:17:25.037 --> 00:17:27.004 It operates on sets of 00:17:27.004 --> 00:17:29.048 relations and produces relations as a result. 00:17:29.048 --> 00:17:30.093 The simplest query is just 00:17:30.093 --> 00:17:32.017 the name of a relation and 00:17:32.017 --> 00:17:33.067 then operators are used 00:17:33.067 --> 00:17:36.073 to filter relations, slice them, and combine them. 00:17:36.073 --> 00:17:37.098 So far, we've learned the 00:17:37.098 --> 00:17:40.000 select operator for selecting rows; 00:17:40.000 --> 00:17:41.078 the project operator for selecting 00:17:41.078 --> 00:17:43.058 columns; the cross-product 00:17:43.058 --> 00:17:45.084 operator for combining every possible 00:17:45.084 --> 00:17:46.086 pair of tuples from two 00:17:46.086 --> 00:17:48.029 relations; and then two 00:17:48.029 --> 00:17:50.002 abbreviations, the natural join, 00:17:50.002 --> 00:17:51.071 which a very useful way to combine 00:17:51.071 --> 00:17:53.096 relations by enforcing a equality 00:17:53.096 --> 00:17:56.097 on certain columns; and the theta join operator. 00:17:56.097 --> 00:17:58.019 In the next video, we'll learn 00:17:58.019 --> 00:18:00.005 some additional operators of relational 00:18:00.005 --> 00:18:01.058 algebra and also some alternative 00:18:01.058 --> 99:59:59.999 notations for relational algebra expressions.