WEBVTT 00:00:00.738 --> 00:00:02.735 If you remember that first decade of the web, 00:00:02.735 --> 00:00:04.990 it was really a static place. 00:00:04.990 --> 00:00:07.235 You could go online, you could look at pages, 00:00:07.235 --> 00:00:09.748 and they were put up either by organizations 00:00:09.748 --> 00:00:11.269 who had teams to do it 00:00:11.269 --> 00:00:13.498 or by individuals who were really tech-savvy 00:00:13.498 --> 00:00:15.235 for the time. 00:00:15.235 --> 00:00:16.810 And with the rise of social media 00:00:16.810 --> 00:00:19.209 and social networks in the early 2000s, 00:00:19.209 --> 00:00:21.358 the web was completely changed 00:00:21.358 --> 00:00:24.966 to a place where now the vast majority of content 00:00:24.966 --> 00:00:28.278 we interact with is put up by average users, 00:00:28.278 --> 00:00:30.975 either in YouTube videos or blog posts 00:00:30.975 --> 00:00:34.290 or product reviews or social media postings. 00:00:34.290 --> 00:00:36.637 And it's also become a much more interactive place, 00:00:36.637 --> 00:00:39.274 where people are interacting with others, 00:00:39.274 --> 00:00:40.970 they're commenting, they're sharing, 00:00:40.970 --> 00:00:42.584 they're not just reading. NOTE Paragraph 00:00:42.584 --> 00:00:44.450 So Facebook is not the only place you can do this, 00:00:44.450 --> 00:00:45.548 but it's the biggest, 00:00:45.548 --> 00:00:47.332 and it serves to illustrate the numbers. 00:00:47.332 --> 00:00:50.809 Facebook has 1.2 billion users per month. 00:00:50.809 --> 00:00:52.739 So half the Earth's Internet population 00:00:52.739 --> 00:00:54.392 is using Facebook. 00:00:54.392 --> 00:00:56.324 They are a site, along with others, 00:00:56.324 --> 00:00:59.543 that has allowed people to create an online persona 00:00:59.543 --> 00:01:01.325 with very little technical skill, 00:01:01.325 --> 00:01:03.801 and people responded by putting huge amounts 00:01:03.801 --> 00:01:05.784 of personal data online. 00:01:05.784 --> 00:01:08.327 So the result is that we have behavioral, 00:01:08.327 --> 00:01:10.313 preference, demographic data 00:01:10.313 --> 00:01:12.414 for hundreds of millions of people, 00:01:12.414 --> 00:01:14.440 which is unprecedented in history. 00:01:14.440 --> 00:01:17.000 And as a computer scientist, what this means is that 00:01:17.000 --> 00:01:18.664 I've been able to build models 00:01:18.664 --> 00:01:20.986 that can predict all sorts of hidden attributes 00:01:20.986 --> 00:01:23.270 for all of you that you don't even know 00:01:23.270 --> 00:01:25.472 you're sharing information about. 00:01:25.472 --> 00:01:27.854 As scientists, we use that to help 00:01:27.854 --> 00:01:29.968 the way people interact online, 00:01:29.968 --> 00:01:32.467 but there's less altruistic applications, 00:01:32.467 --> 00:01:34.848 and there's a problem in that users don't really 00:01:34.848 --> 00:01:37.318 understand these techniques and how they work, 00:01:37.318 --> 00:01:40.446 and even if they did, they don't have a lot of control over it. 00:01:40.446 --> 00:01:41.936 So what I want to talk to you about today 00:01:41.936 --> 00:01:44.638 is some of these things that we're able to do, 00:01:44.638 --> 00:01:47.401 and then give us some ideas of how we might go forward 00:01:47.401 --> 00:01:50.170 to move some control back into the hands of users. NOTE Paragraph 00:01:50.170 --> 00:01:51.756 So this is Target, the company. 00:01:51.756 --> 00:01:53.080 I didn't just put that logo 00:01:53.080 --> 00:01:55.250 on this poor, pregnant woman's belly. 00:01:55.250 --> 00:01:57.090 You may have seen this anecdote that was printed 00:01:57.090 --> 00:01:59.151 in Forbes magazine where Target 00:01:59.151 --> 00:02:01.512 sent a flyer to this 15-year-old girl 00:02:01.512 --> 00:02:03.222 with advertisements and coupons 00:02:03.222 --> 00:02:05.776 for baby bottles and diapers and cribs 00:02:05.776 --> 00:02:07.460 two weeks before she told her parents 00:02:07.460 --> 00:02:09.324 that she was pregnant. 00:02:09.324 --> 00:02:12.028 Yeah, the dad was really upset. 00:02:12.028 --> 00:02:13.744 He said, "How did Target figure out 00:02:13.744 --> 00:02:15.568 that this high school girl was pregnant 00:02:15.568 --> 00:02:17.528 before she told her parents?" 00:02:17.528 --> 00:02:20.149 It turns out that they have the purchase history 00:02:20.149 --> 00:02:22.450 for hundreds of thousands of customers 00:02:22.450 --> 00:02:25.180 and they compute what they call a pregnancy score, 00:02:25.180 --> 00:02:27.512 which is not just whether or not a woman's pregnant, 00:02:27.512 --> 00:02:29.242 but what her due date is. 00:02:29.242 --> 00:02:30.546 And they compute that 00:02:30.546 --> 00:02:32.314 not by looking at the obvious things, 00:02:32.314 --> 00:02:34.826 like, she's buying a crib or baby clothes, 00:02:34.826 --> 00:02:37.769 but things like, she bought more vitamins 00:02:37.769 --> 00:02:39.486 than she normally had, 00:02:39.486 --> 00:02:40.950 or she bought a handbag 00:02:40.950 --> 00:02:42.661 that's big enough to hold diapers. 00:02:42.661 --> 00:02:44.571 And by themselves, those purchases don't seem 00:02:44.571 --> 00:02:47.040 like they might reveal a lot, 00:02:47.040 --> 00:02:49.018 but it's a pattern of behavior that, 00:02:49.018 --> 00:02:52.135 when you take it in the context of thousands of other people, 00:02:52.135 --> 00:02:54.892 starts to actually reveal some insights. 00:02:54.892 --> 00:02:56.685 So that's the kind of thing that we do 00:02:56.685 --> 00:02:59.252 when we're predicting stuff about you on social media. 00:02:59.252 --> 00:03:02.048 We're looking for little patterns of behavior that, 00:03:02.048 --> 00:03:04.730 when you detect them among millions of people, 00:03:04.730 --> 00:03:07.436 lets us find out all kinds of things. NOTE Paragraph 00:03:07.436 --> 00:03:09.183 So in my lab and with colleagues, 00:03:09.183 --> 00:03:10.960 we've developed mechanisms where we can 00:03:10.960 --> 00:03:12.520 quite accurately predict things 00:03:12.520 --> 00:03:14.245 like your political preference, 00:03:14.245 --> 00:03:17.997 your personality score, gender, sexual orientation, 00:03:17.997 --> 00:03:20.870 religion, age, intelligence, 00:03:20.870 --> 00:03:22.264 along with things like 00:03:22.264 --> 00:03:24.201 how much you trust the people you know 00:03:24.201 --> 00:03:26.005 and how strong those relationships are. 00:03:26.005 --> 00:03:27.790 We can do all of this really well. 00:03:27.790 --> 00:03:29.987 And again, it doesn't come from what you might 00:03:29.987 --> 00:03:32.089 think of as obvious information. NOTE Paragraph 00:03:32.089 --> 00:03:34.370 So my favorite example is from this study 00:03:34.370 --> 00:03:35.610 that was published this year 00:03:35.610 --> 00:03:37.405 in the Proceedings of the National Academies. 00:03:37.405 --> 00:03:38.690 If you Google this, you'll find it. 00:03:38.690 --> 00:03:40.562 It's four pages, easy to read. 00:03:40.562 --> 00:03:43.565 And they looked at just people's Facebook likes, 00:03:43.565 --> 00:03:45.485 so just the things you like on Facebook, 00:03:45.485 --> 00:03:47.623 and used that to predict all these attributes, 00:03:47.623 --> 00:03:49.268 along with some other ones. 00:03:49.268 --> 00:03:52.229 And in their paper they listed the five likes 00:03:52.229 --> 00:03:55.016 that were most indicative of high intelligence. 00:03:55.016 --> 00:03:57.340 And among those was liking a page 00:03:57.340 --> 00:03:59.245 for curly fries. (Laughter) 00:03:59.245 --> 00:04:01.338 Curly fries are delicious, 00:04:01.338 --> 00:04:03.868 but liking them does not necessarily mean 00:04:03.868 --> 00:04:05.948 that you're smarter than the average person. 00:04:05.948 --> 00:04:09.155 So how is it that one of the strongest indicators 00:04:09.155 --> 00:04:10.725 of your intelligence 00:04:10.725 --> 00:04:12.172 is liking this page 00:04:12.172 --> 00:04:14.424 when the content is totally irrelevant 00:04:14.424 --> 00:04:16.951 to the attribute that's being predicted? 00:04:16.951 --> 00:04:18.535 And it turns out that we have to look at 00:04:18.535 --> 00:04:20.153 a whole bunch of underlying theories 00:04:20.153 --> 00:04:22.722 to see why we're able to do this. 00:04:22.722 --> 00:04:25.635 One of them is a sociological theory called homophily, 00:04:25.635 --> 00:04:28.727 which basically says people are friends with people like them. 00:04:28.727 --> 00:04:30.741 So if you're smart, you tend to be friends with smart people, 00:04:30.741 --> 00:04:33.371 and if you're young, you tend to be friends with young people, 00:04:33.371 --> 00:04:34.998 and this is well established 00:04:34.998 --> 00:04:36.743 for hundreds of years. 00:04:36.743 --> 00:04:37.975 We also know a lot 00:04:37.975 --> 00:04:40.525 about how information spreads through networks. 00:04:40.525 --> 00:04:42.279 It turns out things like viral videos 00:04:42.279 --> 00:04:44.685 or Facebook likes or other information 00:04:44.685 --> 00:04:46.573 spreads in exactly the same way 00:04:46.573 --> 00:04:49.027 that diseases spread through social networks. 00:04:49.027 --> 00:04:50.818 So this is something we've studied for a long time. 00:04:50.818 --> 00:04:52.394 We have good models of it. 00:04:52.394 --> 00:04:54.551 And so you can put those things together 00:04:54.551 --> 00:04:57.639 and start seeing why things like this happen. 00:04:57.639 --> 00:04:59.453 So if I were to give you a hypothesis, 00:04:59.453 --> 00:05:02.680 it would be that a smart guy started this page, 00:05:02.680 --> 00:05:04.619 or maybe one of the first people who liked it 00:05:04.619 --> 00:05:06.355 would have scored high on that test. 00:05:06.355 --> 00:05:08.643 And they liked it, and their friends saw it, 00:05:08.643 --> 00:05:11.765 and by homophily, we know that he probably had smart friends, 00:05:11.765 --> 00:05:14.821 and so it spread to them, and some of them liked it, 00:05:14.821 --> 00:05:16.010 and they had smart friends, 00:05:16.010 --> 00:05:16.817 and so it spread to them, 00:05:16.817 --> 00:05:18.790 and so it propagated through the network 00:05:18.790 --> 00:05:21.359 to a host of smart people, 00:05:21.359 --> 00:05:23.415 so that by the end, the action 00:05:23.415 --> 00:05:25.959 of liking the curly fries page 00:05:25.959 --> 00:05:27.574 is indicative of high intelligence, 00:05:27.574 --> 00:05:29.377 not because of the content, 00:05:29.377 --> 00:05:31.899 but because the actual action of liking 00:05:31.899 --> 00:05:33.799 reflects back the common attributes 00:05:33.799 --> 00:05:36.267 of other people who have done it. NOTE Paragraph 00:05:36.267 --> 00:05:39.164 So this is pretty complicated stuff, right? 00:05:39.164 --> 00:05:41.363 It's a hard thing to sit down and explain 00:05:41.363 --> 00:05:44.211 to an average user, and even if you do, 00:05:44.211 --> 00:05:46.399 what can the average user do about it? 00:05:46.399 --> 00:05:48.447 How do you know that you've liked something 00:05:48.447 --> 00:05:49.939 that indicates a trait for you 00:05:49.939 --> 00:05:53.484 that's totally irrelevant to the content of what you've liked? 00:05:53.484 --> 00:05:56.030 There's a lot of power that users don't have 00:05:56.030 --> 00:05:58.260 to control how this data is used. 00:05:58.260 --> 00:06:01.372 And I see that as a real problem going forward. NOTE Paragraph 00:06:01.372 --> 00:06:03.349 So I think there's a couple paths 00:06:03.349 --> 00:06:04.350 that we want to look at 00:06:04.350 --> 00:06:06.260 if we want to give users some control 00:06:06.260 --> 00:06:08.000 over how this data is used, 00:06:08.000 --> 00:06:09.940 because it's not always going to be used 00:06:09.940 --> 00:06:11.321 for their benefit. 00:06:11.321 --> 00:06:12.743 An example I often give is that, 00:06:12.743 --> 00:06:14.389 if I ever get bored being a professor, 00:06:14.389 --> 00:06:16.042 I'm going to go start a company 00:06:16.042 --> 00:06:17.496 that predicts all of these attributes 00:06:17.496 --> 00:06:19.098 and things like how well you work in teams 00:06:19.098 --> 00:06:21.769 and if you're a drug user, if you're an alcoholic. 00:06:21.769 --> 00:06:23.209 We know how to predict all that. 00:06:23.209 --> 00:06:24.970 And I'm going to sell reports 00:06:24.970 --> 00:06:27.070 to H.R. companies and big businesses 00:06:27.070 --> 00:06:29.343 that want to hire you. 00:06:29.343 --> 00:06:30.520 We totally can do that now. 00:06:30.520 --> 00:06:32.308 I could start that business tomorrow, 00:06:32.308 --> 00:06:34.360 and you would have absolutely no control 00:06:34.360 --> 00:06:36.498 over me using your data like that. 00:06:36.498 --> 00:06:38.790 That seems to me to be a problem. NOTE Paragraph 00:06:38.790 --> 00:06:40.700 So one of the paths we can go down 00:06:40.700 --> 00:06:42.732 is the policy and law path. 00:06:42.732 --> 00:06:45.778 And in some respects, I think that that would be most effective, 00:06:45.778 --> 00:06:48.534 but the problem is we'd actually have to do it. 00:06:48.534 --> 00:06:51.314 Observing our political process in action 00:06:51.314 --> 00:06:53.693 makes me think it's highly unlikely 00:06:53.693 --> 00:06:55.290 that we're going to get a bunch of representatives 00:06:55.290 --> 00:06:57.276 to sit down, learn about this, 00:06:57.276 --> 00:06:59.382 and then enact sweeping changes 00:06:59.382 --> 00:07:01.539 to intellectual property law in the U.S. 00:07:01.539 --> 00:07:04.000 so users control their data. NOTE Paragraph 00:07:04.000 --> 00:07:05.304 We could go the policy route, 00:07:05.304 --> 00:07:06.783 where social media companies say, 00:07:06.783 --> 00:07:08.185 you know what? You own your data. 00:07:08.185 --> 00:07:10.674 You have total control over how it's used. 00:07:10.674 --> 00:07:12.522 The problem is that the revenue models 00:07:12.522 --> 00:07:14.246 for most social media companies 00:07:14.246 --> 00:07:18.277 rely on sharing or exploiting users' data in some way. 00:07:18.277 --> 00:07:20.110 It's sometimes said of Facebook that the users 00:07:20.110 --> 00:07:22.638 aren't the customer, they're the product. 00:07:22.638 --> 00:07:25.352 And so how do you get a company 00:07:25.352 --> 00:07:27.910 to cede control of their main asset 00:07:27.910 --> 00:07:29.159 back to the users? 00:07:29.159 --> 00:07:30.860 It's possible, but I don't think it's something 00:07:30.860 --> 00:07:33.180 that we're going to see change quickly. NOTE Paragraph 00:07:33.180 --> 00:07:34.680 So I think the other path 00:07:34.680 --> 00:07:36.968 that we can go down that's going to be more effective 00:07:36.968 --> 00:07:38.476 is one of more science. 00:07:38.476 --> 00:07:40.986 It's doing science that allowed us to develop 00:07:40.986 --> 00:07:42.736 all these mechanisms for computing 00:07:42.736 --> 00:07:44.788 this personal data in the first place. 00:07:44.788 --> 00:07:46.894 And it's actually very similar research 00:07:46.894 --> 00:07:48.332 that we'd have to do 00:07:48.332 --> 00:07:50.718 if we want to develop mechanisms 00:07:50.718 --> 00:07:52.139 that can say to a user, 00:07:52.139 --> 00:07:54.368 "Here's the risk of that action you just took." 00:07:54.368 --> 00:07:56.448 By liking that Facebook page, 00:07:56.448 --> 00:07:58.983 or by sharing this piece of personal information, 00:07:58.983 --> 00:08:00.485 you've now improved my ability 00:08:00.485 --> 00:08:02.571 to predict whether or not you're using drugs 00:08:02.571 --> 00:08:05.433 or whether or not you get along well in the workplace. 00:08:05.433 --> 00:08:07.281 And that, I think, can affect whether or not 00:08:07.281 --> 00:08:08.791 people want to share something, 00:08:08.791 --> 00:08:12.030 keep it private, or just keep it offline altogether. 00:08:12.030 --> 00:08:13.593 We can also look at things like 00:08:13.593 --> 00:08:16.321 allowing people to encrypt data that they upload, 00:08:16.321 --> 00:08:18.176 so it's kind of invisible and worthless 00:08:18.176 --> 00:08:19.607 to sites like Facebook 00:08:19.607 --> 00:08:22.236 or third party services that access it, 00:08:22.236 --> 00:08:25.483 but that select users who the person who posted it 00:08:25.483 --> 00:08:28.153 want to see it have access to see it. 00:08:28.153 --> 00:08:30.319 This is all super exciting research 00:08:30.319 --> 00:08:31.939 from an intellectual perspective, 00:08:31.939 --> 00:08:33.798 and so scientists are going to be willing to do it. 00:08:33.798 --> 00:08:37.408 So that gives us an advantage over the law side. NOTE Paragraph 00:08:37.408 --> 00:08:39.133 One of the problems that people bring up 00:08:39.133 --> 00:08:40.728 when I talk about this is, they say, 00:08:40.728 --> 00:08:43.374 you know, if people start keeping all this data private, 00:08:43.374 --> 00:08:45.487 all those methods that you've been developing 00:08:45.487 --> 00:08:48.140 to predict their traits are going to fail. 00:08:48.140 --> 00:08:51.660 And I say, absolutely, and for me, that's success, 00:08:51.660 --> 00:08:53.446 because as a scientist, 00:08:53.446 --> 00:08:57.134 my goal is not to infer information about users, 00:08:57.134 --> 00:08:59.901 it's to improve the way people interact online. 00:08:59.901 --> 00:09:03.119 And sometimes that involves inferring things about them, 00:09:03.119 --> 00:09:06.141 but if users don't want me to use that data, 00:09:06.141 --> 00:09:08.179 I think they should have the right to do that. 00:09:08.179 --> 00:09:10.830 I want users to be informed and consenting 00:09:10.830 --> 00:09:12.942 users of the tools that we develop. NOTE Paragraph 00:09:12.942 --> 00:09:15.894 And so I think encouraging this kind of science 00:09:15.894 --> 00:09:17.240 and supporting researchers 00:09:17.240 --> 00:09:20.263 who want to cede some of that control back to users 00:09:20.263 --> 00:09:22.574 and away from the social media companies 00:09:22.574 --> 00:09:25.245 means that going forward, as these tools evolve 00:09:25.245 --> 00:09:26.721 and advance, 00:09:26.721 --> 00:09:28.135 means that we're going to have an educated 00:09:28.135 --> 00:09:29.829 and empowered user base, 00:09:29.829 --> 00:09:30.929 and I think all of us can agree 00:09:30.929 --> 00:09:33.493 that that's a pretty ideal way to go forward. NOTE Paragraph 00:09:33.493 --> 00:09:35.677 Thank you. NOTE Paragraph 00:09:35.677 --> 00:09:38.757 (Applause)