WEBVTT 00:00:00.798 --> 00:00:02.875 If you remember that first decade of the web, 00:00:02.875 --> 00:00:05.076 it was really a static place. 00:00:05.076 --> 00:00:07.435 You could go online, you could look at pages, 00:00:07.435 --> 00:00:09.748 and they were put up either by organizations 00:00:09.748 --> 00:00:11.389 who had teams to do it 00:00:11.389 --> 00:00:13.748 or by individuals who were really tech-savvy 00:00:13.748 --> 00:00:15.235 for the time. 00:00:15.235 --> 00:00:16.810 And with the rise of social media 00:00:16.810 --> 00:00:19.599 and social networks in the early 2000s, 00:00:19.599 --> 00:00:21.358 the web was completely changed 00:00:21.358 --> 00:00:24.966 to a place where now the vast majority of content 00:00:24.966 --> 00:00:28.278 we interact with is put up by average users, 00:00:28.278 --> 00:00:30.855 either in YouTube videos or blog posts 00:00:30.855 --> 00:00:34.290 or product reviews or social media postings. 00:00:34.290 --> 00:00:36.837 And it's also become a much more interactive place, 00:00:36.837 --> 00:00:39.274 where people are interacting with others, 00:00:39.274 --> 00:00:40.970 they're commenting, they're sharing, 00:00:40.970 --> 00:00:42.754 they're not just reading. NOTE Paragraph 00:00:42.754 --> 00:00:44.450 So Facebook is not the only place you can do this, 00:00:44.450 --> 00:00:45.948 but it's the biggest 00:00:45.948 --> 00:00:47.642 and it serves to illustrate the numbers. 00:00:47.642 --> 00:00:50.979 Facebook has 1.2 billion users per month. 00:00:50.979 --> 00:00:52.979 So half the Earth's internet population 00:00:52.979 --> 00:00:54.632 is using Facebook. 00:00:54.632 --> 00:00:56.564 They are a site, along with others, 00:00:56.564 --> 00:00:59.633 that has allowed people to create an online persona 00:00:59.633 --> 00:01:01.385 with very little technical skill, 00:01:01.385 --> 00:01:03.801 and people responded by putting huge amounts 00:01:03.801 --> 00:01:05.784 of personal data online. 00:01:05.784 --> 00:01:08.417 So the result is that we have behavioral, 00:01:08.417 --> 00:01:10.313 preference, demographic data 00:01:10.313 --> 00:01:12.414 for hundreds of millions of people, 00:01:12.414 --> 00:01:14.440 which is unprecedented in history. 00:01:14.440 --> 00:01:17.088 And as a computer scientist, what this means is that 00:01:17.088 --> 00:01:18.824 I've been able to build models 00:01:18.824 --> 00:01:21.216 that can predict all sorts of hidden attributes 00:01:21.216 --> 00:01:23.097 for all of you that you don't even know 00:01:23.097 --> 00:01:25.472 you're sharing information about. 00:01:25.472 --> 00:01:27.854 As scientists, we use that to help 00:01:27.854 --> 00:01:29.968 the way people interact online, 00:01:29.968 --> 00:01:32.607 but there's less altruistic applications, 00:01:32.607 --> 00:01:34.928 and there's a problem in that users don't really 00:01:34.928 --> 00:01:37.448 understand these techniques and how they work, 00:01:37.448 --> 00:01:40.656 and even if they did, they don't have a lot of control over it. 00:01:40.656 --> 00:01:42.226 So what I want to talk to you about today 00:01:42.226 --> 00:01:44.768 is some of these things that we're able to do, 00:01:44.768 --> 00:01:47.401 and then give us some ideas of how we might go forward 00:01:47.401 --> 00:01:50.280 to move some control back into the hands of users. NOTE Paragraph 00:01:50.280 --> 00:01:51.856 So this is Target, the company. 00:01:51.856 --> 00:01:53.320 I didn't just put that logo 00:01:53.320 --> 00:01:55.040 on this poor, pregnant woman's belly. 00:01:55.040 --> 00:01:57.018 You may have seen this anecdote that was printed 00:01:57.018 --> 00:01:59.241 in Forbes Magazine where Target 00:01:59.241 --> 00:02:01.512 sent a flyer to this 15-year old girl 00:02:01.512 --> 00:02:03.222 with advertisements and coupons 00:02:03.222 --> 00:02:05.776 for baby bottles and diapers and cribs 00:02:05.776 --> 00:02:07.580 two weeks before she told her parents 00:02:07.580 --> 00:02:09.464 that she was pregnant. 00:02:09.464 --> 00:02:11.688 Yeah, the dad was really upset. 00:02:11.688 --> 00:02:13.744 He said, "How did Target figure out 00:02:13.744 --> 00:02:15.568 that this high school girl was pregnant 00:02:15.568 --> 00:02:17.528 before she told her parents?" 00:02:17.528 --> 00:02:20.249 It turns out that they have the purchase history 00:02:20.249 --> 00:02:22.560 for hundreds of thousands of customers 00:02:22.560 --> 00:02:25.280 and they compute what they call a pregnancy score, 00:02:25.280 --> 00:02:27.592 which is not just whether or not a woman's pregnant, 00:02:27.592 --> 00:02:29.352 but what her due date is. 00:02:29.352 --> 00:02:30.936 And they compute that 00:02:30.936 --> 00:02:32.584 not by looking at, like, the obvious things, 00:02:32.584 --> 00:02:35.136 like, she's buying a crib or baby clothes, 00:02:35.136 --> 00:02:37.589 but things like, she bought more vitamins 00:02:37.589 --> 00:02:39.656 than she normally had, 00:02:39.656 --> 00:02:41.160 or she bought a handbag 00:02:41.160 --> 00:02:42.901 that's big enough to hold diapers. 00:02:42.901 --> 00:02:44.721 And by themselves, those purchases don't seem 00:02:44.721 --> 00:02:47.016 like they might reveal a lot, 00:02:47.016 --> 00:02:49.128 but it's a pattern of behavior that, 00:02:49.128 --> 00:02:52.275 when you take it in the context of other people, 00:02:52.275 --> 00:02:55.192 starts to actually reveal some insights. 00:02:55.192 --> 00:02:56.865 So that's the kind of thing that we do 00:02:56.865 --> 00:02:59.672 when we're predicting stuff about you on social media. 00:02:59.672 --> 00:03:02.168 We're looking for little patterns of behavior that, 00:03:02.168 --> 00:03:04.880 when you detect them among millions of people, 00:03:04.880 --> 00:03:07.546 let's us find out all kinds of things. NOTE Paragraph 00:03:07.546 --> 00:03:09.273 So in my lab and with colleagues, 00:03:09.273 --> 00:03:11.048 we've developed mechanisms where we can 00:03:11.048 --> 00:03:12.840 quite accurately predict things 00:03:12.840 --> 00:03:14.585 like your political preference, 00:03:14.585 --> 00:03:18.227 your personality score, gender, sexual orientation, 00:03:18.227 --> 00:03:21.063 religion, age, intelligence, 00:03:21.063 --> 00:03:22.674 along with things like 00:03:22.674 --> 00:03:24.331 how much you trust the people you know 00:03:24.331 --> 00:03:26.195 and how strong those relationships are. 00:03:26.195 --> 00:03:28.001 We can do all of this really well. 00:03:28.001 --> 00:03:30.227 And again, it doesn't come from what you might 00:03:30.227 --> 00:03:32.139 think of as obvious information. NOTE Paragraph 00:03:32.139 --> 00:03:34.150 So my favorite example is from this study 00:03:34.150 --> 00:03:35.810 that was published this year 00:03:35.810 --> 00:03:37.745 in the Proceedings of the National Academies. 00:03:37.745 --> 00:03:39.330 If you Google this, you'll find it. 00:03:39.330 --> 00:03:40.852 It's four pages, easy to read. 00:03:40.852 --> 00:03:43.565 And they looked at just people's Facebook likes, 00:03:43.565 --> 00:03:45.485 so just the things you like on Facebook, 00:03:45.485 --> 00:03:47.623 and used that to predict all these attributes, 00:03:47.623 --> 00:03:49.268 along with some other ones. 00:03:49.268 --> 00:03:51.989 And in their paper they listed the five likes 00:03:51.989 --> 00:03:55.346 that were most indicative of high intelligence. 00:03:55.346 --> 00:03:57.088 And among those was liking a page 00:03:57.088 --> 00:03:59.345 for curly fries. (Laughter) 00:03:59.345 --> 00:04:01.348 Curly fries are delicious, 00:04:01.348 --> 00:04:03.848 but liking them does not necessarily mean 00:04:03.848 --> 00:04:06.168 that you're smarter than the average person. 00:04:06.168 --> 00:04:09.375 So how is it that one of the strongest indicators 00:04:09.375 --> 00:04:10.905 of your intelligence 00:04:10.905 --> 00:04:12.312 is liking this page 00:04:12.312 --> 00:04:14.384 when the content is totally irrelevant 00:04:14.384 --> 00:04:16.991 to the attribute that's being predicted? 00:04:16.991 --> 00:04:18.735 And it turns out that we have to look at 00:04:18.735 --> 00:04:20.563 a whole bunch of underlying theories 00:04:20.563 --> 00:04:22.552 to see why we're able to do this. 00:04:22.552 --> 00:04:25.815 One of them is a sociological theory called homophily, 00:04:25.815 --> 00:04:28.727 which basically says people are friends with people like them. 00:04:28.727 --> 00:04:30.991 So if you're smart, you tend to be friends with smart people, 00:04:30.991 --> 00:04:33.471 and if you're young, you tend to be friends with young people, 00:04:33.471 --> 00:04:35.168 and this is well-established 00:04:35.168 --> 00:04:37.103 for hundreds of years. 00:04:37.103 --> 00:04:38.505 We also know a lot 00:04:38.505 --> 00:04:40.505 about how information spreads through networks. 00:04:40.505 --> 00:04:42.479 It turns out things like viral videos 00:04:42.479 --> 00:04:44.855 or Facebook likes or other information 00:04:44.855 --> 00:04:46.703 spreads in exactly the same way 00:04:46.703 --> 00:04:48.927 that diseases spread through social networks. 00:04:48.927 --> 00:04:50.918 So this is something we've studied for a long time. 00:04:50.918 --> 00:04:52.694 We have good models of it. 00:04:52.694 --> 00:04:54.711 And so you can put those things together 00:04:54.711 --> 00:04:57.639 and start seeing why things like this happen. 00:04:57.639 --> 00:04:59.543 So if I were to give you a hypothesis, 00:04:59.543 --> 00:05:03.087 it would be that a smart guy started this page, 00:05:03.087 --> 00:05:04.799 or maybe one of the first people who liked it 00:05:04.799 --> 00:05:06.535 would have scored high on that test. 00:05:06.535 --> 00:05:08.783 And they liked it, and their friends saw it, 00:05:08.783 --> 00:05:11.895 and by homophily, we know that he probably had smart friends, 00:05:11.895 --> 00:05:14.841 and so it spread to them, and some of them liked it, 00:05:14.841 --> 00:05:16.040 and they had smart friends, 00:05:16.040 --> 00:05:17.257 and so it spread to them, 00:05:17.257 --> 00:05:19.017 and so it propagated through the network 00:05:19.017 --> 00:05:21.359 to kind of a host of smart people, 00:05:21.359 --> 00:05:23.415 so that by the end, the action 00:05:23.415 --> 00:05:26.169 of liking the curly fries page 00:05:26.169 --> 00:05:27.934 is indicative of high intelligence, 00:05:27.934 --> 00:05:29.727 not because of the content, 00:05:29.727 --> 00:05:31.959 but because the actual action of liking 00:05:31.959 --> 00:05:33.799 reflects back the common attributes 00:05:33.799 --> 00:05:36.347 of other people who have done it. NOTE Paragraph 00:05:36.347 --> 00:05:39.254 So this is pretty complicated stuff, right? 00:05:39.254 --> 00:05:41.313 It's a hard thing to sit down and explain 00:05:41.313 --> 00:05:43.831 to an average user, and even if you do, 00:05:43.831 --> 00:05:46.459 what can the average user do about it? 00:05:46.459 --> 00:05:48.447 How do you know that you've liked something 00:05:48.447 --> 00:05:50.119 that indicates a trait for you 00:05:50.119 --> 00:05:53.694 that's totally irrelevant to the content of what you've liked? 00:05:53.694 --> 00:05:56.058 There's a lot of power that users don't have 00:05:56.058 --> 00:05:58.026 to control how this data is used. 00:05:58.026 --> 00:06:01.662 And I see that as a real problem going forward. NOTE Paragraph 00:06:01.662 --> 00:06:03.489 So I think there's a couple path 00:06:03.489 --> 00:06:05.010 that we want to look at 00:06:05.010 --> 00:06:06.610 if we want to give users some control 00:06:06.610 --> 00:06:08.370 over how this data is used, 00:06:08.370 --> 00:06:10.010 because it's not always going to be used 00:06:10.010 --> 00:06:11.521 for their benefit. 00:06:11.521 --> 00:06:12.933 An example I often give is that, 00:06:12.933 --> 00:06:14.689 if I ever get bored being a professor, 00:06:14.689 --> 00:06:16.282 I'm going to go start a company 00:06:16.282 --> 00:06:17.826 that predicts all of these attributes 00:06:17.826 --> 00:06:19.498 and things like how well you work in teams 00:06:19.498 --> 00:06:21.769 and if you're a drug user, if you're an alcoholic. 00:06:21.769 --> 00:06:23.489 We know how to predict all that. 00:06:23.489 --> 00:06:25.050 And I'm going to sell reports 00:06:25.050 --> 00:06:27.023 to h.r. companies and big businesses 00:06:27.023 --> 00:06:29.343 that want to hire you. 00:06:29.343 --> 00:06:30.650 We totally can do that now. 00:06:30.650 --> 00:06:32.548 I could start that business tomorrow, 00:06:32.548 --> 00:06:34.450 and you would have absolutely no control 00:06:34.450 --> 00:06:36.778 over me using your data like that. 00:06:36.778 --> 00:06:39.010 That seems to me to be a problem. NOTE Paragraph 00:06:39.010 --> 00:06:40.850 So one of the paths we can do down 00:06:40.850 --> 00:06:42.962 is the policy and law path. 00:06:42.962 --> 00:06:45.778 And in some respects, I think that that would be most effective, 00:06:45.778 --> 00:06:48.314 but the problem is we'd actually have to do it. 00:06:48.314 --> 00:06:51.314 Observing our political process in action 00:06:51.314 --> 00:06:53.693 makes me think it's highly unlikely 00:06:53.693 --> 00:06:55.290 that we're going to get a bunch of representatives 00:06:55.290 --> 00:06:57.376 to sit down, learn about this, 00:06:57.376 --> 00:06:59.382 and then enact sweeping changes 00:06:59.382 --> 00:07:01.539 to intellectual property law in the U.S. 00:07:01.539 --> 00:07:04.058 so users control their data. NOTE Paragraph 00:07:04.058 --> 00:07:05.594 We could go the policy route, 00:07:05.594 --> 00:07:07.373 where social media companies say, 00:07:07.373 --> 00:07:08.635 you know what? You own your data. 00:07:08.635 --> 00:07:10.674 You have total control over how it's used. 00:07:10.674 --> 00:07:12.522 The problem is that the revenue models 00:07:12.522 --> 00:07:14.246 for most social media companies 00:07:14.246 --> 00:07:18.277 rely on sharing or exploiting users data in some way. 00:07:18.277 --> 00:07:20.110 It's sometimes said of Facebook that the users 00:07:20.110 --> 00:07:22.798 aren't the customer, they're the product. 00:07:22.798 --> 00:07:25.342 And so how do you get a company 00:07:25.342 --> 00:07:27.730 to cede control of their main asset 00:07:27.730 --> 00:07:29.319 back to the users? 00:07:29.319 --> 00:07:31.014 It's possible, but I don't think it's something 00:07:31.014 --> 00:07:33.390 that we're going to see change quickly. NOTE Paragraph 00:07:33.390 --> 00:07:35.054 So I think the other path 00:07:35.054 --> 00:07:36.918 that we can do down that's going to be more effective 00:07:36.918 --> 00:07:38.766 is one of more science. 00:07:38.766 --> 00:07:40.986 It's doing science that allowed us to develop 00:07:40.986 --> 00:07:42.846 all these mechanisms for computing 00:07:42.846 --> 00:07:44.958 this personal data data in the first place. 00:07:44.958 --> 00:07:46.894 And it's actually very similar research 00:07:46.894 --> 00:07:48.672 that we'd have to do 00:07:48.672 --> 00:07:50.718 if we want to develop mechanisms 00:07:50.718 --> 00:07:52.369 that can say to a user, 00:07:52.369 --> 00:07:54.728 "Here's the risk of that action you just took." 00:07:54.728 --> 00:07:56.668 You know, by liking that Facebook page, 00:07:56.668 --> 00:07:59.113 or by sharing this piece of personal information, 00:07:59.113 --> 00:08:00.665 you've now improved my ability 00:08:00.665 --> 00:08:02.681 to predict whether or not you're using drugs 00:08:02.681 --> 00:08:05.273 or whether or not you get along well in the workplace. 00:08:05.273 --> 00:08:07.241 And that, I think, can affect whether or not 00:08:07.241 --> 00:08:08.921 people want to share something, 00:08:08.921 --> 00:08:12.075 keep it private, or just keep it offline altogether. 00:08:12.075 --> 00:08:13.793 We can also look at things like 00:08:13.793 --> 00:08:16.321 allowing people to encrypt data that they upload, 00:08:16.321 --> 00:08:18.176 so it's kind of invisible and worthless 00:08:18.176 --> 00:08:19.817 to sites like Facebook 00:08:19.817 --> 00:08:22.236 or third party services that access it, 00:08:22.236 --> 00:08:25.633 but that select users who the person who posted it 00:08:25.633 --> 00:08:28.473 want to see it have access to see it. 00:08:28.473 --> 00:08:30.369 This is all super-exciting research 00:08:30.369 --> 00:08:32.139 from an intellectual perspective, 00:08:32.139 --> 00:08:33.978 and so scientists are going to be willing to do it. 00:08:33.978 --> 00:08:37.668 So that gives us an advantage over the loss side. NOTE Paragraph 00:08:37.668 --> 00:08:39.363 One of the problems that people bring up 00:08:39.363 --> 00:08:40.978 when I talk about this is, they say, 00:08:40.978 --> 00:08:43.614 you know, if people start keeping all this data private, 00:08:43.614 --> 00:08:45.487 all those methods that you've been developing 00:08:45.487 --> 00:08:48.014 to predict their traits are going to fail. 00:08:48.014 --> 00:08:51.910 And I say, absolutely, and for me, that's success, 00:08:51.910 --> 00:08:53.646 because as a scientist, 00:08:53.646 --> 00:08:56.894 my goal is not to infer information about users, 00:08:56.894 --> 00:08:59.901 it's to improve the way people interact online. 00:08:59.901 --> 00:09:03.119 And sometimes that involves inferring things about them, 00:09:03.119 --> 00:09:06.141 but if users don't want me to use that data, 00:09:06.141 --> 00:09:08.179 I think they should have the right to do that. 00:09:08.179 --> 00:09:10.830 I want users to be informed and consenting 00:09:10.830 --> 00:09:12.942 users of the tools that we develop. NOTE Paragraph 00:09:12.942 --> 00:09:15.894 And so I think encouraging this kind of science 00:09:15.894 --> 00:09:17.950 and supporting researchers 00:09:17.950 --> 00:09:20.263 who want to cede some of that control back to users 00:09:20.263 --> 00:09:22.574 and away from the social media companies 00:09:22.574 --> 00:09:25.245 means that going forward as these tools evolve 00:09:25.245 --> 00:09:26.721 and advance 00:09:26.721 --> 00:09:28.325 means that we're going to have an educated 00:09:28.325 --> 00:09:29.829 and empowered user base, 00:09:29.829 --> 00:09:31.129 and I think all of us can agree 00:09:31.129 --> 00:09:33.493 that that's a pretty ideal way to go forward. NOTE Paragraph 00:09:33.493 --> 00:09:35.677 Thank you. NOTE Paragraph 00:09:35.677 --> 00:09:38.757 (Applause)