1 00:00:00,738 --> 00:00:02,735 If you remember that first decade of the web, 2 00:00:02,735 --> 00:00:04,990 it was really a static place. 3 00:00:04,990 --> 00:00:07,235 You could go online, you could look at pages, 4 00:00:07,235 --> 00:00:09,748 and they were put up either by organizations 5 00:00:09,748 --> 00:00:11,269 who had teams to do it 6 00:00:11,269 --> 00:00:13,498 or by individuals who were really tech-savvy 7 00:00:13,498 --> 00:00:15,235 for the time. 8 00:00:15,235 --> 00:00:16,810 And with the rise of social media 9 00:00:16,810 --> 00:00:19,209 and social networks in the early 2000s, 10 00:00:19,209 --> 00:00:21,358 the web was completely changed 11 00:00:21,358 --> 00:00:24,966 to a place where now the vast majority of content 12 00:00:24,966 --> 00:00:28,278 we interact with is put up by average users, 13 00:00:28,278 --> 00:00:30,975 either in YouTube videos or blog posts 14 00:00:30,975 --> 00:00:34,290 or product reviews or social media postings. 15 00:00:34,290 --> 00:00:36,637 And it's also become a much more interactive place, 16 00:00:36,637 --> 00:00:39,274 where people are interacting with others, 17 00:00:39,274 --> 00:00:40,970 they're commenting, they're sharing, 18 00:00:40,970 --> 00:00:42,584 they're not just reading. 19 00:00:42,584 --> 00:00:44,450 So Facebook is not the only place you can do this, 20 00:00:44,450 --> 00:00:45,548 but it's the biggest, 21 00:00:45,548 --> 00:00:47,332 and it serves to illustrate the numbers. 22 00:00:47,332 --> 00:00:50,809 Facebook has 1.2 billion users per month. 23 00:00:50,809 --> 00:00:52,739 So half the Earth's Internet population 24 00:00:52,739 --> 00:00:54,392 is using Facebook. 25 00:00:54,392 --> 00:00:56,324 They are a site, along with others, 26 00:00:56,324 --> 00:00:59,543 that has allowed people to create an online persona 27 00:00:59,543 --> 00:01:01,325 with very little technical skill, 28 00:01:01,325 --> 00:01:03,801 and people responded by putting huge amounts 29 00:01:03,801 --> 00:01:05,784 of personal data online. 30 00:01:05,784 --> 00:01:08,327 So the result is that we have behavioral, 31 00:01:08,327 --> 00:01:10,313 preference, demographic data 32 00:01:10,313 --> 00:01:12,414 for hundreds of millions of people, 33 00:01:12,414 --> 00:01:14,440 which is unprecedented in history. 34 00:01:14,440 --> 00:01:17,000 And as a computer scientist, what this means is that 35 00:01:17,000 --> 00:01:18,664 I've been able to build models 36 00:01:18,664 --> 00:01:20,986 that can predict all sorts of hidden attributes 37 00:01:20,986 --> 00:01:23,270 for all of you that you don't even know 38 00:01:23,270 --> 00:01:25,472 you're sharing information about. 39 00:01:25,472 --> 00:01:27,854 As scientists, we use that to help 40 00:01:27,854 --> 00:01:29,968 the way people interact online, 41 00:01:29,968 --> 00:01:32,467 but there's less altruistic applications, 42 00:01:32,467 --> 00:01:34,848 and there's a problem in that users don't really 43 00:01:34,848 --> 00:01:37,318 understand these techniques and how they work, 44 00:01:37,318 --> 00:01:40,446 and even if they did, they don't have a lot of control over it. 45 00:01:40,446 --> 00:01:41,936 So what I want to talk to you about today 46 00:01:41,936 --> 00:01:44,638 is some of these things that we're able to do, 47 00:01:44,638 --> 00:01:47,401 and then give us some ideas of how we might go forward 48 00:01:47,401 --> 00:01:50,170 to move some control back into the hands of users. 49 00:01:50,170 --> 00:01:51,756 So this is Target, the company. 50 00:01:51,756 --> 00:01:53,080 I didn't just put that logo 51 00:01:53,080 --> 00:01:55,250 on this poor, pregnant woman's belly. 52 00:01:55,250 --> 00:01:57,090 You may have seen this anecdote that was printed 53 00:01:57,090 --> 00:01:59,151 in Forbes magazine where Target 54 00:01:59,151 --> 00:02:01,512 sent a flyer to this 15-year-old girl 55 00:02:01,512 --> 00:02:03,222 with advertisements and coupons 56 00:02:03,222 --> 00:02:05,776 for baby bottles and diapers and cribs 57 00:02:05,776 --> 00:02:07,460 two weeks before she told her parents 58 00:02:07,460 --> 00:02:09,324 that she was pregnant. 59 00:02:09,324 --> 00:02:12,028 Yeah, the dad was really upset. 60 00:02:12,028 --> 00:02:13,744 He said, "How did Target figure out 61 00:02:13,744 --> 00:02:15,568 that this high school girl was pregnant 62 00:02:15,568 --> 00:02:17,528 before she told her parents?" 63 00:02:17,528 --> 00:02:20,149 It turns out that they have the purchase history 64 00:02:20,149 --> 00:02:22,450 for hundreds of thousands of customers 65 00:02:22,450 --> 00:02:25,180 and they compute what they call a pregnancy score, 66 00:02:25,180 --> 00:02:27,512 which is not just whether or not a woman's pregnant, 67 00:02:27,512 --> 00:02:29,242 but what her due date is. 68 00:02:29,242 --> 00:02:30,546 And they compute that 69 00:02:30,546 --> 00:02:32,314 not by looking at the obvious things, 70 00:02:32,314 --> 00:02:34,826 like, she's buying a crib or baby clothes, 71 00:02:34,826 --> 00:02:37,769 but things like, she bought more vitamins 72 00:02:37,769 --> 00:02:39,486 than she normally had, 73 00:02:39,486 --> 00:02:40,950 or she bought a handbag 74 00:02:40,950 --> 00:02:42,661 that's big enough to hold diapers. 75 00:02:42,661 --> 00:02:44,571 And by themselves, those purchases don't seem 76 00:02:44,571 --> 00:02:47,040 like they might reveal a lot, 77 00:02:47,040 --> 00:02:49,018 but it's a pattern of behavior that, 78 00:02:49,018 --> 00:02:52,135 when you take it in the context of thousands of other people, 79 00:02:52,135 --> 00:02:54,892 starts to actually reveal some insights. 80 00:02:54,892 --> 00:02:56,685 So that's the kind of thing that we do 81 00:02:56,685 --> 00:02:59,252 when we're predicting stuff about you on social media. 82 00:02:59,252 --> 00:03:02,048 We're looking for little patterns of behavior that, 83 00:03:02,048 --> 00:03:04,730 when you detect them among millions of people, 84 00:03:04,730 --> 00:03:07,436 lets us find out all kinds of things. 85 00:03:07,436 --> 00:03:09,183 So in my lab and with colleagues, 86 00:03:09,183 --> 00:03:10,960 we've developed mechanisms where we can 87 00:03:10,960 --> 00:03:12,520 quite accurately predict things 88 00:03:12,520 --> 00:03:14,245 like your political preference, 89 00:03:14,245 --> 00:03:17,997 your personality score, gender, sexual orientation, 90 00:03:17,997 --> 00:03:20,870 religion, age, intelligence, 91 00:03:20,870 --> 00:03:22,264 along with things like 92 00:03:22,264 --> 00:03:24,201 how much you trust the people you know 93 00:03:24,201 --> 00:03:26,005 and how strong those relationships are. 94 00:03:26,005 --> 00:03:27,790 We can do all of this really well. 95 00:03:27,790 --> 00:03:29,987 And again, it doesn't come from what you might 96 00:03:29,987 --> 00:03:32,089 think of as obvious information. 97 00:03:32,089 --> 00:03:34,370 So my favorite example is from this study 98 00:03:34,370 --> 00:03:35,610 that was published this year 99 00:03:35,610 --> 00:03:37,405 in the Proceedings of the National Academies. 100 00:03:37,405 --> 00:03:38,690 If you Google this, you'll find it. 101 00:03:38,690 --> 00:03:40,562 It's four pages, easy to read. 102 00:03:40,562 --> 00:03:43,565 And they looked at just people's Facebook likes, 103 00:03:43,565 --> 00:03:45,485 so just the things you like on Facebook, 104 00:03:45,485 --> 00:03:47,623 and used that to predict all these attributes, 105 00:03:47,623 --> 00:03:49,268 along with some other ones. 106 00:03:49,268 --> 00:03:52,229 And in their paper they listed the five likes 107 00:03:52,229 --> 00:03:55,016 that were most indicative of high intelligence. 108 00:03:55,016 --> 00:03:57,340 And among those was liking a page 109 00:03:57,340 --> 00:03:59,245 for curly fries. (Laughter) 110 00:03:59,245 --> 00:04:01,338 Curly fries are delicious, 111 00:04:01,338 --> 00:04:03,868 but liking them does not necessarily mean 112 00:04:03,868 --> 00:04:05,948 that you're smarter than the average person. 113 00:04:05,948 --> 00:04:09,155 So how is it that one of the strongest indicators 114 00:04:09,155 --> 00:04:10,725 of your intelligence 115 00:04:10,725 --> 00:04:12,172 is liking this page 116 00:04:12,172 --> 00:04:14,424 when the content is totally irrelevant 117 00:04:14,424 --> 00:04:16,951 to the attribute that's being predicted? 118 00:04:16,951 --> 00:04:18,535 And it turns out that we have to look at 119 00:04:18,535 --> 00:04:20,153 a whole bunch of underlying theories 120 00:04:20,153 --> 00:04:22,722 to see why we're able to do this. 121 00:04:22,722 --> 00:04:25,635 One of them is a sociological theory called homophily, 122 00:04:25,635 --> 00:04:28,727 which basically says people are friends with people like them. 123 00:04:28,727 --> 00:04:30,741 So if you're smart, you tend to be friends with smart people, 124 00:04:30,741 --> 00:04:33,371 and if you're young, you tend to be friends with young people, 125 00:04:33,371 --> 00:04:34,998 and this is well established 126 00:04:34,998 --> 00:04:36,743 for hundreds of years. 127 00:04:36,743 --> 00:04:37,975 We also know a lot 128 00:04:37,975 --> 00:04:40,525 about how information spreads through networks. 129 00:04:40,525 --> 00:04:42,279 It turns out things like viral videos 130 00:04:42,279 --> 00:04:44,685 or Facebook likes or other information 131 00:04:44,685 --> 00:04:46,573 spreads in exactly the same way 132 00:04:46,573 --> 00:04:49,027 that diseases spread through social networks. 133 00:04:49,027 --> 00:04:50,818 So this is something we've studied for a long time. 134 00:04:50,818 --> 00:04:52,394 We have good models of it. 135 00:04:52,394 --> 00:04:54,551 And so you can put those things together 136 00:04:54,551 --> 00:04:57,639 and start seeing why things like this happen. 137 00:04:57,639 --> 00:04:59,453 So if I were to give you a hypothesis, 138 00:04:59,453 --> 00:05:02,680 it would be that a smart guy started this page, 139 00:05:02,680 --> 00:05:04,619 or maybe one of the first people who liked it 140 00:05:04,619 --> 00:05:06,355 would have scored high on that test. 141 00:05:06,355 --> 00:05:08,643 And they liked it, and their friends saw it, 142 00:05:08,643 --> 00:05:11,765 and by homophily, we know that he probably had smart friends, 143 00:05:11,765 --> 00:05:14,821 and so it spread to them, and some of them liked it, 144 00:05:14,821 --> 00:05:16,010 and they had smart friends, 145 00:05:16,010 --> 00:05:16,817 and so it spread to them, 146 00:05:16,817 --> 00:05:18,790 and so it propagated through the network 147 00:05:18,790 --> 00:05:21,359 to a host of smart people, 148 00:05:21,359 --> 00:05:23,415 so that by the end, the action 149 00:05:23,415 --> 00:05:25,959 of liking the curly fries page 150 00:05:25,959 --> 00:05:27,574 is indicative of high intelligence, 151 00:05:27,574 --> 00:05:29,377 not because of the content, 152 00:05:29,377 --> 00:05:31,899 but because the actual action of liking 153 00:05:31,899 --> 00:05:33,799 reflects back the common attributes 154 00:05:33,799 --> 00:05:36,267 of other people who have done it. 155 00:05:36,267 --> 00:05:39,164 So this is pretty complicated stuff, right? 156 00:05:39,164 --> 00:05:41,363 It's a hard thing to sit down and explain 157 00:05:41,363 --> 00:05:44,211 to an average user, and even if you do, 158 00:05:44,211 --> 00:05:46,399 what can the average user do about it? 159 00:05:46,399 --> 00:05:48,447 How do you know that you've liked something 160 00:05:48,447 --> 00:05:49,939 that indicates a trait for you 161 00:05:49,939 --> 00:05:53,484 that's totally irrelevant to the content of what you've liked? 162 00:05:53,484 --> 00:05:56,030 There's a lot of power that users don't have 163 00:05:56,030 --> 00:05:58,260 to control how this data is used. 164 00:05:58,260 --> 00:06:01,372 And I see that as a real problem going forward. 165 00:06:01,372 --> 00:06:03,349 So I think there's a couple paths 166 00:06:03,349 --> 00:06:04,350 that we want to look at 167 00:06:04,350 --> 00:06:06,260 if we want to give users some control 168 00:06:06,260 --> 00:06:08,000 over how this data is used, 169 00:06:08,000 --> 00:06:09,940 because it's not always going to be used 170 00:06:09,940 --> 00:06:11,321 for their benefit. 171 00:06:11,321 --> 00:06:12,743 An example I often give is that, 172 00:06:12,743 --> 00:06:14,389 if I ever get bored being a professor, 173 00:06:14,389 --> 00:06:16,042 I'm going to go start a company 174 00:06:16,042 --> 00:06:17,496 that predicts all of these attributes 175 00:06:17,496 --> 00:06:19,098 and things like how well you work in teams 176 00:06:19,098 --> 00:06:21,769 and if you're a drug user, if you're an alcoholic. 177 00:06:21,769 --> 00:06:23,209 We know how to predict all that. 178 00:06:23,209 --> 00:06:24,970 And I'm going to sell reports 179 00:06:24,970 --> 00:06:27,070 to H.R. companies and big businesses 180 00:06:27,070 --> 00:06:29,343 that want to hire you. 181 00:06:29,343 --> 00:06:30,520 We totally can do that now. 182 00:06:30,520 --> 00:06:32,308 I could start that business tomorrow, 183 00:06:32,308 --> 00:06:34,360 and you would have absolutely no control 184 00:06:34,360 --> 00:06:36,498 over me using your data like that. 185 00:06:36,498 --> 00:06:38,790 That seems to me to be a problem. 186 00:06:38,790 --> 00:06:40,700 So one of the paths we can go down 187 00:06:40,700 --> 00:06:42,732 is the policy and law path. 188 00:06:42,732 --> 00:06:45,778 And in some respects, I think that that would be most effective, 189 00:06:45,778 --> 00:06:48,534 but the problem is we'd actually have to do it. 190 00:06:48,534 --> 00:06:51,314 Observing our political process in action 191 00:06:51,314 --> 00:06:53,693 makes me think it's highly unlikely 192 00:06:53,693 --> 00:06:55,290 that we're going to get a bunch of representatives 193 00:06:55,290 --> 00:06:57,276 to sit down, learn about this, 194 00:06:57,276 --> 00:06:59,382 and then enact sweeping changes 195 00:06:59,382 --> 00:07:01,539 to intellectual property law in the U.S. 196 00:07:01,539 --> 00:07:04,000 so users control their data. 197 00:07:04,000 --> 00:07:05,304 We could go the policy route, 198 00:07:05,304 --> 00:07:06,783 where social media companies say, 199 00:07:06,783 --> 00:07:08,185 you know what? You own your data. 200 00:07:08,185 --> 00:07:10,674 You have total control over how it's used. 201 00:07:10,674 --> 00:07:12,522 The problem is that the revenue models 202 00:07:12,522 --> 00:07:14,246 for most social media companies 203 00:07:14,246 --> 00:07:18,277 rely on sharing or exploiting users' data in some way. 204 00:07:18,277 --> 00:07:20,110 It's sometimes said of Facebook that the users 205 00:07:20,110 --> 00:07:22,638 aren't the customer, they're the product. 206 00:07:22,638 --> 00:07:25,352 And so how do you get a company 207 00:07:25,352 --> 00:07:27,910 to cede control of their main asset 208 00:07:27,910 --> 00:07:29,159 back to the users? 209 00:07:29,159 --> 00:07:30,860 It's possible, but I don't think it's something 210 00:07:30,860 --> 00:07:33,180 that we're going to see change quickly. 211 00:07:33,180 --> 00:07:34,680 So I think the other path 212 00:07:34,680 --> 00:07:36,968 that we can go down that's going to be more effective 213 00:07:36,968 --> 00:07:38,476 is one of more science. 214 00:07:38,476 --> 00:07:40,986 It's doing science that allowed us to develop 215 00:07:40,986 --> 00:07:42,736 all these mechanisms for computing 216 00:07:42,736 --> 00:07:44,788 this personal data in the first place. 217 00:07:44,788 --> 00:07:46,894 And it's actually very similar research 218 00:07:46,894 --> 00:07:48,332 that we'd have to do 219 00:07:48,332 --> 00:07:50,718 if we want to develop mechanisms 220 00:07:50,718 --> 00:07:52,139 that can say to a user, 221 00:07:52,139 --> 00:07:54,368 "Here's the risk of that action you just took." 222 00:07:54,368 --> 00:07:56,448 By liking that Facebook page, 223 00:07:56,448 --> 00:07:58,983 or by sharing this piece of personal information, 224 00:07:58,983 --> 00:08:00,485 you've now improved my ability 225 00:08:00,485 --> 00:08:02,571 to predict whether or not you're using drugs 226 00:08:02,571 --> 00:08:05,433 or whether or not you get along well in the workplace. 227 00:08:05,433 --> 00:08:07,281 And that, I think, can affect whether or not 228 00:08:07,281 --> 00:08:08,791 people want to share something, 229 00:08:08,791 --> 00:08:12,030 keep it private, or just keep it offline altogether. 230 00:08:12,030 --> 00:08:13,593 We can also look at things like 231 00:08:13,593 --> 00:08:16,321 allowing people to encrypt data that they upload, 232 00:08:16,321 --> 00:08:18,176 so it's kind of invisible and worthless 233 00:08:18,176 --> 00:08:19,607 to sites like Facebook 234 00:08:19,607 --> 00:08:22,236 or third party services that access it, 235 00:08:22,236 --> 00:08:25,483 but that select users who the person who posted it 236 00:08:25,483 --> 00:08:28,153 want to see it have access to see it. 237 00:08:28,153 --> 00:08:30,319 This is all super exciting research 238 00:08:30,319 --> 00:08:31,939 from an intellectual perspective, 239 00:08:31,939 --> 00:08:33,798 and so scientists are going to be willing to do it. 240 00:08:33,798 --> 00:08:37,408 So that gives us an advantage over the law side. 241 00:08:37,408 --> 00:08:39,133 One of the problems that people bring up 242 00:08:39,133 --> 00:08:40,728 when I talk about this is, they say, 243 00:08:40,728 --> 00:08:43,374 you know, if people start keeping all this data private, 244 00:08:43,374 --> 00:08:45,487 all those methods that you've been developing 245 00:08:45,487 --> 00:08:48,140 to predict their traits are going to fail. 246 00:08:48,140 --> 00:08:51,660 And I say, absolutely, and for me, that's success, 247 00:08:51,660 --> 00:08:53,446 because as a scientist, 248 00:08:53,446 --> 00:08:57,134 my goal is not to infer information about users, 249 00:08:57,134 --> 00:08:59,901 it's to improve the way people interact online. 250 00:08:59,901 --> 00:09:03,119 And sometimes that involves inferring things about them, 251 00:09:03,119 --> 00:09:06,141 but if users don't want me to use that data, 252 00:09:06,141 --> 00:09:08,179 I think they should have the right to do that. 253 00:09:08,179 --> 00:09:10,830 I want users to be informed and consenting 254 00:09:10,830 --> 00:09:12,942 users of the tools that we develop. 255 00:09:12,942 --> 00:09:15,894 And so I think encouraging this kind of science 256 00:09:15,894 --> 00:09:17,240 and supporting researchers 257 00:09:17,240 --> 00:09:20,263 who want to cede some of that control back to users 258 00:09:20,263 --> 00:09:22,574 and away from the social media companies 259 00:09:22,574 --> 00:09:25,245 means that going forward, as these tools evolve 260 00:09:25,245 --> 00:09:26,721 and advance, 261 00:09:26,721 --> 00:09:28,135 means that we're going to have an educated 262 00:09:28,135 --> 00:09:29,829 and empowered user base, 263 00:09:29,829 --> 00:09:30,929 and I think all of us can agree 264 00:09:30,929 --> 00:09:33,493 that that's a pretty ideal way to go forward. 265 00:09:33,493 --> 00:09:35,677 Thank you. 266 00:09:35,677 --> 00:09:38,757 (Applause)