0:00:00.787,0:00:04.632 America's favorite pie is? 0:00:04.632,0:00:08.138 Audience: Apple.[br]Kenneth Cukier: Apple. Of course it is. 0:00:08.138,0:00:09.369 How do we know it? 0:00:09.369,0:00:12.122 Because of data. 0:00:12.122,0:00:14.188 You look at supermarket sales. 0:00:14.188,0:00:17.054 You look at supermarket[br]sales of 30-centimeter pies 0:00:17.054,0:00:21.129 that are frozen, and apple wins, no contest. 0:00:21.129,0:00:26.309 The majority of the sales are apple. 0:00:26.309,0:00:29.273 But then supermarkets started selling 0:00:29.273,0:00:31.856 smaller, 11-centimeter pies, 0:00:31.856,0:00:36.030 and suddenly, apple fell to fourth or fifth place. 0:00:36.030,0:00:38.905 Why? What happened? 0:00:38.905,0:00:41.723 Okay, think about it. 0:00:41.723,0:00:45.571 When you buy a 30-centimeter pie, 0:00:45.571,0:00:47.832 the whole family has to agree, 0:00:47.832,0:00:51.623 and apple is everyone's second favorite. 0:00:51.623,0:00:53.558 (Laughter) 0:00:53.558,0:00:57.173 But when you buy an individual 11-centimeter pie, 0:00:57.173,0:01:00.918 you can buy the one that you want. 0:01:00.918,0:01:04.933 You can get your first choice. 0:01:04.933,0:01:06.574 You have more data. 0:01:06.574,0:01:08.128 You can see something 0:01:08.128,0:01:09.260 that you couldn't see 0:01:09.260,0:01:13.213 when you only had smaller amounts of it. 0:01:13.213,0:01:15.688 Now, the point here is that more data 0:01:15.688,0:01:17.971 doesn't just let us see more, 0:01:17.971,0:01:19.825 more of the same thing we were looking at. 0:01:19.825,0:01:23.438 More data allows us to see new. 0:01:23.438,0:01:26.532 It allows us to see better. 0:01:26.532,0:01:30.188 It allows us to see different. 0:01:30.188,0:01:33.361 In this case, it allows us to see 0:01:33.361,0:01:36.274 what America's favorite pie is: 0:01:36.274,0:01:38.816 not apple. 0:01:38.816,0:01:42.430 Now, you probably all have heard the term big data. 0:01:42.430,0:01:44.487 In fact, you're probably sick of hearing the term 0:01:44.487,0:01:46.117 big data. 0:01:46.117,0:01:49.447 It is true that there is a lot of hype around the term, 0:01:49.447,0:01:51.779 and that is very unfortunate, 0:01:51.779,0:01:54.825 because big data is an extremely important tool 0:01:54.825,0:01:58.559 by which society is going to advance. 0:01:58.559,0:02:02.120 In the past, we used to look at small data 0:02:02.120,0:02:03.824 and think about what it would mean 0:02:03.824,0:02:05.320 to try to understand the world, 0:02:05.320,0:02:07.311 and now we have a lot more of it, 0:02:07.311,0:02:10.033 more than we ever could before. 0:02:10.033,0:02:11.910 What we find is that when we have 0:02:11.910,0:02:14.634 a large body of data, we can fundamentally do things 0:02:14.634,0:02:17.910 that we couldn't do when we[br]only had smaller amounts. 0:02:17.910,0:02:20.551 Big data is important, and big data is new, 0:02:20.551,0:02:22.328 and when you think about it, 0:02:22.328,0:02:24.544 the only way this planet is going to deal 0:02:24.544,0:02:26.333 with its global challenges — 0:02:26.333,0:02:29.870 to feed people, supply them with medical care, 0:02:29.870,0:02:32.680 supply them with energy, electricity, 0:02:32.680,0:02:34.469 and to make sure they're not burnt to a crisp 0:02:34.469,0:02:35.707 because of global warming — 0:02:35.707,0:02:39.902 is because of the effective use of data. 0:02:39.902,0:02:43.772 So what is new about big [br]data? What is the big deal? 0:02:43.772,0:02:46.289 Well, to answer that question, let's think about 0:02:46.289,0:02:48.185 what information looked like, 0:02:48.185,0:02:51.219 physically looked like in the past. 0:02:51.219,0:02:54.830 In 1908, on the island of Crete, 0:02:54.830,0:02:59.565 archaeologists discovered a clay disc. 0:02:59.565,0:03:03.624 They dated it from 2000 B.C., so it's 4,000 years old. 0:03:03.624,0:03:05.628 Now, there's inscriptions on this disc, 0:03:05.628,0:03:06.955 but we actually don't know what it means. 0:03:06.955,0:03:09.053 It's a complete mystery, but the point is that 0:03:09.053,0:03:10.981 this is what information used to look like 0:03:10.981,0:03:13.070 4,000 years ago. 0:03:13.070,0:03:15.618 This is how society stored 0:03:15.618,0:03:19.142 and transmitted information. 0:03:19.142,0:03:23.302 Now, society hasn't advanced all that much. 0:03:23.302,0:03:26.776 We still store information on discs, 0:03:26.776,0:03:29.960 but now we can store a lot more information, 0:03:29.960,0:03:31.220 more than ever before. 0:03:31.220,0:03:34.313 Searching it is easier. Copying it easier. 0:03:34.313,0:03:37.813 Sharing it is easier. Processing it is easier. 0:03:37.813,0:03:40.579 And what we can do is we can reuse this information 0:03:40.579,0:03:42.413 for uses that we never even imagined 0:03:42.413,0:03:45.608 when we first collected the data. 0:03:45.608,0:03:47.860 In this respect, the data has gone 0:03:47.860,0:03:51.392 from a stock to a flow, 0:03:51.392,0:03:55.330 from something that is stationary and static 0:03:55.330,0:03:58.939 to something that is fluid and dynamic. 0:03:58.939,0:04:02.962 There is, if you will, a liquidity to information. 0:04:02.962,0:04:06.436 The disc that was discovered off of Crete 0:04:06.436,0:04:10.200 that's 4,000 years old, is heavy, 0:04:10.200,0:04:12.162 it doesn't store a lot of information, 0:04:12.162,0:04:15.278 and that information is unchangeable. 0:04:15.278,0:04:19.289 By contrast, all of the files 0:04:19.289,0:04:21.150 that Edward Snowden took 0:04:21.150,0:04:23.771 from the National Security[br]Agency in the United States 0:04:23.771,0:04:26.190 fits on a memory stick 0:04:26.190,0:04:29.200 the size of a fingernail, 0:04:29.200,0:04:33.945 and it can be shared at the speed of light. 0:04:33.945,0:04:39.200 More data. More. 0:04:39.200,0:04:41.174 Now, one reason why we have[br]so much data in the world today 0:04:41.174,0:04:42.606 is we are collecting things 0:04:42.606,0:04:45.886 that we've always collected information on, 0:04:45.886,0:04:48.542 but another reason why is we're taking things 0:04:48.542,0:04:51.354 that have always been informational 0:04:51.354,0:04:53.840 but have never been rendered into a data format 0:04:53.840,0:04:56.259 and we are putting it into data. 0:04:56.259,0:04:59.567 Think, for example, the question of location. 0:04:59.567,0:05:01.816 Take, for example, Martin Luther. 0:05:01.816,0:05:03.413 If we wanted to know in the 1500s 0:05:03.413,0:05:06.080 where Martin Luther was, 0:05:06.080,0:05:08.172 we would have to follow him at all times, 0:05:08.172,0:05:10.309 maybe with a feathery quill and an inkwell, 0:05:10.309,0:05:11.985 and record it, 0:05:11.985,0:05:14.168 but now think about what it looks like today. 0:05:14.168,0:05:16.290 You know that somewhere, 0:05:16.290,0:05:18.736 probably in a telecommunications carrier's database, 0:05:18.736,0:05:21.772 there is a spreadsheet or at least a database entry 0:05:21.772,0:05:23.860 that records your information 0:05:23.860,0:05:25.923 of where you've been at all times. 0:05:25.923,0:05:27.283 If you have a cell phone, 0:05:27.283,0:05:30.130 and that cell phone has GPS,[br]but even if it doesn't have GPS, 0:05:30.130,0:05:32.515 it can record your information. 0:05:32.515,0:05:36.599 In this respect, location has been datafied. 0:05:36.599,0:05:41.200 Now think, for example, of the issue of posture, 0:05:41.200,0:05:42.485 the way that you are all sitting right now, 0:05:42.485,0:05:44.515 the way that you sit, 0:05:44.515,0:05:47.286 the way that you sit, the way that you sit. 0:05:47.286,0:05:49.363 It's all different, and it's a function of your leg length 0:05:49.363,0:05:51.456 and your back and the contours of your back, 0:05:51.456,0:05:53.987 and if I were to put sensors, [br]maybe 100 sensors 0:05:53.987,0:05:55.753 into all of your chairs right now, 0:05:55.753,0:05:59.353 I could create an index that's fairly unique to you, 0:05:59.353,0:06:03.762 sort of like a fingerprint, but it's not your finger. 0:06:03.762,0:06:06.731 So what could we do with this? 0:06:06.731,0:06:09.128 Researchers in Tokyo are using it 0:06:09.128,0:06:13.516 as a potential anti-theft device in cars. 0:06:13.516,0:06:16.440 The idea is that the carjacker sits behind the wheel, 0:06:16.440,0:06:18.544 tries to stream off, but the car recognizes 0:06:18.544,0:06:20.906 that a non-approved driver is behind the wheel, 0:06:20.906,0:06:23.070 and maybe the engine just stops, unless you 0:06:23.070,0:06:26.247 type in a password into the dashboard 0:06:26.247,0:06:30.905 to say, "Hey, I have authorization to drive." Great. 0:06:30.905,0:06:33.458 What if every single car in Europe 0:06:33.458,0:06:34.915 had this technology in it? 0:06:34.915,0:06:38.080 What could we do then? 0:06:38.080,0:06:40.320 Maybe, if we aggregated the data, 0:06:40.320,0:06:44.134 maybe we could identify telltale signs 0:06:44.134,0:06:46.843 that best predict that a car accident 0:06:46.843,0:06:52.736 is going to take place in the next five seconds. 0:06:52.736,0:06:55.293 And then what we will have datafied 0:06:55.293,0:06:57.076 is driver fatigue, 0:06:57.076,0:06:59.410 and the service would be when the car senses 0:06:59.410,0:07:02.847 that the person slumps into that position, 0:07:02.847,0:07:06.841 automatically knows, hey, set an internal alarm 0:07:06.841,0:07:08.866 that would vibrate the steering wheel, honk inside 0:07:08.866,0:07:10.587 to say, "Hey, wake up, 0:07:10.587,0:07:12.491 pay more attention to the road." 0:07:12.491,0:07:14.344 These are the sorts of things we can do 0:07:14.344,0:07:17.165 when we datafy more aspects of our lives. 0:07:17.165,0:07:20.840 So what is the value of big data? 0:07:20.840,0:07:23.030 Well, think about it. 0:07:23.030,0:07:25.442 You have more information. 0:07:25.442,0:07:28.783 You can do things that you couldn't do before. 0:07:28.783,0:07:30.459 One of the most impressive areas 0:07:30.459,0:07:32.188 where this concept is taking place 0:07:32.188,0:07:35.495 is in the area of machine learning. 0:07:35.495,0:07:38.572 Machine learning is a branch of artificial intelligence, 0:07:38.572,0:07:41.950 which itself is a branch of computer science. 0:07:41.950,0:07:43.493 The general idea is that instead of 0:07:43.493,0:07:45.610 instructing a computer what do do, 0:07:45.610,0:07:48.230 we are going to simply throw data at the problem 0:07:48.230,0:07:51.436 and tell the computer to figure it out for itself. 0:07:51.436,0:07:53.213 And it will help you understand it 0:07:53.213,0:07:56.765 by seeing its origins. 0:07:56.765,0:07:59.153 In the 1950s, a computer scientist 0:07:59.153,0:08:02.745 at IBM named Arthur Samuel liked to play checkers, 0:08:02.745,0:08:04.147 so he wrote a computer program 0:08:04.147,0:08:06.960 so he could play against the computer. 0:08:06.960,0:08:09.671 He played. He won. 0:08:09.671,0:08:11.774 He played. He won. 0:08:11.774,0:08:14.789 He played. He won, 0:08:14.789,0:08:16.567 because the computer only knew 0:08:16.567,0:08:18.794 what a legal move was. 0:08:18.794,0:08:20.881 Arthur Samuel knew something else. 0:08:20.881,0:08:25.510 Arthur Samuel knew strategy. 0:08:25.510,0:08:27.906 So he wrote a small sub-program alongside it 0:08:27.906,0:08:29.880 operating in the background, and all it did 0:08:29.880,0:08:31.697 was score the probability 0:08:31.697,0:08:34.260 that a given board configuration would likely lead 0:08:34.260,0:08:37.170 to a winning board versus a losing board 0:08:37.170,0:08:39.678 after every move. 0:08:39.678,0:08:42.828 He plays the computer. He wins. 0:08:42.828,0:08:45.336 He plays the computer. He wins. 0:08:45.336,0:08:49.067 He plays the computer. He wins. 0:08:49.067,0:08:51.344 And then Arthur Samuel leaves the computer 0:08:51.344,0:08:53.571 to play itself. 0:08:53.571,0:08:57.080 It plays itself. It collects more data. 0:08:57.080,0:09:01.389 It collects more data. It increases[br]the accuracy of its prediction. 0:09:01.389,0:09:03.493 And then Arthur Samuel goes back to the computer 0:09:03.493,0:09:05.811 and he plays it, and he loses, 0:09:05.811,0:09:07.880 and he plays it, and he loses, 0:09:07.880,0:09:09.927 and he plays it, and he loses, 0:09:09.927,0:09:12.526 and Arthur Samuel has created a machine 0:09:12.526,0:09:18.814 that surpasses his ability in a task that he taught it. 0:09:18.814,0:09:21.312 And this idea of machine learning 0:09:21.312,0:09:25.239 is going everywhere. 0:09:25.239,0:09:28.388 How do you think we have self-driving cars? 0:09:28.388,0:09:30.525 Are we any better off as a society 0:09:30.525,0:09:33.810 enshrining all the rules of the road into software? 0:09:33.810,0:09:36.408 No. Memory is cheaper. No. 0:09:36.408,0:09:40.402 Algorithms are faster. No. Processors are better. No. 0:09:40.402,0:09:43.174 All of those things matter, but that's not why. 0:09:43.174,0:09:46.315 It's because we changed the nature of the problem. 0:09:46.315,0:09:47.845 We changed the nature of the problem from one 0:09:47.845,0:09:50.090 in which we tried to overtly and explicitly 0:09:50.090,0:09:52.671 explain to the computer how to drive 0:09:52.671,0:09:53.987 to one in which we say, 0:09:53.987,0:09:55.863 "Here's a lot of data around the vehicle. 0:09:55.863,0:09:57.396 You figure it out. 0:09:57.396,0:09:59.263 You figure it out that that is a traffic light, 0:09:59.263,0:10:01.344 that that traffic light is red and not green, 0:10:01.344,0:10:03.358 that that means that you need to stop 0:10:03.358,0:10:06.441 and not go forward." 0:10:06.441,0:10:07.959 Machine learning is at the basis 0:10:07.959,0:10:09.950 of many of the things that we do online: 0:10:09.950,0:10:11.807 search engines, 0:10:11.807,0:10:15.608 Amazon's personalization algorithm, 0:10:15.608,0:10:17.820 computer translation, 0:10:17.820,0:10:22.110 voice recognition systems. 0:10:22.110,0:10:24.945 Researchers recently have looked at 0:10:24.945,0:10:28.140 the question of biopsies, 0:10:28.140,0:10:30.907 cancerous biopsies, 0:10:30.907,0:10:33.222 and they've asked the computer to identify 0:10:33.222,0:10:35.693 by looking at the data and survival rates 0:10:35.693,0:10:40.360 to determine whether cells are actually 0:10:40.360,0:10:42.904 cancerous or not, 0:10:42.904,0:10:44.682 and sure enough, when you throw the data at it, 0:10:44.682,0:10:46.729 through a machine-learning algorithm, 0:10:46.729,0:10:48.606 the machine was able to identify 0:10:48.606,0:10:50.868 the 12 telltale signs that best predict 0:10:50.868,0:10:54.167 that this biopsy of the breast cancer cells 0:10:54.167,0:10:57.385 are indeed cancerous. 0:10:57.385,0:10:59.883 The problem: The medical literature 0:10:59.883,0:11:02.672 only knew nine of them. 0:11:02.672,0:11:04.472 Three of the traits were ones 0:11:04.472,0:11:07.447 that people didn't need to look for, 0:11:07.447,0:11:12.978 but that the machine spotted. 0:11:12.978,0:11:18.903 Now, there are dark sides to big data as well. 0:11:18.903,0:11:20.977 It will improve our lives, but there are problems 0:11:20.977,0:11:23.617 that we need to be conscious of, 0:11:23.617,0:11:26.240 and the first one is the idea 0:11:26.240,0:11:28.926 that we may be punished for predictions, 0:11:28.926,0:11:32.796 that the police may use big data for their purposes, 0:11:32.796,0:11:35.147 a little bit like "Minority Report." 0:11:35.147,0:11:37.588 Now, it's a term called predictive policing, 0:11:37.588,0:11:39.951 or algorithmic criminology, 0:11:39.951,0:11:41.987 and the idea is that if we take a lot of data, 0:11:41.987,0:11:44.146 for example where past crimes have been, 0:11:44.146,0:11:46.689 we know where to send the patrols. 0:11:46.689,0:11:48.804 That makes sense, but the problem, of course, 0:11:48.804,0:11:53.348 is that it's not simply going to stop on location data, 0:11:53.348,0:11:56.307 it's going to go down to the level of the individual. 0:11:56.307,0:11:58.557 Why don't we use data about the person's 0:11:58.557,0:12:00.785 high school transcript? 0:12:00.785,0:12:02.346 Maybe we should use the fact that 0:12:02.346,0:12:04.374 they're unemployed or not, their credit score, 0:12:04.374,0:12:05.926 their web-surfing behavior, 0:12:05.926,0:12:07.804 whether they're up late at night. 0:12:07.804,0:12:10.965 Their Fitbit, when it's able[br]to identify biochemistries, 0:12:10.965,0:12:15.201 will show that they have aggressive thoughts. 0:12:15.201,0:12:17.422 We may have algorithms that are likely to predict 0:12:17.422,0:12:19.055 what we are about to do, 0:12:19.055,0:12:20.299 and we may be held accountable 0:12:20.299,0:12:22.889 before we've actually acted. 0:12:22.889,0:12:24.621 Privacy was the central challenge 0:12:24.621,0:12:27.501 in a small data era. 0:12:27.501,0:12:29.650 In the big data age, 0:12:29.650,0:12:34.173 the challenge will be safeguarding free will, 0:12:34.173,0:12:37.952 moral choice, human volition, 0:12:37.952,0:12:41.020 human agency. 0:12:42.540,0:12:44.765 There is another problem: 0:12:44.765,0:12:48.321 Big data is going to steal our jobs. 0:12:48.321,0:12:51.833 Big data and algorithms are going to challenge 0:12:51.833,0:12:54.894 white collar, professional knowledge work 0:12:54.894,0:12:56.547 in the 21st century 0:12:56.547,0:12:58.981 in the same way that factory automation 0:12:58.981,0:13:01.170 and the assembly line 0:13:01.170,0:13:04.196 challenged blue collar labor in the 20th century. 0:13:04.196,0:13:06.288 Think about a lab technician 0:13:06.288,0:13:07.697 who is looking through a microscope 0:13:07.697,0:13:09.321 at a cancer biopsy 0:13:09.321,0:13:11.958 and determining whether it's cancerous or not. 0:13:11.958,0:13:13.930 The person went to university. 0:13:13.930,0:13:15.360 The person buys property. 0:13:15.360,0:13:17.101 He or she votes. 0:13:17.101,0:13:20.767 He or she is a stakeholder in society. 0:13:20.767,0:13:22.161 And that person's job, 0:13:22.161,0:13:23.770 as well as an entire fleet 0:13:23.770,0:13:25.739 of professionals like that person, 0:13:25.739,0:13:28.889 is going to find that their jobs are radically changed 0:13:28.889,0:13:31.246 or actually completely eliminated. 0:13:31.246,0:13:32.530 Now, we like to think 0:13:32.530,0:13:35.717 that technology creates jobs over a period of time 0:13:35.717,0:13:39.182 after a short, temporary period of dislocation, 0:13:39.182,0:13:41.123 and that is true for the frame of reference 0:13:41.123,0:13:43.265 with which we all live, the Industrial Revolution, 0:13:43.265,0:13:45.593 because that's precisely what happened. 0:13:45.593,0:13:47.926 But we forget something in that analysis: 0:13:47.926,0:13:49.756 There are some categories of jobs 0:13:49.756,0:13:53.176 that simply get eliminated and never come back. 0:13:53.176,0:13:55.180 The Industrial Revolution wasn't very good 0:13:55.180,0:13:59.182 if you were a horse. 0:13:59.182,0:14:01.237 So we're going to need to be careful 0:14:01.237,0:14:04.751 and take big data and adjust it for our needs, 0:14:04.751,0:14:07.936 our very human needs. 0:14:07.936,0:14:09.890 We have to be the master of this technology, 0:14:09.890,0:14:11.546 not its servant. 0:14:11.546,0:14:14.504 We are just at the outset of the big data era, 0:14:14.504,0:14:17.654 and honestly, we are not very good 0:14:17.654,0:14:21.861 at handling all the data that we can now collect. 0:14:21.861,0:14:25.191 It's not just a problem for[br]the National Security Agency. 0:14:25.191,0:14:28.229 Businesses collect lots of[br]data, and they misuse it too, 0:14:28.229,0:14:31.896 and we need to get better at[br]this, and this will take time. 0:14:31.896,0:14:33.718 It's a little bit like the challenge that was faced 0:14:33.718,0:14:36.125 by primitive man and fire. 0:14:36.125,0:14:38.010 This is a tool, but this is a tool that, 0:14:38.010,0:14:41.569 unless we're careful, will burn us. 0:14:44.008,0:14:47.128 Big data is going to transform how we live, 0:14:47.128,0:14:49.929 how we work and how we think. 0:14:49.929,0:14:51.818 It is going to help us manage our careers 0:14:51.818,0:14:55.452 and lead lives of satisfaction and hope 0:14:55.452,0:14:58.444 and happiness and health, 0:14:58.444,0:15:01.750 but in the past, we've often[br]looked at information technology 0:15:01.750,0:15:03.958 and our eyes have only seen the T, 0:15:03.958,0:15:05.644 the technology, the hardware, 0:15:05.644,0:15:07.906 because that's what was physical. 0:15:07.906,0:15:10.830 We now need to recast our gaze at the I, 0:15:10.830,0:15:12.210 the information, 0:15:12.210,0:15:13.583 which is less apparent, 0:15:13.583,0:15:17.692 but in some ways a lot more important. 0:15:17.692,0:15:21.157 Humanity can finally learn from the information 0:15:21.157,0:15:23.575 that it can collect, 0:15:23.575,0:15:25.690 as part of our timeless quest 0:15:25.690,0:15:28.849 to understand the world and our place in it, 0:15:28.849,0:15:34.480 and that's why big data is a big deal. 0:15:34.480,0:15:38.048 (Applause)