WEBVTT 00:00:06.636 --> 00:00:09.077 Statistics are persuasive. 00:00:09.077 --> 00:00:12.541 So much so that people, organizations, and whole countries 00:00:12.541 --> 00:00:17.747 base some of their most important decisions on organized data. 00:00:17.747 --> 00:00:19.484 But there's a problem with that. 00:00:19.484 --> 00:00:23.301 Any set of statistics might have something lurking inside it, 00:00:23.301 --> 00:00:27.251 something that can turn the results completely upside down. 00:00:27.251 --> 00:00:30.920 For example, imagine you need to choose between two hospitals 00:00:30.920 --> 00:00:33.737 for an elderly relative's surgery. 00:00:33.737 --> 00:00:36.434 Out of each hospital's last 1000 patient's, 00:00:36.434 --> 00:00:39.612 900 survived at Hospital A, 00:00:39.612 --> 00:00:43.021 while only 800 survived at Hospital B. 00:00:43.021 --> 00:00:46.170 So it looks like Hospital A is the better choice. 00:00:46.170 --> 00:00:47.843 But before you make your decision, 00:00:47.843 --> 00:00:51.411 remember that not all patients arrive at the hospital 00:00:51.411 --> 00:00:53.811 with the same level of health. 00:00:53.811 --> 00:00:56.703 And if we divide each hospital's last 1000 patients 00:00:56.703 --> 00:01:01.132 into those who arrived in good health and those who arrived in poor health, 00:01:01.132 --> 00:01:03.772 the picture starts to look very different. 00:01:03.772 --> 00:01:07.849 Hospital A had only 100 patients who arrived in poor health, 00:01:07.849 --> 00:01:10.325 of which 30 survived. 00:01:10.325 --> 00:01:14.852 But Hospital B had 400, and they were able to save 210. 00:01:14.852 --> 00:01:17.169 So Hospital B is the better choice 00:01:17.169 --> 00:01:20.741 for patients who arrive at hospital in poor health, 00:01:20.741 --> 00:01:24.526 with a survival rate of 52.5%. 00:01:24.526 --> 00:01:28.445 And what if your relative's health is good when she arrives at the hospital? 00:01:28.445 --> 00:01:32.271 Strangely enough, Hospital B is still the better choice, 00:01:32.271 --> 00:01:35.676 with a survival rate of over 98%. 00:01:35.676 --> 00:01:38.733 So how can Hospital A have a better overall survival rate 00:01:38.733 --> 00:01:44.830 if Hospital B has better survival rates for patients in each of the two groups? 00:01:44.830 --> 00:01:48.589 What we've stumbled upon is a case of Simpson's paradox, 00:01:48.589 --> 00:01:51.899 where the same set of data can appear to show opposite trends 00:01:51.899 --> 00:01:54.664 depending on how it's grouped. 00:01:54.664 --> 00:01:58.744 This often occurs when aggregated data hides a conditional variable, 00:01:58.744 --> 00:02:01.377 sometimes known as a lurking variable, 00:02:01.377 --> 00:02:06.584 which is a hidden additional factor that significantly influences results. 00:02:06.584 --> 00:02:10.023 Here, the hidden factor is the relative proportion of patients 00:02:10.023 --> 00:02:13.264 who arrive in good or poor health. 00:02:13.264 --> 00:02:16.544 Simpson's paradox isn't just a hypothetical scenario. 00:02:16.544 --> 00:02:18.924 It pops up from time to time in the real world, 00:02:18.924 --> 00:02:22.132 sometimes in important contexts. 00:02:22.132 --> 00:02:24.130 One study in the UK appeared to show 00:02:24.130 --> 00:02:27.600 that smokers had a higher survival rate than nonsmokers 00:02:27.600 --> 00:02:29.846 over a twenty-year time period. 00:02:29.846 --> 00:02:33.307 That is, until dividing the participants by age group 00:02:33.307 --> 00:02:37.823 showed that the nonsmokers were significantly older on average, 00:02:37.823 --> 00:02:40.930 and thus, more likely to die during the trial period, 00:02:40.930 --> 00:02:44.438 precisely because they were living longer in general. 00:02:44.438 --> 00:02:47.286 Here, the age groups are the lurking variable, 00:02:47.286 --> 00:02:50.176 and are vital to correctly interpret the data. 00:02:50.176 --> 00:02:51.559 In another example, 00:02:51.559 --> 00:02:54.281 an analysis of Florida's death penalty cases 00:02:54.281 --> 00:02:58.265 seemed to reveal no racial disparity in sentencing 00:02:58.265 --> 00:03:01.581 between black and white defendants convicted of murder. 00:03:01.581 --> 00:03:06.396 But dividing the cases by the race of the victim told a different story. 00:03:06.396 --> 00:03:07.969 In either situation, 00:03:07.969 --> 00:03:11.091 black defendants were more likely to be sentenced to death. 00:03:11.091 --> 00:03:15.066 The slightly higher overall sentencing rate for white defendants 00:03:15.066 --> 00:03:18.692 was due to the fact that cases with white victims 00:03:18.692 --> 00:03:21.359 were more likely to elicit a death sentence 00:03:21.359 --> 00:03:24.091 than cases where the victim was black, 00:03:24.091 --> 00:03:28.483 and most murders occurred between people of the same race. 00:03:28.483 --> 00:03:31.319 So how do we avoid falling for the paradox? 00:03:31.319 --> 00:03:34.686 Unfortunately, there's no one-size-fits-all answer. 00:03:34.686 --> 00:03:38.504 Data can be grouped and divided in any number of ways, 00:03:38.504 --> 00:03:42.106 and overall numbers may sometimes give a more accurate picture 00:03:42.106 --> 00:03:46.638 than data divided into misleading or arbitrary categories. 00:03:46.638 --> 00:03:52.089 All we can do is carefully study the actual situations the statistics describe 00:03:52.089 --> 00:03:55.977 and consider whether lurking variables may be present. 00:03:55.977 --> 00:03:59.378 Otherwise, we leave ourselves vulnerable to those who would use data 00:03:59.378 --> 00:04:02.649 to manipulate others and promote their own agendas.