1 00:00:06,636 --> 00:00:09,077 Statistics are persuasive. 2 00:00:09,077 --> 00:00:12,541 So much so that people, organizations, and whole countries 3 00:00:12,541 --> 00:00:17,747 base some of their most important decisions on organized data. 4 00:00:17,747 --> 00:00:19,484 But there's a problem with that. 5 00:00:19,484 --> 00:00:23,301 Any set of statistics might have something lurking inside it, 6 00:00:23,301 --> 00:00:27,251 something that can turn the results completely upside down. 7 00:00:27,251 --> 00:00:30,920 For example, imagine you need to choose between two hospitals 8 00:00:30,920 --> 00:00:33,737 for an elderly relative's surgery. 9 00:00:33,737 --> 00:00:36,434 Out of each hospital's last 1000 patient's, 10 00:00:36,434 --> 00:00:39,612 900 survived at Hospital A, 11 00:00:39,612 --> 00:00:43,021 while only 800 survived at Hospital B. 12 00:00:43,021 --> 00:00:46,170 So it looks like Hospital A is the better choice. 13 00:00:46,170 --> 00:00:47,843 But before you make your decision, 14 00:00:47,843 --> 00:00:51,411 remember that not all patients arrive at the hospital 15 00:00:51,411 --> 00:00:53,811 with the same level of health. 16 00:00:53,811 --> 00:00:56,703 And if we divide each hospital's last 1000 patients 17 00:00:56,703 --> 00:01:01,132 into those who arrived in good health and those who arrived in poor health, 18 00:01:01,132 --> 00:01:03,772 the picture starts to look very different. 19 00:01:03,772 --> 00:01:07,849 Hospital A had only 100 patients who arrived in poor health, 20 00:01:07,849 --> 00:01:10,325 of which 30 survived. 21 00:01:10,325 --> 00:01:14,852 But Hospital B had 400, and they were able to save 210. 22 00:01:14,852 --> 00:01:17,169 So Hospital B is the better choice 23 00:01:17,169 --> 00:01:20,741 for patients who arrive at hospital in poor health, 24 00:01:20,741 --> 00:01:24,526 with a survival rate of 52.5%. 25 00:01:24,526 --> 00:01:28,445 And what if your relative's health is good when she arrives at the hospital? 26 00:01:28,445 --> 00:01:32,271 Strangely enough, Hospital B is still the better choice, 27 00:01:32,271 --> 00:01:35,676 with a survival rate of over 98%. 28 00:01:35,676 --> 00:01:38,733 So how can Hospital A have a better overall survival rate 29 00:01:38,733 --> 00:01:44,830 if Hospital B has better survival rates for patients in each of the two groups? 30 00:01:44,830 --> 00:01:48,589 What we've stumbled upon is a case of Simpson's paradox, 31 00:01:48,589 --> 00:01:51,899 where the same set of data can appear to show opposite trends 32 00:01:51,899 --> 00:01:54,664 depending on how it's grouped. 33 00:01:54,664 --> 00:01:58,744 This often occurs when aggregated data hides a conditional variable, 34 00:01:58,744 --> 00:02:01,377 sometimes known as a lurking variable, 35 00:02:01,377 --> 00:02:06,584 which is a hidden additional factor that significantly influences results. 36 00:02:06,584 --> 00:02:10,023 Here, the hidden factor is the relative proportion of patients 37 00:02:10,023 --> 00:02:13,264 who arrive in good or poor health. 38 00:02:13,264 --> 00:02:16,544 Simpson's paradox isn't just a hypothetical scenario. 39 00:02:16,544 --> 00:02:18,924 It pops up from time to time in the real world, 40 00:02:18,924 --> 00:02:22,132 sometimes in important contexts. 41 00:02:22,132 --> 00:02:24,130 One study in the UK appeared to show 42 00:02:24,130 --> 00:02:27,600 that smokers had a higher survival rate than nonsmokers 43 00:02:27,600 --> 00:02:29,846 over a twenty-year time period. 44 00:02:29,846 --> 00:02:33,307 That is, until dividing the participants by age group 45 00:02:33,307 --> 00:02:37,823 showed that the nonsmokers were significantly older on average, 46 00:02:37,823 --> 00:02:40,930 and thus, more likely to die during the trial period, 47 00:02:40,930 --> 00:02:44,438 precisely because they were living longer in general. 48 00:02:44,438 --> 00:02:47,286 Here, the age groups are the lurking variable, 49 00:02:47,286 --> 00:02:50,176 and are vital to correctly interpret the data. 50 00:02:50,176 --> 00:02:51,559 In another example, 51 00:02:51,559 --> 00:02:54,281 an analysis of Florida's death penalty cases 52 00:02:54,281 --> 00:02:58,265 seemed to reveal no racial disparity in sentencing 53 00:02:58,265 --> 00:03:01,581 between black and white defendants convicted of murder. 54 00:03:01,581 --> 00:03:06,396 But dividing the cases by the race of the victim told a different story. 55 00:03:06,396 --> 00:03:07,969 In either situation, 56 00:03:07,969 --> 00:03:11,091 black defendants were more likely to be sentenced to death. 57 00:03:11,091 --> 00:03:15,066 The slightly higher overall sentencing rate for white defendants 58 00:03:15,066 --> 00:03:18,692 was due to the fact that cases with white victims 59 00:03:18,692 --> 00:03:21,359 were more likely to elicit a death sentence 60 00:03:21,359 --> 00:03:24,091 than cases where the victim was black, 61 00:03:24,091 --> 00:03:28,483 and most murders occurred between people of the same race. 62 00:03:28,483 --> 00:03:31,319 So how do we avoid falling for the paradox? 63 00:03:31,319 --> 00:03:34,686 Unfortunately, there's no one-size-fits-all answer. 64 00:03:34,686 --> 00:03:38,504 Data can be grouped and divided in any number of ways, 65 00:03:38,504 --> 00:03:42,106 and overall numbers may sometimes give a more accurate picture 66 00:03:42,106 --> 00:03:46,638 than data divided into misleading or arbitrary categories. 67 00:03:46,638 --> 00:03:52,089 All we can do is carefully study the actual situations the statistics describe 68 00:03:52,089 --> 00:03:55,977 and consider whether lurking variables may be present. 69 00:03:55,977 --> 00:03:59,378 Otherwise, we leave ourselves vulnerable to those who would use data 70 00:03:59,378 --> 00:04:02,649 to manipulate others and promote their own agendas.