1
00:00:06,636 --> 00:00:09,077
Statistics are persuasive.

2
00:00:09,077 --> 00:00:12,541
So much so that people, organizations,
and whole countries

3
00:00:12,541 --> 00:00:17,747
base some of their most important 
decisions on organized data.

4
00:00:17,747 --> 00:00:19,484
But there's a problem with that.

5
00:00:19,484 --> 00:00:23,301
Any set of statistics might have something
lurking inside it,

6
00:00:23,301 --> 00:00:27,251
something that can turn the results
completely upside down.

7
00:00:27,251 --> 00:00:30,920
For example, imagine you need to choose
between two hospitals

8
00:00:30,920 --> 00:00:33,737
for an elderly relative's surgery.

9
00:00:33,737 --> 00:00:36,434
Out of each hospital's 
last 1000 patient's,

10
00:00:36,434 --> 00:00:39,612
900 survived at Hospital A,

11
00:00:39,612 --> 00:00:43,021
while only 800 survived at Hospital B.

12
00:00:43,021 --> 00:00:46,170
So it looks like Hospital A 
is the better choice.

13
00:00:46,170 --> 00:00:47,843
But before you make your decision,

14
00:00:47,843 --> 00:00:51,411
remember that not all patients
arrive at the hospital

15
00:00:51,411 --> 00:00:53,811
with the same level of health.

16
00:00:53,811 --> 00:00:56,703
And if we divide each hospital's
last 1000 patients

17
00:00:56,703 --> 00:01:01,132
into those who arrived in good health
and those who arrived in poor health,

18
00:01:01,132 --> 00:01:03,772
the picture starts to look very different.

19
00:01:03,772 --> 00:01:07,849
Hospital A had only 100 patients
who arrived in poor health,

20
00:01:07,849 --> 00:01:10,325
of which 30 survived.

21
00:01:10,325 --> 00:01:14,852
But Hospital B had 400,
and they were able to save 210.

22
00:01:14,852 --> 00:01:17,169
So Hospital B is the better choice

23
00:01:17,169 --> 00:01:20,741
for patients who arrive 
at hospital in poor health,

24
00:01:20,741 --> 00:01:24,526
with a survival rate of 52.5%.

25
00:01:24,526 --> 00:01:28,445
And what if your relative's health
is good when she arrives at the hospital?

26
00:01:28,445 --> 00:01:32,271
Strangely enough, Hospital B is still
the better choice,

27
00:01:32,271 --> 00:01:35,676
with a survival rate of over 98%.

28
00:01:35,676 --> 00:01:38,733
So how can Hospital A have a better
overall survival rate

29
00:01:38,733 --> 00:01:44,830
if Hospital B has better survival rates
for patients in each of the two groups?

30
00:01:44,830 --> 00:01:48,589
What we've stumbled upon is a case
of Simpson's paradox,

31
00:01:48,589 --> 00:01:51,899
where the same set of data can appear
to show opposite trends

32
00:01:51,899 --> 00:01:54,664
depending on how it's grouped.

33
00:01:54,664 --> 00:01:58,744
This often occurs when aggregated data
hides a conditional variable,

34
00:01:58,744 --> 00:02:01,377
sometimes known as a lurking variable,

35
00:02:01,377 --> 00:02:06,584
which is a hidden additional factor
that significantly influences results.

36
00:02:06,584 --> 00:02:10,023
Here, the hidden factor is the relative
proportion of patients

37
00:02:10,023 --> 00:02:13,264
who arrive in good or poor health.

38
00:02:13,264 --> 00:02:16,544
Simpson's paradox isn't just
a hypothetical scenario.

39
00:02:16,544 --> 00:02:18,924
It pops up from time 
to time in the real world,

40
00:02:18,924 --> 00:02:22,132
sometimes in important contexts.

41
00:02:22,132 --> 00:02:24,130
One study in the UK appeared to show

42
00:02:24,130 --> 00:02:27,600
that smokers had a higher survival rate
than nonsmokers

43
00:02:27,600 --> 00:02:29,846
over a twenty-year time period.

44
00:02:29,846 --> 00:02:33,307
That is, until dividing the participants
by age group

45
00:02:33,307 --> 00:02:37,823
showed that the nonsmokers 
were significantly older on average,

46
00:02:37,823 --> 00:02:40,930
and thus, more likely
to die during the trial period,

47
00:02:40,930 --> 00:02:44,438
precisely because they were living longer
in general.

48
00:02:44,438 --> 00:02:47,286
Here, the age groups 
are the lurking variable,

49
00:02:47,286 --> 00:02:50,176
and are vital to correctly 
interpret the data.

50
00:02:50,176 --> 00:02:51,559
In another example,

51
00:02:51,559 --> 00:02:54,281
an analysis of Florida's 
death penalty cases

52
00:02:54,281 --> 00:02:58,265
seemed to reveal 
no racial disparity in sentencing

53
00:02:58,265 --> 00:03:01,581
between black and white defendants
convicted of murder.

54
00:03:01,581 --> 00:03:06,396
But dividing the cases by the race
of the victim told a different story.

55
00:03:06,396 --> 00:03:07,969
In either situation,

56
00:03:07,969 --> 00:03:11,091
black defendants were more likely
to be sentenced to death.

57
00:03:11,091 --> 00:03:15,066
The slightly higher overall sentencing 
rate for white defendants

58
00:03:15,066 --> 00:03:18,692
was due to the fact 
that cases with white victims

59
00:03:18,692 --> 00:03:21,359
were more likely 
to elicit a death sentence

60
00:03:21,359 --> 00:03:24,091
than cases where the victim was black,

61
00:03:24,091 --> 00:03:28,483
and most murders occurred between
people of the same race.

62
00:03:28,483 --> 00:03:31,319
So how do we avoid 
falling for the paradox?

63
00:03:31,319 --> 00:03:34,686
Unfortunately, 
there's no one-size-fits-all answer.

64
00:03:34,686 --> 00:03:38,504
Data can be grouped and divided
in any number of ways,

65
00:03:38,504 --> 00:03:42,106
and overall numbers may sometimes
give a more accurate picture

66
00:03:42,106 --> 00:03:46,638
than data divided into misleading
or arbitrary categories.

67
00:03:46,638 --> 00:03:52,089
All we can do is carefully study the
actual situations the statistics describe

68
00:03:52,089 --> 00:03:55,977
and consider whether lurking variables
may be present.

69
00:03:55,977 --> 00:03:59,378
Otherwise, we leave ourselves
vulnerable to those who would use data

70
00:03:59,378 --> 00:04:02,649
to manipulate others
and promote their own agendas.