WEBVTT 00:00:10.080 --> 00:00:17.985 applause 00:00:17.985 --> 00:00:22.900 Thank you very much, can you… You can hear me? Yes! 00:00:22.900 --> 00:00:27.620 I’ve been at this now 23 years. We worked, with… My colleagues and I, 00:00:27.620 --> 00:00:31.390 we worked in about 30 countries, we’ve advised 9 Truth Commissions, 00:00:31.390 --> 00:00:36.410 official Truth Commissions, 4 UN missions, 00:00:36.410 --> 00:00:40.150 4 international criminal tribunals. We have testified in 4 different cases 00:00:40.150 --> 00:00:44.240 – 2 internationally, 2 domestically – and we’ve advised dozens and dozens 00:00:44.240 --> 00:00:49.120 of non-governmental Human Rights groups around the world. The point of this stuff 00:00:49.120 --> 00:00:54.180 is to figure out how to bring the knowledge of the people who’ve suffered 00:00:54.180 --> 00:00:58.770 human rights violations to bear, on demanding accountability 00:00:58.770 --> 00:01:04.960 from the perpetrators. Our job is to figure out how we can tell the truth. 00:01:04.960 --> 00:01:09.240 It is one of the moral foundations of the international Human Rights movement 00:01:09.240 --> 00:01:14.220 that we speak Truth to Power. We look in the face of the powerful 00:01:14.220 --> 00:01:19.299 and we tell them what we believe they have done that is wrong. 00:01:19.299 --> 00:01:23.639 If that’s gonna work, we have to speak the truth. 00:01:23.639 --> 00:01:29.470 We have to be right, we have to get the analysis on. 00:01:29.470 --> 00:01:33.979 That’s not always easy and to get there, 00:01:33.979 --> 00:01:37.209 there are sort of 3 themes that I wanna try to touch in this talk. 00:01:37.209 --> 00:01:40.379 Since the talk is pretty short I’m really gonna touch on 2 of them, so 00:01:40.379 --> 00:01:43.619 at the very end of the talk I’ll invite people who’d like to talk more about 00:01:43.619 --> 00:01:49.270 the specifically technical aspects of this work, about classifiers, about clustering, 00:01:49.270 --> 00:01:53.620 about statistical estimation, about database techniques. People who wanna talk 00:01:53.620 --> 00:01:56.990 about that I’d love to gather and we’ll try to find a space. I’ve been fighting 00:01:56.990 --> 00:02:00.460 with the Wiki for 2 days; I think I’m probably not the only one. 00:02:00.460 --> 00:02:04.959 We can gather, we can talk about that stuff more in detail. So today, 00:02:04.959 --> 00:02:09.990 in the next 25 minutes I’m going to focus specifically on 00:02:09.990 --> 00:02:14.520 the trial of General José Efraín Ríos Montt 00:02:14.520 --> 00:02:20.200 who ruled Guatemala from March 1982 until August 1983. 00:02:20.200 --> 00:02:25.180 That’s General Ríos, there in the upper corner in the red tie. 00:02:25.180 --> 00:02:30.600 During the government of General Ríos Montt 00:02:30.600 --> 00:02:35.610 tens of thousands of people were killed by the army of Guatemala. And the question 00:02:35.610 --> 00:02:39.610 that has been facing Guatemalans since that time is: 00:02:39.610 --> 00:02:44.080 “Did the pattern of killing that the army committed 00:02:44.080 --> 00:02:49.690 constitute acts of genocide?”. Now genocide is a very specific crime 00:02:49.690 --> 00:02:54.420 in International Law. It does not mean you killed a lot of people. 00:02:54.420 --> 00:02:58.910 There are other war crimes for mass killing. Genocide specifically means 00:02:58.910 --> 00:03:03.930 that you picked out a particular group; and to the exclusion of other groups 00:03:03.930 --> 00:03:08.460 nearby them you focused on eliminating that group. 00:03:08.460 --> 00:03:14.240 That’s key because for a statistician that gives us a hypothesis we can test 00:03:14.240 --> 00:03:18.860 which is: “What is the relative risk, what is the differential probability 00:03:18.860 --> 00:03:22.820 of people in the target group being killed relative to their neighbours 00:03:22.820 --> 00:03:28.150 who are not in the target group?” So without further ado, 00:03:28.150 --> 00:03:31.970 let’s look at the relative risk of being killed for indigenous people 00:03:31.970 --> 00:03:36.880 in the 3 rural counties of Chajul, Cotzal and Nebaj 00:03:36.880 --> 00:03:41.400 relative to their non-indigenous neighbours. 00:03:41.400 --> 00:03:45.960 We have – and I’ll talk in a moment about how we have this – we have information, 00:03:45.960 --> 00:03:51.490 and evidence, and estimations of the deaths of about 2150 indigenous people. 00:03:51.490 --> 00:03:58.550 People killed by the army in the period of the government of General Ríos. 00:03:58.550 --> 00:04:02.550 The population, the total number of people alive who were indigenous 00:04:02.550 --> 00:04:07.370 in those counties in the census of 1981 is about 39,000. 00:04:07.370 --> 00:04:14.500 So the approximate crude mortality rate due to homicide by the army 00:04:14.500 --> 00:04:18.710 is 5.5% for indigenous people in that period. Now that’s relative 00:04:18.710 --> 00:04:22.890 to the homicide rate for non-indigenous people in the same place 00:04:22.890 --> 00:04:27.200 of approximately 0.7%. So what we ask is: “What is the ratio 00:04:27.200 --> 00:04:30.530 between those 2 numbers?” And the ratio between those 2 numbers 00:04:30.530 --> 00:04:35.600 is the relative risk. It’s approximately 8. We interpret that as: if you were 00:04:35.600 --> 00:04:41.339 an indigenous person alive in one of those 3 counties in 1982, 00:04:41.339 --> 00:04:46.939 your probability of being killed by the army was 8 times greater 00:04:46.939 --> 00:04:51.069 than a person also living in those 3 counties 00:04:51.069 --> 00:04:56.179 who was not indigenous. Eight times, 8 times! 00:04:56.179 --> 00:05:00.250 To put that in relative terms: the probability… the relative risk of being 00:05:00.250 --> 00:05:04.720 a Bosniac relative to being Serb in Bosnia during the war in Bosnia 00:05:04.720 --> 00:05:09.800 was a little less than 3. So your relative risk of being indigenous 00:05:09.800 --> 00:05:13.310 was more than twice nearly 3 times as much as your relative risk 00:05:13.310 --> 00:05:19.200 of being Bosniac in the Bosnian War. It’s an astonishing level of focus. 00:05:19.200 --> 00:05:23.809 It shows a tremendous planning and coherence, I believe. 00:05:23.809 --> 00:05:29.469 So, again coming back to the statistical conclusion, how do we come to that? 00:05:29.469 --> 00:05:32.849 How do we find that information? How do we make that conclusion? First, we’re only 00:05:32.849 --> 00:05:35.470 looking at homicides committed by the army. We’re not looking at homicides 00:05:35.470 --> 00:05:39.409 committed by other parties, by the guerrillas, by private actors. 00:05:39.409 --> 00:05:44.499 We’re not looking at excess mortality, the mortality that we might find 00:05:44.499 --> 00:05:47.709 in conflict that is in excess of normal peacetime mortality. 00:05:47.709 --> 00:05:51.470 We’re not looking at any of that, only homicide. And the percentage 00:05:51.470 --> 00:05:55.330 relates the number of people killed by the army with the population that was alive. 00:05:55.330 --> 00:05:58.650 That’s crucial here. We’re looking at rates and we’re comparing the rate 00:05:58.650 --> 00:06:02.430 of the indigenous people shown in the blue bar to non-indigenous people 00:06:02.430 --> 00:06:06.869 shown in the green bar. The width of the bars show the relative populations 00:06:06.869 --> 00:06:11.829 in each of those 2 communities. So clearly there are many more indigenous people, 00:06:11.829 --> 00:06:14.980 but a higher fraction of them are also killed. The bars also show something else. 00:06:14.980 --> 00:06:18.049 And that’s what I’ll focus on for the rest of the talk. There are 2 sections 00:06:18.049 --> 00:06:22.159 to each of the 2 bars, a dark section on the bottom, a lighter section on top. 00:06:22.159 --> 00:06:27.779 And what that indicates is what we know in terms of being able to name people 00:06:27.779 --> 00:06:31.249 with their first and last name, their location and dates of death, and 00:06:31.249 --> 00:06:35.560 what we must infer statistically. Now I’m beginning to touch on the second theme 00:06:35.560 --> 00:06:40.949 of my talk: Which is that when we are studying mass violence and war crimes, 00:06:40.949 --> 00:06:48.749 we cannot do statistical or pattern analysis with raw information. 00:06:48.749 --> 00:06:51.950 We must use the tools of mathematical statistics to understand 00:06:51.950 --> 00:06:56.080 what we don’t know! The information which cannot be observed directly. 00:06:56.080 --> 00:07:00.649 We have to estimate that in order to control for the process of the production 00:07:00.649 --> 00:07:04.989 of information. Information doesn’t just fall out of the sky, the way it does 00:07:04.989 --> 00:07:10.359 for industry. If I’m running an ISP I know every packet that runs through my routers. 00:07:10.359 --> 00:07:14.959 That’s not how the social world works. In order to find information about killings 00:07:14.959 --> 00:07:17.889 we have to hear about that killing from someone, we have to investigate, 00:07:17.889 --> 00:07:22.119 we have to find the human remains. And if we can’t observe the killing 00:07:22.119 --> 00:07:28.130 we won’t hear about it and many killings are hidden. In my team we have a kind of 00:07:28.130 --> 00:07:33.760 catch phrase: that the world… if a lawyer is killed in a big city at high noon 00:07:33.760 --> 00:07:38.259 the world knows about it before dinner time. Every single time. 00:07:38.259 --> 00:07:41.850 But when a rural peasant is killed 3-days walk from a road in the dead of night, 00:07:41.850 --> 00:07:45.489 we’re unlikely to ever hear. And technology is not changing this. 00:07:45.489 --> 00:07:48.899 I’ll talk later about that technology is actually making the problem worse. 00:07:48.899 --> 00:07:53.470 So, let’s get back to Guatemala and just conclude 00:07:53.470 --> 00:07:57.950 that the little vertical bars, little vertical lines at the top of each bar 00:07:57.950 --> 00:08:03.079 indicate the confidence interval. Which is similar to what lay people sometimes call 00:08:03.079 --> 00:08:07.199 a margin of error. It is our level of uncertainty about each of those estimates 00:08:07.199 --> 00:08:10.960 and you’ll notice that the uncertainty is much, much smaller than 00:08:10.960 --> 00:08:14.509 the difference between the 2 bars. The uncertainty does not affect our ability 00:08:14.509 --> 00:08:17.970 to draw the conclusion that there was a spectacular difference 00:08:17.970 --> 00:08:21.900 in the mortality rates between the people who were the hypothesized 00:08:21.900 --> 00:08:26.630 target of genocide and those who were not. 00:08:26.630 --> 00:08:30.520 Now the data: first we had the census of 1981, 00:08:30.520 --> 00:08:35.339 this was a crucial piece. I think there’s very interesting questions to ask 00:08:35.339 --> 00:08:39.609 about why the Government of Guatemala conducted a census on the eve of 00:08:39.609 --> 00:08:44.540 committing a genocide. There is excellent work done by historical demographers 00:08:44.540 --> 00:08:47.950 about the use of censuses in mass violence. It has been common 00:08:47.950 --> 00:08:52.880 throughout history. Similarly, or excuse me, in parallel 00:08:52.880 --> 00:08:57.420 there were 4 very large projects. First, the CIIDH 00:08:57.420 --> 00:09:01.600 – a group of non-Governmental Human Rights groups – 00:09:01.600 --> 00:09:06.610 collected 1240 records of deaths in this three-county region. 00:09:06.610 --> 00:09:11.750 Next, the Catholic Church collected a bit fewer than 800 deaths. 00:09:11.750 --> 00:09:16.539 The truth commission – the Comisión para el Esclarecimiento Histórico (CEH) – 00:09:16.539 --> 00:09:22.000 conducted a really big research project in the late 1990s and 00:09:22.000 --> 00:09:25.810 of that we got information about a little bit more than a thousand deaths. 00:09:25.810 --> 00:09:30.450 And then the National Program for Compensation is very, very large 00:09:30.450 --> 00:09:35.370 and gave us about 4700 records of deaths. 00:09:35.370 --> 00:09:40.659 Now, this is interesting but this is not unique. 00:09:40.659 --> 00:09:45.769 Many of the deaths are reported in common across those data sources and so… 00:09:45.769 --> 00:09:49.490 we think about this in terms of a Venn diagram. We think of: how did these 00:09:49.490 --> 00:09:54.329 different data sets intersect with each other or collide with each other. And 00:09:54.329 --> 00:09:59.130 we can diagram that as in the sense of these 3 white circles intersecting. 00:09:59.130 --> 00:10:05.610 But as I mentioned earlier we’re also interested in what we have not observed. 00:10:05.610 --> 00:10:09.490 And this is crucial for us because when we’re thinking about 00:10:09.490 --> 00:10:13.420 how much information we have, we have to distinguish between the world on the left, 00:10:13.420 --> 00:10:17.200 in which our intersecting circles cover about a third of the reality, 00:10:17.200 --> 00:10:21.829 versus the world on the right where our intersecting circles cover all of reality. 00:10:21.829 --> 00:10:26.390 These are very different worlds; and the reason they’re so different is not simply 00:10:26.390 --> 00:10:29.710 because we want to know the magnitude, not simply because we want to know 00:10:29.710 --> 00:10:34.490 the total number of killings. That’s important – but even more important: 00:10:34.490 --> 00:10:40.160 we have to know that we’ve covered, we’ve estimated in equal proportions 00:10:40.160 --> 00:10:44.430 the two parties. We have to estimate in equal proportions the number of deaths 00:10:44.430 --> 00:10:48.340 of non-indigenous people and the number of deaths of indigenous people. 00:10:48.340 --> 00:10:51.510 Because if we don’t get those estimates correct our comparison 00:10:51.510 --> 00:10:56.080 of their mortality rates will be biased. Our story will be wrong. We will fail 00:10:56.080 --> 00:11:01.840 to speak Truth to Power. We can’t have that. So what do we do? Algebra! 00:11:01.840 --> 00:11:06.390 Algebra is our friend. So I’m gonna give you just a tiny taste of how we 00:11:06.390 --> 00:11:09.650 solve this problem and I’m going to introduce a series of assumptions. 00:11:09.650 --> 00:11:13.279 Those of you who would like to debate those assumptions: I invite you to join me 00:11:13.279 --> 00:11:18.359 after the talk and we will talk endlessly and tediously about capture heterogeneity. 00:11:18.359 --> 00:11:22.240 But in the short term, 00:11:22.240 --> 00:11:27.940 we have a universe N of total killings in a specific time/space/ethnicity/location. 00:11:27.940 --> 00:11:30.690 And of that we have 2 projects A and B. 00:11:30.690 --> 00:11:34.619 A captures some number of deaths from the universe N, 00:11:34.619 --> 00:11:40.169 and the probability with which a death is captured by project A from the universe N 00:11:40.169 --> 00:11:44.600 is by elementary probability theory the number of deaths documented by A 00:11:44.600 --> 00:11:48.740 divided by the unknown number of deaths in the population N. 00:11:48.740 --> 00:11:52.969 Similarly, the probability with which a death from N is documented by project B 00:11:52.969 --> 00:11:58.149 is B over N, and this is the cool part: the probability with which a death 00:11:58.149 --> 00:12:01.949 is documented by both A and B is M. 00:12:01.949 --> 00:12:05.579 Now we can put the 2 databases together, we can compare them. Let’s talk about 00:12:05.579 --> 00:12:09.370 the use of random force classifiers and clustering to do that later. 00:12:09.370 --> 00:12:12.489 But we can put the 2 databases together, compare them, determine the deaths 00:12:12.489 --> 00:12:17.429 that are in M – that is in N both A and B – and divide M by N. 00:12:17.429 --> 00:12:23.060 But, also by probability theory, the probability that a death occurs in M 00:12:23.060 --> 00:12:27.740 is equal to the product of the individual probabilities. 00:12:27.740 --> 00:12:31.619 The probability of any compound event, an event made up of two independent events is 00:12:31.619 --> 00:12:36.410 equal to the product of those two events, so M over N is equal to 00:12:36.410 --> 00:12:41.420 A over N times B over N. Solve for N. 00:12:41.420 --> 00:12:45.140 Multiply it through by N squared, divide by M, and we have an estimate of N 00:12:45.140 --> 00:12:49.360 which is equal to AB over M. Now, the lights in my eyes, I can’t see, but I saw 00:12:49.360 --> 00:12:52.740 a few light bulbs go off over people’s heads. And when I showed this proof 00:12:52.740 --> 00:12:57.180 to the judge in the trial of General Ríos 00:12:57.180 --> 00:13:01.529 I saw a light bulb go on over her head. 00:13:01.529 --> 00:13:04.379 It’s a beautiful thing, it’s a beautiful thing. 00:13:04.379 --> 00:13:09.509 applause 00:13:09.509 --> 00:13:12.660 So we don’t do it in 2 systems because that takes a lot of assumptions. 00:13:12.660 --> 00:13:16.069 We do it in 4. You will recall that we have 4 data sources. We organize 00:13:16.069 --> 00:13:21.530 the data sources in this format such that we have an inclusion 00:13:21.530 --> 00:13:26.249 and an exclusion pattern in the table on the left, which… for which we can define 00:13:26.249 --> 00:13:29.810 the number of deaths which fall into each of these intersecting patterns. 00:13:29.810 --> 00:13:33.729 And I’ll give you a very quick metaphor here. The metaphor is: 00:13:33.729 --> 00:13:38.239 imagine that you have 2 dark rooms and you want to assess the size of those 2 rooms 00:13:38.239 --> 00:13:42.049 – which room is larger? And the only tool that you have to assess the size 00:13:42.049 --> 00:13:46.359 of those rooms is a handful of little rubber balls. The little rubber balls 00:13:46.359 --> 00:13:50.400 have a property that when they hit each other they make a sound. makes CLICK sound 00:13:50.400 --> 00:13:53.390 So we throw the balls into the first room and we listen, and we hear 00:13:53.390 --> 00:13:57.190 makes several CLICK sounds. We collect the balls, go to the second room, 00:13:57.190 --> 00:14:00.490 throw them with equal force – imagining a spherical cow of uniform density! 00:14:00.490 --> 00:14:03.950 We throw the balls into the second room with equal force and we hear 00:14:03.950 --> 00:14:07.799 makes one CLICK sound So which room is larger? 00:14:07.799 --> 00:14:12.070 The second room, because we hear fewer collisions, right? Well, the estimation, 00:14:12.070 --> 00:14:15.620 the toy example I gave in the previous slide is the mathematical formalization 00:14:15.620 --> 00:14:20.070 of the intuition that fewer collisions mean a larger space. 00:14:20.070 --> 00:14:23.329 And so what we’re doing here is laying out the pattern of collisions. 00:14:23.329 --> 00:14:26.679 Not just the collisions, the pairwise collisions, but the three-way and 00:14:26.679 --> 00:14:31.409 four-way collisions. And that allows us to make the estimate 00:14:31.409 --> 00:14:37.439 that was shown in the bar graph of the light part of each of the bars. So 00:14:37.439 --> 00:14:41.460 we can come back to our conclusion and put a confidence interval on the estimates. 00:14:41.460 --> 00:14:45.910 And the confidence intervals are shown there. Now I’m gonna move through this 00:14:45.910 --> 00:14:50.850 somewhat more quickly to get to the end of the talk but I wanna put up one more slide 00:14:50.850 --> 00:14:56.240 that was used in the testimony and that is that we divided time 00:14:56.240 --> 00:15:01.220 into 16-month periods and compared the 16-month period of 00:15:01.220 --> 00:15:04.580 General Ríos’s governance – now it’s only 16 months ’cause we went April to July, 00:15:04.580 --> 00:15:07.679 because it’s only a few days in August, a few days in March, so we shaved those off, 00:15:07.679 --> 00:15:12.310 okay… – 16-month period of General Ríos’s Government and compared it 00:15:12.310 --> 00:15:17.110 to several periods before and after. And I think that the key observation here 00:15:17.110 --> 00:15:21.809 is that the rate of killing against indigenous people 00:15:21.809 --> 00:15:26.729 is substantially higher done under General Ríos’s Government than under previous 00:15:26.729 --> 00:15:33.280 or succeeding governments. But more importantly the ratio between the two, 00:15:33.280 --> 00:15:37.950 the relative risk of being killed as an indigenous person, was at its peak 00:15:37.950 --> 00:15:42.639 during the government of General Ríos. 00:15:42.639 --> 00:15:46.709 Have we proven genocide? No. 00:15:46.709 --> 00:15:49.870 This is evidence consistent with the hypothesis that acts of genocide 00:15:49.870 --> 00:15:53.539 were committed. The finding of genocide is a legal finding, not so much 00:15:53.539 --> 00:15:58.580 a scientific one. So as scientists, our job is to provide evidence that 00:15:58.580 --> 00:16:02.870 the finders of fact – the judges in this case – can use in their determination. 00:16:02.870 --> 00:16:05.219 This is evidence consistent with that hypothesis. 00:16:05.219 --> 00:16:08.189 Were this evidence otherwise, as scientists we would say we would 00:16:08.189 --> 00:16:11.480 reject the hypothesis that genocide was committed. However, with this evidence 00:16:11.480 --> 00:16:15.370 we find that the evidence, the data is consistent with 00:16:15.370 --> 00:16:18.080 the prosecution’s hypothesis. 00:16:18.080 --> 00:16:25.320 So, it worked! 00:16:25.320 --> 00:16:29.049 Ríos Montt was convicted on genocide charges. applause 00:16:29.049 --> 00:16:31.359 You can clap! applause 00:16:31.359 --> 00:16:36.359 applause 00:16:36.359 --> 00:16:39.499 For a week! mumbled, surprised laughter 00:16:39.499 --> 00:16:42.279 Then the Constitutional Court intervened, 00:16:42.279 --> 00:16:44.959 there I know a couple of experts on Guatemala here in the audience 00:16:44.959 --> 00:16:47.839 who can tell you more about why that happened and exactly what happened. 00:16:47.839 --> 00:16:52.669 However, the Constitutional Court ordered a new trial, 00:16:52.669 --> 00:16:59.160 which is at this time scheduled for the very beginning of 2015. 00:16:59.160 --> 00:17:02.970 And I look forward to testifying again, 00:17:02.970 --> 00:17:06.820 and again, and again, and again! 00:17:06.820 --> 00:17:12.680 applause 00:17:12.680 --> 00:17:16.989 Look, but I wanna come back to this point. Because as a bunch of technologists… 00:17:16.989 --> 00:17:21.589 – there is a lot of folks who really like technology here, I really like it too! 00:17:21.589 --> 00:17:25.559 Technology doesn’t get us to science – you have to have science 00:17:25.559 --> 00:17:28.770 to get you to science. Technology helps you organize the data. It helps you do 00:17:28.770 --> 00:17:32.050 all kinds of extremely great and cool things without which we wouldn’t be able 00:17:32.050 --> 00:17:36.480 to even do the science. But you can’t have just technology! 00:17:36.480 --> 00:17:40.970 You can’t just have a bunch of data and make conclusions. That’s naive, 00:17:40.970 --> 00:17:44.529 and you will get the wrong conclusions. ‘The point of rigorous statistics is 00:17:44.529 --> 00:17:48.100 to be right’, and there is a little bit of a caveat there – or to at least know 00:17:48.100 --> 00:17:51.620 how uncertain you are. Statistics is often called the ‘Science of Uncertainty’. 00:17:51.620 --> 00:17:55.960 That is actually my favorite definition of it. So, 00:17:55.960 --> 00:18:01.509 I’m going to assume that we care about getting it right. 00:18:01.509 --> 00:18:05.489 No one laughed, that’s good. 00:18:05.489 --> 00:18:08.890 Not everyone does, to my distress. 00:18:08.890 --> 00:18:11.320 So if you only have some of the data 00:18:11.320 --> 00:18:15.490 – and I will argue that we always only have some of the data – 00:18:15.490 --> 00:18:20.449 you need some kind of model that will tell you the relationship between your data 00:18:20.449 --> 00:18:23.989 and the real world. Statisticians call that an inference. 00:18:23.989 --> 00:18:26.200 In order to get from here to there you’re gonna need some kind of 00:18:26.200 --> 00:18:30.469 probability model that tells you why your data is like the world, 00:18:30.469 --> 00:18:33.960 or in what sense you have to tweet, twiddle and do algebra with your data 00:18:33.960 --> 00:18:39.309 to get from what you can observe to what is actually true. 00:18:39.309 --> 00:18:42.690 And statistics is about comparisons. Yeah, we get a big number and 00:18:42.690 --> 00:18:46.169 journalists love the big number; but it’s really about these relationships 00:18:46.169 --> 00:18:50.609 and patterns! So to get those relationships and patterns, 00:18:50.609 --> 00:18:53.560 in order for them to be right, in order for our answer to be correct, 00:18:53.560 --> 00:18:57.439 every one of the estimates we make for every point in the pattern 00:18:57.439 --> 00:19:01.700 has to be right. It’s a hard problem. It’s a hard problem. 00:19:01.700 --> 00:19:05.070 And what I worry about is that we have come into this world 00:19:05.070 --> 00:19:09.400 in which people throw the notion of Big Data around as though the data allows us 00:19:09.400 --> 00:19:14.230 to make an end-run around problems of sampling and modeling. It doesn’t. 00:19:14.230 --> 00:19:19.120 So as technologist, the reason I’m, you know, ranting at you guys about it 00:19:19.120 --> 00:19:24.540 is that it’s very tempting to have a lot of data and think you have an answer! 00:19:24.540 --> 00:19:30.580 And it’s even more tempting because in industry context you might be right. 00:19:30.580 --> 00:19:36.739 Not so much in Human Rights, not so much. Violence is a hidden process. 00:19:36.739 --> 00:19:39.960 The people who commit violence have an enormous commitment to hiding it, 00:19:39.960 --> 00:19:44.420 distorting it, explaining it in different ways. All of those things dramatically 00:19:44.420 --> 00:19:48.350 affect the information that is produced from the violence that we’re going to use 00:19:48.350 --> 00:19:53.730 to do our analysis. So we usually don’t know what we don’t know 00:19:53.730 --> 00:19:58.220 in Human Rights data collection. And that means that we don’t know 00:19:58.220 --> 00:20:03.829 if what we don’t know is systematically different from what we do know. 00:20:03.829 --> 00:20:06.270 Maybe we know about all the lawyers and we don’t know about the people 00:20:06.270 --> 00:20:10.070 in the countryside. Maybe we know about all the indigenous people 00:20:10.070 --> 00:20:14.130 and not the non-indigenous people. If that were true, the argument 00:20:14.130 --> 00:20:17.980 that I just made would be merely an artifact of the reporting process 00:20:17.980 --> 00:20:21.740 rather than some true analysis. Now we did the estimations why I believe 00:20:21.740 --> 00:20:25.009 we can reject that critique, but that’s what we have to worry about. 00:20:25.009 --> 00:20:28.860 And let’s go back to the Venn diagram and say: which of these is accurate? 00:20:28.860 --> 00:20:32.840 It’s not just for one of the points in our pattern analysis. 00:20:32.840 --> 00:20:36.500 The problem is that we’re going to compare things. 00:20:36.500 --> 00:20:40.890 As in Peru where we compared killings committed by the Peruvian army against 00:20:40.890 --> 00:20:44.860 killings committed by the Maoist Guerillas with the Sendero Luminoso. And we found 00:20:44.860 --> 00:20:51.460 there that in fact we knew very little about what the Sendero Luminoso had done. 00:20:51.460 --> 00:20:55.779 Whereas we knew almost everything what the Peruvian army had done. 00:20:55.779 --> 00:20:57.970 This is called the coverage rate. The rate between what we know and 00:20:57.970 --> 00:21:02.750 what we don’t know. And raw data, however big, 00:21:02.750 --> 00:21:07.510 does not get us to patterns. And here is a bunch of… 00:21:07.510 --> 00:21:11.569 kinds of raw data that I’ve used and that I really enjoy using. 00:21:11.569 --> 00:21:14.270 You know – truth commission testimonies, UN investigations, press articles, 00:21:14.270 --> 00:21:18.309 SMS messages, crowdsourcing, NGO documentation, social media feeds, 00:21:18.309 --> 00:21:21.180 perpetrator records, government archives, state agency registries – I know those 00:21:21.180 --> 00:21:23.570 sound all the same but they actually turn out to be slightly different. 00:21:23.570 --> 00:21:28.340 Happy to talk in tedious detail! Refugee Camp records, any non-random sample. 00:21:28.340 --> 00:21:31.990 All of those are gonna take some kind of probability model 00:21:31.990 --> 00:21:36.070 and we don’t have that many probability models to use. So 00:21:36.070 --> 00:21:40.330 raw data is great for cases – but it doesn’t get you to patterns. 00:21:40.330 --> 00:21:45.120 And patterns – again – patterns are the thing that allow us to do analysis. 00:21:45.120 --> 00:21:49.289 They are the thing… the patterns are what get us to something that we can use 00:21:49.289 --> 00:21:53.629 to help prosecutors, advocates and the… 00:21:53.629 --> 00:21:56.409 and the victims themselves. 00:21:56.409 --> 00:22:00.589 I gave a version of this talk, a much earlier version of this talk 00:22:00.589 --> 00:22:04.630 several years ago in Medellín, Columbia. I’ve worked a lot in Columbia, 00:22:04.630 --> 00:22:07.670 it’s really… it’s a great place to work. There’s really terrific 00:22:07.670 --> 00:22:13.569 Victims Rights groups there. And a woman from a township, 00:22:13.569 --> 00:22:17.310 smaller than a county, near to Medellín came up to me after the talk and she said: 00:22:17.310 --> 00:22:21.150 “You know, a lot of people… you know I’m a Human Rights activist, 00:22:21.150 --> 00:22:25.309 my job is to collect data, I tell stories about people who have suffered. 00:22:25.309 --> 00:22:28.210 But there are people in my village I know who have had 00:22:28.210 --> 00:22:32.910 people in their families disappeared and they’re never gonna talk about, ever. 00:22:32.910 --> 00:22:38.090 We’re never going to be able to use their names, because they are afraid.” 00:22:38.090 --> 00:22:45.349 We can’t name the victims. At least we’d better count them. 00:22:45.349 --> 00:22:49.520 So about that counting: there’s 3 ways to do it right. You can have 00:22:49.520 --> 00:22:54.430 a perfect census – you can have all the data. Yeah it’s nice, good luck with that. 00:22:54.430 --> 00:22:58.910 You can have a random sample of the population - that’s hard! 00:22:58.910 --> 00:23:03.029 Sometimes doable but very hard. In my experience we rarely interview 00:23:03.029 --> 00:23:07.140 victims of homicide, very rarely. Laughing 00:23:07.140 --> 00:23:09.640 And that means there’s a complicated probability relationship between 00:23:09.640 --> 00:23:13.670 the person you sampled, the interview and the death that they talk to you about. 00:23:13.670 --> 00:23:17.300 Or you can do some kind of posterior modeling of the sampling process which is… 00:23:17.300 --> 00:23:21.260 which is in essence what I proposed in the earlier slide. 00:23:21.260 --> 00:23:25.020 So what can we do with raw data, guys? We can collect a bunch of… 00:23:25.020 --> 00:23:28.930 We can say that a case exists. Ok – that’s actually important! We can say: 00:23:28.930 --> 00:23:34.409 “Something happened” with raw data. We can say: “We know something about that case". 00:23:34.409 --> 00:23:38.250 We can say: “There were 100 victims in that case or at least 100 victims 00:23:38.250 --> 00:23:41.570 in that case”, if we can name 100 people. 00:23:41.570 --> 00:23:46.390 But we can’t do comparisons: “This is the biggest massacre this year”. 00:23:46.390 --> 00:23:48.350 We don’t really know. Because we don’t know about that massacres 00:23:48.350 --> 00:23:53.910 we don’t know about. No patterns. Don’t talk about the hot spot of violence. 00:23:53.910 --> 00:23:59.420 No, we don’t know that. Happy to talk more about that if we gather after, 00:23:59.420 --> 00:24:06.439 but I wanna come to a close here with the importance of getting it right. 00:24:06.439 --> 00:24:11.380 I’ve talked about one case today. This is another case, the case of this man: 00:24:11.380 --> 00:24:16.320 Edgar Fernando García. Mr. García was a student Labor leader in Guatemala 00:24:16.320 --> 00:24:19.800 early in the 1980s. He left his office in February 1984 00:24:19.800 --> 00:24:24.470 – did not come home. People reported later that they saw someone 00:24:24.470 --> 00:24:28.810 shoving Mr. García into a vehicle and driving away. 00:24:28.810 --> 00:24:33.900 His widow became a very important Human Rights activist in Guatemala 00:24:33.900 --> 00:24:38.570 and now she’s a very important, and in my opinion impressive politician. 00:24:38.570 --> 00:24:42.240 And there’s her infant daughter. She continued to struggle to find out 00:24:42.240 --> 00:24:46.130 what had happened to Mr. García for decades. 00:24:46.130 --> 00:24:50.400 And in 2006 documents came to light in the National Archives of the… 00:24:50.400 --> 00:24:54.429 excuse me, the Historical Archives of the national Police, showing that 00:24:54.429 --> 00:24:59.320 the Police had realized an operation in the area of Mr. García’s office 00:24:59.320 --> 00:25:01.930 and it was very likely that they had disappeared him. 00:25:01.930 --> 00:25:07.400 These 2 guys up here in the upper right were Police officers in that area; 00:25:07.400 --> 00:25:11.359 they were arrested, charged with the disappearance of Mister García and 00:25:11.359 --> 00:25:15.620 convicted. Part of the evidence used to convict them was communications meta data 00:25:15.620 --> 00:25:19.510 showing that documents flowed through the archive. 00:25:19.510 --> 00:25:23.699 I mean paper communications! We coded it by hand. We went through and read 00:25:23.699 --> 00:25:28.459 the ‘From’ and ‘To’ lines from every Memo. And 00:25:28.459 --> 00:25:34.229 they were convicted in 2010 and after that conviction 00:25:34.229 --> 00:25:38.699 Mr. García’s infant daughter – now a grown woman – was clearly joyful. 00:25:38.699 --> 00:25:42.730 Justice brings closure to a family that never knows when to start talking 00:25:42.730 --> 00:25:48.059 about someone in the past tense. Perhaps even more powerfully: 00:25:48.059 --> 00:25:52.319 those guys’ grand boss, their boss's boss, Colonel Héctor Bol de la Cruz, 00:25:52.319 --> 00:25:58.439 this man here, was convicted of Mr. García’s disappearance 00:25:58.439 --> 00:26:02.069 in September this year [2013]. applause 00:26:02.069 --> 00:26:07.610 applause 00:26:07.610 --> 00:26:10.789 I don’t know if any of you have ever been dissident students, 00:26:10.789 --> 00:26:15.330 but if you’ve been dissident students demonstrating in the street think about 00:26:15.330 --> 00:26:19.300 how you would feel if your friends and comrades were disappeared, 00:26:19.300 --> 00:26:23.419 and take a long look at Colonel Bol de la Cruz. Here is the rest of the stuff 00:26:23.419 --> 00:26:25.626 that we will talk about if we gather afterwards. Thank you very much 00:26:25.626 --> 00:26:29.086 for your attention. I really have enjoyed CCC. 00:26:29.086 --> 00:26:36.086 applause 00:26:36.086 --> 00:26:47.203 Subtitles created by c3subtitles.de in the year 2016. Join and help us!