WEBVTT
00:00:10.080 --> 00:00:17.985
applause
00:00:17.985 --> 00:00:22.900
Thank you very much, can you…
You can hear me? Yes!
00:00:22.900 --> 00:00:27.620
I’ve been at this now 23 years. We
worked, with… My colleagues and I,
00:00:27.620 --> 00:00:31.390
we worked in about 30 countries,
we’ve advised 9 Truth Commissions,
00:00:31.390 --> 00:00:36.410
official Truth Commissions, 4 UN missions,
00:00:36.410 --> 00:00:40.150
4 international criminal tribunals.
We have testified in 4 different cases
00:00:40.150 --> 00:00:44.240
– 2 internationally, 2 domestically – and
we’ve advised dozens and dozens
00:00:44.240 --> 00:00:49.120
of non-governmental Human Rights groups
around the world. The point of this stuff
00:00:49.120 --> 00:00:54.180
is to figure out how to bring the
knowledge of the people who’ve suffered
00:00:54.180 --> 00:00:58.770
human rights violations to bear,
on demanding accountability
00:00:58.770 --> 00:01:04.960
from the perpetrators. Our job is to
figure out how we can tell the truth.
00:01:04.960 --> 00:01:09.240
It is one of the moral foundations of the
international Human Rights movement
00:01:09.240 --> 00:01:14.220
that we speak Truth to Power. We
look in the face of the powerful
00:01:14.220 --> 00:01:19.299
and we tell them what we believe
they have done that is wrong.
00:01:19.299 --> 00:01:23.639
If that’s gonna work, we
have to speak the truth.
00:01:23.639 --> 00:01:29.470
We have to be right, we
have to get the analysis on.
00:01:29.470 --> 00:01:33.979
That’s not always easy and to get there,
00:01:33.979 --> 00:01:37.209
there are sort of 3 themes that
I wanna try to touch in this talk.
00:01:37.209 --> 00:01:40.379
Since the talk is pretty short I’m
really gonna touch on 2 of them, so
00:01:40.379 --> 00:01:43.619
at the very end of the talk I’ll invite
people who’d like to talk more about
00:01:43.619 --> 00:01:49.270
the specifically technical aspects of this
work, about classifiers, about clustering,
00:01:49.270 --> 00:01:53.620
about statistical estimation, about
database techniques. People who wanna talk
00:01:53.620 --> 00:01:56.990
about that I’d love to gather and we’ll
try to find a space. I’ve been fighting
00:01:56.990 --> 00:02:00.460
with the Wiki for 2 days; I think
I’m probably not the only one.
00:02:00.460 --> 00:02:04.959
We can gather, we can talk about
that stuff more in detail. So today,
00:02:04.959 --> 00:02:09.990
in the next 25 minutes I’m
going to focus specifically on
00:02:09.990 --> 00:02:14.520
the trial of General
José Efraín Ríos Montt
00:02:14.520 --> 00:02:20.200
who ruled Guatemala from
March 1982 until August 1983.
00:02:20.200 --> 00:02:25.180
That’s General Ríos, there in
the upper corner in the red tie.
00:02:25.180 --> 00:02:30.600
During the government
of General Ríos Montt
00:02:30.600 --> 00:02:35.610
tens of thousands of people were killed by
the army of Guatemala. And the question
00:02:35.610 --> 00:02:39.610
that has been facing Guatemalans
since that time is:
00:02:39.610 --> 00:02:44.080
“Did the pattern of killing
that the army committed
00:02:44.080 --> 00:02:49.690
constitute acts of genocide?”. Now
genocide is a very specific crime
00:02:49.690 --> 00:02:54.420
in International Law. It does not
mean you killed a lot of people.
00:02:54.420 --> 00:02:58.910
There are other war crimes for mass
killing. Genocide specifically means
00:02:58.910 --> 00:03:03.930
that you picked out a particular group;
and to the exclusion of other groups
00:03:03.930 --> 00:03:08.460
nearby them you focused
on eliminating that group.
00:03:08.460 --> 00:03:14.240
That’s key because for a statistician
that gives us a hypothesis we can test
00:03:14.240 --> 00:03:18.860
which is: “What is the relative risk,
what is the differential probability
00:03:18.860 --> 00:03:22.820
of people in the target group being
killed relative to their neighbours
00:03:22.820 --> 00:03:28.150
who are not in the target group?”
So without further ado,
00:03:28.150 --> 00:03:31.970
let’s look at the relative risk of
being killed for indigenous people
00:03:31.970 --> 00:03:36.880
in the 3 rural counties of
Chajul, Cotzal and Nebaj
00:03:36.880 --> 00:03:41.400
relative to their
non-indigenous neighbours.
00:03:41.400 --> 00:03:45.960
We have – and I’ll talk in a moment about
how we have this – we have information,
00:03:45.960 --> 00:03:51.490
and evidence, and estimations of the
deaths of about 2150 indigenous people.
00:03:51.490 --> 00:03:58.550
People killed by the army in the period
of the government of General Ríos.
00:03:58.550 --> 00:04:02.550
The population, the total number of
people alive who were indigenous
00:04:02.550 --> 00:04:07.370
in those counties in the census
of 1981 is about 39,000.
00:04:07.370 --> 00:04:14.500
So the approximate crude mortality
rate due to homicide by the army
00:04:14.500 --> 00:04:18.710
is 5.5% for indigenous people in
that period. Now that’s relative
00:04:18.710 --> 00:04:22.890
to the homicide rate for non-indigenous
people in the same place
00:04:22.890 --> 00:04:27.200
of approximately 0.7%. So what
we ask is: “What is the ratio
00:04:27.200 --> 00:04:30.530
between those 2 numbers?” And
the ratio between those 2 numbers
00:04:30.530 --> 00:04:35.600
is the relative risk. It’s approximately
8. We interpret that as: if you were
00:04:35.600 --> 00:04:41.339
an indigenous person alive in
one of those 3 counties in 1982,
00:04:41.339 --> 00:04:46.939
your probability of being killed
by the army was 8 times greater
00:04:46.939 --> 00:04:51.069
than a person also living
in those 3 counties
00:04:51.069 --> 00:04:56.179
who was not indigenous.
Eight times, 8 times!
00:04:56.179 --> 00:05:00.250
To put that in relative terms: the
probability… the relative risk of being
00:05:00.250 --> 00:05:04.720
a Bosniac relative to being Serb
in Bosnia during the war in Bosnia
00:05:04.720 --> 00:05:09.800
was a little less than 3. So your
relative risk of being indigenous
00:05:09.800 --> 00:05:13.310
was more than twice nearly 3 times
as much as your relative risk
00:05:13.310 --> 00:05:19.200
of being Bosniac in the Bosnian War.
It’s an astonishing level of focus.
00:05:19.200 --> 00:05:23.809
It shows a tremendous planning
and coherence, I believe.
00:05:23.809 --> 00:05:29.469
So, again coming back to the statistical
conclusion, how do we come to that?
00:05:29.469 --> 00:05:32.849
How do we find that information? How do we
make that conclusion? First, we’re only
00:05:32.849 --> 00:05:35.470
looking at homicides committed by the
army. We’re not looking at homicides
00:05:35.470 --> 00:05:39.409
committed by other parties, by
the guerrillas, by private actors.
00:05:39.409 --> 00:05:44.499
We’re not looking at excess mortality,
the mortality that we might find
00:05:44.499 --> 00:05:47.709
in conflict that is in excess of
normal peacetime mortality.
00:05:47.709 --> 00:05:51.470
We’re not looking at any of that,
only homicide. And the percentage
00:05:51.470 --> 00:05:55.330
relates the number of people killed by the
army with the population that was alive.
00:05:55.330 --> 00:05:58.650
That’s crucial here. We’re looking at
rates and we’re comparing the rate
00:05:58.650 --> 00:06:02.430
of the indigenous people shown in the
blue bar to non-indigenous people
00:06:02.430 --> 00:06:06.869
shown in the green bar. The width of
the bars show the relative populations
00:06:06.869 --> 00:06:11.829
in each of those 2 communities. So clearly
there are many more indigenous people,
00:06:11.829 --> 00:06:14.980
but a higher fraction of them are also
killed. The bars also show something else.
00:06:14.980 --> 00:06:18.049
And that’s what I’ll focus on for the
rest of the talk. There are 2 sections
00:06:18.049 --> 00:06:22.159
to each of the 2 bars, a dark section
on the bottom, a lighter section on top.
00:06:22.159 --> 00:06:27.779
And what that indicates is what we know
in terms of being able to name people
00:06:27.779 --> 00:06:31.249
with their first and last name, their
location and dates of death, and
00:06:31.249 --> 00:06:35.560
what we must infer statistically. Now I’m
beginning to touch on the second theme
00:06:35.560 --> 00:06:40.949
of my talk: Which is that when we are
studying mass violence and war crimes,
00:06:40.949 --> 00:06:48.749
we cannot do statistical or pattern
analysis with raw information.
00:06:48.749 --> 00:06:51.950
We must use the tools of mathematical
statistics to understand
00:06:51.950 --> 00:06:56.080
what we don’t know! The information
which cannot be observed directly.
00:06:56.080 --> 00:07:00.649
We have to estimate that in order to
control for the process of the production
00:07:00.649 --> 00:07:04.989
of information. Information doesn’t just
fall out of the sky, the way it does
00:07:04.989 --> 00:07:10.359
for industry. If I’m running an ISP I know
every packet that runs through my routers.
00:07:10.359 --> 00:07:14.959
That’s not how the social world works. In
order to find information about killings
00:07:14.959 --> 00:07:17.889
we have to hear about that killing from
someone, we have to investigate,
00:07:17.889 --> 00:07:22.119
we have to find the human remains.
And if we can’t observe the killing
00:07:22.119 --> 00:07:28.130
we won’t hear about it and many killings
are hidden. In my team we have a kind of
00:07:28.130 --> 00:07:33.760
catch phrase: that the world… if a lawyer
is killed in a big city at high noon
00:07:33.760 --> 00:07:38.259
the world knows about it before
dinner time. Every single time.
00:07:38.259 --> 00:07:41.850
But when a rural peasant is killed 3-days
walk from a road in the dead of night,
00:07:41.850 --> 00:07:45.489
we’re unlikely to ever hear. And
technology is not changing this.
00:07:45.489 --> 00:07:48.899
I’ll talk later about that technology is
actually making the problem worse.
00:07:48.899 --> 00:07:53.470
So, let’s get back to Guatemala
and just conclude
00:07:53.470 --> 00:07:57.950
that the little vertical bars, little
vertical lines at the top of each bar
00:07:57.950 --> 00:08:03.079
indicate the confidence interval. Which is
similar to what lay people sometimes call
00:08:03.079 --> 00:08:07.199
a margin of error. It is our level of
uncertainty about each of those estimates
00:08:07.199 --> 00:08:10.960
and you’ll notice that the uncertainty
is much, much smaller than
00:08:10.960 --> 00:08:14.509
the difference between the 2 bars. The
uncertainty does not affect our ability
00:08:14.509 --> 00:08:17.970
to draw the conclusion that there
was a spectacular difference
00:08:17.970 --> 00:08:21.900
in the mortality rates between the
people who were the hypothesized
00:08:21.900 --> 00:08:26.630
target of genocide and those who were not.
00:08:26.630 --> 00:08:30.520
Now the data: first we
had the census of 1981,
00:08:30.520 --> 00:08:35.339
this was a crucial piece. I think there’s
very interesting questions to ask
00:08:35.339 --> 00:08:39.609
about why the Government of Guatemala
conducted a census on the eve of
00:08:39.609 --> 00:08:44.540
committing a genocide. There is excellent
work done by historical demographers
00:08:44.540 --> 00:08:47.950
about the use of censuses in mass
violence. It has been common
00:08:47.950 --> 00:08:52.880
throughout history. Similarly,
or excuse me, in parallel
00:08:52.880 --> 00:08:57.420
there were 4 very large
projects. First, the CIIDH
00:08:57.420 --> 00:09:01.600
– a group of non-Governmental
Human Rights groups –
00:09:01.600 --> 00:09:06.610
collected 1240 records of deaths
in this three-county region.
00:09:06.610 --> 00:09:11.750
Next, the Catholic Church collected
a bit fewer than 800 deaths.
00:09:11.750 --> 00:09:16.539
The truth commission – the Comisión
para el Esclarecimiento Histórico (CEH) –
00:09:16.539 --> 00:09:22.000
conducted a really big research
project in the late 1990s and
00:09:22.000 --> 00:09:25.810
of that we got information about a little
bit more than a thousand deaths.
00:09:25.810 --> 00:09:30.450
And then the National Program for
Compensation is very, very large
00:09:30.450 --> 00:09:35.370
and gave us about 4700
records of deaths.
00:09:35.370 --> 00:09:40.659
Now, this is interesting
but this is not unique.
00:09:40.659 --> 00:09:45.769
Many of the deaths are reported in common
across those data sources and so…
00:09:45.769 --> 00:09:49.490
we think about this in terms of a Venn
diagram. We think of: how did these
00:09:49.490 --> 00:09:54.329
different data sets intersect with each
other or collide with each other. And
00:09:54.329 --> 00:09:59.130
we can diagram that as in the sense
of these 3 white circles intersecting.
00:09:59.130 --> 00:10:05.610
But as I mentioned earlier we’re also
interested in what we have not observed.
00:10:05.610 --> 00:10:09.490
And this is crucial for us because
when we’re thinking about
00:10:09.490 --> 00:10:13.420
how much information we have, we have to
distinguish between the world on the left,
00:10:13.420 --> 00:10:17.200
in which our intersecting circles
cover about a third of the reality,
00:10:17.200 --> 00:10:21.829
versus the world on the right where our
intersecting circles cover all of reality.
00:10:21.829 --> 00:10:26.390
These are very different worlds; and the
reason they’re so different is not simply
00:10:26.390 --> 00:10:29.710
because we want to know the magnitude,
not simply because we want to know
00:10:29.710 --> 00:10:34.490
the total number of killings. That’s
important – but even more important:
00:10:34.490 --> 00:10:40.160
we have to know that we’ve covered,
we’ve estimated in equal proportions
00:10:40.160 --> 00:10:44.430
the two parties. We have to estimate in
equal proportions the number of deaths
00:10:44.430 --> 00:10:48.340
of non-indigenous people and the
number of deaths of indigenous people.
00:10:48.340 --> 00:10:51.510
Because if we don’t get those
estimates correct our comparison
00:10:51.510 --> 00:10:56.080
of their mortality rates will be biased.
Our story will be wrong. We will fail
00:10:56.080 --> 00:11:01.840
to speak Truth to Power. We can’t have
that. So what do we do? Algebra!
00:11:01.840 --> 00:11:06.390
Algebra is our friend. So I’m gonna
give you just a tiny taste of how we
00:11:06.390 --> 00:11:09.650
solve this problem and I’m going to
introduce a series of assumptions.
00:11:09.650 --> 00:11:13.279
Those of you who would like to debate
those assumptions: I invite you to join me
00:11:13.279 --> 00:11:18.359
after the talk and we will talk endlessly
and tediously about capture heterogeneity.
00:11:18.359 --> 00:11:22.240
But in the short term,
00:11:22.240 --> 00:11:27.940
we have a universe N of total killings in
a specific time/space/ethnicity/location.
00:11:27.940 --> 00:11:30.690
And of that we have 2 projects A and B.
00:11:30.690 --> 00:11:34.619
A captures some number of
deaths from the universe N,
00:11:34.619 --> 00:11:40.169
and the probability with which a death is
captured by project A from the universe N
00:11:40.169 --> 00:11:44.600
is by elementary probability theory the
number of deaths documented by A
00:11:44.600 --> 00:11:48.740
divided by the unknown number
of deaths in the population N.
00:11:48.740 --> 00:11:52.969
Similarly, the probability with which a
death from N is documented by project B
00:11:52.969 --> 00:11:58.149
is B over N, and this is the cool part:
the probability with which a death
00:11:58.149 --> 00:12:01.949
is documented by both A and B is M.
00:12:01.949 --> 00:12:05.579
Now we can put the 2 databases together,
we can compare them. Let’s talk about
00:12:05.579 --> 00:12:09.370
the use of random force classifiers
and clustering to do that later.
00:12:09.370 --> 00:12:12.489
But we can put the 2 databases together,
compare them, determine the deaths
00:12:12.489 --> 00:12:17.429
that are in M – that is in N both
A and B – and divide M by N.
00:12:17.429 --> 00:12:23.060
But, also by probability theory, the
probability that a death occurs in M
00:12:23.060 --> 00:12:27.740
is equal to the product of
the individual probabilities.
00:12:27.740 --> 00:12:31.619
The probability of any compound event, an
event made up of two independent events is
00:12:31.619 --> 00:12:36.410
equal to the product of those two
events, so M over N is equal to
00:12:36.410 --> 00:12:41.420
A over N times B over N. Solve for N.
00:12:41.420 --> 00:12:45.140
Multiply it through by N squared, divide
by M, and we have an estimate of N
00:12:45.140 --> 00:12:49.360
which is equal to AB over M. Now, the
lights in my eyes, I can’t see, but I saw
00:12:49.360 --> 00:12:52.740
a few light bulbs go off over people’s
heads. And when I showed this proof
00:12:52.740 --> 00:12:57.180
to the judge in the trial of General Ríos
00:12:57.180 --> 00:13:01.529
I saw a light bulb go on over her head.
00:13:01.529 --> 00:13:04.379
It’s a beautiful thing,
it’s a beautiful thing.
00:13:04.379 --> 00:13:09.509
applause
00:13:09.509 --> 00:13:12.660
So we don’t do it in 2 systems because
that takes a lot of assumptions.
00:13:12.660 --> 00:13:16.069
We do it in 4. You will recall that we
have 4 data sources. We organize
00:13:16.069 --> 00:13:21.530
the data sources in this format
such that we have an inclusion
00:13:21.530 --> 00:13:26.249
and an exclusion pattern in the table on
the left, which… for which we can define
00:13:26.249 --> 00:13:29.810
the number of deaths which fall into
each of these intersecting patterns.
00:13:29.810 --> 00:13:33.729
And I’ll give you a very quick
metaphor here. The metaphor is:
00:13:33.729 --> 00:13:38.239
imagine that you have 2 dark rooms and you
want to assess the size of those 2 rooms
00:13:38.239 --> 00:13:42.049
– which room is larger? And the only
tool that you have to assess the size
00:13:42.049 --> 00:13:46.359
of those rooms is a handful of little
rubber balls. The little rubber balls
00:13:46.359 --> 00:13:50.400
have a property that when they hit each
other they make a sound. makes CLICK sound
00:13:50.400 --> 00:13:53.390
So we throw the balls into the first
room and we listen, and we hear
00:13:53.390 --> 00:13:57.190
makes several CLICK sounds. We
collect the balls, go to the second room,
00:13:57.190 --> 00:14:00.490
throw them with equal force – imagining
a spherical cow of uniform density!
00:14:00.490 --> 00:14:03.950
We throw the balls into the second
room with equal force and we hear
00:14:03.950 --> 00:14:07.799
makes one CLICK sound
So which room is larger?
00:14:07.799 --> 00:14:12.070
The second room, because we hear fewer
collisions, right? Well, the estimation,
00:14:12.070 --> 00:14:15.620
the toy example I gave in the previous
slide is the mathematical formalization
00:14:15.620 --> 00:14:20.070
of the intuition that fewer
collisions mean a larger space.
00:14:20.070 --> 00:14:23.329
And so what we’re doing here is
laying out the pattern of collisions.
00:14:23.329 --> 00:14:26.679
Not just the collisions, the pairwise
collisions, but the three-way and
00:14:26.679 --> 00:14:31.409
four-way collisions. And that
allows us to make the estimate
00:14:31.409 --> 00:14:37.439
that was shown in the bar graph of
the light part of each of the bars. So
00:14:37.439 --> 00:14:41.460
we can come back to our conclusion and put
a confidence interval on the estimates.
00:14:41.460 --> 00:14:45.910
And the confidence intervals are shown
there. Now I’m gonna move through this
00:14:45.910 --> 00:14:50.850
somewhat more quickly to get to the end of
the talk but I wanna put up one more slide
00:14:50.850 --> 00:14:56.240
that was used in the testimony
and that is that we divided time
00:14:56.240 --> 00:15:01.220
into 16-month periods and
compared the 16-month period of
00:15:01.220 --> 00:15:04.580
General Ríos’s governance – now it’s only
16 months ’cause we went April to July,
00:15:04.580 --> 00:15:07.679
because it’s only a few days in August, a
few days in March, so we shaved those off,
00:15:07.679 --> 00:15:12.310
okay… – 16-month period of General
Ríos’s Government and compared it
00:15:12.310 --> 00:15:17.110
to several periods before and after. And
I think that the key observation here
00:15:17.110 --> 00:15:21.809
is that the rate of killing
against indigenous people
00:15:21.809 --> 00:15:26.729
is substantially higher done under General
Ríos’s Government than under previous
00:15:26.729 --> 00:15:33.280
or succeeding governments. But more
importantly the ratio between the two,
00:15:33.280 --> 00:15:37.950
the relative risk of being killed as an
indigenous person, was at its peak
00:15:37.950 --> 00:15:42.639
during the government of General Ríos.
00:15:42.639 --> 00:15:46.709
Have we proven genocide? No.
00:15:46.709 --> 00:15:49.870
This is evidence consistent with the
hypothesis that acts of genocide
00:15:49.870 --> 00:15:53.539
were committed. The finding of genocide
is a legal finding, not so much
00:15:53.539 --> 00:15:58.580
a scientific one. So as scientists,
our job is to provide evidence that
00:15:58.580 --> 00:16:02.870
the finders of fact – the judges in this
case – can use in their determination.
00:16:02.870 --> 00:16:05.219
This is evidence consistent
with that hypothesis.
00:16:05.219 --> 00:16:08.189
Were this evidence otherwise, as
scientists we would say we would
00:16:08.189 --> 00:16:11.480
reject the hypothesis that genocide was
committed. However, with this evidence
00:16:11.480 --> 00:16:15.370
we find that the evidence,
the data is consistent with
00:16:15.370 --> 00:16:18.080
the prosecution’s hypothesis.
00:16:18.080 --> 00:16:25.320
So, it worked!
00:16:25.320 --> 00:16:29.049
Ríos Montt was convicted on
genocide charges. applause
00:16:29.049 --> 00:16:31.359
You can clap!
applause
00:16:31.359 --> 00:16:36.359
applause
00:16:36.359 --> 00:16:39.499
For a week!
mumbled, surprised laughter
00:16:39.499 --> 00:16:42.279
Then the Constitutional Court intervened,
00:16:42.279 --> 00:16:44.959
there I know a couple of experts on
Guatemala here in the audience
00:16:44.959 --> 00:16:47.839
who can tell you more about why that
happened and exactly what happened.
00:16:47.839 --> 00:16:52.669
However, the Constitutional
Court ordered a new trial,
00:16:52.669 --> 00:16:59.160
which is at this time scheduled
for the very beginning of 2015.
00:16:59.160 --> 00:17:02.970
And I look forward to testifying again,
00:17:02.970 --> 00:17:06.820
and again, and again, and again!
00:17:06.820 --> 00:17:12.680
applause
00:17:12.680 --> 00:17:16.989
Look, but I wanna come back to this point.
Because as a bunch of technologists…
00:17:16.989 --> 00:17:21.589
– there is a lot of folks who really like
technology here, I really like it too!
00:17:21.589 --> 00:17:25.559
Technology doesn’t get us to science
– you have to have science
00:17:25.559 --> 00:17:28.770
to get you to science. Technology helps
you organize the data. It helps you do
00:17:28.770 --> 00:17:32.050
all kinds of extremely great and cool
things without which we wouldn’t be able
00:17:32.050 --> 00:17:36.480
to even do the science. But you
can’t have just technology!
00:17:36.480 --> 00:17:40.970
You can’t just have a bunch of data
and make conclusions. That’s naive,
00:17:40.970 --> 00:17:44.529
and you will get the wrong conclusions.
‘The point of rigorous statistics is
00:17:44.529 --> 00:17:48.100
to be right’, and there is a little bit of
a caveat there – or to at least know
00:17:48.100 --> 00:17:51.620
how uncertain you are. Statistics is often
called the ‘Science of Uncertainty’.
00:17:51.620 --> 00:17:55.960
That is actually my favorite
definition of it. So,
00:17:55.960 --> 00:18:01.509
I’m going to assume that we
care about getting it right.
00:18:01.509 --> 00:18:05.489
No one laughed, that’s good.
00:18:05.489 --> 00:18:08.890
Not everyone does, to my distress.
00:18:08.890 --> 00:18:11.320
So if you only have some of the data
00:18:11.320 --> 00:18:15.490
– and I will argue that we always
only have some of the data –
00:18:15.490 --> 00:18:20.449
you need some kind of model that will tell
you the relationship between your data
00:18:20.449 --> 00:18:23.989
and the real world.
Statisticians call that an inference.
00:18:23.989 --> 00:18:26.200
In order to get from here to there
you’re gonna need some kind of
00:18:26.200 --> 00:18:30.469
probability model that tells you
why your data is like the world,
00:18:30.469 --> 00:18:33.960
or in what sense you have to tweet,
twiddle and do algebra with your data
00:18:33.960 --> 00:18:39.309
to get from what you can
observe to what is actually true.
00:18:39.309 --> 00:18:42.690
And statistics is about comparisons.
Yeah, we get a big number and
00:18:42.690 --> 00:18:46.169
journalists love the big number; but
it’s really about these relationships
00:18:46.169 --> 00:18:50.609
and patterns! So to get those
relationships and patterns,
00:18:50.609 --> 00:18:53.560
in order for them to be right, in order
for our answer to be correct,
00:18:53.560 --> 00:18:57.439
every one of the estimates we make
for every point in the pattern
00:18:57.439 --> 00:19:01.700
has to be right. It’s a hard
problem. It’s a hard problem.
00:19:01.700 --> 00:19:05.070
And what I worry about is that
we have come into this world
00:19:05.070 --> 00:19:09.400
in which people throw the notion of Big
Data around as though the data allows us
00:19:09.400 --> 00:19:14.230
to make an end-run around problems
of sampling and modeling. It doesn’t.
00:19:14.230 --> 00:19:19.120
So as technologist, the reason I’m,
you know, ranting at you guys about it
00:19:19.120 --> 00:19:24.540
is that it’s very tempting to have a lot
of data and think you have an answer!
00:19:24.540 --> 00:19:30.580
And it’s even more tempting because
in industry context you might be right.
00:19:30.580 --> 00:19:36.739
Not so much in Human Rights, not so
much. Violence is a hidden process.
00:19:36.739 --> 00:19:39.960
The people who commit violence have
an enormous commitment to hiding it,
00:19:39.960 --> 00:19:44.420
distorting it, explaining it in different
ways. All of those things dramatically
00:19:44.420 --> 00:19:48.350
affect the information that is produced
from the violence that we’re going to use
00:19:48.350 --> 00:19:53.730
to do our analysis. So we usually
don’t know what we don’t know
00:19:53.730 --> 00:19:58.220
in Human Rights data collection.
And that means that we don’t know
00:19:58.220 --> 00:20:03.829
if what we don’t know is systematically
different from what we do know.
00:20:03.829 --> 00:20:06.270
Maybe we know about all the lawyers
and we don’t know about the people
00:20:06.270 --> 00:20:10.070
in the countryside. Maybe we know
about all the indigenous people
00:20:10.070 --> 00:20:14.130
and not the non-indigenous people.
If that were true, the argument
00:20:14.130 --> 00:20:17.980
that I just made would be merely
an artifact of the reporting process
00:20:17.980 --> 00:20:21.740
rather than some true analysis. Now
we did the estimations why I believe
00:20:21.740 --> 00:20:25.009
we can reject that critique, but that’s
what we have to worry about.
00:20:25.009 --> 00:20:28.860
And let’s go back to the Venn diagram
and say: which of these is accurate?
00:20:28.860 --> 00:20:32.840
It’s not just for one of the
points in our pattern analysis.
00:20:32.840 --> 00:20:36.500
The problem is that we’re
going to compare things.
00:20:36.500 --> 00:20:40.890
As in Peru where we compared killings
committed by the Peruvian army against
00:20:40.890 --> 00:20:44.860
killings committed by the Maoist Guerillas
with the Sendero Luminoso. And we found
00:20:44.860 --> 00:20:51.460
there that in fact we knew very little
about what the Sendero Luminoso had done.
00:20:51.460 --> 00:20:55.779
Whereas we knew almost everything
what the Peruvian army had done.
00:20:55.779 --> 00:20:57.970
This is called the coverage rate.
The rate between what we know and
00:20:57.970 --> 00:21:02.750
what we don’t know. And
raw data, however big,
00:21:02.750 --> 00:21:07.510
does not get us to patterns.
And here is a bunch of…
00:21:07.510 --> 00:21:11.569
kinds of raw data that I’ve used
and that I really enjoy using.
00:21:11.569 --> 00:21:14.270
You know – truth commission testimonies,
UN investigations, press articles,
00:21:14.270 --> 00:21:18.309
SMS messages, crowdsourcing, NGO
documentation, social media feeds,
00:21:18.309 --> 00:21:21.180
perpetrator records, government archives,
state agency registries – I know those
00:21:21.180 --> 00:21:23.570
sound all the same but they actually
turn out to be slightly different.
00:21:23.570 --> 00:21:28.340
Happy to talk in tedious detail! Refugee
Camp records, any non-random sample.
00:21:28.340 --> 00:21:31.990
All of those are gonna take
some kind of probability model
00:21:31.990 --> 00:21:36.070
and we don’t have that many
probability models to use. So
00:21:36.070 --> 00:21:40.330
raw data is great for cases – but
it doesn’t get you to patterns.
00:21:40.330 --> 00:21:45.120
And patterns – again – patterns are
the thing that allow us to do analysis.
00:21:45.120 --> 00:21:49.289
They are the thing… the patterns are what
get us to something that we can use
00:21:49.289 --> 00:21:53.629
to help prosecutors, advocates and the…
00:21:53.629 --> 00:21:56.409
and the victims themselves.
00:21:56.409 --> 00:22:00.589
I gave a version of this talk, a
much earlier version of this talk
00:22:00.589 --> 00:22:04.630
several years ago in Medellín, Columbia.
I’ve worked a lot in Columbia,
00:22:04.630 --> 00:22:07.670
it’s really… it’s a great place to
work. There’s really terrific
00:22:07.670 --> 00:22:13.569
Victims Rights groups there.
And a woman from a township,
00:22:13.569 --> 00:22:17.310
smaller than a county, near to Medellín
came up to me after the talk and she said:
00:22:17.310 --> 00:22:21.150
“You know, a lot of people… you
know I’m a Human Rights activist,
00:22:21.150 --> 00:22:25.309
my job is to collect data, I tell stories
about people who have suffered.
00:22:25.309 --> 00:22:28.210
But there are people in my
village I know who have had
00:22:28.210 --> 00:22:32.910
people in their families disappeared and
they’re never gonna talk about, ever.
00:22:32.910 --> 00:22:38.090
We’re never going to be able to use
their names, because they are afraid.”
00:22:38.090 --> 00:22:45.349
We can’t name the victims. At
least we’d better count them.
00:22:45.349 --> 00:22:49.520
So about that counting: there’s
3 ways to do it right. You can have
00:22:49.520 --> 00:22:54.430
a perfect census – you can have all the
data. Yeah it’s nice, good luck with that.
00:22:54.430 --> 00:22:58.910
You can have a random sample
of the population - that’s hard!
00:22:58.910 --> 00:23:03.029
Sometimes doable but very hard.
In my experience we rarely interview
00:23:03.029 --> 00:23:07.140
victims of homicide, very rarely.
Laughing
00:23:07.140 --> 00:23:09.640
And that means there’s a complicated
probability relationship between
00:23:09.640 --> 00:23:13.670
the person you sampled, the interview
and the death that they talk to you about.
00:23:13.670 --> 00:23:17.300
Or you can do some kind of posterior
modeling of the sampling process which is…
00:23:17.300 --> 00:23:21.260
which is in essence what
I proposed in the earlier slide.
00:23:21.260 --> 00:23:25.020
So what can we do with raw data,
guys? We can collect a bunch of…
00:23:25.020 --> 00:23:28.930
We can say that a case exists. Ok
– that’s actually important! We can say:
00:23:28.930 --> 00:23:34.409
“Something happened” with raw data. We can
say: “We know something about that case".
00:23:34.409 --> 00:23:38.250
We can say: “There were 100 victims
in that case or at least 100 victims
00:23:38.250 --> 00:23:41.570
in that case”, if we can name 100 people.
00:23:41.570 --> 00:23:46.390
But we can’t do comparisons: “This
is the biggest massacre this year”.
00:23:46.390 --> 00:23:48.350
We don’t really know. Because we
don’t know about that massacres
00:23:48.350 --> 00:23:53.910
we don’t know about. No patterns. Don’t
talk about the hot spot of violence.
00:23:53.910 --> 00:23:59.420
No, we don’t know that. Happy to talk
more about that if we gather after,
00:23:59.420 --> 00:24:06.439
but I wanna come to a close here with
the importance of getting it right.
00:24:06.439 --> 00:24:11.380
I’ve talked about one case today. This
is another case, the case of this man:
00:24:11.380 --> 00:24:16.320
Edgar Fernando García. Mr. García was
a student Labor leader in Guatemala
00:24:16.320 --> 00:24:19.800
early in the 1980s. He left
his office in February 1984
00:24:19.800 --> 00:24:24.470
– did not come home. People reported
later that they saw someone
00:24:24.470 --> 00:24:28.810
shoving Mr. García into a
vehicle and driving away.
00:24:28.810 --> 00:24:33.900
His widow became a very important
Human Rights activist in Guatemala
00:24:33.900 --> 00:24:38.570
and now she’s a very important, and
in my opinion impressive politician.
00:24:38.570 --> 00:24:42.240
And there’s her infant daughter. She
continued to struggle to find out
00:24:42.240 --> 00:24:46.130
what had happened to
Mr. García for decades.
00:24:46.130 --> 00:24:50.400
And in 2006 documents came to light
in the National Archives of the…
00:24:50.400 --> 00:24:54.429
excuse me, the Historical Archives
of the national Police, showing that
00:24:54.429 --> 00:24:59.320
the Police had realized an operation
in the area of Mr. García’s office
00:24:59.320 --> 00:25:01.930
and it was very likely that
they had disappeared him.
00:25:01.930 --> 00:25:07.400
These 2 guys up here in the upper
right were Police officers in that area;
00:25:07.400 --> 00:25:11.359
they were arrested, charged with the
disappearance of Mister García and
00:25:11.359 --> 00:25:15.620
convicted. Part of the evidence used to
convict them was communications meta data
00:25:15.620 --> 00:25:19.510
showing that documents
flowed through the archive.
00:25:19.510 --> 00:25:23.699
I mean paper communications! We coded
it by hand. We went through and read
00:25:23.699 --> 00:25:28.459
the ‘From’ and ‘To’ lines
from every Memo. And
00:25:28.459 --> 00:25:34.229
they were convicted in 2010
and after that conviction
00:25:34.229 --> 00:25:38.699
Mr. García’s infant daughter – now
a grown woman – was clearly joyful.
00:25:38.699 --> 00:25:42.730
Justice brings closure to a family
that never knows when to start talking
00:25:42.730 --> 00:25:48.059
about someone in the past tense.
Perhaps even more powerfully:
00:25:48.059 --> 00:25:52.319
those guys’ grand boss, their boss's
boss, Colonel Héctor Bol de la Cruz,
00:25:52.319 --> 00:25:58.439
this man here, was convicted
of Mr. García’s disappearance
00:25:58.439 --> 00:26:02.069
in September this year [2013].
applause
00:26:02.069 --> 00:26:07.610
applause
00:26:07.610 --> 00:26:10.789
I don’t know if any of you have
ever been dissident students,
00:26:10.789 --> 00:26:15.330
but if you’ve been dissident students
demonstrating in the street think about
00:26:15.330 --> 00:26:19.300
how you would feel if your friends
and comrades were disappeared,
00:26:19.300 --> 00:26:23.419
and take a long look at Colonel Bol
de la Cruz. Here is the rest of the stuff
00:26:23.419 --> 00:26:25.626
that we will talk about if we gather
afterwards. Thank you very much
00:26:25.626 --> 00:26:29.086
for your attention. I really
have enjoyed CCC.
00:26:29.086 --> 00:26:36.086
applause
00:26:36.086 --> 00:26:47.203
Subtitles created by c3subtitles.de
in the year 2016. Join and help us!