applause
Thank you very much, can you…
You can hear me? Yes!
I’ve been at this now 23 years. We
worked, with… My colleagues and I,
we worked in about 30 countries,
we’ve advised 9 Truth Commissions,
official Truth Commissions, 4 UN missions,
4 international criminal tribunals.
We have testified in 4 different cases
– 2 internationally, 2 domestically – and
we’ve advised dozens and dozens
of non-governmental Human Rights groups
around the world. The point of this stuff
is to figure out how to bring the
knowledge of the people who’ve suffered
human rights violations to bear,
on demanding accountability
from the perpetrators. Our job is to
figure out how we can tell the truth.
It is one of the moral foundations of the
international Human Rights movement
that we speak Truth to Power. We
look in the face of the powerful
and we tell them what we believe
they have done that is wrong.
If that’s gonna work, we
have to speak the truth.
We have to be right, we
have to get the analysis on.
That’s not always easy and to get there,
there are sort of 3 themes that
I wanna try to touch in this talk.
Since the talk is pretty short I’m
really gonna touch on 2 of them, so
at the very end of the talk I’ll invite
people who’d like to talk more about
the specifically technical aspects of this
work, about classifiers, about clustering,
about statistical estimation, about
database techniques. People who wanna talk
about that I’d love to gather and we’ll
try to find a space. I’ve been fighting
with the Wiki for 2 days; I think
I’m probably not the only one.
We can gather, we can talk about
that stuff more in detail. So today,
in the next 25 minutes I’m
going to focus specifically on
the trial of General
José Efraín Ríos Montt
who ruled Guatemala from
March 1982 until August 1983.
That’s General Ríos, there in
the upper corner in the red tie.
During the government
of General Ríos Montt
tens of thousands of people were killed by
the army of Guatemala. And the question
that has been facing Guatemalans
since that time is:
“Did the pattern of killing
that the army committed
constitute acts of genocide?”. Now
genocide is a very specific crime
in International Law. It does not
mean you killed a lot of people.
There are other war crimes for mass
killing. Genocide specifically means
that you picked out a particular group;
and to the exclusion of other groups
nearby them you focused
on eliminating that group.
That’s key because for a statistician
that gives us a hypothesis we can test
which is: “What is the relative risk,
what is the differential probability
of people in the target group being
killed relative to their neighbours
who are not in the target group?”
So without further ado,
let’s look at the relative risk of
being killed for indigenous people
in the 3 rural counties of
Chajul, Cotzal and Nebaj
relative to their
non-indigenous neighbours.
We have – and I’ll talk in a moment about
how we have this – we have information,
and evidence, and estimations of the
deaths of about 2150 indigenous people.
People killed by the army in the period
of the government of General Ríos.
The population, the total number of
people alive who were indigenous
in those counties in the census
of 1981 is about 39,000.
So the approximate crude mortality
rate due to homicide by the army
is 5.5% for indigenous people in
that period. Now that’s relative
to the homicide rate for non-indigenous
people in the same place
of approximately 0.7%. So what
we ask is: “What is the ratio
between those 2 numbers?” And
the ratio between those 2 numbers
is the relative risk. It’s approximately
8. We interpret that as: if you were
an indigenous person alive in
one of those 3 counties in 1982,
your probability of being killed
by the army was 8 times greater
than a person also living
in those 3 counties
who was not indigenous.
Eight times, 8 times!
To put that in relative terms: the
probability… the relative risk of being
a Bosniac relative to being Serb
in Bosnia during the war in Bosnia
was a little less than 3. So your
relative risk of being indigenous
was more than twice nearly 3 times
as much as your relative risk
of being Bosniac in the Bosnian War.
It’s an astonishing level of focus.
It shows a tremendous planning
and coherence, I believe.
So, again coming back to the statistical
conclusion, how do we come to that?
How do we find that information? How do we
make that conclusion? First, we’re only
looking at homicides committed by the
army. We’re not looking at homicides
committed by other parties, by
the guerrillas, by private actors.
We’re not looking at excess mortality,
the mortality that we might find
in conflict that is in excess of
normal peacetime mortality.
We’re not looking at any of that,
only homicide. And the percentage
relates the number of people killed by the
army with the population that was alive.
That’s crucial here. We’re looking at
rates and we’re comparing the rate
of the indigenous people shown in the
blue bar to non-indigenous people
shown in the green bar. The width of
the bars show the relative populations
in each of those 2 communities. So clearly
there are many more indigenous people,
but a higher fraction of them are also
killed. The bars also show something else.
And that’s what I’ll focus on for the
rest of the talk. There are 2 sections
to each of the 2 bars, a dark section
on the bottom, a lighter section on top.
And what that indicates is what we know
in terms of being able to name people
with their first and last name, their
location and dates of death, and
what we must infer statistically. Now I’m
beginning to touch on the second theme
of my talk: Which is that when we are
studying mass violence and war crimes,
we cannot do statistical or pattern
analysis with raw information.
We must use the tools of mathematical
statistics to understand
what we don’t know! The information
which cannot be observed directly.
We have to estimate that in order to
control for the process of the production
of information. Information doesn’t just
fall out of the sky, the way it does
for industry. If I’m running an ISP I know
every packet that runs through my routers.
That’s not how the social world works. In
order to find information about killings
we have to hear about that killing from
someone, we have to investigate,
we have to find the human remains.
And if we can’t observe the killing
we won’t hear about it and many killings
are hidden. In my team we have a kind of
catch phrase: that the world… if a lawyer
is killed in a big city at high noon
the world knows about it before
dinner time. Every single time.
But when a rural peasant is killed 3-days
walk from a road in the dead of night,
we’re unlikely to ever hear. And
technology is not changing this.
I’ll talk later about that technology is
actually making the problem worse.
So, let’s get back to Guatemala
and just conclude
that the little vertical bars, little
vertical lines at the top of each bar
indicate the confidence interval. Which is
similar to what lay people sometimes call
a margin of error. It is our level of
uncertainty about each of those estimates
and you’ll notice that the uncertainty
is much, much smaller than
the difference between the 2 bars. The
uncertainty does not affect our ability
to draw the conclusion that there
was a spectacular difference
in the mortality rates between the
people who were the hypothesized
target of genocide and those who were not.
Now the data: first we
had the census of 1981,
this was a crucial piece. I think there’s
very interesting questions to ask
about why the Government of Guatemala
conducted a census on the eve of
committing a genocide. There is excellent
work done by historical demographers
about the use of censuses in mass
violence. It has been common
throughout history. Similarly,
or excuse me, in parallel
there were 4 very large
projects. First, the CIIDH
– a group of non-Governmental
Human Rights groups –
collected 1240 records of deaths
in this three-county region.
Next, the Catholic Church collected
a bit fewer than 800 deaths.
The truth commission – the Comisión
para el Esclarecimiento Histórico (CEH) –
conducted a really big research
project in the late 1990s and
of that we got information about a little
bit more than a thousand deaths.
And then the National Program for
Compensation is very, very large
and gave us about 4700
records of deaths.
Now, this is interesting
but this is not unique.
Many of the deaths are reported in common
across those data sources and so…
we think about this in terms of a Venn
diagram. We think of: how did these
different data sets intersect with each
other or collide with each other. And
we can diagram that as in the sense
of these 3 white circles intersecting.
But as I mentioned earlier we’re also
interested in what we have not observed.
And this is crucial for us because
when we’re thinking about
how much information we have, we have to
distinguish between the world on the left,
in which our intersecting circles
cover about a third of the reality,
versus the world on the right where our
intersecting circles cover all of reality.
These are very different worlds; and the
reason they’re so different is not simply
because we want to know the magnitude,
not simply because we want to know
the total number of killings. That’s
important – but even more important:
we have to know that we’ve covered,
we’ve estimated in equal proportions
the two parties. We have to estimate in
equal proportions the number of deaths
of non-indigenous people and the
number of deaths of indigenous people.
Because if we don’t get those
estimates correct our comparison
of their mortality rates will be biased.
Our story will be wrong. We will fail
to speak Truth to Power. We can’t have
that. So what do we do? Algebra!
Algebra is our friend. So I’m gonna
give you just a tiny taste of how we
solve this problem and I’m going to
introduce a series of assumptions.
Those of you who would like to debate
those assumptions: I invite you to join me
after the talk and we will talk endlessly
and tediously about capture heterogeneity.
But in the short term,
we have a universe N of total killings in
a specific time/space/ethnicity/location.
And of that we have 2 projects A and B.
A captures some number of
deaths from the universe N,
and the probability with which a death is
captured by project A from the universe N
is by elementary probability theory the
number of deaths documented by A
divided by the unknown number
of deaths in the population N.
Similarly, the probability with which a
death from N is documented by project B
is B over N, and this is the cool part:
the probability with which a death
is documented by both A and B is M.
Now we can put the 2 databases together,
we can compare them. Let’s talk about
the use of random force classifiers
and clustering to do that later.
But we can put the 2 databases together,
compare them, determine the deaths
that are in M – that is in N both
A and B – and divide M by N.
But, also by probability theory, the
probability that a death occurs in M
is equal to the product of
the individual probabilities.
The probability of any compound event, an
event made up of two independent events is
equal to the product of those two
events, so M over N is equal to
A over N times B over N. Solve for N.
Multiply it through by N squared, divide
by M, and we have an estimate of N
which is equal to AB over M. Now, the
lights in my eyes, I can’t see, but I saw
a few light bulbs go off over people’s
heads. And when I showed this proof
to the judge in the trial of General Ríos
I saw a light bulb go on over her head.
It’s a beautiful thing,
it’s a beautiful thing.
applause
So we don’t do it in 2 systems because
that takes a lot of assumptions.
We do it in 4. You will recall that we
have 4 data sources. We organize
the data sources in this format
such that we have an inclusion
and an exclusion pattern in the table on
the left, which… for which we can define
the number of deaths which fall into
each of these intersecting patterns.
And I’ll give you a very quick
metaphor here. The metaphor is:
imagine that you have 2 dark rooms and you
want to assess the size of those 2 rooms
– which room is larger? And the only
tool that you have to assess the size
of those rooms is a handful of little
rubber balls. The little rubber balls
have a property that when they hit each
other they make a sound. makes CLICK sound
So we throw the balls into the first
room and we listen, and we hear
makes several CLICK sounds. We
collect the balls, go to the second room,
throw them with equal force – imagining
a spherical cow of uniform density!
We throw the balls into the second
room with equal force and we hear
makes one CLICK sound
So which room is larger?
The second room, because we hear fewer
collisions, right? Well, the estimation,
the toy example I gave in the previous
slide is the mathematical formalization
of the intuition that fewer
collisions mean a larger space.
And so what we’re doing here is
laying out the pattern of collisions.
Not just the collisions, the pairwise
collisions, but the three-way and
four-way collisions. And that
allows us to make the estimate
that was shown in the bar graph of
the light part of each of the bars. So
we can come back to our conclusion and put
a confidence interval on the estimates.
And the confidence intervals are shown
there. Now I’m gonna move through this
somewhat more quickly to get to the end of
the talk but I wanna put up one more slide
that was used in the testimony
and that is that we divided time
into 16-month periods and
compared the 16-month period of
General Ríos’s governance – now it’s only
16 months ’cause we went April to July,
because it’s only a few days in August, a
few days in March, so we shaved those off,
okay… – 16-month period of General
Ríos’s Government and compared it
to several periods before and after. And
I think that the key observation here
is that the rate of killing
against indigenous people
is substantially higher done under General
Ríos’s Government than under previous
or succeeding governments. But more
importantly the ratio between the two,
the relative risk of being killed as an
indigenous person, was at its peak
during the government of General Ríos.
Have we proven genocide? No.
This is evidence consistent with the
hypothesis that acts of genocide
were committed. The finding of genocide
is a legal finding, not so much
a scientific one. So as scientists,
our job is to provide evidence that
the finders of fact – the judges in this
case – can use in their determination.
This is evidence consistent
with that hypothesis.
Were this evidence otherwise, as
scientists we would say we would
reject the hypothesis that genocide was
committed. However, with this evidence
we find that the evidence,
the data is consistent with
the prosecution’s hypothesis.
So, it worked!
Ríos Montt was convicted on
genocide charges. applause
You can clap!
applause
applause
For a week!
mumbled, surprised laughter
Then the Constitutional Court intervened,
there I know a couple of experts on
Guatemala here in the audience
who can tell you more about why that
happened and exactly what happened.
However, the Constitutional
Court ordered a new trial,
which is at this time scheduled
for the very beginning of 2015.
And I look forward to testifying again,
and again, and again, and again!
applause
Look, but I wanna come back to this point.
Because as a bunch of technologists…
– there is a lot of folks who really like
technology here, I really like it too!
Technology doesn’t get us to science
– you have to have science
to get you to science. Technology helps
you organize the data. It helps you do
all kinds of extremely great and cool
things without which we wouldn’t be able
to even do the science. But you
can’t have just technology!
You can’t just have a bunch of data
and make conclusions. That’s naive,
and you will get the wrong conclusions.
‘The point of rigorous statistics is
to be right’, and there is a little bit of
a caveat there – or to at least know
how uncertain you are. Statistics is often
called the ‘Science of Uncertainty’.
That is actually my favorite
definition of it. So,
I’m going to assume that we
care about getting it right.
No one laughed, that’s good.
Not everyone does, to my distress.
So if you only have some of the data
– and I will argue that we always
only have some of the data –
you need some kind of model that will tell
you the relationship between your data
and the real world.
Statisticians call that an inference.
In order to get from here to there
you’re gonna need some kind of
probability model that tells you
why your data is like the world,
or in what sense you have to tweet,
twiddle and do algebra with your data
to get from what you can
observe to what is actually true.
And statistics is about comparisons.
Yeah, we get a big number and
journalists love the big number; but
it’s really about these relationships
and patterns! So to get those
relationships and patterns,
in order for them to be right, in order
for our answer to be correct,
every one of the estimates we make
for every point in the pattern
has to be right. It’s a hard
problem. It’s a hard problem.
And what I worry about is that
we have come into this world
in which people throw the notion of Big
Data around as though the data allows us
to make an end-run around problems
of sampling and modeling. It doesn’t.
So as technologist, the reason I’m,
you know, ranting at you guys about it
is that it’s very tempting to have a lot
of data and think you have an answer!
And it’s even more tempting because
in industry context you might be right.
Not so much in Human Rights, not so
much. Violence is a hidden process.
The people who commit violence have
an enormous commitment to hiding it,
distorting it, explaining it in different
ways. All of those things dramatically
affect the information that is produced
from the violence that we’re going to use
to do our analysis. So we usually
don’t know what we don’t know
in Human Rights data collection.
And that means that we don’t know
if what we don’t know is systematically
different from what we do know.
Maybe we know about all the lawyers
and we don’t know about the people
in the countryside. Maybe we know
about all the indigenous people
and not the non-indigenous people.
If that were true, the argument
that I just made would be merely
an artifact of the reporting process
rather than some true analysis. Now
we did the estimations why I believe
we can reject that critique, but that’s
what we have to worry about.
And let’s go back to the Venn diagram
and say: which of these is accurate?
It’s not just for one of the
points in our pattern analysis.
The problem is that we’re
going to compare things.
As in Peru where we compared killings
committed by the Peruvian army against
killings committed by the Maoist Guerillas
with the Sendero Luminoso. And we found
there that in fact we knew very little
about what the Sendero Luminoso had done.
Whereas we knew almost everything
what the Peruvian army had done.
This is called the coverage rate.
The rate between what we know and
what we don’t know. And
raw data, however big,
does not get us to patterns.
And here is a bunch of…
kinds of raw data that I’ve used
and that I really enjoy using.
You know – truth commission testimonies,
UN investigations, press articles,
SMS messages, crowdsourcing, NGO
documentation, social media feeds,
perpetrator records, government archives,
state agency registries – I know those
sound all the same but they actually
turn out to be slightly different.
Happy to talk in tedious detail! Refugee
Camp records, any non-random sample.
All of those are gonna take
some kind of probability model
and we don’t have that many
probability models to use. So
raw data is great for cases – but
it doesn’t get you to patterns.
And patterns – again – patterns are
the thing that allow us to do analysis.
They are the thing… the patterns are what
get us to something that we can use
to help prosecutors, advocates and the…
and the victims themselves.
I gave a version of this talk, a
much earlier version of this talk
several years ago in Medellín, Columbia.
I’ve worked a lot in Columbia,
it’s really… it’s a great place to
work. There’s really terrific
Victims Rights groups there.
And a woman from a township,
smaller than a county, near to Medellín
came up to me after the talk and she said:
“You know, a lot of people… you
know I’m a Human Rights activist,
my job is to collect data, I tell stories
about people who have suffered.
But there are people in my
village I know who have had
people in their families disappeared and
they’re never gonna talk about, ever.
We’re never going to be able to use
their names, because they are afraid.”
We can’t name the victims. At
least we’d better count them.
So about that counting: there’s
3 ways to do it right. You can have
a perfect census – you can have all the
data. Yeah it’s nice, good luck with that.
You can have a random sample
of the population - that’s hard!
Sometimes doable but very hard.
In my experience we rarely interview
victims of homicide, very rarely.
Laughing
And that means there’s a complicated
probability relationship between
the person you sampled, the interview
and the death that they talk to you about.
Or you can do some kind of posterior
modeling of the sampling process which is…
which is in essence what
I proposed in the earlier slide.
So what can we do with raw data,
guys? We can collect a bunch of…
We can say that a case exists. Ok
– that’s actually important! We can say:
“Something happened” with raw data. We can
say: “We know something about that case".
We can say: “There were 100 victims
in that case or at least 100 victims
in that case”, if we can name 100 people.
But we can’t do comparisons: “This
is the biggest massacre this year”.
We don’t really know. Because we
don’t know about that massacres
we don’t know about. No patterns. Don’t
talk about the hot spot of violence.
No, we don’t know that. Happy to talk
more about that if we gather after,
but I wanna come to a close here with
the importance of getting it right.
I’ve talked about one case today. This
is another case, the case of this man:
Edgar Fernando García. Mr. García was
a student Labor leader in Guatemala
early in the 1980s. He left
his office in February 1984
– did not come home. People reported
later that they saw someone
shoving Mr. García into a
vehicle and driving away.
His widow became a very important
Human Rights activist in Guatemala
and now she’s a very important, and
in my opinion impressive politician.
And there’s her infant daughter. She
continued to struggle to find out
what had happened to
Mr. García for decades.
And in 2006 documents came to light
in the National Archives of the…
excuse me, the Historical Archives
of the national Police, showing that
the Police had realized an operation
in the area of Mr. García’s office
and it was very likely that
they had disappeared him.
These 2 guys up here in the upper
right were Police officers in that area;
they were arrested, charged with the
disappearance of Mister García and
convicted. Part of the evidence used to
convict them was communications meta data
showing that documents
flowed through the archive.
I mean paper communications! We coded
it by hand. We went through and read
the ‘From’ and ‘To’ lines
from every Memo. And
they were convicted in 2010
and after that conviction
Mr. García’s infant daughter – now
a grown woman – was clearly joyful.
Justice brings closure to a family
that never knows when to start talking
about someone in the past tense.
Perhaps even more powerfully:
those guys’ grand boss, their boss's
boss, Colonel Héctor Bol de la Cruz,
this man here, was convicted
of Mr. García’s disappearance
in September this year [2013].
applause
applause
I don’t know if any of you have
ever been dissident students,
but if you’ve been dissident students
demonstrating in the street think about
how you would feel if your friends
and comrades were disappeared,
and take a long look at Colonel Bol
de la Cruz. Here is the rest of the stuff
that we will talk about if we gather
afterwards. Thank you very much
for your attention. I really
have enjoyed CCC.
applause
Subtitles created by c3subtitles.de
in the year 2016. Join and help us!