How juries are fooled by statistics1,371,959 views | Peter Donnelly • TEDGlobal 2005Peter Peter Donnelly
Mathematician; statistician
Peter Donnelly is an expert in probability theory who applies statistical methods to genetic data -- spurring advances in disease treatment and insight on our evolution. He's also an expert on DNA analysis, and an advocate for sensible statistical analysis in the courtroom.
00:00
As other speakers have said, it's a rather daunting experience -- a particularly daunting experience -- to be speaking in front of this audience. But unlike the other speakers, I'm not going to tell you about the mysteries of the universe, or the wonders of evolution, or the really clever, innovative ways people are attacking the major inequalities in our world. Or even the challenges of nation-states in the modern global economy. My brief, as you've just heard, is to tell you about statistics -- and, to be more precise, to tell you some exciting things about statistics. And that's -- (Laughter) -- that's rather more challenging than all the speakers before me and all the ones coming after me. (Laughter) One of my senior colleagues told me, when I was a youngster in this profession, rather proudly, that statisticians were people who liked figures but didn't have the personality skills to become accountants. (Laughter) And there's another in-joke among statisticians, and that's, "How do you tell the introverted statistician from the extroverted statistician?" To which the answer is, "The extroverted statistician's the one who looks at the other person's shoes." (Laughter) But I want to tell you something useful -- and here it is, so concentrate now. This evening, there's a reception in the University's Museum of Natural History. And it's a wonderful setting, as I hope you'll find, and a great icon to the best of the Victorian tradition. It's very unlikely -- in this special setting, and this collection of people -- but you might just find yourself talking to someone you'd rather wish that you weren't. So here's what you do. When they say to you, "What do you do?" -- you say, "I'm a statistician." (Laughter) Well, except they've been pre-warned now, and they'll know you're making it up. And then one of two things will happen. They'll either discover their long-lost cousin in the other corner of the room and run over and talk to them. Or they'll suddenly become parched and/or hungry -- and often both -- and sprint off for a drink and some food. And you'll be left in peace to talk to the person you really want to talk to.
01:55
It's one of the challenges in our profession to try and explain what we do. We're not top on people's lists for dinner party guests and conversations and so on. And it's something I've never really found a good way of doing. But my wife -- who was then my girlfriend -- managed it much better than I've ever been able to. Many years ago, when we first started going out, she was working for the BBC in Britain, and I was, at that stage, working in America. I was coming back to visit her. She told this to one of her colleagues, who said, "Well, what does your boyfriend do?" Sarah thought quite hard about the things I'd explained -- and she concentrated, in those days, on listening. (Laughter) Don't tell her I said that. And she was thinking about the work I did developing mathematical models for understanding evolution and modern genetics. So when her colleague said, "What does he do?" She paused and said, "He models things." (Laughter) Well, her colleague suddenly got much more interested than I had any right to expect and went on and said, "What does he model?" Well, Sarah thought a little bit more about my work and said, "Genes." (Laughter) "He models genes."
03:06
That is my first love, and that's what I'll tell you a little bit about. What I want to do more generally is to get you thinking about the place of uncertainty and randomness and chance in our world, and how we react to that, and how well we do or don't think about it. So you've had a pretty easy time up till now -- a few laughs, and all that kind of thing -- in the talks to date. You've got to think, and I'm going to ask you some questions. So here's the scene for the first question I'm going to ask you. Can you imagine tossing a coin successively? And for some reason -- which shall remain rather vague -- we're interested in a particular pattern. Here's one -- a head, followed by a tail, followed by a tail.
03:42
So suppose we toss a coin repeatedly. Then the pattern, head-tail-tail, that we've suddenly become fixated with happens here. And you can count: one, two, three, four, five, six, seven, eight, nine, 10 -- it happens after the 10th toss. So you might think there are more interesting things to do, but humor me for the moment. Imagine this half of the audience each get out coins, and they toss them until they first see the pattern head-tail-tail. The first time they do it, maybe it happens after the 10th toss, as here. The second time, maybe it's after the fourth toss. The next time, after the 15th toss. So you do that lots and lots of times, and you average those numbers. That's what I want this side to think about.
04:18
The other half of the audience doesn't like head-tail-tail -- they think, for deep cultural reasons, that's boring -- and they're much more interested in a different pattern -- head-tail-head. So, on this side, you get out your coins, and you toss and toss and toss. And you count the number of times until the pattern head-tail-head appears and you average them. OK? So on this side, you've got a number -- you've done it lots of times, so you get it accurately -- which is the average number of tosses until head-tail-tail. On this side, you've got a number -- the average number of tosses until head-tail-head.
04:46
So here's a deep mathematical fact -- if you've got two numbers, one of three things must be true. Either they're the same, or this one's bigger than this one, or this one's bigger than that one. So what's going on here? So you've all got to think about this, and you've all got to vote -- and we're not moving on. And I don't want to end up in the two-minute silence to give you more time to think about it, until everyone's expressed a view. OK. So what you want to do is compare the average number of tosses until we first see head-tail-head with the average number of tosses until we first see head-tail-tail.
05:16
Who thinks that A is true -- that, on average, it'll take longer to see head-tail-head than head-tail-tail? Who thinks that B is true -- that on average, they're the same? Who thinks that C is true -- that, on average, it'll take less time to see head-tail-head than head-tail-tail? OK, who hasn't voted yet? Because that's really naughty -- I said you had to. (Laughter) OK. So most people think B is true. And you might be relieved to know even rather distinguished mathematicians think that. It's not. A is true here. It takes longer, on average. In fact, the average number of tosses till head-tail-head is 10 and the average number of tosses until head-tail-tail is eight. How could that be? Anything different about the two patterns? There is. Head-tail-head overlaps itself. If you went head-tail-head-tail-head, you can cunningly get two occurrences of the pattern in only five tosses. You can't do that with head-tail-tail. That turns out to be important.
06:21
There are two ways of thinking about this. I'll give you one of them. So imagine -- let's suppose we're doing it. On this side -- remember, you're excited about head-tail-tail; you're excited about head-tail-head. We start tossing a coin, and we get a head -- and you start sitting on the edge of your seat because something great and wonderful, or awesome, might be about to happen. The next toss is a tail -- you get really excited. The champagne's on ice just next to you; you've got the glasses chilled to celebrate. You're waiting with bated breath for the final toss. And if it comes down a head, that's great. You're done, and you celebrate. If it's a tail -- well, rather disappointedly, you put the glasses away and put the champagne back. And you keep tossing, to wait for the next head, to get excited.
07:00
On this side, there's a different experience. It's the same for the first two parts of the sequence. You're a little bit excited with the first head -- you get rather more excited with the next tail. Then you toss the coin. If it's a tail, you crack open the champagne. If it's a head you're disappointed, but you're still a third of the way to your pattern again. And that's an informal way of presenting it -- that's why there's a difference. Another way of thinking about it -- if we tossed a coin eight million times, then we'd expect a million head-tail-heads and a million head-tail-tails -- but the head-tail-heads could occur in clumps. So if you want to put a million things down amongst eight million positions and you can have some of them overlapping, the clumps will be further apart. It's another way of getting the intuition.
07:45
What's the point I want to make? It's a very, very simple example, an easily stated question in probability, which every -- you're in good company -- everybody gets wrong. This is my little diversion into my real passion, which is genetics. There's a connection between head-tail-heads and head-tail-tails in genetics, and it's the following. When you toss a coin, you get a sequence of heads and tails. When you look at DNA, there's a sequence of not two things -- heads and tails -- but four letters -- As, Gs, Cs and Ts. And there are little chemical scissors, called restriction enzymes which cut DNA whenever they see particular patterns. And they're an enormously useful tool in modern molecular biology. And instead of asking the question, "How long until I see a head-tail-head?" -- you can ask, "How big will the chunks be when I use a restriction enzyme which cuts whenever it sees G-A-A-G, for example? How long will those chunks be?"
08:35
That's a rather trivial connection between probability and genetics. There's a much deeper connection, which I don't have time to go into and that is that modern genetics is a really exciting area of science. And we'll hear some talks later in the conference specifically about that. But it turns out that unlocking the secrets in the information generated by modern experimental technologies, a key part of that has to do with fairly sophisticated -- you'll be relieved to know that I do something useful in my day job, rather more sophisticated than the head-tail-head story -- but quite sophisticated computer modelings and mathematical modelings and modern statistical techniques. And I will give you two little snippets -- two examples -- of projects we're involved in in my group in Oxford, both of which I think are rather exciting. You know about the Human Genome Project. That was a project which aimed to read one copy of the human genome. The natural thing to do after you've done that -- and that's what this project, the International HapMap Project, which is a collaboration between labs in five or six different countries. Think of the Human Genome Project as learning what we've got in common, and the HapMap Project is trying to understand where there are differences between different people.
09:43
Why do we care about that? Well, there are lots of reasons. The most pressing one is that we want to understand how some differences make some people susceptible to one disease -- type-2 diabetes, for example -- and other differences make people more susceptible to heart disease, or stroke, or autism and so on. That's one big project. There's a second big project, recently funded by the Wellcome Trust in this country, involving very large studies -- thousands of individuals, with each of eight different diseases, common diseases like type-1 and type-2 diabetes, and coronary heart disease, bipolar disease and so on -- to try and understand the genetics. To try and understand what it is about genetic differences that causes the diseases. Why do we want to do that? Because we understand very little about most human diseases. We don't know what causes them. And if we can get in at the bottom and understand the genetics, we'll have a window on the way the disease works, and a whole new way about thinking about disease therapies and preventative treatment and so on. So that's, as I said, the little diversion on my main love.
10:44
Back to some of the more mundane issues of thinking about uncertainty. Here's another quiz for you -- now suppose we've got a test for a disease which isn't infallible, but it's pretty good. It gets it right 99 percent of the time. And I take one of you, or I take someone off the street, and I test them for the disease in question. Let's suppose there's a test for HIV -- the virus that causes AIDS -- and the test says the person has the disease. What's the chance that they do? The test gets it right 99 percent of the time. So a natural answer is 99 percent. Who likes that answer? Come on -- everyone's got to get involved. Don't think you don't trust me anymore. (Laughter) Well, you're right to be a bit skeptical, because that's not the answer. That's what you might think. It's not the answer, and it's not because it's only part of the story. It actually depends on how common or how rare the disease is. So let me try and illustrate that. Here's a little caricature of a million individuals. So let's think about a disease that affects -- it's pretty rare, it affects one person in 10,000. Amongst these million individuals, most of them are healthy and some of them will have the disease. And in fact, if this is the prevalence of the disease, about 100 will have the disease and the rest won't. So now suppose we test them all. What happens? Well, amongst the 100 who do have the disease, the test will get it right 99 percent of the time, and 99 will test positive. Amongst all these other people who don't have the disease, the test will get it right 99 percent of the time. It'll only get it wrong one percent of the time. But there are so many of them that there'll be an enormous number of false positives. Put that another way -- of all of them who test positive -- so here they are, the individuals involved -- less than one in 100 actually have the disease. So even though we think the test is accurate, the important part of the story is there's another bit of information we need.
12:39
Here's the key intuition. What we have to do, once we know the test is positive, is to weigh up the plausibility, or the likelihood, of two competing explanations. Each of those explanations has a likely bit and an unlikely bit. One explanation is that the person doesn't have the disease -- that's overwhelmingly likely, if you pick someone at random -- but the test gets it wrong, which is unlikely. The other explanation is that the person does have the disease -- that's unlikely -- but the test gets it right, which is likely. And the number we end up with -- that number which is a little bit less than one in 100 -- is to do with how likely one of those explanations is relative to the other. Each of them taken together is unlikely.
13:24
Here's a more topical example of exactly the same thing. Those of you in Britain will know about what's become rather a celebrated case of a woman called Sally Clark, who had two babies who died suddenly. And initially, it was thought that they died of what's known informally as "cot death," and more formally as "Sudden Infant Death Syndrome." For various reasons, she was later charged with murder. And at the trial, her trial, a very distinguished pediatrician gave evidence that the chance of two cot deaths, innocent deaths, in a family like hers -- which was professional and non-smoking -- was one in 73 million. To cut a long story short, she was convicted at the time. Later, and fairly recently, acquitted on appeal -- in fact, on the second appeal. And just to set it in context, you can imagine how awful it is for someone to have lost one child, and then two, if they're innocent, to be convicted of murdering them. To be put through the stress of the trial, convicted of murdering them -- and to spend time in a women's prison, where all the other prisoners think you killed your children -- is a really awful thing to happen to someone. And it happened in large part here because the expert got the statistics horribly wrong, in two different ways.
14:36
So where did he get the one in 73 million number? He looked at some research, which said the chance of one cot death in a family like Sally Clark's is about one in 8,500. So he said, "I'll assume that if you have one cot death in a family, the chance of a second child dying from cot death aren't changed." So that's what statisticians would call an assumption of independence. It's like saying, "If you toss a coin and get a head the first time, that won't affect the chance of getting a head the second time." So if you toss a coin twice, the chance of getting a head twice are a half -- that's the chance the first time -- times a half -- the chance a second time. So he said, "Here, I'll assume that these events are independent. When you multiply 8,500 together twice, you get about 73 million." And none of this was stated to the court as an assumption or presented to the jury that way. Unfortunately here -- and, really, regrettably -- first of all, in a situation like this you'd have to verify it empirically. And secondly, it's palpably false. There are lots and lots of things that we don't know about sudden infant deaths. It might well be that there are environmental factors that we're not aware of, and it's pretty likely to be the case that there are genetic factors we're not aware of. So if a family suffers from one cot death, you'd put them in a high-risk group. They've probably got these environmental risk factors and/or genetic risk factors we don't know about. And to argue, then, that the chance of a second death is as if you didn't know that information is really silly. It's worse than silly -- it's really bad science. Nonetheless, that's how it was presented, and at trial nobody even argued it. That's the first problem. The second problem is, what does the number of one in 73 million mean? So after Sally Clark was convicted -- you can imagine, it made rather a splash in the press -- one of the journalists from one of Britain's more reputable newspapers wrote that what the expert had said was, "The chance that she was innocent was one in 73 million." Now, that's a logical error. It's exactly the same logical error as the logical error of thinking that after the disease test, which is 99 percent accurate, the chance of having the disease is 99 percent. In the disease example, we had to bear in mind two things, one of which was the possibility that the test got it right or not. And the other one was the chance, a priori, that the person had the disease or not. It's exactly the same in this context. There are two things involved -- two parts to the explanation. We want to know how likely, or relatively how likely, two different explanations are. One of them is that Sally Clark was innocent -- which is, a priori, overwhelmingly likely -- most mothers don't kill their children. And the second part of the explanation is that she suffered an incredibly unlikely event. Not as unlikely as one in 73 million, but nonetheless rather unlikely. The other explanation is that she was guilty. Now, we probably think a priori that's unlikely. And we certainly should think in the context of a criminal trial that that's unlikely, because of the presumption of innocence. And then if she were trying to kill the children, she succeeded. So the chance that she's innocent isn't one in 73 million. We don't know what it is. It has to do with weighing up the strength of the other evidence against her and the statistical evidence. We know the children died. What matters is how likely or unlikely, relative to each other, the two explanations are. And they're both implausible. There's a situation where errors in statistics had really profound and really unfortunate consequences. In fact, there are two other women who were convicted on the basis of the evidence of this pediatrician, who have subsequently been released on appeal. Many cases were reviewed. And it's particularly topical because he's currently facing a disrepute charge at Britain's General Medical Council.
18:28
So just to conclude -- what are the take-home messages from this? Well, we know that randomness and uncertainty and chance are very much a part of our everyday life. It's also true -- and, although, you, as a collective, are very special in many ways, you're completely typical in not getting the examples I gave right. It's very well documented that people get things wrong. They make errors of logic in reasoning with uncertainty. We can cope with the subtleties of language brilliantly -- and there are interesting evolutionary questions about how we got here. We are not good at reasoning with uncertainty. That's an issue in our everyday lives. As you've heard from many of the talks, statistics underpins an enormous amount of research in science -- in social science, in medicine and indeed, quite a lot of industry. All of quality control, which has had a major impact on industrial processing, is underpinned by statistics. It's something we're bad at doing. At the very least, we should recognize that, and we tend not to. To go back to the legal context, at the Sally Clark trial all of the lawyers just accepted what the expert said. So if a pediatrician had come out and said to a jury, "I know how to build bridges. I've built one down the road. Please drive your car home over it," they would have said, "Well, pediatricians don't know how to build bridges. That's what engineers do." On the other hand, he came out and effectively said, or implied, "I know how to reason with uncertainty. I know how to do statistics." And everyone said, "Well, that's fine. He's an expert." So we need to understand where our competence is and isn't. Exactly the same kinds of issues arose in the early days of DNA profiling, when scientists, and lawyers and in some cases judges, routinely misrepresented evidence. Usually -- one hopes -- innocently, but misrepresented evidence. Forensic scientists said, "The chance that this guy's innocent is one in three million." Even if you believe the number, just like the 73 million to one, that's not what it meant. And there have been celebrated appeal cases in Britain and elsewhere because of that.
20:23
And just to finish in the context of the legal system. It's all very well to say, "Let's do our best to present the evidence." But more and more, in cases of DNA profiling -- this is another one -- we expect juries, who are ordinary people -- and it's documented they're very bad at this -- we expect juries to be able to cope with the sorts of reasoning that goes on. In other spheres of life, if people argued -- well, except possibly for politics -- but in other spheres of life, if people argued illogically, we'd say that's not a good thing. We sort of expect it of politicians and don't hope for much more. In the case of uncertainty, we get it wrong all the time -- and at the very least, we should be aware of that, and ideally, we might try and do something about it. Thanks very much.
Xiaofei Zhang, Translator
Zhu Jie, Reviewer
00:00
正如一些演讲者所说 在这里的观众面前演讲 是一次令人畏缩的经历--相当令人恐慌 不过与其他演讲者不同 我不会给大家讲 宇宙的迷团 也不会讲进化的奥妙 抑或是人们用来对抗世界上主要的不平等现象的 那些着实非常奇妙新颖的办法 更不会讲现代全球经济下国家之间的挑战 就像你们刚才听到的 概括来说 我讲的内容是统计学-- 更确切地说 是一些统计学中很有趣的事情 而这-- (笑) --相对所有在我之前以及之后的演讲者而言 具有空前绝后的挑战性 (笑) 当我在统计学这个领域还是新人的时候 一个资深同事相当自豪地告诉我 统计学家是那些喜欢数字 但性格上不适合做会计的人 (笑) 还有一个统计学的笑话 “怎样看出统计学家是内向还是外向呢?” 答案就是 “外向的统计学家会看别人的鞋” (笑) 不过其实我想讲一些有用的--所以请注意 今晚在学校的自然历史博物馆里有一个招待会 希望你能发现 这是一个绝妙的场合 也是维多利亚优秀传统中的表现 在这样的场合 这样的人群中 虽然有点不大可能 但你也许仍然发现你在跟一些你并不想聊天的人交谈 这时候你就可以这么做 当他们问:“你的工作是?”--你就说:“我是统计学家” (笑) 除非他们事先得到提醒 知道这是你编的 一般出现的情形都不过以下两种 他们会突然在屋子另一角发现了失散多年的表亲 然后赶去跟他们说话 或者他们会突然很渴或者很饿--通常是饥渴交迫-- 然后奔向食物和饮料 这是你就能一个人静下来 跟你想聊天的人交谈
01:55
解释我们到底是做什么的 是我们这个领域的一个挑战 我们并不是晚宴的贵宾 也不是理想的交谈对象 对此我也一直没能找到什么好的解决办法 但我的妻子--当时是我的女朋友 在这件事上就比我出色的多 多年前 那时我们刚开始约会 她在英国BBC工作 而我当时在美国 我回英国看她的时候 她跟一个同事说起这事 那个同事问:“你男朋友是做什么的?” 她苦苦思索着我刚才解释过的工作 于是那段时间她一直是一个专心的倾听者 (笑) 别告诉她我跟说过这事 她当时想 我的工作是建立数模 来加深对进化和现代基因学的了解 所以当同事问:“他是干什么的?” 她就停顿一下 然后说:“他做模型。” (笑) 当然 她的同事立即就对我产生了出乎我意料的兴趣 并继续问:“他做什么模型?” 然后 萨拉又想了想我的工作 然后答:“基因。” (笑) “他建立基因模型。”
03:06
这就是我的初恋 题外话了 总的来说 我要给大家讲一些 不确定性、随机性和概率在生活中的影响 我们对此的反应是怎样的 以及我们了解他们的程度 到现在为止大家听得都很轻松 到现在为止都是听听笑笑 现在大家要开始思考了 我会提几个问题 下面这个场景就是我开始问第一个问题 想象连续掷硬币的情形 由于某种原因--我就暂时不做过多的解释了-- 我们很喜欢某种特定的情形 比如这个--正面、反面、正面
03:42
假设我们连续掷硬币 然后我们设定这样一个情形 正反反 数着掷十次:一 二 三 四 五 六 七 八 九 十 然后看结果怎么样 你可能觉得还有更有趣的事可以做 不过这次先迁就我一下 假设这半边观众都拿出硬币开始投掷 直到他们看到正反反现象为止 第一回投硬币 也许十次以后才能看到 第二回 也许第四次就能看到 再下一回 也许比15次还多 做过很多遍这个实验后 将每遍的次数平均 这就是我想让这半边思考的情况
04:18
那半边观众不喜欢正反反 出于某些深刻的文化因素 他们觉得这很无聊-- 他们跟更喜欢另一种情形--正反正 所以 这半边的观众拿出硬币 反复投掷 然后记下看到正反正情形出现时掷硬币的次数 然后将所有的次数平均 那么 这半边的观众得出了一个平均数 因为做了很多次 所以这个数字是准确的 就是正反反情形出现时投掷硬币次数的平均 而这半边的观众 大家也得出了一个数字--正反正情形的平均
04:46
那么就有了这样一个数学问题 两个数之间只能有三种情形 他们或者相等 或者这个比那个大 或者那个比这个大 那么在我们这两种情形下这两个数相比会怎样呢 大家来思考一下 然后投个票 现在给大家一些时间 不过我不想因为给大家更多的时间思考直到每个人都立场明确 而最后以两分钟沉默告终 所以你们要做的只是比较这两种情形下 平均数的大小
05:16
哪些认为A是对的-- 即 平均来看 出现正反正的情形要晚于正反反情形? 哪些认为B是对的--即 平均来看次数相同? 哪些认为C是对的--即 平均来看 出现正反正情形的次数 要少于正反反的情形? 好 谁没有投票? 那真是很调皮--我说过你们要选择一个 (笑) 好的 那么大多数人认为B是正确的 也许当听到甚至非常优秀的数学家也是这么想的 你会放下心来 B不正确 答案是A 实际上 平均起来 正反正情形下掷硬币的次数是10次 而正反反情形的次数是8次 怎么会这样呢 这两种情形有什么不同吗 二者的确不同 正反正情形会自我重叠 如果你掷出正-反-正-反-正 你能在这五次中 看到两次正反正的情形 而这在正反反的情形下无法实现 这一点变得很重要
06:21
有两种方法可以来想这个问题 我提供其中之一 假设我们正在进行这个实验 这半边观众--记住 你们希望看到正反反 而你们希望看到正反正 我们开始投硬币 第一次是正 大家都开始暗自激动 因为一个美妙绝伦的事情要发生了 第二次是反--大家都很激动 手边的香槟已经冰好 大家都拿着杯子开始准备庆祝 大家都屏气凝神观望最后一掷 如果是正 那么非常好 你们完了 而你们可以庆祝了 如果这是反--那么有些遗憾 你们要把杯子移开 然后把香槟放回去 接着掷硬币 等着下一个正 然后开始激动
07:00
而这半边则完全不同 这个序列中前两步都是相同的 大家因第一个是正有点兴奋 当第二个是反的时候 变得更加激动 然后再掷硬币 如果是反 你们就可以打开香槟了 如果是正 你们会感到失望 但你们仍旧已经完成了这个模式的三分之一 这就是一种不大正式的解释--这就是出现不同的原因 另外一种思考的方法就是-- 如果我们掷八百万次硬币 我们可能会预计有一百万正反正情形 和一百万次正反反情形的出现--但正反正的情形可能接连出现 所以如果你想在八百万个位置中得到一百万个固定的模式 可能会有一些是重叠的 重叠的部分会很长 这就是另外一种思考方法
07:45
那么这说明什么问题呢? 这是一个非常简单的例子 一个很简单明了的问题-- 有很多人跟你们一样--这个问题几乎没有人答对 这是一个小小的题外话 我很想讲的 是基因学 在基因学中 正反正和正反反两种情形间存在某种联系 这个联系是这样的 掷硬币的时候 你会得到一个正和反组成的序列 而当观察DNA时 会发现这不是两个元素组成的序列--正反正-- 而是四个字母--A G C T 有一些小小的化学剪刀 叫做限制性内切酶 当它们遇到特定的情形时 就会剪断DNA 在现代分子生物学中它们是非常有用的工具 在基因学中 我们不问“什么时候能看到正反正的情形?” 你可以问 比如说 “如果用限制性内切酶来剪断任何它遇到的GAAG排列 剪下来的基因部分会有多大?” 那些基因部分会有多长?
08:35
这是概率和基因之间的一个相当细微的联系 他们之间还有一个更深的联系 这里我没有时间多讲 那就是 现代基因学是一个很令人激动的科学领域 以后我们可能会在某些大会的演讲中听到这个部分 但是若把现代实验技术中发现的秘密公开, 关键就是那必须与一些相当复杂的-- 当听到我的工作是多有用的时候你们会倍感释然 比正反正的试验要复杂地多-- 但是相当复杂的计算机建模 数学建模 以及现代统计技术 我会举在牛津我们团队正在研究的项目中 的两个小例子 我认为这两个例子都很有趣 大家都了解人类基因组计划 那是一个项目 目的在于构建人类基因组遗传图谱 当完成那个项目后 下一步自然是-- --就是这个计划 国际人类基因组单体型图计划 目前有五六个不同个国家的实验室在合作研究 把人类基因遗传图谱看做是对我们共同点的了解 而国际人类基因组单体型图计划就是试着了解 人类之间的不同
09:43
为什么要这么关注这些呢? 这有很多原因 最紧迫的一个就是 我们想了解其中一些不同 是怎样让一些人容易患一种病的--比如说 二型糖尿病-- 而另一些不同使人更容易得心脏病 或中风 自闭症等等其它病症 这是一个宏大的项目 最近 英国威康信托基金会资助了一个项目 其规模仅次于上一个项目 它包括了很多大型的研究-- 成千上万的人各负责八种不同的疾病 有一些比较常见的疾病 比如一型糖尿病 二型糖尿病和冠心病 躁狂抑郁症等等--来试着了解基因 着这了解那些导致疾病的基因的不同之处 为什么我们想做这些呢? 因为我们对大多数人类疾病都了解甚微 我们不知道病因是什么 如果我们从根本入手并了解基因 这边开启了一个通向疾病病理的窗口 也开辟了思考疾病治疗方法 和预防措施的新路径 所以 就像我之前说过的那样 这是我主要兴趣的一个小分支
10:44
回到一些关于随机性的平凡的问题上来 这是给你们的另一个测试-- 现在假设我们拿到了一个疾病的检测 这个检测并不是完全准确的 但准确性很高 这个检测的准确性高达99% 现在我让你们中的一个人 或从街上拉来一个人 然后检测他患病的几率 假设这是一个艾滋病毒的测试--一个导致艾滋病的病毒-- 而测试表明这个人患病 那么他患病的几率是多少呢 这个测试准确性是99% 所以自然而然会得出99%这个答案 谁喜欢这个答案? 别这样--每个人都参与进来 不要觉得你不再相信我了 (笑) 不过 你们的怀疑是正确的 因为这不是正确答案 你们可能是这么想的 这不是正确答案 并不是因为这只是故事的一部分 而实际上它取决于这种病是常见的还是罕见的 现在我来试着说明一下 这个图代表一百万人 我们来考虑一种疾病的感染率-- 它非常罕见 在一万人中仅一人患病 在这一百万人中 大部分人都是健康的 而一些人会患病 实际上 如果这是疾病的流行程度 那么约一百人会患病而其余人不会 现在假设我们给所有人做了测试 会出现什么情况呢 在100个患有该疾病的人中 这个测试会有99%的正确性 所以99个人会检测出患病 在那些没有患病的人中 这个测试仍然有99%的正确率 只有1%是错误的 但是没有患病的人太多了 所以错误的患病检测会非常多 换种方法说-- 在所有结果是患病的检测中--就是这些人-- 真正患病的几率小于1% 所以即便我们认为这个测试是准确的 这个例子重要的部分在于 我们还需要一些信息
12:39
这就是关键 当知道测试结果为患病时 我们要做的就是 权衡下面两种解释的概率或可能性 每种解释都有一定的可能性 一种解释是这个人不患病-- 这种可能性比较大 如果你随机选人的话-- 但是测试结果错了 这种情况很罕见 另一种解释就是这个人不患病--这很少见-- 但测试结果正确 这可能性很大 而我们最后得到的数字-- 就是略少于100的数字-- 与这几种解释之间的关联性有关 每个解释合起来都不大可能
13:24
这是另一个说明同样道理的例子 更加切题 在英国的听众知道 这是一个很有名的案子 一个女人叫做萨里•克拉克 她有两个孩子 都突然去世 很自然人们以为这属于婴儿猝死 更正式的说法是婴儿猝死综合征 由于多种原因 萨里后来以谋杀罪被逮捕 在法庭上 一个非常著名的小儿科医师作证 两个婴儿猝死 在一个像萨里的家里-- 有经验并不吸烟的--概率为七千三百万分之一 长话短说 她最后被判有罪 后来 最近 她在上诉中无罪释放了 当置于实际情境中 大家就能想象 一个人失去了一个孩子 然后又失去了另一个 然后又被诬为凶手 这是多么可怕的事情 要被迫承受审判的压力 并判有罪-- 在女监里熬过一段日子 那里所有的囚犯 都认为是你杀了孩子--这件事发生在一个人身上真是太可怕了 而这些事的发生 很大程度上是因为那个专家 得出的数据是错误的 错误出在两方面
14:36
那么他是怎样得出七千三百万分之一这个数字的呢 他看了一些研究 那些研究上说一个家庭里一个婴儿猝死的概率 就像萨里•克拉克家 这概率是八千五百分之一 所以他说:“我假设如果一个家庭中出现了一个婴儿猝死 那么第二个婴儿发生猝死的概率也不会变。” 这被统计学家们称为独立事件 这就像是在说:“如果你掷硬币第一次是正 这并不会影响第二次投掷得到正的概率。” 所以如果你扔两次硬币 第一次正的几率是二分之一 第二次正的几率也是二分之一 所以他说:“我们来假设 假设这些事件是独立的 当你将八千五百分之一相乘 你就会得到七千三百分之一 而上面这些并没有在法庭上向陪审团 展示作为前提 不幸的是--确实很令人遗憾-- 首先 在这种情况下要先以经验判断 第二 这可能是错的 我们对婴儿猝死综合症有太多不了解 很可能有一些我们并不知道的环境因素 也很可能是有一些 我们并不了解的基因因素 所以如果一个家庭出现一个婴儿猝死 你就要把他们放到高概率组 他们很可能有这些环境因素 和/或基因因素 而我们对这些并不知情 而就像不知道上面得出的信息一样 确定第二个死亡的概率 是非常愚蠢的 这比愚蠢还糟--这是坏科学 但是 这推论就这样呈现在法庭上 而几乎没有人质疑 这是第一个问题 第二个问题是 七千三百万分之一这个数字意味着什么 在萨里•克拉克被定罪后-- 可以想象 这在媒体中引起轩然大波-- 一个英国相当有名望的报社记者写到 这个专家说 “她无罪的几率是七千三百万分之一” 这是一个逻辑上的错误 这个错误相当于认为 在准确率99%的疾病测试后 患病的几率是99% 在疾病的例子中 我们要注意两点 一个是这个测试得出的可能性是否正确 另一个就是这个人本身是否患病 这个情形是完全相同的 这个解释包括两个部分 我们想知道这两种不同解释发生的可能性 或相对的可能性 一个是 萨里•克拉克是清白的-- 也就是 一个先验 极为可能-- 大多母亲不会杀自己的孩子 这个解释的第二部分 就是她遭遇了一个可能性极小的时间 不像七千三百万分之一那样小 但也同样不可能 另一个解释就是 我们可能认为一个先验是 不大可能 然后我们当然应该认为在刑事审判的情形下 这是不大可能的 因为我们以无罪为前提 如果她那时试着杀害孩子 那么她成功了 所以她无罪的机率并不是七千三百万分之一 我们不知道这个个机率是多少 这同衡量其它对她不利的证据 和数据型证据有关 我们知道 孩子死了 重要的是这两种解释 相对发生的机率 他们都令人难以置信 在这种情形下 错误的数据 产生了很重大而且不幸的结果 事实上 还有其他两个女人因这个小儿科医师的作证 而被定罪 而她们在上诉中都被无罪释放了 很多案子都因此而重审 这引起了很高的关注 因为他正面临着 英国综合医学委员会的名誉调查
18:28
总结一下 我们应该得到什么警示呢 我们知道 随机性、不确定性和概率 在生活中影响重大 并且大家作为一个集体 在很多方面都很特别 大家没有回答正确我给出的例子 是完全正常并具有代表性的 有很多人们理解错误的记录 他们在不确定性方面犯逻辑错误 我们可以很好地解决语言的细微差别 还有有趣的进化方面的问题 如我们是怎么来到这里的 我们并不擅长不确定性 这是我们生活中的一个问题 像你们听过的很多演讲 数据是很多科学研究中 的基础--社会科学 医学 确实 很多行业 所有的质量控制 这些对工业过程的影响极其重要 这些都以数据为基础 而这方面我们并不擅长 至少我们应该意识到这一点 并尽力防止错误发生 回到法律方面 在萨里•克拉克的案子中 所有律师都接受了专家的证词 如果一个小儿科医师出来对陪审团作证 我不知道怎样建造桥梁 我在路那边建了一个 开车回家的时候请放心过桥 他们会说 小儿科医师不懂怎样建造桥梁 那是工程师的工作 而另一方面 他们站出来说 或暗示 我知道怎样运用不确定性 我知道怎样处理数据 然后大家都说 这没问题 他是专家 所以我们应该明白我们的什么是我们的强项 什么不是 完全相同类型的问题每天都出现在DNA的测绘中 科学家 律师 有些情况下甚至法官 都会错误地解释证据 通常--大家希望--结果是无罪 只是错误地解释了证据 法庭上的科学家说 这个人无罪的机率是三百万分之一 即使你相信这个数据 就像七千三百万分之一 这也并不是它真正的含义 因为这个在英国和其他地方 有很多上诉案件
20:23
这就是在法律层面上我们要考虑的问题 说“我们尽量给予证据更好的解释”固然很好 但越来越的地 在DNA测绘中--这也很重要-- 我们希望陪审团 那些普通人-- 记录表明他们非常不擅此类-- 我们希望陪审团能够处理好这些推理 在生活的其它方面 如果人们在争辩的时候--当然 也许不包括政治 但是在生活的其他方面 如果人们争辩地并不合逻辑 我们认为这不是好现象 在不确定性方面 我们也从某种程度上对政客抱有希望 但并不奢求什么 我们一直都没对过 至少 我们应该认识到这一点 并且 希望我们能试着做什么去改变这一点 谢谢大家
https://www.ted.com/talks/peter_donnelly_how_juries_are_fooled_by_statistics/transcript?referrer=playlist-our_brains_predictably_irrati&autoplay=true&subtitle=en