How to debunk dodgy data
如何看穿骗人的数据【《拆穿胡扯》书评】

【双语】数字恐怖 Sums of all fears-书迷号 shumihao.com

ON JULY 2ND the American state of Georgia counted a total of 87,709 cases of covid-19. Fifteen days later the number had risen to 135,183. Yet the state government’s online heat map looked largely the same. There appeared to be no increase in the number of crimson red areas where the outbreak was most severe. How come?

美国佐治亚州到7月2日共有87,709例新冠肺炎病例。15天后,病例数增加到了135,183例。但州政府的线上数据热图看上去却没有什么变化。代表疫情最严重的深红色区块的数量似乎并没有增加。怎么回事?

As it turned out, the threshold for places to turn red had been lifted from 2,961 cases to 3,769. This example of misleading data visualisation was called out by Carl Bergstrom and Jevin West. It joined the ever-growing catalogue of “bullshit”, malign and otherwise, which they debunk for students at the University of Washington.

原来,热图中区块变为深红的阈值从原来的2961例上调到了3769例。卡尔·伯格斯特龙(Carl Bergstrom)和杰文·韦斯特(Jevin West)拆穿了这个可视化数据误导受众的案例。他们在华盛顿大学向学生们揭露日益增多的各种恶意或无意的“胡扯”,这便是其中之一。

Out of that course they have spun “Calling Bullshit”, a helpful guide to navigating a world full of doubtful claims based on spurious data. Using clever anecdotes, nods to online culture and allusions to ancient philosophy, the book tells ordinary readers how to spot nonsense—even if they are not numerical whizzes. As well as sketching the difference between correlation and causality, the authors outline visualisation techniques and explain machine learning to arm people against assertions that seem, and so probably are, either “too good or too bad to be true”.

他们也把课程内容写成了《拆穿胡扯》(Calling Bullshit)一书。这是一本十分有用的指南,帮助读者分辨生活中充斥的各种基于欺骗性数据的可疑说法。这本书通过讲述趣闻轶事、引用网络文化和古代哲学典故来告诉普通读者如何识别骗人的鬼话——即便他们不是数学奇才。作者概述了相关关系与因果关系的区别,简要介绍了可视化技术,解释了机器学习原理,从而让读者有能力去辨别那些听上去(因此很可能确实也是)“好到或坏到不真实”的断言。

There is, alas, no shortage of material. In one of their examples, a widely shared scholarly article seems to show that musicians from genres such as rap and hip-hop die much younger than those who play blues or jazz. The researchers in question calculate that half of all hip-hop musicians are murdered—a classic case of a claim too bad to be true. Messrs Bergstrom and West show where they went wrong: the raw numbers are not incorrect, but the picture they paint is incomplete, because they discount performers who are still alive. As rap music only began in the 1970s, rappers who have already died tend to have done so younger than those from the more venerable genres cited in the article.

糟糕的是,这样的素材比比皆是。在他们举出的例子中有一篇被广泛传播的学术文章,文章似乎表明说唱和嘻哈等类型歌手的死亡年龄比蓝调或爵士音乐人要年轻得多。进行该项研究的人员计算出,有一半嘻哈歌手是被谋杀的——这种说法就是典型的“坏到不真实”。伯格斯特龙和韦斯特指出了他们的谬误:原始数字没有错,但他们的研究范围并不完整,因为他们忽略了仍然健在的歌手。说唱乐从上世纪70年代才开始兴起,因此与文中提及的那些比较严肃的音乐流派相比,现在已经去世的说唱歌手自然比较年轻。

The ways of deceit and error with data are many—and the authors point them out ruthlessly. Their fellow scientists, the media, the “TED brand of bullshit”: no one is spared. They describe how the findings of a study can be manipulated to make them seem statistically important even when they are not, and how feeding an algorithm skewed inputs yields unreliable results. For instance, in 2017 two scientists sparked ethical concerns by claiming to have built an algorithm that could guess whether a person was gay or straight on the basis of pictures gleaned from a dating site. The paper, which The Economist covered at the time, failed to mention that their “gaydar” may have been responding to variations in how people choose to present themselves (make-up, poses and so on), rather than to authentic physical differences.

数据的欺骗性和谬误多种多样,作者无情地予以揭露。无论是科学家同行、媒体,还是“打着TED旗号的胡扯”,谁也不能幸免。他们描述了如何操纵研究结果,让它们看起来具备统计上的重要性,尽管实际并非如此;以及如何将经过扭曲的数据输入算法,从而得到不可靠的结果。例如,两位科学家在2017年声称开发出了一种算法,可以根据从约会网站上采集的图片来推测一个人是同性恋还是异性恋,结果引发了伦理担忧。当时本刊也报道了这篇论文,但该论文并没有提到他们的“同性恋雷达”可能只是根据人们不同的自我展现方式(化妆、姿势等)做出判断,而并非根据真实的体貌差异。

While charts depicting the life expectancy of musicians are hardly lethal themselves, purporting to discern a person’s character from dodgy variables is perilous. Amid the pandemic, misinformation about infection rates and the efficacy of drugs—often bolstered by sneaky graphics, as in Georgia—is a particular concern. Some scientists are bypassing the usual peer-review process. Meanwhile newsrooms are under ever-greater pressure to attract clicks. More and more bullshit is contaminating debate. Mr Bergstrom and Mr West picked a good time to expose it. ■

描绘音乐人预期寿命的图表本身没什么杀伤力,但声称能靠一些不明不白的变量辨别一个人的性格就相当危险了。在这次疫情中,关于感染率和药物疗效的错误信息尤其令人担忧,其背后往往有搞鬼的图表在支撑,正如佐治亚州的例子那样。一些科学家已经绕过了常规的同行评议程序。与此同时,新闻媒体也日益面临吸引点击量的压力。越来越多的胡扯正在混淆视听。伯格斯特龙和韦斯特的揭露正当时。