How to debunk dodgy data

【双语】数字恐怖 Sums of all fears-书迷号

ON JULY 2ND the American state of Georgia counted a total of 87,709 cases of covid-19. Fifteen days later the number had risen to 135,183. Yet the state government’s online heat map looked largely the same. There appeared to be no increase in the number of crimson red areas where the outbreak was most severe. How come?


As it turned out, the threshold for places to turn red had been lifted from 2,961 cases to 3,769. This example of misleading data visualisation was called out by Carl Bergstrom and Jevin West. It joined the ever-growing catalogue of “bullshit”, malign and otherwise, which they debunk for students at the University of Washington.

原来,热图中区块变为深红的阈值从原来的2961例上调到了3769例。卡尔·伯格斯特龙(Carl Bergstrom)和杰文·韦斯特(Jevin West)拆穿了这个可视化数据误导受众的案例。他们在华盛顿大学向学生们揭露日益增多的各种恶意或无意的“胡扯”,这便是其中之一。

Out of that course they have spun “Calling Bullshit”, a helpful guide to navigating a world full of doubtful claims based on spurious data. Using clever anecdotes, nods to online culture and allusions to ancient philosophy, the book tells ordinary readers how to spot nonsense—even if they are not numerical whizzes. As well as sketching the difference between correlation and causality, the authors outline visualisation techniques and explain machine learning to arm people against assertions that seem, and so probably are, either “too good or too bad to be true”.

他们也把课程内容写成了《拆穿胡扯》(Calling Bullshit)一书。这是一本十分有用的指南,帮助读者分辨生活中充斥的各种基于欺骗性数据的可疑说法。这本书通过讲述趣闻轶事、引用网络文化和古代哲学典故来告诉普通读者如何识别骗人的鬼话——即便他们不是数学奇才。作者概述了相关关系与因果关系的区别,简要介绍了可视化技术,解释了机器学习原理,从而让读者有能力去辨别那些听上去(因此很可能确实也是)“好到或坏到不真实”的断言。

There is, alas, no shortage of material. In one of their examples, a widely shared scholarly article seems to show that musicians from genres such as rap and hip-hop die much younger than those who play blues or jazz. The researchers in question calculate that half of all hip-hop musicians are murdered—a classic case of a claim too bad to be true. Messrs Bergstrom and West show where they went wrong: the raw numbers are not incorrect, but the picture they paint is incomplete, because they discount performers who are still alive. As rap music only began in the 1970s, rappers who have already died tend to have done so younger than those from the more venerable genres cited in the article.


The ways of deceit and error with data are many—and the authors point them out ruthlessly. Their fellow scientists, the media, the “TED brand of bullshit”: no one is spared. They describe how the findings of a study can be manipulated to make them seem statistically important even when they are not, and how feeding an algorithm skewed inputs yields unreliable results. For instance, in 2017 two scientists sparked ethical concerns by claiming to have built an algorithm that could guess whether a person was gay or straight on the basis of pictures gleaned from a dating site. The paper, which The Economist covered at the time, failed to mention that their “gaydar” may have been responding to variations in how people choose to present themselves (make-up, poses and so on), rather than to authentic physical differences.


While charts depicting the life expectancy of musicians are hardly lethal themselves, purporting to discern a person’s character from dodgy variables is perilous. Amid the pandemic, misinformation about infection rates and the efficacy of drugs—often bolstered by sneaky graphics, as in Georgia—is a particular concern. Some scientists are bypassing the usual peer-review process. Meanwhile newsrooms are under ever-greater pressure to attract clicks. More and more bullshit is contaminating debate. Mr Bergstrom and Mr West picked a good time to expose it. ■