There is a well-known saying which Mark Twain popularized in Chapters from My Autobiography, published in the North American Review in 1907. “Figures often beguile me,” he wrote, “particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: ‘There are three kinds of lies: lies, damned lies, and statistics.’
Apparently Twain’s attribution to Disraeli was incorrect, but the saying was in common use by the 1890’s including a letter to the British newspaper National Observer on 8 June 1891, published 13 June 1891, where it was written: “Sir, —It has been wittily remarked that there are three kinds of falsehood: the first is a ‘fib,’ the second is a downright lie, and the third and most aggravated is statistics.”
My interpretation of these three categories is something like this. The first category consists of simple falsehoods. These can be merely incorrect but spoken in the conviction that they are true. Other times they are spoken knowingly to cover up information that is inconvenient or embarrassing. The second category is spoken knowingly and with intent to deceive or perhaps malign someone. The third category, which I see often in my perusing of various blogs about science and medicine in particular, uses a convenient statistic as a magic talisman to deflect criticism and give the appearance of truth to what is fundamentally a lie.
Many of the statistics I see cited repeatedly are only preliminary estimates based on a limited sample of data. In other cases, the user will find one number in a large report that supports the argument they wish to make while ignoring the rest of the report which disagrees with their argument. In the process of science, researchers will do repeated studies on matters of importance like the effectiveness of various treatments or the safety of vaccines. Thus information which may have been current 20 or 30 years ago will often have been superseded by newer and more accurate results.
In the middle of a global pandemic like COVID-19 which is caused by the SARS-CoV-2 virus, the statistics about it change as often as the weather. And even the definition of what we are counting changes as we become more aware of the nature and extent of this disease. So I would like to discuss the important statistics about this disease and how they compare with more familiar diseases like influenza and measles. Specifically, I want to discuss the uncertainty in some of the key statistics, how some people twist those statistics to mislead us and imply the disease is less harmful or widespread than it actually is, and how even in the face of that uncertainty those statistics can help us learn about this disease and guide us to control it and minimize the harm to ourselves and those we care about.
On December 31, 2019, the Wuhan Municipal Health Committee informed the World Health Organization (WHO) of 27 “cases of pneumonia of unknown etiology (unknown cause) detected in Wuhan.” This was the first public announcement about COVID-19.
On January 7, 2020, scientists of the National Institute of Viral Disease Control and Prevention (IVDC) confirmed the novel coronavirus isolated on 3 January was the pathogenic cause of the viral pneumonia of unknown etiology (VPUE) cluster, and the disease has been designated novel coronavirus-infected pneumonia (NCIP).
On January 8, South Korea announced the first possible case of the virus coming from China.
On January 14, WHO sent a tweet which said “preliminary investigations conducted by the Chinese authorities have found no clear evidence of human-to-human transmission of the novel coronavirus (2019-nCoV) identified in Wuhan, China”. According to Reuters in Geneva, WHO said there may have been limited human-to-human transmission of a new coronavirus in China within families, and it is possible there could be a wider outbreak.
On January 15, the first known travel-related case of 2019 novel coronavirus entered the United States: “The patient from Washington with confirmed 2019-nCoV infection returned to the United States from Wuhan on January 15, 2020. The patient sought care at a medical facility in the state of Washington, where the patient was treated for the illness. Based on the patient’s travel history and symptoms, healthcare professionals suspected this new coronavirus. A clinical specimen was collected and sent to CDC overnight, where laboratory testing yesterday confirmed the diagnosis via CDC’s Real-time Reverse Transcription-Polymerase Chain Reaction (rRT-PCR) test.”
At this point it was clear to anyone concerned with diseases that this was a significant new disease. And doctors and public health people were asking questions like:
How bad is it? / What are the symptoms? / What is the case fatality/death rate?
How contagious is it? / Does it spread from humans to humans like the H1N1 or just from animals to humans like the so-called avian or bird flu’s? How much does it spread? / What is the replication number?
If it spreads from humans to humans, does it spread only after symptoms are evident like the first SARS virus or could it spread before people are aware they are sick? (which has turned out to be the case)
Who is vulnerable? Does it affect mainly the elderly or can anyone get it? Can children spread the disease? What are the long-term effects?
We can get clues to the answers from simple observations. For instance, there were several cases in China of families where one member of the family got sick, then over subsequent days the rest of the family also got sick. That is clear evidence of spreading from human to human, not a common source like exposure to the same animal at a live market.
There was also a case where a woman from China visited someone in Europe. After she returned to China, both people got sick with Covid-19. That was pretty clear evidence of pre-symptomatic spread.
But to really understand those answers and compare a disease like Covid-19 with other diseases like the H1N1 influenza or the more familiar measles, we need measurements in numbers, i.e. statistics. And to use those statistics intelligently, we need an awareness of just how accurate or uncertain they are.
So what are those key numbers?
Johns Hopkins University has put a wonderful website online to gather and share those numbers.
<a href="https://coronavirus.jhu.edu">Coronavirus Resource Center</a>
The top level emphasizes maps and graphs rather than specific numbers, but these <a href=”https://coronavirus.jhu.edu/data/new-cases“>Critical Trends</a> show the difference between a very slow downward trend in the U.S. overall and how other countries like Italy, Spain and Germany have reduced their overall cases by a factor of 3 or more.
As of May 8, 2020, <blockquote>The first case of COVID-19 in US was reported … on 1/22/2020. Since then, the country has reported 1,228,603 cases, and 73,431 deaths.</blockquote>
For analyzing and comparing different diseases, the key statistics involve <a href=”https://www.health.ny.gov/diseases/chronic/basicstat.htm“>Incidence, Prevalence, Morbidity, and Mortality</a>.
<blockquote>Incidence is the number of newly diagnosed cases of a disease. An incidence rate is the number of new cases of a disease divided by the number of persons at risk for the disease.</blockquote>
For instance, on May 7, Texas (where I currently reside with a population of 28,995,881) reported 968 new cases of Covid-19 for a total of 35,390. There were 25 deaths for a total of 973. New Mexico (where I used to live with a population of 2,096,829) reported 202 new cases for a total of 4493 and 3 deaths for a total of 172. New York (population 19,453,561) reported 2786 new cases for a total of 323,978 and 242 deaths for a total of 19,887. So the incidence rate in Texas yesterday was 1 in 29,954. New Mexico was 1 in 10,380. New York was 1 in 6983. But on Monday May 4 in New Mexico, 2 counties in the northwest (San Juan and McKinley) with a combined population of 201,864 reported 136 of the 186 new cases in the state. That gives those two counties an incidence rate of 1 in 1484, 5 times as high as the state of New York.
<blockquote>Prevalence is a measure of disease that allows us to determine a person’s likelihood of having a disease. Therefore, the number of prevalent cases is the total number of cases of disease existing in a population.</blockquote>
Prevalence tends to be used more for persistent diseases like Hepatitis or diabetes. For instance, the CDC estimates that nearly 2.4 million Americans – 1 percent of the adult population – were living with hepatitis C from 2013 through 2016. At the same time, 34.2 million people in the U.S. have diabetes (10.5% of the US population). So far, most people who get sick with Covid-19 recover and many of them have symptoms that are barely detectable. But many of the seriously ill suffer from <a href=”https://www.sciencemag.org/news/2020/04/survivors-severe-covid-19-beating-virus-just-beginning#“> Post-ICU Syndrome</a>. And a few children are turning up with an <a href=”https://www.washingtonpost.com/health/2020/05/06/kawasaki-disease-coronavirus/“> inflammatory shock syndrome.</a> So, it may be important to track the prevalence of those conditions in the future.
Morbidity is another term for illness. A person can have several co-morbidities simultaneously. So, morbidities can range from Alzheimer’s disease to cancer to traumatic brain injury. Morbidities are NOT deaths. Prevalence is a measure often used to determine the level of morbidity in a population.</blockquote>
A search on morbidity for measles in the U.S. will lead you to the CDC Pinkbook, which tells us
<blockquote>Before 1963, approximately 500,000 cases and 500 deaths were reported annually, with epidemic cycles every 2–3 years. However, the actual number of cases was estimated at 3–4 million annually. More than 50% of persons had measles by age 6, and more than 90% had measles by age 15. The highest incidence was among 5–9-year-olds, who generally accounted for more than 50% of reported cases.</blockquote>
<blockquote>Mortality is another term for death. A mortality rate is the number of deaths due to a disease divided by the total population</blockquote>
So in the 50’s there were about 500 deaths per year in a population that was increasing from 150 to 180 milion. That works out to about 3 deaths per 100,000. But the case fatality rate was about 1 in 1000 and we have seen similar numbers in recent outbreaks in France, Romania and other countries. And there were also 48-53 deaths per 100,000 population from influenza and pneumonia. Since pneumonia is often a secondary infection in patients with measles, some of those deaths may have been a result of a measles infection.
One other important number is called the <a href=”https://www.healthline.com/health/r-nought-reproduction-number#how-its-calculated“>reproduction number</a>, often designated as R or R0 . The distinction is that R is the actual reproduction number in the current population and can be location specific. R0 is the basic reproduction number. R0 tells you the average number of people who will contract a contagious disease from one person with that disease. It specifically applies to a population of people who were previously free of infection and haven’t been vaccinated. The higher the reproduction number, the more people will catch the disease if they are not immune.
R can be lowered from R0 when some of the population become immune to the disease, if people are widely scattered, or if artificial constraints on contact are instituted such as social distancing and shelter-in-place orders. And in some cases R can be higher than R0 when one person is especially contagious or in cramped quarters like a nursing home, military barracks, or prison. For the rest of this article, I have used R0 to refer to the calculated or estimated reproduction number.
For the 1918 flu pandemic, the R0 value of the 1918 pandemic was estimated to be between 1.4 and 2.8.
The R0 for COVID-19 is a median of 5.7, according to a study published online in Emerging Infectious Diseases. That’s about double an earlier R0 estimate of 2.2 to 2.7. There is a lot of uncertainty about the reproduction number for the <a href=”https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3323341/“>2003 SARS virus</a>, which was relatively well controlled by isolation and quarantine measures.
For the 2009 H1N1 pandemic, <blockquote>Rt declined from around 1.4–1.5 at the start of the local epidemic to around 1.1–1.2 later in the summer, suggesting changes in transmissibility perhaps related to school vacations or seasonality. Estimates of Rt based on hospitalizations of confirmed H1N1 cases closely matched estimates based on case notifications.</blockquote>
The reproduction number is important because it tells us how rapidly a disease can grow and spread if people are not protected against it. And it tells us how much we need to be protected against the disease to keep it under control. If you multiply the reproduction number times the fraction of the population that are not protected, (1 minus the fraction that are protected) i.e. immune, you will see if the disease will grow or decline with each generation. So if the reproduction number is 2 and 55% are protected. 1 x (1-0.55) = 0.9. Thus the number of people infected declines by 10% with each cycle and eventually the disease will die out. The best way to achieve that is by vaccination which doesn’t require each person to get the disease first. But we don’t have a vaccine for Covid-19 yet.
And so far we don’t know how well protected the people who do get the disease will be. An early study shows that even people who get a mild form of the disease <a href=”https://directorsblog.nih.gov/2020/05/07/study-finds-nearly-everyone-who-recovers-from-covid-19-makes-coronavirus-antibodies/“>develop antibodies.</a> But, we won’t find out for some time how strong or long-lasting this protection will be. And immunity to coronaviruses doesn’t seem to last very long. Immunity to the first <a href=”https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2851497/“> SARS virus</a> only lasted an average of two years. We can hope for better, but we will have to wait and see.
So now that I’ve laid out a basis for discussion, let’s look at the ways in which people can use statistics to mislead the reader/listener to distort or minimize the danger of this disease. The first is a variation on a classic fallacy called the argument from ignorance. The basic form of this is to say that since you don’t understand how something can happen, then it can’t happen. But what we have seen especially in January and February when we were just learning about this disease was a tendency to assume that the cases we had identified were all there were. This especially showed up in political pronouncements, but it also affected the perception in rural states that hadn’t seen the disease yet while it was building up in Washington, California, New York and Louisiana.
And we still see this thinking in the comments that spring up on news story sites and blogs that cover politics and medicine. I read one commenter arguing yesterday that the SARS-CoV-2 wasn’t a problem because no cases had been diagnosed in his county. And even though Texas continues to report about 1000 new cases a day and deaths have almost doubled from 663 to 1100 in the last two weeks, there are still many counties in western Texas that have yet to see their first case of this disease.
But at the political level, this is turning into willful ignorance. I always thought that knowledge was power, but our president doesn’t want to know how many people are sick because it will make us (or perhaps him?) <a href=”https://www.businessinsider.com/trump-says-too-much-coronavirus-testing-makes-us-look-bad-2020-5“>look bad</a> And even though Dallas County tied its single day high yesterday and “Texas ranks last in coronavirus testing in the country”, <a href=”https://www.dallasnews.com/news/public-health/2020/05/11/dallas-county-reports-253-new-coronavirus-cases-tying-its-single-day-high/>the federal government plans to pull COVID-19 tests from two Dallas sites</a> which can test up to 500 people a day.
Two other ways to argue against the evidence take advantage of the fact that several key statistics are actually ratios, one number divided by another number.
<blockquote>A mortality rate — often confused with a CFR — is a measure of the number of deaths (in general, or due to a specific cause) in a population scaled to the size of that population per unit of time. A CFR, in contrast, is the number of dead among the number of diagnosed cases. The term infection fatality rate (IFR) also applies to infectious disease outbreaks, and represents the proportion of deaths among all the infected individuals. It is closely related to the CFR, but attempts to additionally account for all asymptomatic and undiagnosed infections.</blockquote>
So, for instance, in Texas with 39,869 cases and 1100 deaths, the CFR is 2.76% but the mortality rate is only .00379% since most of the population has still not been exposed to and been infected with this virus. So one argumentative technique is to argue that the disease is not that dangerous because the mortality rate is so low (so far). This is a way of inflating the denominator.
But to see how dangerous Covid-19 really is, we need to compare it with other diseases like influenza or measles. <a href=”https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3809029/“>Wong et al </a> did a systematic review of the 2009 H1N1 influenza pandemic and came up with a CFR upper bound of “10 deaths per 100,000 infections” or about 0.1%. The CFR for measles varies widely from <a href=”https://www.who.int/immunization/sage/Review%20article%20of%20measles%20CFRs.pdf“>country to country</a>. But typical estimates hover around 1 in 1000 or 2000 cases, similar to influenza. For comparison, death from measles was reported in approximately 0.2% of the cases in the United States from 1985 through 1992. So, Covid-19 is clearly a much more serious disease than influenza or measles. And we can’t yet vaccinate to protect ourselves.
Another common technique is to deprecate the numerator by making it seem smaller than it really is. We can see this is Senator Scott Jensen’s <a href=”https://www.usatoday.com/story/news/factcheck/2020/04/24/fact-check-medicare-hospitals-paid-more-covid-19-patients-coronavirus/3000638001/“>TV claim</a> that “hospitals get paid more if Medicare patients are listed as having COVID-19 and get three times as much money if they need a ventilator”. The implication which was widely shared on Facebook is that the reported number of Covid-19 cases is less than reported. But Senator Jensen himself said that he thinks the overall number of COVID-19 cases have been undercounted based on limitations in the number of tests available..
And if you look at total deaths <a href=”https://www.nytimes.com/interactive/2020/04/21/world/coronavirus-missing-deaths.html“>regardless of cause</a>, you can see a huge increase beyond what is accounted for in the official coronavirus death counts. Some of those deaths are undoubtedly from other causes and might have been saved if hospitals hadn’t been so swamped with Covid-19 cases. But many are likely people who didn’t test positive or were advised to stay at home because their symptoms were mild.
A final way to argue against the data is to quote an outlier or irrelevant statistic. For instance, on one day last week New York City reported a high percentage of new Covid-19 cases were people who were sheltering at home. The implication was that sheltering at home didn’t help. But the 75% reduction in the city’s daily case rate belies that implication. And many of those cases were in nursing homes or senior living facilities which can easily become hotspots for an outbreak.
So where do we go from here? There is a broad consensus among medical and epidemiology experts that we need to do three things.
First, we should continue shelter-in-place, social distancing and other infection limitation measures until we achieve a steady reduction in cases over a 14 day period. The target level is usually stated at about 1 new case per day per 100,000 population. For Texas, that would be about 300 cases. So at 1000 cases per day which is not going down, we still have a lot of work to do.
Second, we also need to increase the level of testing to locate outbreaks. A prominent <a href=”https://www.npr.org/sections/health-shots/2020/05/07/851610771/u-s-coronavirus-testing-still-falls-short-hows-your-state-doing“>Harvard study</a> found that we need to increase the level of testing to about 150 tests per day per 100,000 people. This can vary depending on how common and widespread the disease is in a specific state or city. This model shows that Texas needs to do about 27,000 tests a day while it is averaging only about 17,000. New Mexico is doing about 3,000 tests a day and needs to do about 5,000.
At that point, we can control the oubreak and limit its spread with <a href=”https://time.com/5825140/what-is-contact-tracing-coronavirus/“>contact tracing</a>. This worked to control the 2003 SARS outbreak and the 2014 Ebola virus epidemic. But it is a laborious process of sending trained people out to identify recent contacts of people who test positive and advise them to quarantine themselves. The goal of 1 new case per 100,000 is primarily based on this workload. Even then, managing new outbreaks can be tricky.
But right now, the different states in the U.S. and other countries around the world are doing a vast experiment adopting varying strategies in how to limit this new disease, support their population and try to get their economies and life in general back to a semblance of normal or whatever the new normal turns out to be.
Final note: I’m not an expert in these subjects so I try to base my thoughts on those who know and understand better than I do. I welcome comments and critiques. But for now, they go to moderation so I can filter out any profanity or obvious spam.
May 19, 2020
It’s been about a week and I decided to give this a quick read-over and did a little minor editing. I’ll probably revisit the subject in about a month.
One persistent argument that I see online is a version of deprecating the numerator. The idea is that people who die with other co-morbidities should be counted as deaths from those and not from Covid-19. Thus our estimates of the CFR are too high and the SARS-CoV-2 virus is not as dangerous as claimed. But this is pointless and misleading for a couple reasons.
First, it doesn’t save anyone’s life. It just shifts the blame. And, as the excess death statistics show, a lot more people are dying this year in the U.S. and other countries than died last year. And the one major difference this year is Covid-19. Future scientists will sort out the overall effects of this new virus. But until then, we should track and count all cases that test positive as we are now doing.
And second, if anything, counting only the positive cases understates the devastating effect of this disease. As Dr. Jeremy Samuel Faust points out, https://news.yahoo.com/flu-deaths-were-counted-covid-053449918.html the official estimates for influenza cases and deaths include many cases based on symptom-based diagnoses even without viral confirmation. If we used the same method for counting influenza that we are now using for Covid-19, the official influenza counts would be much lower.
<blockquote>The 25,000 to 69,000 numbers that Trump cited do not represent counted flu deaths per year; they are estimates that the CDC produces by multiplying the number of flu death counts reported by various coefficients produced through complicated algorithms. These coefficients are based on assumptions of how many cases, hospitalizations, and deaths they believe went unreported. In the last six flu seasons, the CDC’s reported number of actual confirmed flu deaths — that is, counting flu deaths the way we are currently counting deaths from the coronavirus — has ranged from 3,448 to 15,620. [Jeremy Faust, Scientific American]</blockquote>
Another argument I have seen is the irrelevant comparison, such as the number of people dying each year from tobacco. https://www.cdc.gov/tobacco/data_statistics/fact_sheets/health_effects/tobacco_related_mortality/index.htm If cigarette smoking causes 480,000 deaths every year and Covid-19 “only” causes 60,000 or 80,000 or now more than 90,000, why are we getting so concerned about this virus?
Of course we have taken action about the damage from tobacco in the last 30-40 years. During that time, cigarette use has declined substantially. If we hadn’t done those things, there would be a lot more people dying from tobacco. But this argument also ignores the difference in the two conditions. Tobacco deaths are a sort of sunk cost, the result of years of behavior. Even if every smoker in the U.S. stopped tomorrow, many of those deaths would happen anyway because of the damage their bodies have already suffered. Also, most tobacco use is voluntary although some of the damage is from secondary smoke. It’s not easy, but people can choose not to smoke. No one can just choose not to get infected by a virus in the air they breathe. It requires caution and cooperation.
Whereas, Covid-19 is a contagious disease. If we cooperate and work together to limit its spread, we can minimize the harm and greatly reduce the number of people who suffer this disease and give ourselves time to find a useful treatment and develop and produce an effective vaccine to finally control it.
This Week in Virology is a podcast available on Google, Apple, Spotify and other podcast apps. It offers a fascinating discussion of ongoing subjects in virology even though much of it is way above my limited biology training. Episode 607 featured an interview with Dr Jeffrey Shaman of Columbia University https://datascience.columbia.edu/jeffrey-shaman.
<blockquote>to explain why more SARS-CoV-2 testing and contact tracing is needed to stop the pandemic, and provide insights on immunity and reinfection from seasonal CoVs, the problems with antibody tests, and what to expect in the coming months</blockquote>
There was a special interview episode on March 26 featuring Dr Mark Denison of Vanderbilt University who discussed COVID-19 and SARS-CoV-2 with an emphasis on antiviral therapeutics.
Another useful source was the Explore the Space podcast Episode 181. That episode featured Jeremy Konyndyk, who is a Senior Policy Fellow at the Center for Global Development & a recognized expert on global outbreak preparedness. They discussed the unsettling response to the Covid-19 pandemic from the US government thus far, the disruptive impact of magical thinking, & what a fierce sense of urgency looks like..