Whether we like it or not, the past months have given us a crash course in epidemiology. COVID-19 has taken terms like reproduction number, herd immunity, social distancing, or flattening the curve from research literature to daily news and instructions for visitors to California State Parks.

We are in the middle of a pandemic we have partially tamed by putting the economy in a coma. This pandemic has already killed more Americans in two months than the Vietnam war in 20 years and we are facing the unprecedented challenge of restarting factories in this context.

Among the many things to learn in a hurry, are what epidemiologist Adam Kucharski calls the rules of contagion, as they apply to the people who work in a factory and its surrounding community.

Quality control is the closest most of us in Manufacturing ever get to serious statistics/data science. It’s not the same domain as epidemiology, and there is little crossover in tools or methods. This is to share what I have just learned about this topic. I welcome any comment that might correct misconceptions on my part or otherwise enlighten us.

Maps

Statistical epidemiology has a long history, and it started with mapping the geographical locations of cases. It is easier to do now than ever before and still useful in guiding action, whether at the level of a country or a factory.

The Cholera Map of London

In Visual Explanations, Edward Tufte cited the work of John Snow, a doctor who plotted the locations of cholera deaths in London in September 1854 and connected them to the Broad Street water pump. Gary Kapanowski also recommended Steven Johnson’s 2006 book The Ghost Map about the London cholera epidemic.

This was not about models of the dynamics of epidemics but about mapping the geographical locations of infections, visually identifying clusters, and linking them to other features like water pumps.

COVID-19 Maps Today

Mapping is easier today and used to support the same kind of detective work John Snow did in London 166 years ago. The following heat map of COVID-19 in the US was in the New York Times on 5/4.

COVID-19 hot spots in the US on 5/4/2020

If you zoom in on Ohio, you find the highest number of cases per capita in Marion and Pickaway county, on news5cleveland.com. By zooming and investigating further, you find that both counties have large jails where the majority of inmates are infected.

COVID-19 Hot Spots in Ohio on 5/4/2020

Maps of Fever Anomalies

Mapping is also used, for example, for identifying COVID-19 clusters through temperature anomalies detected by connected thermometers. The Kinsa company has sold 2 million connected medical thermometers in the US, and collects 150,000 temperature readings daily. They claim that, after filtering out the noise due to seasonal flu, they can see upcoming COVID-19 clusters before the CDC does, on literal heat maps.

Maps of Factory Cases

If, in a factory, you read every employee’s temperature when they arrive and systematically test any employee with COVID-19 symptoms, you have a list of confirmed cases within the workforce and you can locate the work location of each on a map of the plant.

If this map shows you that a cluster is growing in the welding shop or in a segment of final assembly, you can zoom in on it, test everyone working in it, disinfect it, reinforce the personal protection equipment in it, and reengineer the line…

Contagion over time

In The Rules of Contagion (2020), Adam Kucharski introduces Ronald Ross as a pioneer in mathematical models of contagion. The charts used for COVID-19 in connection with expressions like “flattening the curve” or “herd immunity” are based on refinements of the SIR model developed by Ross disciples Kermack and McKendrick in 1927 in the UK, about the same time Shewhart was developing Control Charts.

For current research on this subject, see, for example, the MIDAS network. MIDAS stands for “Models of Infectious Disease Agent Study” and is a network of academics funded by the US National Institutes of Health.

Exponential Growth At The Outset

In the beginning of an outbreak, the number of infected people grows exponentially, and news reports commonly used this term about COVID-19 in February and March, 2020. It means that the number of infected people grows at a rate that is proportional to the number already infected.

It’s like instantly compounded interest, as each infected person joins the group that causes new infections. In the charts published about COVID-19, the growth has been expressed in terms of the number of days it takes to double, which gives you a straight line in a semi-logarithmic plot.

The Katz/Sanger-Katz Chart

The following chart, by Josh Katz and Margot Sanger-Katz, was used to show deaths rather than infections in multiple countries, with the straight gray lines corresponding to different doubling intervals.

Cumulative COVID-19 death toll in different countries as of 4/13/2020 veers off exponential model

Even in the short period of time shown here, the actuals plots do not stay straight, meaning that they veer away from exponential growth. It is a model for a trend, treating the number of infected people like a continuous variable when, in fact, it is a random integer.

The Charts From “Our World In Data”

Our World In Data publishes many charts and data sets about COVID-19 that are free to download and updated daily. Online, you can select the countries for which you wish to see the plots. They also included the straight lines corresponding to exponential growth at various speeds.

More decisive departure from the exponential model between March and May, 2020

Why plot deaths rather than infections? As COVID-19 deaths are better documented than infections, they can be plotted more accurately. The death toll is a lagging indicator of the infection rate, proportional on a given day to the number of infections roughly two weeks earlier.

While not all COVID-19 deaths are recorded as such and some deaths from other causes may be misclassified as due to COVID-19, deaths are still more accurately recorded, in many different countries.

Many virus carriers don’t know they are infected, don’t experience any symptoms and don’t see a doctor, while they are in fact contaminating others. We don’t know how many there are. See more details below in the discussion of recovery and mortality.

The Exponential Model

In the early stages, between times t and t+dt , the number I\left ( t \right ) of infected people grows by \beta I\left ( t \right )dt and shrinks as infected people recover or die by \gamma I\left ( t \right )dt , so that the net change is dI(t) = \left(\beta – \gamma\right)I(t)dt = \left(R_{0} – 1\right)\gamma I(t)dt , so that

I(t) = I(0) exp\left[\left(R_{0} – 1\right)\gamma t\right]

with R_{0} = \frac{\beta}{\gamma} , pronounced “are nought” and known as the epidemic’s reproduction number.

Beyond the very early stages, the exponential model breaks down because it takes no account of the decrease of the susceptible population with each infection.

R_{0} : An outbreak’s reproduction number

R_{0} is a key parameter of the disease in a specific human and natural environment. If R_{0} > 1 , the introduction of a small number I(0) of infected people starts an outbreak. On the other hand, if R_{0} < 1 , I(t) quickly peters down to 0 and the disease “vanishes.”

Published Estimates

R_{0} s are estimated and published by the US Centers for Disease Control (CDC) for various outbreaks of various diseases. They are a function of both a disease and a population. How a virus to spread and how infected people recover varies with the lifestyle of the population and its living environment.

These differences can be an order of magnitude. For example, US CDC publications rate the MERS coronavirus, a predecessor of COVID-19, with R_{0} = 0.45 in Saudi Arabia and R_{0} = 8.1 in Korea. This is 18 times higher. Statements found in the press, like “The flu has an R0 value of about 1.3…” make no sense unless the context is specified.

COVID-19’s R_{0} in various parts of the US is not yet known. For the Wuhan outbreak, the Chinese government announced it between 2 and 3, already a wide margin. The CDC, however, now estimates it with a median of 5.7 and a 95\% confidence interval of 3.8 to 8.9 .

In the US, we can expect different R_{0} values for a factory, a state, and the whole country, and ranges of values for each. These wide, non-overlapping ranges for COVID-19 as well as for other diseases suggest that R_{0} is not an easy parameter to estimate.

Undisclosed Estimation Methods

Many publications are coy about the way R_{0} is estimated. Healthline, for example, has an article entitled “How is the R0 of a disease calculated?” that doesn’t tell you. In addition, it mistakenly describes R_{0} as a parameter of the disease only. The School of Public Health at the University of Michigandoes not say much more:

…there’s uncertainty about many of the factors that go into estimating R0, such as in estimating the number of cases, especially early on in an outbreak. Based on these current estimates, projections of the future number of cases of coronavirus are fraught with high levels of uncertainty and will likely be somewhat inaccurate.

The difficulties arise for a number of reasons.

First, the basic properties of this viral pathogen – like the infectious period – are as yet unknown.

Second, researchers don’t know how many mild cases or infections that don’t result in symptoms have been missed by surveillance but nevertheless are spreading the disease…

In other words, we don’t have the data needed to estimate R_{0} when an outbreak starts and this information would be the most useful. We can revisit the question after introducing a more elaborate model, that is applicable beyond the initial stage.

Within the workforce of a factory, this information is easier to obtain than in a larger population. We will discuss this below. First, let us introduce the most commonly used model of infectious diseases.

The SIR Model

“SIR” stands for “Susceptible, Infected, and Recovered.” The 1927 SIR model ignores the variations in the overall population N due to births, deaths by causes other than the epidemic, and migrations, and breaks it down into three compartments:

I infected and contagious individuals

S susceptible individuals, who can contract the disease through encounters with infected, contagious individuals

R individuals who have been infected but are no longer contagious as a result of recovery or death and, if alive, are now immune.

Whether the assumption that recovered individuals are immune holds for COVID-19 is currently unknown but it has held for most infectious diseases in the past.

The SIR Model, Explained With High School Math

Like Shewhart with SPC, the epidemiologists of the 1920s had to keep their models simple enough for manual calculations. Today, all sorts of refinements are possible, and the randomness of contagion can be simulated to obtain ranges of outcomes rather than only expected values.

When I << S , we can use the exponential model, as when there are 10 infected individuals in the US population of 330 million. When considering the spread of the disease in a smaller group, like the 3,500 employees of a factory, the dynamics of the three compartments cannot be ignored.

The SIR Model In Equations

Between times t and t + dt , there are I\times S possible encounters between infected and susceptible individuals. A fraction of those possible encounters actually occurs, and a fraction of the actual encounters result in an infection, as shown in the following picture.

Infections are a fraction of possible encounters

With the additional simplifying assumption that all the infected people are equally infectious, the number of susceptibles decreases as follows:

\frac{dS} {dt} = -\alpha I S

Near t = 0, S \approx N and, for consistency with the exponential approximation, we choose to use the per-capita rate \beta :

\alpha = \frac{\beta}{N}

Meanwhile, a fraction of the infected people recovers or dies, therefore:

\frac{dR} {dt} = \gamma I

This is also a simplifying assumption because if, for example, an infection takes 25 days to run its course, then there will be no recovery at all for the first 25 days after the first infection.

As above the number of infected individuals is increased by the number of new infections and decreased by the number of recoveries. Therefore, the rate of net change is

Instead of \gamma, the recovery process is often described more intuitively in terms of the infectious period I_{p} = \frac{1}{\gamma} .

Back-of-the-envelope Conclusions About The SIR Model

The High School math above leaves us with a system of three differential equations:

\frac{dS} {dt} = -R_{0} \frac{S}{N}\gamma I

\frac{dI} {dt} = \left(R_{0}\frac{S}{N} -1\right)\gamma I

\frac{dR} {dt} = \gamma I

with the constraint S(t) + I(t) + R(t) = N for all t .

Even though these equations look simple, the system does not have a solution in the form of a simple formula involving log, exp, polynomials or trig functions. The solutions you see plotted are numerical approximations.

Before giving up on the High School math, however, we can still, on the back of an envelope, figure out the shape of the infection curve and the proportion of the population that is infected by the time the epidemic ends. The following chart shows what we can find this way.

For an epidemic to even start, we must have at least one infected person in the population. Therefore, I\left(0\right)>0 . The population N is large compared to I\left(0\right) , and we can assume that, initially, no one is immune, so that S\left(0\right) \approx N , R\left(0\right) = 0 and

and the number infected I(t) falls off the exponential growth to reach a peak when S\left(t\right) = \frac{N}{R_{0}} . At that point, there are no longer enough Susceptibles available to sustain growth in the infection.

Beyond the peak, I decreases faster than a negative exponential as S decreases. If S_{1} = S\left(t_{1}\right) for any t_{1} past the peak, then, for t > t_{1} ,

If r_{\infty} = \frac{R_{\infty}}{N} is the proportion of recovered individuals at the end of the epidemic, it satisfies the equation

1- r_{\infty} – exp\left[-R_{0}\times r_{\infty}\right] = 0

We can solve this numerically to obtain s_{\infty} = 1- r_{\infty} as a function of R_{0} and plot it:

Since “Recovered” means “previously infected but no longer contagious,” it includes those who die from the disease and those who suffer permanent ill effects. The proportion s_{\infty} of people in the population who live through the epidemic without every being infected is, therefore, the most interesting, and it drops fast at R_{0} increases.

With R_{0} in the range estimated by the CDC for COVID-19 in Wuhan, in the absence of countermeasures, the entire population would have been infected; by the lower estimates of the Chinese government, between 75% and 95% would have been.

We now know a fair amount about the way the epidemic plays out but we still don’t know when the peak happens or how high it is. We’ll need more powerful tools to find out.

Recovery and Mortality

In the SIR model, the Recovered compartment includes those who died from the disease. In everyday speech, death is certainly not a special case of “recovery” and I suspect the term was chosen to fit the mnemonic acronym “S.I.R.”

We can assume that the final number of deaths d_{\infty} of the epidemic proportional r_{\infty} . For COVID-19, we don’t have the data needed to evaluate the ratio, for lack of estimates of the number of infected people.

Mortality Data

In March and April, 2020, COVID-19 killed more Americans than the Vietnam war in 20 years. We can read in the above chart that reducing R_{0} from 3 to 1.5 would reduce r_{\infty} by 29% and therefore d_{\infty} by the same ratio. In other words, “flattening the curve” doesn’t just smooth the workload of hospitals, it also directly saves tens of thousands of lives just in the US.

Most countries, record times and causes of death. This data, however, is a human judgment call, and there are many reports asserting that underreporting of COVID-19 as a cause of death.

Case Fatality Rate Versus Infected Fatality Rate

The most commonly cited metric of mortality from COVID-19 is the Case Fatality Rate (CFR), the probability that a person diagnosed with the disease will die of it. What we would really like to know is the Infected Fatality Rate (IFR), the probability that a person infected with it will die of it.

The health care system knows who was diagnosed with the disease and when. Patients who die have been diagnosed before, and we should be able to know when. With this data, for each day in the past, we should know how many people were diagnosed that day, and how many of them died. From these numbers, we should be able to track the CFR over time without bias.

Infected People

Many infected people don’t know it and recover without showing any symptoms. They transmit the virus but leave no trace in the health care system. Testing random samples in the population would provide the data to estimate their number but this has yet to be done.

With the flu also, many catch the disease and self-medicate without leaving a record, like the Emily character in The Devil Wears Prada:

Based on a 2015 paper, the CDC does not attempt to count the Emilys:

“For the post-pandemic seasons, we lacked data on non-hospitalized illnesses and did not estimate this number of influenza illnesses in the population.”

Using the CFRs of two diseases is an apples-to-apples comparison as long as it is clear that they are not IFRs. As reported in Our World in Data, as estimated by the US CDC, the CFR for seasonal flu is between 0.1\% and 0.2\% ; for COVID-19, 5.84\% as of 5/4/2020 per European CDC.

What creates confusion is mixing CFRs and IFRs. On May 2, on CNN, I heard Dr. Sanjay Gupta misspeak when, after discussing the rate of deaths per diagnosed case of COVID-19, he said “with the flu, if you are infected, you have 0.1% chance of dying.”

Plotting The S,I, and R Curves

Within the SIR model, to see when the peak occurs and how high it is, you need to solve the system of equations that express the dynamics of the epidemic.

The shinySIR package in R is one of many resources that let you plot epidemic profiles for ranges of R_{0} and I_{p}. In the first example, with a low R_{0}, the infection curve could be confused with a classical bell shape. The second example, with a higher R_{0}, shows that it is in fact both pointy and skewed.

As R_{0} is the ratio of two rates, it is independent of whichever unit of time you use. I_{p} , on the other hand, is a period and therefore expressed in units of time. The x-axis in the following charts just says “Time.” We can assume it to be days.

In the first example, the epidemic dies down before all the Susceptibles are Infected; in the second, the infection reaches everyone.

Bracketing The Parameters

The main limitation of the SIR model is that it treats random numbers of people like continuous variables determined by differential equations. In 1927, Kermack and McKendrick couldn’t go beyond this but we can with an ordinary laptop computer.

S, I and R truly are random integers, and the SIR model gives their expected values. Then:

\beta \frac{S}{N}dt is the probability of S dropping by 1 and I rising by 1 between t and t+dt and

\gamma dt , likewise, is the probability that I drop by 1 and R rise by 1 .

By dividing the time in sufficiently tiny \Delta ts we can use this to simulate multiple epidemic trajectories and estimate the distributions of S\left(t\right) , I\left(t\right) and R\left(t\right) around the expected values provided by the model.

It is also possible to fit an SIR model to actual data. The rt.live site provides confidence intervals for the reproduction number R_{t} at time t for all 50 states in the US.

If nothing changes from t= 0 onward, then R_{t} = R_{0}\times \frac{S\left(t\right)}{N} . If, on the other hand, public health countermeasures, or the weather, change the dynamics of the outbreak, this shows in the data on new cases, and rt.live refines the estimates by a method published by Bettencourt and Ribeiro in 2008. It provides both a most likely value for of R_{t} and a confidence interval.

If R_{t} > 1 , I is grows; otherwise, it falls. The rt.live snapshot for 4/28/2020 is as follows:

The best value is for Michigan; the worst, for Nebraska. The time series for R_{t} provided for these states are as follows:

The people behind rt.live are Instagram co-founders Kevin Systrom and Mike Krieger, and Gradient founder Thomas Vladeck. Their raw data and Python code are available to download.

Refining The Model

The IT available to us also allows us to remove the simplifying assumptions of the SIR model. When using simulations, we are free to be more realistic:

We don’t have to assume a constant population N. Mothers can have babies, people can die of other causes, and migrants can join or leave.

People can be less contagious when asymptomatic, more with mild symptoms, and most with acute symptoms.

The immunity of recovered patients can have a time limit.

Transmission may be less likely in hot than in cold weather.

…

Just because we can, however, doesn’t mean we should. It all depends on what is gained as a result of using a more realistic model, in terms of the decisions it supports.

Flattening The Curve

“Flattening the curve” is the stated goal of the countermeasures taken worldwide, in effect to reduce the frequency of encounters between Infected and Susceptible individuals. The US Centers for Disease Control (CDC) published the following picture to explain the rationale:

The shape of these curves is a clear indication that, after 93 years, the SIR model and its subsequent refinements have not lost their relevance.

The primary motivation given for flattening the curve is to avoid overloading the health care system with a sharp peak in infections. In such a peak, more patients die from being denied care for lack of equipment and health care workers.

There are other reasons to flatten the curve that are not commonly mentioned:

It reduces the total number of infections and therefore deaths, as discussed above about s_{\infty} and r_{\infty} .

Those of us who will die from COVID-19 would rather do it later than sooner.

Flattening the curve buys time to find cures, treatments, and vaccines.

SIR Curves And The Spanish Flu of 1918-1919

On 3/28/2020, the National Geographic published a series of charts on Spanish flu death rate in US cities in 1918-1919, highlighting, for each city, the period of time during which they applied “social distancing measures,” although I doubt the term was used at the time. These charts are commonly used as evidence of the effectiveness of today’s social distancing against COVID-19.

The plots are not of infections but of deaths that occurred in excess of what would have been usual numbers at these times and locations. For comparisons between cities of unequal populations, the quantity plotted is a number of deaths per 100,000 inhabitants. Using the same method, Josh Katz and Margot Sanger-Katz have plotted the differences between the actual death counts March 15 through May 5, 2020 and the expected values based on prior years’ counts for the same periods in the same locations. They found, for example, that New York City had about 23,000 excess deaths during that period, compared to the 18,000 officially imputed to COVID-19.

The visual similarities between the Spanish flu charts and the SIR model suggest that, for all its simplifying assumptions, it does capture the fundamental dynamics of contagion within a population. The CDC today still uses R_{0} as a key characteristic of an outbreak.

The worst and the best cities were Philadelphia and St. Louis, as shown in the following picture:

This article strikes me as valid and informative, with a few caveats:

The article does not cite the sources of the data and does not explain what specific social distancing measures each city used.

Since we don’t know the details of what each city did, the difference between the white and brown areas on the chart are between doing nothing and doing something.

The populations of Philadelphia and St-Louis in 1918 were respectively 1.5 million and 687,000 and therefore there is no doubt that the differences in death rates could not have arisen from the same probability of death.

The brown zone in St-Louis did start about 5 weeks after the first death versus 3 weeks for Philadelphia. The brown zone in St-Louis started when the death rate was lower and rising slower. In other words, Philadelphia responded to a blaring red alert; St-Louis, to a blinking amber light. The outcome shows that it was proper to act at that point.

The cities that had second bumps obviously did something wrong but the charts don’t contain enough information to figure out what it was.

Restarting Factories Safely

You mitigate an outbreak by reducing R_{0} , which means that you either reduce the infection rate \beta or increase the recovery rate \gamma . For COVID-19, we currently have neither a vaccine that would bring \beta to 0 nor cures to boost \gamma .

Until we do, all we can do is reduce the number of encounters between Infected and Susceptible individuals. Testing, contact tracing, isolation, quarantines, social distancing, personal protective equipment, etc., are all countermeasures intended to reduce \beta .

The evidence so far is that they work, albeit at the cost of putting the economy in an induced coma, causing great hardship including famines.

The challenge to manufacturers today is to bring the factories back up without losing the gains so painfully achieved. The workforce of a factory is a population, with its Susceptibles, Infected, and Recovered compartments.

With appropriate monitoring, you can track their evolution over time and use the SIR model or its more sophisticated descendants to anticipate the evolution of the epidemic within this population.

A factory is not, however, an isolated population. Family members can infect employees who then bring the disease to work; employees can catch the virus at work and contaminate their families. If you measure the rate of infection in the community at large, then you can compare it with the rate among employees.

Measuring Infection Rates By Random Sample Tests

We predict election results from samples of 1,000 voters and audiences of TV shows from 5,000 viewers. We should be able to know the COVID-19 infection rate of 330 million Americans with a small margin of error from testing a random sample of a few thousand, as recommended by Dartmouth Profs. Daniel Rockmore and Michael Herron. It is difficult to understand why this has not been done yet.

If you are short on test kits, it’s understandable to reserve them for people with symptoms. However, a small number of test kits is all we need to find out what proportion of the population is infected, including contagious people with no symptoms, which is also vital for public health.

This kind of testing will increase the estimates of the number of infected people by an unknown factor, possibly frighteningly large.

In Vò, Italy, they tested the entire population of 3,500, and were able to stop contagion completely. Testing a random sample would not achieve this. All it would do is give us a confidence interval for the rate of infection in the population.

The closest to random sample tests in the US have been two studies conducted respectively out of Stanford University about Santa Clara County and USC about Los Angeles County.

Both studies tested for antibodies, revealing past but not current infections. The Santa Clara County study recruited participants by an ad on Facebook; the Los Angeles study, through a marketing firm. Neither sample was random. Both were self-selected, and the antibody tests had not been vetted by the Food & Drug Administration.

Within a factory, a policy of “cycle testing” of a random sample of employees every day could track the infection level of the entire workforce and trigger alarms, as discussed below.

Cycle Testing

Warehouse managers are familiar with cycle counting as a preferred means of assuring inventory accuracy. If testing for COVID-19 infection were as quick, easy, accurate and cheap as taking temperature, then all employees could be tested every day when arriving at work.

The technology for this may eventually exist but, as of now, each test costs about $100, requires specially trained operators, takes up to a week to get the results, and 15% to 30% of the negative results are false.

The processing time for a test sample is on the order of 6 hours, and the rest of the lead time is in transportation and waiting. This says that, for tests to be of any use in a factory, the samples must be processed in-house, and that the factories must set up their own labs.

Cycle testing is based on cycle counting. You partition the employees into random groups of equal size and test one group every day. For a workforce of 2,000, you might form 20 groups of 100. You would select each group to be a representative random sample. Testing one group every day with same-day results would achieve two purposes:

Monitoring the level of infection within the workforce, including asymptomatic carriers.

Testing each employee every 20 working days.

Once the level of infection is low single-digit percentages, you can increase testing productivity without losing accuracy. There is a method first used by the US Army in World War II to test soldiers for syphilis. It involves mixing the samples from a group and testing it. A negative result clears all the soldiers in the group. A positive one causes recalls for individual retests.

One Comment

Sid JoynsonMay 7, 2020 @ 1:08 am

Michel, fascinating. Should help in the Corona-war.
Some advice from previous war-winning experiences, will help fight the next war, and would have made a massive difference to this one.
“It is better to be prepared for war; than to hope the enemy will not come.” I have seen this attributed to Sun Tzu and Von Clausewitz. It doesn’t matter who said it, but it is important that our leaders and the Lean movement understand it.