The current global emergency due to the diffusion of coronavirus in several countries shows how fragile are the equilibria of our societies when facing “inner” threats such as an epidemic. Eventually drugs to relieve symptoms and a vaccine will be found, but in the meanwhile **humankind seems to be helpless against the virus**. The only effective weapon is being to stop direct contacts between people so to segregate infected ones.

In some sense this is disheartening for human self esteem: didn’t we go to the Moon, didn’t we fly across the air and didn’t we build a net of information exchange which changed forever our habits and lives? As far as the latter achievement is concerned, nowadays we have real time access to information about the virus and its spread: **is it possible to use these information to fight somehow the epidemic**?

The will to understand and to participate in this “war against the virus” is a natural impulse in we human people, and this is the reason why also computer scientists and practitioners, who deal with data everyday, provided a flood of dashboards, graphs and sentences about the epidemic and its possible effects and evolution.

But, of course, to deal with viruses is a job for virologists: in some sense a computer scientist, and in particular a data scientist, dealing with the study of coronavirus seems to be like a virologist dealing with the study of a cryptolocker. The risk is to provide “blind” analyses which perhaps fit the current data but are meaningless from a biological point of view.

In this note I will try to put forth some reflections on these risks, by writing some toy code to play with different models and to face the difficulties to deal with real data.

## The easiest thing: to get data

In my opinion, a wonderful aspect of our times is **the availability of data and tools** to expand our understanding of the world: this epidemic makes no exception. There are plenty of official repositories to get recorded data. To keep thing simple (this is not a research paper!) I will use a simple repository, where CSV data are maintained by *The New York Times*, based on reports from state and local health agencies (more information at this link).

These data provide, for each date from 21 January 2020 and for each US state the number of cases and the number of deaths, each row of the CSV being of the following form (the *fips* field may be used to cross-reference these data with other data for each location):

```
date,state,fips,cases,deaths
2020-01-21,Washington,53,1,0
```

Let us load these data by means of the following Python script and filter them to consider only the New York state. We will use such data to produce two numerical series *CASES* and *DEATHS*, and plot them.

```
from git import Repo
import matplotlib.pyplot as plt
import numpy as np
from os.path import join
from pandas import read_csv
from shutil import rmtree
from tempfile import mkdtemp
tempdir = mkdtemp()
repo = Repo.clone_from("https://github.com/nytimes/covid-19-data", tempdir, branch="master", depth=1)
with open(join(tempdir, "us-states.csv")) as cf:
DATA = read_csv(cf)
try:
rmtree(tempdir) # this may fail due to grant limitations...
except:
pass
NY_DATA = DATA.loc[DATA.state == "New York", ["cases", "deaths"]]
CASES = np.array(NY_DATA.loc[:, "cases"])
DEATHS = np.array(NY_DATA.loc[:, "deaths"])
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].plot(CASES, "r+", label="Cases: data")
axes[1].plot(DEATHS, "k+", label="Deaths: data")
axes[0].legend()
axes[1].legend()
```

These two times series show a trend which, as far as the series of cases is concerned, starts increasing very rapidly from 18th day on, so that cases increased ten times in ten days after that date.

**How can we model such curves?** Let us try some simple models at first, starting from the most popular ones. Indeed, after a rapid Internet research it’ll turn out that many people do model population increment by two kind of epidemic models: **exponential model** and **logistic model**. Let us blindly follow this trend.

## Parametric epidemic models: exponential and logistics

### Exponential modelling

An exponential model is, to people used to a minimum of math, just the exponential function with a couple of parameters plugged in so to be able to being calibrated. Thus we imagine a law which given a time *t *produces a value *C*(*t*) for the cases at time *t* and *D*(*t*) for the deaths at time *t*, both being mathematical functions, of the form:

*C(t) = c _{1}exp(c_{2}t)*

*D(t) = d*

_{1}exp(d_{2}t)where *c*_{1},*c*_{2} and *d*_{1},*d*_{2} are “model parameters“. To calibrate the curve for the appropriated data series, we need to find out the value of the parameters which, substituted inside the functions, let them to approximate the real data in the best possible way. Thus we have an **optimization problem**.

This is interesting since **most artificial intelligence problems, and each machine learning problem, are actually optimization problems!** Thus on calibrating a parametric model by fitting data, **we are already doing machine learning**, in some sense, namely we are learning the optimal parameters to fit the data. Instead of devising a method to calibrate parameters in a model, we may use an already programmed one: this is simple in Python and runs more or less as follows (the calibration function just minimizes the quadratic error since we use the Euclidean norm).

```
from scipy.optimize import fmin
INTERVAL = np.arange(len(NY_DATA)) # t=0 means the first day; time increase is daily
def exponential(t, params):
return params[0]*np.exp(params[1]*t)
def apply_model(model, init_cond, interval, data, disp =True, shadow=False, title=""):
opt_params = fmin(lambda p: np.linalg.norm(model(interval, p) - data),
init_cond, disp = False)
if disp:
y = model(interval, opt_params)
err = np.linalg.norm(y - data)
print("Optimal parameters:", opt_params)
print(f"Model error = {err:.2f}")
plt.figure(figsize=(10, 5))
if shadow:
plt.fill_between(interval, [0 if x < 0 else x for x in y - err], y+err, facecolor="silver")
plt.plot(interval, data, "r+", label="Cases: data")
plt.plot(interval, y, "b", label=f"Cases: model")
plt.title(title)
plt.legend()
return opt_params
apply_model(exponential, (CASES[0], 1), INTERVAL, CASES, title="Cases")
apply_model(exponential, (DEATHS[0], 1), INTERVAL, DEATHS, title="Deaths")
```

The output looks like the following:

```
Optimal parameters: [3.98032285e+03 9.36244642e-02]
Model error = 77807.25
Optimal parameters: [7.91850070e-11 7.61280159e-01]
Model error = 16721.72
```

This result is just bad (look at the errors! the first thing to look at when fitting some data). But indeed, as we have already pointed out, these data seem to display an exponential behavior only from 18th day on or so. Indeed if we cut both time series by dropping the initial values we get a better picture:

```
apply_model(exponential, (CASES[17], 1), INTERVAL[17:], CASES[17:], title="Cases")
apply_model(exponential, (DEATHS[17], 1), INTERVAL[17:], DEATHS[17:], title="Deaths")
```

And the results are:

```
Optimal parameters: [4.94288955e+03 8.81173180e-02]
Model error = 64453.74
Optimal parameters: [32.08823952 0.13596649]
Model error = 2459.21
```

These plots seems to display a much better fitting: but is it really so? On plotting just a line we are in some sense cheating, since our model fits data up to an error. It is better to plot this “error area” as a strip around the curves, which makes the (poor) effectiveness of the epidemic model clearer:

```
apply_model(exponential, (CASES[17], 1), INTERVAL[17:], CASES[17:], title="Cases", shadow=True)
apply_model(exponential, (DEATHS[17], 1), INTERVAL[17:], DEATHS[17:], title="Deaths", shadow=True)
```

```
Optimal parameters: [4.94288955e+03 8.81173180e-02]
Model error = 64453.74
Optimal parameters: [32.08823952 0.13596649]
Model error = 2459.21
```

Now we can appreciate how coarse this modelling is!

But, even worse, this “model” has a trend which makes no sense, since it **keeps increasing exponentially forever**. That would mean that both cases and deaths will continue to increase until they outnumber the total population on Earth!

Thus, this “model” is just a bad fitting for some points on a curve and not a meaningful predictive model. To see it more clearly, let us suppose to split our data into a “training set“, 80% of all the data which we use to calibrate the epidemic model, and a “testing set”, the remaining 20% which we use to test its performance.

```
i0 = (len(INTERVAL) * 4) // 5
opt_param = apply_model(exponential, (CASES[17], 1), INTERVAL[17:i0], CASES[17:i0], disp=False)
y = exponential(INTERVAL[18:], opt_param)
print("Model error =", np.linalg.norm(y - CASES[18:]))
plt.figure(figsize=(10, 5))
plt.plot(INTERVAL[18:i0], CASES[18:i0], "r+", label="Cases: training data")
plt.plot(INTERVAL[i0:], CASES[i0:], "y+", label="Cases: testing data")
plt.plot(INTERVAL[18:], y, "b", label="Cases: model")
plt.legend()
opt_param = apply_model(exponential, (DEATHS[17], 1), INTERVAL[17:i0], DEATHS[17:i0], disp=False)
y = exponential(INTERVAL[18:], opt_param)
print("Model error =", np.linalg.norm(y - DEATHS[18:]))
plt.figure(figsize=(10, 5))
plt.plot(INTERVAL[18:i0], DEATHS[18:i0], "r+", label="Deaths: training data")
plt.plot(INTERVAL[i0:], DEATHS[i0:], "y+", label="Deaths: testing data")
plt.plot(INTERVAL[18:], y, "b", label="Deaths: model")
plt.legend()
```

The result is the following:

```
Model error = 352059.91670035594
Model error = 26296.154926583255
```

Therefore: **this model fits data (badly) but it is totally wrong at predictions**, as far the cases curve is concerned. On the other hand, testing data are more or less fitted by this exponential curve. However, it doesn’t need a Nobel prize winner to understand that there is an upper bound to the number of people in a population, and that the approach toward such bound is not exponential.

Let us try to understand that with some math, which at some point is always needed when dealing with models. We should use calculus, but let us keep things simple, after all we are interested in points *C0, C1*,… along a curve, not the entire curve *C*(*t*): to recover the exponential curve we remark that to produce a time series of exponential type the next point in the curve is the current one *Cn* plus an increment proportional to *Cn*: the equation expressing this is

*Cn+1 − Cn = c _{2}Cn*

We can use this definition to compute points of the curve *C*: in the case of exponential, this is unneeded (we know the function!), but it generalizes to other models, for example the **logistic model**.

### Logistic modelling

According to this model, the growth in a part of a population follows the law:

*Cn+1 – Cn = c _{2}Cn(N−Cn)/N*

being *N* the total population. In this way *Cn* will not diverge as in the exponential series, but at most it will hit the *N* value (in which case all the population is dead…). We can try to use this model to fit our data: more precisely we will model cases by the logistic and deaths by the exponential.

```
N = 1e5
def logistic(t, params):
return params[0] /(1+np.exp(-params[1]*(t-params[2])))
apply_model(logistic, (2*CASES[-1], 1, (INTERVAL[-1]-INTERVAL[17])/2), INTERVAL[17:], CASES[17:], title="Cases", shadow=True)
apply_model(logistic, (2*DEATHS[-1], 1, (INTERVAL[-1]-INTERVAL[17])/2), INTERVAL[17:], DEATHS[17:], title="Deaths", shadow=True)
```

Here are the results:

```
Optimal parameters: [2.29810781e+05 1.83183796e-01 3.41548660e+01]
Model error = 18468.12
Optimal parameters: [1.31533912e+04 2.41333621e-01 3.82790444e+01]
Model error = 457.69
```

This model seems to fit better the curve of cases, even if the error is still large for cases, as the grey strips show. The curve of deaths seems to be truly modeled in this way. Moreover we have to guess the total population number, which is some kind of “hyper-parameter” for this model. Let us split, as before, the dataset into two parts, with the first 80% values as training set and the remaining as testing set.

```
i0 = (len(INTERVAL) * 4) // 5
opt_param = apply_model(logistic, (2*CASES[-1], 1, (INTERVAL[-1]-INTERVAL[17])/2), INTERVAL[17:i0], CASES[17:i0], disp=False)
y = logistic(INTERVAL[18:], opt_param)
print("Model error =", np.linalg.norm(y - CASES[18:]))
plt.figure(figsize=(10, 5))
plt.plot(INTERVAL[18:i0], CASES[18:i0], "r+", label="Cases: training data")
plt.plot(INTERVAL[i0:], CASES[i0:], "y+", label="Cases: testing data")
plt.plot(INTERVAL[18:], y, "b", label="Cases: model")
plt.legend()
opt_param = apply_model(logistic, (2*DEATHS[-1], 1, (INTERVAL[-1]-INTERVAL[17])/2), INTERVAL[17:i0], DEATHS[17:i0], disp=False)
y = logistic(INTERVAL[18:], opt_param)
print("Model error =", np.linalg.norm(y - DEATHS[18:]))
plt.figure(figsize=(10, 5))
plt.plot(INTERVAL[18:i0], DEATHS[18:i0], "r+", label="Deaths: training data")
plt.plot(INTERVAL[i0:], DEATHS[i0:], "y+", label="Deaths: testing data")
plt.plot(INTERVAL[18:], y, "b", label="Deaths: model")
plt.legend()
```

And here are the results:

```
Model error = 92466.50981544874
Model error = 5404.289943307971
```

Errors are still huge, but the trend of the actual cases seems to be qualitatively like the model curve (while the modelling of the deaths curve is clearly wrong!).

Indeed, notwithstanding what is read on the Web, **these models are too simple to have any predictive value**: people using them to model epidemic spreads are largely wasting their time.

To get more accurate models one has to modify the increment appropriately for different curves. For example, the following two increment equations are inspired from the so-called **SIR model** (which is the simplest one used in epidemiology):

*Cn+1 = Cn+c _{2}Cn(N – Cn − Dn)/N − d2Cn*

*Dn+1 = Dn + d _{2}Cn *

Thus we take into account deaths and remove them from cases: the population is split into three parts: infected (I), dead (D) and all the rest (R), and the equations are not independent but related.

In a true SIR model we should model infected (I), dead or healed people (R, for retired), and people susceptible to be infected (S), but these data are missing from out dataset (and usually very difficult to measure). The idea of SIR is that during the virus spread population S decreases, while both I and R increases: I attains a maximum and start decreasing, while R keep increasing.

There are more sophisticated epidemic models such as **SEIR**, **MSEIR**, etc. which divide the population in different parts and model the growth of the population in each part: they are the standard deterministic models of epidemiology. Anyway, the results of a simulation along 100 days of our new model follow.

```
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import fmin
interval = range(len(NY_DATA))
CASES = np.array(NY_DATA.loc[:, "cases"])
DEATHS = np.array(NY_DATA.loc[:, "deaths"])
N = 0.5e6
def IDR(t, c1, c2, d1, d2):
if not isinstance(t, float) and not isinstance(t, int):
return np.array([IDR(i, c1, c2, d1, d2) for i in t])
if t == 0:
return np.array([c1, d1])
ft_1 = IDR(t-1, c1, c2, d1, d2)
ct_1 = ft_1[0]
dt_1 = ft_1[1]
return np.array([ct_1 + c2*ct_1*(N - ct_1 - dt_1)/N - d2*ct_1, dt_1+d2*ct_1])
def calibration(interval, C, D):
CD = np.array([[C[i], D[i]] for i in range(len(C))])
return fmin(lambda params: np.linalg.norm(IDR(interval, params[0], params[1], params[2], params[3]) - CD), (C[0],1,D[0],1), disp = False)
big_interval = range(100)
c1, c2, d1, d2 = calibration(interval, CASES, DEATHS)
fit = IDR(big_interval, c1, c2, d1, d2)
fig, axes = plt.subplots(1, 2, figsize=(10, 5), sharey='all')
axes[0].plot(interval, CASES, "r+", label="Cases: data")
axes[0].plot(big_interval, fit[:,0], "b", label=f"Cases: C(t)")
axes[0].legend()
axes[1].plot(interval, DEATHS, "r+", label="Deaths: data")
axes[1].plot(big_interval, fit[:,1], "b", label=f"Deaths: D(t)")
axes[1].legend()
print("Error for D(t) =", np.linalg.norm(fit[:len(interval),0] - CASES))
print("Error for D(t) =", np.linalg.norm(fit[:len(interval),1] - DEATHS))
```

The output is:

```
Error for D(t) = 55277.27002540128
Error for D(t) = 6128.543579522949
```

Let us split the data set in two series as before to use the first half as training set and the second one as testing set.

```
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
training_interval = interval[:len(interval)*4//5]
testing_first = len(training_interval)
# Calibrate parameters for modelling the cases
training_cases = CASES[:len(interval)*4//5]
training_deaths = DEATHS[:len(interval)*4//5]
c1, c2, d1, d2 = calibration(training_interval, training_cases, training_deaths)
fit = IDR(interval, c1, c2, d1, d2)
axes[0].plot(training_interval, training_cases, "r+", label="Cases: training data")
axes[0].plot(interval[testing_first:], CASES[testing_first:], "y+", label="Cases: testing data")
axes[0].plot(interval, fit[:,0], "b", label="Cases: model C(t)")
axes[0].legend()
# Calibrate parameters for modelling the cases
axes[1].plot(training_interval, training_deaths, "r+", label="Deaths: training data")
axes[1].plot(interval[testing_first:], DEATHS[testing_first:], "y+", label="Deaths: testing data")
axes[1].plot(interval, fit[:,1], "b", label="Deaths: model D(t)")
axes[1].legend()
print("Error for C(t) =", np.linalg.norm(fit[:,0] - CASES))
print("Error for D(t) =", np.linalg.norm(fit[:,1] - DEATHS))
```

Again, the result is definitely not satisfactory…

```
Error for C(t) = 184506.22085618484
Error for D(t) = 2511.425471089619
```

As we see, the logistic model overestimates the increasing of the the cases curve but more or less predicts the trend of the deaths curve.

## Remarks on epidemic model parameters learning

We didn’t succeed in finding a model to predict our data but we learned some useful lessons:

- Simple models
**may fit**available data but are**not reliable to predict**anything - Models are often chosen according to the available data: we did so and we failed. Honestly, New York Times data are not suitable for any interesting analysis, and we better look for more complete data, for example reporting also the number of healed people, etc.
- To train a “parametric model” is easy and it is a
**simple form of machine learning**: the difference with more advanced techniques is that here we pretend that the model follows a**deterministic rule**(the function we used to fit it: exponential, logistic, our toy model or more realistic function as SIR, SEIR, etc.) - Models which are not randomly chosen but built
**relying upon epidemiological knowledge**should be privileged, but finding data needed to feed them may be challenging

Actual machine learning models do not pretend to shape a priori data inside a deterministic curve: rather they create a black-box model which given input provides output, that’s all. **Neural networks** are like that and we will touch some points about them in a following article about how to apply them to model our data.