In a previous article, we shared some reflections on **machine learning** **models for data fitting and prediction**, playing with a simple dataset related to the current coronavirus spread emergency. We had a look at simple models and mentioned some more advanced ones relying rely upon deterministic dynamical system theory to be properly understood. However, we limited ourselves to the so-called *parametric models*, which try to fit and predict data by using a known parametrized function with a fixed number of parameters.

## Non parametric models

For example, we considered an exponential function which transforms an input signal *x* into an output signal *a *exp(*bx*) with *a*, *b *parameters to be determined to best fit training data. The merits of this approach are that it is simple to implement, since we just need an optimizer and parameters are usually very few. Moreover, it is a **white-box model**: we do know the function, since we’ve chosen it, so that we have plenty of theoretical tools to justify its use, to assess errors, etc. We can also try to modify it (as we did with the logistic one) more or less knowing what we are doing.

On the contrary, a **non parametric model** is just a **black box** which accepts a signal *x* and emits another signal *y* according to some inner rule which does depend on many parameters hidden in the model and whose interpretation is largely unknown. We cannot modify the model and even its parameters -which are usually much more than in a parametric model– are tuned inside the black box. We have some external parameters, called “hyper-parameters“, which we may control, but whose tuning is totally empirical, since we have no theoretical guess due to the black-box nature of the model.

Put in this way it seems that those models are difficult to deal with and frustrating to calibrate. Indeed that’s the naked truth, but they are often very effective, in some cases amazingly effective. Most reader have guessed it: **artificial neural networks** belong to this family of models.

In the following lines we will play with some neural networks to perform some analysis of a simple dataset of coronavirus spread in the state of New York, with the purpose to get some insight on how these model should be applied and what can be expected from them. The Python snippets which follows are by no means perfect but rather they mediate between the use of library methods and, whenever possible, disclosure of what happens behind the scenes by doing some computation “by hands”.

## Neural networks come into play

Since we are going to consider neural networks as black boxes, to a certain extent we may ignore the inner structure of such nets and look at them as algorithms. They take a numeric input vector and provides another numeric vector as output. The output may even be a single number: in this case the network defines a function *y *= *f*(*x*_{1},…,*x _{n}*), but notice that this function needs to be built somehow dwelling upon a

**training set**.

A training set contains data of the same format. For a function *y *= *f*(*x*_{1},…,*x _{n}*), the training set is a set of pairs, called

*samples*, (

*y*

_{1},(

*x*

_{11},…,

*x*

_{1}

*)),…,(*

_{n}*y*,(

_{m}*x*

_{m}_{1},…,

*x*)) which are used to calibrate the function so that

_{mn}*y*

_{1}=

*f*(

*x*

_{11},…,

*x*

_{1}

*),…,*

_{n}*y*=

_{m}*f*(

*x*

_{m}_{1},…,

*x*). Notice that numbers

_{mn}*x*depends on two indexes: when dealing with neural networks prepare yourself to handle variables depending even on three indexes (improperly called

_{ij}*tensors*even if we do not mind about their coordinate transformation rules).

A neural network has **hidden parameters** (called **weights**) which are changed each time the evaluation of a record of the training set fails so to correct them and make it possible to the next evaluations to provide the desired result. Thus **the net “learns” the function** values on the training set. This is more properly stated by saying that the function has to fit the training set the best it can. And indeed neural networks can do better than most other approximation and interpolation algorithms.

The function that our neural network should learn is the curve of cases and/or the curve of deaths: however it is not immediate how to do that. Indeed, **neural networks are great at interpolating functions, thus to fit known data and approximate unknown data**. In our case, known data are detections until today, while unknown data are all future detections. We are not interested in the curve between today and yesterday, thus we are not interested in interpolating the curve but just in extrapolating from it.

Usually the function *y *= *f*(*x*_{1},…,*x _{n}*) which a neural network needs to interpolate depends on data which have no particular orders: thus the

*x*

_{1},…,

*x*are just seen as coordinates of a point in some several dimensional space, and the proximity relationships (for example their distances) are taken into account. But on

_{n}**dealing with time series, time ordering is essential**.

A “traditional” **neural network is just a set of parallel layers**, which contains the network nodes (called *neurons*), which in turn contain network weights. Neurons are connected so that information passes from a layer through the next one, starting from the input layer and ending with the output layer. Comparing the output with the expected output on the training data, weights inside nodes are adjusted by a rule called “backpropagation” (or its variants). Each time, weights are changed and their previous values are discarded. So weights “have no memory”.

There is a class of neural networks which have been devised (since the 80s) to deal with time series, and they are called **recurrent networks**. In a recurrent network, some neurons are connected with themselves, thus their output is one of their input, and that implies that these special networks “remember” past data passed through the network. This is an oversimplified and qualitative description but it will suffice for our purposes. If you need to learn more on this, I suggest a good article by Stephen Grossberg on Scholarpedia.

There are several kinds of recurrent networks but I don’t want to focus on that here, rather to apply them as black boxes to our problem. To do that we have to face some issues related to available data.

## Setting up a recurrent LSTM network

As in the previous article of this series, I will use the time series maintained by The New York Times, based on reports from state and local health agencies (see this link).

```
from git import Repo
import matplotlib.pyplot as plt
import numpy as np
from os.path import join
from pandas import read_csv
from shutil import rmtree
from tempfile import mkdtemp
# Get data from github repository
tempdir = mkdtemp()
repo = Repo.clone_from("https://github.com/nytimes/covid-19-data", tempdir, branch="master", depth=1)
with open(join(tempdir, "us-states.csv")) as cf:
DATA = read_csv(cf)
try:
rmtree(tempdir) # this may fail due to grant limitations...
except:
pass
#Filter data we are interested in
NY_DATA = DATA.loc[DATA.state == "New York", ["cases", "deaths"]]
CASES = np.array(NY_DATA.loc[:, "cases"])
DEATHS = np.array(NY_DATA.loc[:, "deaths"])
# Plot the data
fig, axes = plt.subplots(2, 2, figsize=(10, 5))
axes[0,0].plot(CASES, "r+", label="Cases: data")
axes[0,1].plot(DEATHS, "k+", label="Deaths: data")
axes[0,0].legend()
axes[0,1].legend()
axes[1,0].hist(CASES)
axes[1,1].hist(DEATHS)
```

There are two time series provided with these data: the time series of COVID-19 cases (the officially detected ones, a number usually far from the real number of cases and difficult to correlate to it) and the time series of deaths whose proven cause is COVID-19.

Since we want to use a neural network we need a training set. However, at the moment we have just a time series. If the aim is to predict, we may use the time series to provide a training set as follows. Assume we have a time series *s*_{0},…,*s _{N} *, being

*s*numbers detected at times

_{i}*i*=0,2,…,

*N*. If we make the assumption that the series is such that the next value depends only on the previous one (Markov hypothesis) then the prediction of

*s*

_{i}_{+1}is just a function of

*s*, so that we want to learn the function

_{i}*y*=

*f*(

*x*) such that

*s*

_{i}_{+1}=

*f*(

*s*) for

_{i}*i*= 0,…,

*N*−1.

Therefore we may set up our training data just as (*s*_{1},*s*_{0}),(*s*_{2},*s*_{1}),…,(*s _{N}*,

*s*

_{N}_{−1}). This can be done even if the output

*y*is a vector rather than a single number. Usually, “multivariate” time series prediction is much harder than “univariate” one, but that is not the case when using neural networks. Indeed there’s no much difference, the methods work in both cases just in the same way. Anyway, the univariate case fits the purpose of this article.

Of course one could model more complex dependencies of the next value on the previous ones. For example suppose we decide that the next value may depend on the previous three. In this case, we are modelling a function of the form *y *= *f*(*x*_{1},*x*_{2},*x*_{3}) and the training set becomes

*(s _{3},(s_{0},s_{1},s_{2})),(s_{4},(s_{1},s_{2},s_{3})),…(s_{N},(s_{N−3},s_{N−2},s_{N−1}))*

The size of the input vector is called *step* in this context. The previous example shows a three time steps input, one time step output and one step prediction. Of course one can generalize, but the more steps the less samples we can arrange in the training set.

It is not difficult to write a code to produce such samples from a single series. In the first place we will consider one time step input. In the following script I download data and prepare training sets as just described, plotting both series and their histograms.

```
# We use this function to produce training data with given input step (output step=1)
def extract_sample(series):
X = []
Y = []
for i in range(len(series)-1):
X.append(series[i:i+1])
Y.append(series[i+1])
return X, Y
# Training sets for both series: we separate input (x) from output (y) data
X_CASES, Y_CASES = extract_sample(CASES)
X_DEATHS, Y_DEATHS = extract_sample(DEATHS)
```

Now we have a training set, but we still need a model. There are many recurrent networks which may be used, here we’ll use a classical model, the so-called **LSTM** (*Long-Short Term Model*) which dates back to 1997 and still is a good choice in many cases (for example in natural language processing). Its striking feature is that some of the inner nodes of the network (nodes contain one or more weights) are used as “memory” of the network and not for computational purposes.

Again, I won’t indulge in theoretical details, rather I take the developer point of view and notice that **this model is available in the Keras Python library** so we will just use it. We will see in a moment that taking this approach needs a bit of data preprocessing and setting of few parameters.

Let’s start with the simplest case, thus a **one-layer LSTM**. We **design the model using Keras API**, which is more or less straightforward, setting some external parameters to standard values and setting hyper-parameters by some trial&error. Such parameters are the number of nodes, the number of epochs and the batch size. The model is compiled by means of the Keras backend, either Theano or TensorFlow, and then it can be used to train data via the `fit`

method. Then, the model can be used (and tested) using the `predict`

method.

We will use 80% of our samples as training samples, and the remaining ones as test samples. Moreover, we will concentrate on predicting deaths.

```
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
def train_test(neural_network, X, Y, epochs=5000, training_percent_size = 0.8):
# This function takes a dataset, splits it into training and testing sets
# according to training_percent_size, trains a plain LSTM model on it
# launches preditions on datasets, and plots the result.
N = len(X)
# Extract training and testing samples and convert in numpy arrays
n_training = int(N*training_percent_size)
X_TRAINING, Y_TRAINING = np.array(X[:n_training]), np.array(Y[:n_training])
X_TESTING, Y_TESTING = np.array(X[n_training:]), np.array(Y[n_training:])
# neural_network.fit expect a (samples, input-steps, output-steps) shaped X array
X_TRAINING = X_TRAINING.reshape((X_TRAINING.shape[0], X_TRAINING.shape[1], 1))
X_TESTING = X_TESTING.reshape((X_TESTING.shape[0], X_TESTING.shape[1], 1))
# Training: shuffle=False prevents data shuffling, order is important!
for i in range(epochs):
neural_network.fit(X_TRAINING, Y_TRAINING, epochs = 1, verbose = 0, shuffle = False)
neural_network.reset_states()
Y_PREDICTED = [] # we list predictions in this array
for x in X_TESTING:
x = x.reshape((1, 1, 1))
y = neural_network.predict(x, verbose = 0)
Y_PREDICTED.append(y[0][0])
# Plot data in read and predictions in blue
plt.plot(Y, 'r')
plt.plot(range(n_training, N), Y_PREDICTED, 'b')
plt.show()
# Test loss
print("Loss =", np.linalg.norm(Y_TESTING - np.array(Y_PREDICTED))**2/len(Y_TESTING))
# The curve with prediction instead of testing values is returned
return np.concatenate((Y_TRAINING, Y_PREDICTED))
# Set up the network with one layer of 10 nodes
nn = Sequential()
nn.add(LSTM(10, activation = "relu", input_shape = (1, 1)))
nn.add(Dense(1)) # the layer has a 1-dimensional output (a number!)
nn.compile(loss="mean_squared_error", optimizer="adam")
prediction = train_test(nn, X_DEATHS, Y_DEATHS)
```

The result is not bad for a simple minded trial. The model seems to get the trend of the deaths curve, to say the least.

To get a better result, a first impulse would be to add one more layer. This is very simple with Keras, since it just takes one more `model.add`

instruction:

```
nn2 = Sequential()
nn2.add(LSTM(10, activation = "relu", input_shape = (1, 1), return_sequences=True))
nn2.add(LSTM(10, activation = "relu"))
nn2.add(Dense(1)) # the layer has a 1-dimensional output (a number!)
nn2.compile(loss="mean_squared_error", optimizer="adam")
```

In the previous snippet, the `return_sequences=True`

option adjusts the output of the first layer to the input shape expected from the second layer, which we set to the same number of nodes of the first one just to keep things simple. The result becomes as follows:

`prediction = train_test(nn2, X_DEATHS, Y_DEATHS)`

Actually, inserting a new layer didn’t result in any significant improvement.

## A bit of time series lore

Instead to keep blindly try to improve our naive neural network, let’s try something more clever. Indeed people used to time series fitting have surely become outraged by the poor pre-processing we did. Time series are effectively studied when they are paths of *stationary processes*, which are solutions of stochastic dynamical systems whose behaviour maintains, roughly speaking, mean and variance within a certain range. But look at our data: the trend of the curve clearly depends on time, there is no iterative structure at a first sight.

The good news is that, to a certain extent, we may transform a series into a stationary one, so that the transformation is invertible. The simplest way to do that is to look at the series of differences of the given series *s*_{0},*s*_{1},…,*s _{N}*, thus

*d*_{1} = *s*_{1} − *s*_{0}, *d*_{2 }= *s*_{2} − *s*_{1}, …, *d** _{N}* =

*s*

*−*

_{N}*s*

_{N}_{−1}

In both cases, the transformation (*s*_{1},…,*s _{N}*) → (

*d*

_{1},…,

*d*) is invertible (notice that we excluded

_{N}*s*

_{0}from the original series, it is a sort of “initial condition” we keep aside), and the inverse is given by the recursive relations:

*s*_{1} = *s*_{0} + *d*_{1}, *s*_{2} = *s*_{1} + *d*_{2}, …, *s** _{N}* =

*s*

_{N}_{−1}+

*d*

_{N}In some sense the original series is the cumulative series of the differences one. Of course we could also define a difference series with a given step *k*:

*d _{k} *=

*s*−

_{k}*s*

_{0},

*d*

_{k}_{+1}=

*s*

_{k}_{+1}−

*s*

_{1}, …,

*d*=

_{N}*s*−

_{N}*s*

_{N}_{−}

_{k}Let us try, with step *k*=1, for our two coronavirus series:

```
def difference(s):
return [s[i+1] - s[i] for i in range(len(s)-1)]
def cumulated(s0, d):
# cumulated(s[0], difference(s)) == s
s = [s0]
for i in range(len(d)):
s.append(s[i] + d[i])
return s
D_CASES = difference(CASES)
D_DEATHS = difference(DEATHS)
fig, axes = plt.subplots(2, 2, figsize=(10, 5))
axes[0,0].plot(D_CASES, "r+", label="Cases diff")
axes[0,1].plot(D_DEATHS, "k+", label="Deaths diff")
axes[0,0].legend()
axes[0,1].legend()
axes[1,0].hist(D_CASES)
axes[1,1].hist(D_DEATHS)
```

Differentiation may be iterated, thus we may consider the second difference of a series (the difference of its difference series), and so on, and at each step the series become more and more similar to a stationary one. Of course, the k-th difference of a given series has k elements less than the original one.

For example let us compute the third difference of our series:

```
D3_CASES = difference(difference(difference(CASES)))
D3_DEATHS = difference(difference(difference(DEATHS)))
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes[0,0].plot(D3_CASES, "r+", label="Cases diff^3")
axes[0,1].plot(D3_DEATHS, "k+", label="Deaths diff^3")
axes[0,0].legend()
axes[0,1].legend()
axes[1,0].hist(D3_CASES)
axes[1,1].hist(D3_DEATHS)
```

Notice that histograms tend to become more and more similar to normal distributions.

Let us perform again the **training of the neural network on differenced data**, and convert back the results into cumulated form. The performance has an impressive improvement.

```
plt.figure(0)
DX_DEATHS, DY_DEATHS = extract_sample(D_DEATHS)
d_prediction = train_test(nn2, DX_DEATHS, DY_DEATHS)
# We plot also the actual series
plt.figure(1)
prediction = cumulated(Y_DEATHS[0], d_prediction)
plt.plot(prediction, 'b')
plt.plot(Y_DEATHS, 'r')
print("Loss =", np.linalg.norm(Y_DEATHS - np.array(prediction))**2/len(Y_DEATHS))
```

Another invertible transform which makes sense to apply in our cases is the **scaling transformation**, by means of which we reduce our data inside a fixed interval. When using neural networks this is worth since such networks have nodes which usually output numbers between -1 and 1, unless otherwise stated. For example we used the “relu” activation function which just cuts negative values, and which is popular in deep learning.

We can drop the “relu” specification, but we have to rescale data. First let’s show the scaling function, which just takes the minimum and maximum value of the series and transform linearly the former into -1 and the latter into 1 (and all numbers in between accordingly). We will apply the scaling to the difference series.

```
def scaling(s, a = -1, b = 1):
m = min(s)
M = max(s)
scale = (b - a)/(M - m)
return a + scale * (s - m)
DS_CASES = scaling(difference(CASES))
DS_DEATHS = scaling(difference(DEATHS))
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].plot(DS_CASES, "r+", label="Cases diff+scaled")
axes[1].plot(DS_DEATHS, "k+", label="Deaths diff+scaled")
axes[0].legend()
axes[1].legend()
```

Finally let us check our basic LSTM network to these data, and transform back the results to provide a prediction as we did before. We use the same parameters to train the network but we let the activation function be the default one, which is a sigmoid function between -1 and 1.

```
nn3 = Sequential()
nn3.add(LSTM(10, input_shape = (1, 1), return_sequences=True))
nn3.add(LSTM(10))
nn3.add(Dense(1)) # the layer has a 1-dimensional output (a number!)
nn3.compile(loss="mean_squared_error", optimizer="adam")
plt.figure(0)
DSX_DEATHS, DSY_DEATHS = extract_sample(DS_DEATHS)
ds_prediction = train_test(nn3, DSX_DEATHS, DSY_DEATHS)
# We plot also the actual series
plt.figure(1)
d_prediction = scaling(ds_prediction, min(D_DEATHS), max(D_DEATHS))
prediction = cumulated(Y_DEATHS[0], d_prediction)
plt.plot(prediction, 'b')
plt.plot(Y_DEATHS, 'r')
print("Loss =", np.linalg.norm(Y_DEATHS - np.array(prediction))**2/len(Y_DEATHS))
```

Another improvement is readily noticed: our predictions are far from being reliable in the long run but they give quite a reasonable trend of the deaths curve.

## Conclusions

In this article we set up a simple non-parametric model to predict time series, and applied it to some data on coronavirus spread. The techniques we showed can be greatly improved in several respects, for example on using stateful LSTM models, etc. and also on using different and more informative datasets.

But our aims were:

- To point out the
**differences between the parametric models**(more math, specific domain considerations, etc.)**and non parametric models**(framework use, guessing hyper-parameters, etc.). - To show how
**data preparation is a non trivial task**when facing machine learning algorithms implementation. Indeed most of the code we wrote was due to data handling, while the model was set up in a handful of API calls. - To point out that barely making a
**more complicated neural network does not imply a corresponding gaining in performance**. Even if it is very easy to complicate the model in Keras, do it only if actually needed, and experiment by changing hyper-parameters to find a good combination of them rather than adding layers, etc. - To stress the importance of
**knowing what we are doing**. Starting from some elementary techniques for time series, we improved the result just transforming the input data, without changing the model!

All in all, **human reflection about machine algorithms is always needed**. Relevant choices left to the data scientists who design and implement the simulation remain the key for the understanding of a real phenomenon (as a virus spread).

If you are interested in these topics, do not miss the opportunity to attend our **Deep Learning online conference**: click here for more information!