How to build a (small) homemade machine learning model to predict Covid indicators for France
Update (13/07/2022): french data gouv source has been removed from the project as the data scheme is unstable and the new format is now unusable.
Seeking for data science initiatives in order to estimate french covid epidemic indicators, I came to the conclusion that there was not much on the Internet, compared to the US initiatives for instance (e.g. Kaggle COVID-19 Open Research Dataset Challenge and COVID19 Global Forecasting). The most advanced usage of french data is done by Guillaume Rozier and the CovidTracker team at https://covidtracker.fr, but doesn’t provide (at the time of writing this article) forecasting tools. And since I wanted to make up my own opinion on the situation, I tried to see if an agnostic machine learning model could do this job, as most of the models exposed in scientific publications were specific to epidemiological research field at the start of the epidemic (like SEIR models), and even if ML approaches are now proposed for France (e.g. [Mohimont and al. 2020]), I didn’t find any online tools or services (or even notebooks) giving this kind of insights for my country. And at a moment, during my study, I finally had enough material to share it during a speech (fr language) at Orléans Tech Talks (below) and via an article (the one you’re reading).
At the beginning…
…things were simple. I searched for public datasets, and I opened a Jupyter Notebook in Google Colab to test a classical approach that data scientists, like myself, usually do with time series data: few cleaning and EDA, feature engineering, ML model(s), performance metrics and some data viz. I quickly started with two different data sources, international and french:
- Our World in Data (about OWID)
Coronavirus Pandemic worldwide datasets
- French government datasets (about data.gouv.fr)
Data relating to virological test results
Data relating to the COVID-19 epidemic in France: overview
COVID-19 epidemic monitoring indicators
I chose different sources for the same indicator to compare values and also because the time of receipt of the same indicator is not always identical (sometimes I can obtain a few days of more recent data according to the source).
At first, I focused on new cases forecasting, and I ended with four indicators to forecast:
- new cases
- new hospitalizations
- new admissions to intensive care services
- new deaths
I applied for each dataset the same preparing operations: typing date column, resampling (having exactly one line per day, meaning adding lines when days are missing), linear interpolation (when adding missing days, replace missing values by interpolation).
In the following paragraphs, I focus only on new cases indicator to forecast, but the steps are nearly the same with the other indicators. Everything can be found in the following Colab notebook I used for the Orléans Tech Talks presentation, using a simple tech stack (Fig 1.).
EDA Time ! (Exploratory Data Analysis)
Once the prepare operations done, I first visualized the new tests and the new cases, (almost) raw data during time (with a double y-axis) (Fig. 2).
First reaction: what is that sawtooth signal ? It’s clearly a false seasonality. Let’s zoom a little bit (see Fig. 3).
Okay I get it: all the low values corresponds to sunday (or a public holiday). And we also see that the results of tests (cases) are shifted in time by one day. This is an important information if we want to calculate proportion of cases according to tests — which everyone should do, as the french medias only talk about the number of cases in absolute value which does not make sense if we do not compare them to the tests. Pro data tip: if you don’t want cases anymore, don’t do test ;) .
Having this first “dirty” visualization, we can do better by smoothing the data using a 7-days moving average window (Fig. 4).
We can clearly see the dynamics between tests and cases, having high spread moments (e.g. November) with a stronger dynamics of cases compared to tests, and some lower spread moments (e.g. December) when tests dynamics is superior compared to cases. Plotting the proportion of cases according to tests (without forgetting to shift tests data column by one day) gives us a better idea of the real infection dynamics, which, at the time of the writing, seems to slow down a little bit (Fig. 5).
To have an idea of the progression of the epidemic, cumuling the proportion of cases according to tests can be a good indicator: if the coefficient of the line is below 1, the epidemic is slowing down, if it’s above 1, the epidemic is progressing (Fig. 6).
At this stage, I had enough information to test a ML approach to predict the new cases values.
Machine Learning part
Disclaimer: even though I have a PhD in Computer Sciences, I have almost no skills or knowledge in biology or any field correlated to it. The approach presented here is, by design, data driven.
The data sources previously cited gives several information:
- new patients in intensive care
- new hospitalizations
- new healed patients
- reproduction rate
- new deaths
at which we add new cases, new tests, and the first “created” feature: proportion of cases versus tests. It gives eight features, but they are more targets than features: we would like to predict them. Habitually, time series classic models can be used, like ARIMA statistical model family. The problem with these models is their univariate and parametric nature: they can only use/predict one variable, and must be strongly tweaked in term of parameters to give some results. For illustration only, I put in my shared Colab notebook a prediction test using Facebook Prophet package which provides one of the more advanced model of its kind, but it fails to produce a good prediction (Fig. 7) when asking it to forecast new cases starting January the 15th, knowing the past (see “Metrics and performances” section for further analysis).
Another approach is to use ensemble or linear models, not designed for time series, and to build time features to allow these models to capture signal between these features and the target. To do so, one known approach is to create rolling windows with different sizes, different aggregations, and different shifts. Shifts values are related to the maximum period you want to predict: as I want to predict the next 14 days, I can’t use data more recent than 14 days ago. In my case, I’m using my own DS-Box package to accelerate this phase, using the following parameters to build my time features based on raw features (Fig. 8):
- Aggregation operations: mean, median, standard deviation, minimum, maximum
- Rolling windows sizes (in days): 3, 7, 14, 28
- Shifts (in days): 14, 21, 28
I obtain features with a name as: AggOp_WindowSize_FeatureName_Shift
For instance, mean_3_new_tests_14 = mean of new tests on 3 days 14 days ago.
To capture dynamics, I added also features of that kind: diff_mean_7_new_tests_14_21 = difference between mean of new tests on 7 days 14 days ago and 21 days ago
I obtained a lot of features (too much maybe): 496 features. At the time of the writing, the dataset has 386 rows… It is a well known problematic situation in data science, having more features than examples, as a model can overfit: it might be able to have combinations of features values that are strongly correlated to each value in the target, and it cannot makes a good prediction if a combination of features values has never been seen in the train dataset. In summary, the model learns “by heart”. But in my experiments, well… it was difficult to remove features without loosing some performance.
As I wanted fast results, and to avoid “uncommon” librairies (explanations will come in the deployment part), I took “standard” models available in Scikit-Learn package, with some of them able to avoid over-fitting:
- Gradient Boosted Trees (prefer LightGBM package if you want more performance in term of metrics and computation)
- Random Forest (and its alternative version, Extremely Randomized Trees)
- Bayesian Ridge (Ridge model with prior distribution estimation)
- Elastic-Net (combining L1 and L2 norms regularization)
Apart from the Ridge model, the others models are able, with different capabilities, to reduce over-fitting: GBT can handle it if the amount of trees doesn’t use all the features, RF does it by design, taking random subsets of the features and/or samples, and Elastic-Net can use the L1-norm to set the weight of features to zero, excluding them in that case.
For the purpose of this article, I chose arbitrarily the split date of the 01/15/2021 to build train and test datasets (Fig. 9), which gives approximately 190 rows in train dataset and 30 rows in test dataset (after removing rows containing target null values). I also re-scaled features values with a MinMax scaler to have coherent weight values assigned to features (concerns mostly linear models), and tweaked a little models hyper-parameters (see Colab notebook).
Metrics and Models performance
I used two metrics to measure performance: Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), using the latter to choose the best model. The results can be found below (Fig. 10), ordered by RMSE (lower is better).
For information, FB Prophet model RMSE was about 11200. Let’s zoom and have a look (Fig. 11) at the different predictions generated by each model:
GBT and RF models have an abnormal peak which should be the cause of their low performance rank, and Bayesian Ridge has a tendency to overfit in this case, while ET and Elastic Net models seem, according to their scores and graphs generated, to generalize well the problem. As the objective is to have a trend of quality at the earliest date, I also produced a graph with the cumulated sum of values, which slightly different results if we focus on the last point (Fig. 12).
ET model is still doing well, but the Bayesian Ridge has corrected its errors over time. Depending on the KPI you would like to consider, you should choose a model accordingly. In my case, I stayed with the classic RMSE metric to take my decisions, as I preferred having an average quality prediction for all points.
These few results were encouraging, and as I’m curious, I wanted to see how predictions could evolve through time. And then, it happens…
…because in the end, I built an entire workflow
To see which predictions the model(s) could produce over time, I had to build workflows that pull the data sources, make the ML stuff and expose data in a nice dashboard. To do so, I used a simple tech stack added to the ML one (Fig. 13).
As I wanted to have a low budget architecture, I installed Apache Airflow 2 directly on my personal Synology NAS (it’s the reason why I didn’t use LightGBM). I used BigQuery only for the native integration with Data Studio, and of course, a versioning platform (GitHub) to handle the code.
I created three different workflows which are scheduled each day, as I usually consider data science as code:
- Sources data import
- ML processing
- Export data to BigQuery and build Data Studio table
Simple ML approach to predict insights about Covid-19 epidemic in France. Workflows are daily scheduled using Apache…
I focus here only on the ML processing workflow, which combines every step made in the Colab notebook into a packaged code (Fig. 14), for all the four indicators I presented at the beginning(cases, hospitalizations, intensive care admissions, deaths).
The only new things which are present in the workflow are the model selection and the feature selection phases. For model selection, I test the five models described above, and I use the Permutation Importance technique using ELI5 package to select features, splitting historical data in three parts: training dataset, validation dataset used by permutation importance, and test dataset on which the final performance is verified (Fig. 15). If the features selection gives better performances on the test dataset, the features for this model are updated. For model selection, after several experiments and having too few data amount in term of rows, I preferred to used latest 14 days available to test the models performances.
Finally, the results ends into a Data Studio dashboard (in french) with several charts (Fig. 16) giving several indicators:
- predictions values per indicators + evolution of proportion of cases according to tests
- predictions amount evolution per model export date
- models performances monitoring
- features contribution
To have a better idea of the behavior of the predictions, I prefer to aggregate the data view by the week (to avoid the “sunday artefact”).
Starting from a “simple” notebook approach, it ends nearly in a production (sort-of) data science project. The reason why I use the term “small” in the title, is that the model part is small for sure, but not the perimeter where it lies in. This data driven approach can be sufficient to estimate future trends, however, to validate this kind of results, it should be reviewed by epidemiological experts, and could be combined with their specialized modeling to get the best results. And may this humble project lead the way for other initiatives of this kind. Thanks for reading !
Code source: https://github.com/vlevorato/covid-ml-models