In this tutorial, we will learn the basics of how statistical modeling in R works, and then briefly overview methods of ordination.

To do this, we will use a dataset about covid19 put together by Our World in Data, joined to another dataset obtained in the same place that includes country responses (school closings, etc). The main goal here is to learn about modelling in R, but we also include in the end the steps taken to clean the data and merge tables as we need here. You can check this out to learn how to do these things to your datasets.

By using linear models, we will try to understand which responses of countries around the world to covid19 in the last months seem to be important in predicting the current number of daily cases.

In the second part, with ordination, we will visualize which countries are more similar to each other in the interventions they have taken in the last two weeks, and whether this has any relationship to geography.

Linear models

Overview

We use linear models (or linear regressions) to establish the relationship between a response (or dependent) variable and one or more predictive (or independent) variables. The goal can either be to explain the variation in the response variable, or to predict a response given what is known about a series of predictors.

R provides a very flexible syntax for linear models, and many kinds of analyses, such as t-tests, ANOVAs and ANCOVAs, can be treated as specific cases of linear models. If you check the Wikipedia article for t-test, for example, you will find out that it is actually the same as testing the slope of a linear model, which we will do below.

To learn how to use linear models in R, we will first load a dataset with information on cases and responses to covid19 across countries. Then, we will explore this dataset with graphs, using functions that we have seen in the last tutorial. We will then start by fitting simple models, and then making them more complex including more variables. Finally, we will test which model fits the data better and do some predictions.

Loading and exploring the data

Let’s start by loading the dataset with read.csv(). Remeber to include the path to the folder in which it is located, or use setwd() to set the working directory first. Just in case, we will also clear the memory and erase everything before we start, with rm(list = ls()).

By use the argument row.names = 1, we are telling read.csv() that the first column is not really data, but just row numbers.

covid_data <- read.csv('covid_data_2020-05-16.csv', row.names = 1)
head(covid_data)

##   iso_code         country_name                    world_region past_cases
## 1      ABW                Aruba Latin America and the Caribbean        1.4
## 2      AFG          Afghanistan            Asia and the Pacific       52.6
## 3      AGO               Angola                          Africa        0.0
## 4      ALB              Albania                          Europe       15.6
## 5      AND              Andorra                          Europe       14.4
## 6      ARE United Arab Emirates                       West Asia      401.0
##   new_cases new_deaths new_cases_per_million new_deaths_per_million new_tests
## 1       0.0        0.0                0.0000                 0.0000        NA
## 2     400.0        9.6               10.2754                 0.2464        NA
## 3       0.6        0.0                0.0182                 0.0000        NA
## 4       9.6        0.0                3.3360                 0.0000        NA
## 5       1.2        0.2               15.5308                 2.5884        NA
## 6     726.6        2.4               73.4652                 0.2424        NA
##   new_tests_per_thousand tests_units population population_density median_age
## 1                     NA          NA     106766            584.800       41.2
## 2                     NA          NA   38928341             54.422       18.6
## 3                     NA          NA   32866268             23.890       16.8
## 4                     NA          NA    2877800            104.871       38.0
## 5                     NA          NA      77265            163.755         NA
## 6                     NA          NA    9890400            112.442       34.0
##   aged_65_older aged_70_older gdp_per_capita extreme_poverty cvd_death_rate
## 1        13.085         7.452      35973.781              NA             NA
## 2         2.581         1.337       1803.987              NA        597.029
## 3         2.405         1.362       5819.495              NA        276.045
## 4        13.188         8.643      11803.431             1.1        304.195
## 5            NA            NA             NA              NA        109.135
## 6         1.144         0.526      67293.483              NA        317.840
##   diabetes_prevalence female_smokers male_smokers handwashing_facilities
## 1               11.62             NA           NA                     NA
## 2                9.59             NA           NA                 37.746
## 3                3.94             NA           NA                 26.664
## 4               10.08            7.1         51.2                     NA
## 5                7.97           29.0         37.8                     NA
## 6               17.26            1.2         37.4                     NA
##   hospital_beds_per_100k total_cases total_deaths total_cases_per_million
## 1                     NA         101            3                 945.994
## 2                   0.50        6402          168                 164.456
## 3                     NA          48            2                   1.460
## 4                   2.89         916           31                 318.299
## 5                     NA         761           49                9849.220
## 6                   1.20       21831          210                2207.292
##   total_deaths_per_million total_tests total_tests_per_thousand
## 1                   28.099          NA                       NA
## 2                    4.316          NA                       NA
## 3                    0.061          NA                       NA
## 4                   10.772          NA                       NA
## 5                  634.181          NA                       NA
## 6                   21.233          NA                       NA
##   C1_School.closing C1_Flag C2_Workplace.closing C2_Flag
## 1                 3       1                    3       1
## 2                 3       1                    3       0
## 3                 3       1                    2       1
## 4                 3       1                    2       0
## 5                 3       1                    2       1
## 6                 3       1                    2       1
##   C3_Cancel.public.events C3_Flag C4_Restrictions.on.gatherings C4_Flag
## 1                       2       1                             4       1
## 2                       2       1                             4       1
## 3                       2       1                             4       1
## 4                       2       1                             4       1
## 5                       1       1                             0      NA
## 6                       2       1                             4       1
##   C5_Close.public.transport C5_Flag C6_Stay.at.home.requirements C6_Flag
## 1                         0       1                            2       1
## 2                         2       0                            2       0
## 3                         1       1                            2       1
## 4                         2       0                            2       1
## 5                         1       1                            1       1
## 6                         1       1                            2       1
##   C7_Restrictions.on.internal.movement C7_Flag C8_International.travel.controls
## 1                                    2       1                                3
## 2                                    2       0                                1
## 3                                    2       1                                4
## 4                                    2       1                                4
## 5                                    0      NA                                3
## 6                                    1       1                                4
##   E1_Income.support E1_Flag E2_Debt.contract.relief E3_Fiscal.measures
## 1                 0      NA                       2                  0
## 2                 0      NA                       0                 NA
## 3                 0      NA                       2                  0
## 4                 2       0                       1                  0
## 5                 2       1                       0                  0
## 6                 0      NA                       2                  0
##   E4_International.support H1_Public.information.campaigns H1_Flag
## 1                        0                               2       1
## 2                        0                               2       1
## 3                        0                               1       1
## 4                        0                               2       1
## 5                       NA                               2       1
## 6                        0                               2       1
##   H2_Testing.policy H3_Contact.tracing H4_Emergency.investment.in.healthcare
## 1                 1                  0                                     0
## 2                 1                  1                                     0
## 3                 2                  0                                     0
## 4                 2                  1                                     0
## 5                 3                  1                                     0
## 6                 3                  2                                     0
##   H5_Investment.in.vaccines M1_Wildcard ConfirmedCases ConfirmedDeaths
## 1                         0          NA            100               2
## 2                         0          NA            784              30
## 3                         0          NA             24               2
## 4                         0          NA            494              31
## 5                         0          NA            743              40
## 6                         0          NA           5825              33
##   StringencyIndex StringencyIndexForDisplay LegacyStringencyIndex
## 1           83.46                     83.46                 80.00
## 2           76.33                     76.33                 75.48
## 3           86.77                     86.77                 84.05
## 4           88.36                     88.36                 85.24
## 5           59.12                     59.12                 74.05
## 6           86.77                     86.77                 85.24
##   LegacyStringencyIndexForDisplay
## 1                           80.00
## 2                           75.48
## 3                           84.05
## 4                           85.24
## 5                           74.05
## 6                           85.24

At the end of this tutorial we explain how this dataset was made and provide links with the explanation of each column. This dataset includes 63 columns with information for 155 countries. new_cases (and every column starting with new_) are averaged over the past 5 days, and past_cases are the average of 15-20 days ago. The different country responses (all columns after C1_School.closing) correspond to the most frequent response in the past 2 weeks. Therefore, this dataset provides a baseline of how many new cases of covid19 there were 2 weeks ago, the government response during the last 2 weeks, and the current number of new cases.

We will start by reducing the dataset to only a few columns that we will use to learn linear models, making it more manageable.

To explore the data before we start, let’s make a smaller version with only a few columns. Let’s retain the number of new cases, number of cases two weeks ago, gdp per capita, stringency index, international travel control, school closing and income support.

You can find definitions for all variables in the dataset website:

Stringency index is an index that summarizes the overal stringency of government measures on a scale of 0-100. The other columns are coded as follows:

C1_School closing

Record closings of schools and universities

no measures
recommend closing
require closing (only some levels or categories, eg just high school, or just public schools)
require closing all levels

C2_Workplace closing

Record closings of workplaces

no measures
recommend closing (or recommend work from home)
require closing (or work from home) for some sectors or categories of workers
require closing (or work from home) for all-but-essential workplaces (eg grocery stores, doctors)

E1_Income support

Record if the government is providing direct cash payments to people who lose their jobs or cannot work.

no income support
government is replacing less than 50% of lost salary (or if a flat sum, it is less than 50% median salary)
government is replacing more than 50% of lost salary (or if a flat sum, it is greater than 50% median salary)

To see column names, we can use the function colnames()

colnames(covid_data)

##  [1] "iso_code"                             
##  [2] "country_name"                         
##  [3] "world_region"                         
##  [4] "past_cases"                           
##  [5] "new_cases"                            
##  [6] "new_deaths"                           
##  [7] "new_cases_per_million"                
##  [8] "new_deaths_per_million"               
##  [9] "new_tests"                            
## [10] "new_tests_per_thousand"               
## [11] "tests_units"                          
## [12] "population"                           
## [13] "population_density"                   
## [14] "median_age"                           
## [15] "aged_65_older"                        
## [16] "aged_70_older"                        
## [17] "gdp_per_capita"                       
## [18] "extreme_poverty"                      
## [19] "cvd_death_rate"                       
## [20] "diabetes_prevalence"                  
## [21] "female_smokers"                       
## [22] "male_smokers"                         
## [23] "handwashing_facilities"               
## [24] "hospital_beds_per_100k"               
## [25] "total_cases"                          
## [26] "total_deaths"                         
## [27] "total_cases_per_million"              
## [28] "total_deaths_per_million"             
## [29] "total_tests"                          
## [30] "total_tests_per_thousand"             
## [31] "C1_School.closing"                    
## [32] "C1_Flag"                              
## [33] "C2_Workplace.closing"                 
## [34] "C2_Flag"                              
## [35] "C3_Cancel.public.events"              
## [36] "C3_Flag"                              
## [37] "C4_Restrictions.on.gatherings"        
## [38] "C4_Flag"                              
## [39] "C5_Close.public.transport"            
## [40] "C5_Flag"                              
## [41] "C6_Stay.at.home.requirements"         
## [42] "C6_Flag"                              
## [43] "C7_Restrictions.on.internal.movement" 
## [44] "C7_Flag"                              
## [45] "C8_International.travel.controls"     
## [46] "E1_Income.support"                    
## [47] "E1_Flag"                              
## [48] "E2_Debt.contract.relief"              
## [49] "E3_Fiscal.measures"                   
## [50] "E4_International.support"             
## [51] "H1_Public.information.campaigns"      
## [52] "H1_Flag"                              
## [53] "H2_Testing.policy"                    
## [54] "H3_Contact.tracing"                   
## [55] "H4_Emergency.investment.in.healthcare"
## [56] "H5_Investment.in.vaccines"            
## [57] "M1_Wildcard"                          
## [58] "ConfirmedCases"                       
## [59] "ConfirmedDeaths"                      
## [60] "StringencyIndex"                      
## [61] "StringencyIndexForDisplay"            
## [62] "LegacyStringencyIndex"                
## [63] "LegacyStringencyIndexForDisplay"

Now let’s make a new data.frame only with the desired columns. Remeber that when indexing data.frames or matrices, we can use either column numbers or column names. Here we will use numbers to make it simpler:

covid_small <- covid_data[,c(2,3,5,4,17,33,46,60)]
head(covid_small)

##           country_name                    world_region new_cases past_cases
## 1                Aruba Latin America and the Caribbean       0.0        1.4
## 2          Afghanistan            Asia and the Pacific     400.0       52.6
## 3               Angola                          Africa       0.6        0.0
## 4              Albania                          Europe       9.6       15.6
## 5              Andorra                          Europe       1.2       14.4
## 6 United Arab Emirates                       West Asia     726.6      401.0
##   gdp_per_capita C2_Workplace.closing E1_Income.support StringencyIndex
## 1      35973.781                    3                 0           83.46
## 2       1803.987                    3                 0           76.33
## 3       5819.495                    2                 0           86.77
## 4      11803.431                    2                 2           88.36
## 5             NA                    2                 2           59.12
## 6      67293.483                    2                 0           86.77

Now, to explore this dataset, we can use the functions summary() and plot():

Summary will show some statistics for each column.

summary(covid_small)

##  country_name       world_region         new_cases         past_cases      
##  Length:160         Length:160         Min.   :    0.0   Min.   :    0.00  
##  Class :character   Class :character   1st Qu.:    3.9   1st Qu.:    3.35  
##  Mode  :character   Mode  :character   Median :   22.1   Median :   22.10  
##                                        Mean   :  537.7   Mean   :  464.30  
##                                        3rd Qu.:  209.6   3rd Qu.:  142.95  
##                                        Max.   :22719.6   Max.   :27620.80  
##                                                                            
##  gdp_per_capita     C2_Workplace.closing E1_Income.support StringencyIndex 
##  Min.   :   661.2   Min.   :0.000        Min.   :0.0000    Min.   : 16.67  
##  1st Qu.:  5034.7   1st Qu.:2.000        1st Qu.:0.0000    1st Qu.: 76.19  
##  Median : 13367.6   Median :2.000        Median :1.0000    Median : 84.26  
##  Mean   : 20330.7   Mean   :2.165        Mean   :0.8065    Mean   : 81.12  
##  3rd Qu.: 30155.2   3rd Qu.:3.000        3rd Qu.:1.0000    3rd Qu.: 91.40  
##  Max.   :116935.6   Max.   :3.000        Max.   :2.0000    Max.   :100.00  
##  NA's   :7          NA's   :2            NA's   :5         NA's   :3

Plot will quickly plot scatterplots of two variables together, so we can already have some idea about which variables are related. Let’s plot all columns except for the first two (country name and world region)

plot(covid_small[,-c(1,2)])

We can also use a for loop to plot a histogram for each column. The function ncol() retrieves the number of columns of a data.frame, and we will use it to loop from column 3 to the last one.

for (i in 3:ncol(covid_small)){
  hist(covid_small[,i],main = colnames(covid_small)[i])
}

It is hard to visualize number of cases in the linear scale like this, let’s try doing a histogram of the logs:

hist(log10(covid_small[,3]), main = colnames(covid_small)[3])

hist(log10(covid_small[,4]), main = colnames(covid_small)[4])

A few observations of this initial exploration:

It will probably be a good idea to log-transform the number of cases, since they vary over orders of magnitude
It also seems that the number of new cases is highly correlated with number of cases 2 weeks ago, so we always have to take that into account.
Some of our possible independent variables seem to be correlated: richer countries seem to have closed schools more often, for example.

It is always good to visualize your dataset before starting to do inferences, so you will know how to best build your and models interpret them!

Before we proceed, let’s prepare the dataset we will use throughout the tutorial for linear models. We need to make sure all data columns have the appropriate type. For linear models, type matters, and specifically we have three important types:

numeric: columns with this type will be treated as a continuous variable. Example: number of cases.
unordered factor: columns with this type will be treated as categorical variables. Example: world region
ordered factor: columns with this type will be treated as ordinal variables. Example: Stringency of workplace closing

Let’s use str() to figure out the current data types for our columns and change as needed.

str(covid_small)

## 'data.frame':    160 obs. of  8 variables:
##  $ country_name        : chr  "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ world_region        : chr  "Latin America and the Caribbean" "Asia and the Pacific" "Africa" "Europe" ...
##  $ new_cases           : num  0 400 0.6 9.6 1.2 ...
##  $ past_cases          : num  1.4 52.6 0 15.6 14.4 401 91.4 44 162 52.4 ...
##  $ gdp_per_capita      : num  35974 1804 5819 11803 NA ...
##  $ C2_Workplace.closing: int  3 3 2 2 2 2 2 2 2 3 ...
##  $ E1_Income.support   : int  0 0 0 2 2 0 1 1 2 2 ...
##  $ StringencyIndex     : num  83.5 76.3 86.8 88.4 59.1 ...

Country name and world region are currently characters. R will automatically convert them to unordered factors when we fit a model, but it is good practice to be explicit. Let’s change them to factors.

covid_small$country_name <- factor(covid_small$country_name)
covid_small$world_region <- factor(covid_small$world_region)

Stringency Index, gdp per capita and numbers of cases are currently numeric, as they should be, so we will do nothing.

International travel controls and Income support are currently considered as integer numbers, but they are not quite. They are ordinal variables encoded as numbers. Let’s transform them into ordered factors:

covid_small$E1_Income.support <- factor(covid_small$E1_Income.support, ordered = T)

covid_small$C2_Workplace.closing <- factor(covid_small$C2_Workplace.closing, ordered = T)

Let’s look at the data now:

str(covid_small)

## 'data.frame':    160 obs. of  8 variables:
##  $ country_name        : Factor w/ 160 levels "Afghanistan",..: 7 1 5 2 4 150 6 8 9 10 ...
##  $ world_region        : Factor w/ 6 levels "Africa","Asia and the Pacific",..: 4 2 1 3 3 6 4 2 3 3 ...
##  $ new_cases           : num  0 400 0.6 9.6 1.2 ...
##  $ past_cases          : num  1.4 52.6 0 15.6 14.4 401 91.4 44 162 52.4 ...
##  $ gdp_per_capita      : num  35974 1804 5819 11803 NA ...
##  $ C2_Workplace.closing: Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 4 4 3 3 3 3 3 3 3 4 ...
##  $ E1_Income.support   : Ord.factor w/ 3 levels "0"<"1"<"2": 1 1 1 3 3 1 2 2 3 3 ...
##  $ StringencyIndex     : num  83.5 76.3 86.8 88.4 59.1 ...

Seems better. Treating columns as factors also allow us to change the names of their levels to something more informative. Using the key above, let’s do it wiht income support. Currently, level names are 0, 1 and 2. Let’s change to none, less than 50% and more than 50%.

levels(covid_small$E1_Income.support) <- c('none',
                                          'less than 50%',
                                          'more than 50%')

Now let’s check:

str(covid_small)

## 'data.frame':    160 obs. of  8 variables:
##  $ country_name        : Factor w/ 160 levels "Afghanistan",..: 7 1 5 2 4 150 6 8 9 10 ...
##  $ world_region        : Factor w/ 6 levels "Africa","Asia and the Pacific",..: 4 2 1 3 3 6 4 2 3 3 ...
##  $ new_cases           : num  0 400 0.6 9.6 1.2 ...
##  $ past_cases          : num  1.4 52.6 0 15.6 14.4 401 91.4 44 162 52.4 ...
##  $ gdp_per_capita      : num  35974 1804 5819 11803 NA ...
##  $ C2_Workplace.closing: Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 4 4 3 3 3 3 3 3 3 4 ...
##  $ E1_Income.support   : Ord.factor w/ 3 levels "none"<"less than 50%"<..: 1 1 1 3 3 1 2 2 3 3 ...
##  $ StringencyIndex     : num  83.5 76.3 86.8 88.4 59.1 ...

Let’s do the same for International Travel controls:

levels(covid_small$C2_Workplace.closing) <- c('none',
                                             'recommend',
                                             'require some',
                                             'require most')

Finally, let’s change the names of some variables to something shorter and easier to type. Current names are:

colnames(covid_small)

## [1] "country_name"         "world_region"         "new_cases"           
## [4] "past_cases"           "gdp_per_capita"       "C2_Workplace.closing"
## [7] "E1_Income.support"    "StringencyIndex"

Let’s change some of them by using their index:

colnames(covid_small)[6] <- 'work_close'

colnames(covid_small)[7] <- 'income_support'

colnames(covid_small)[8] <- 'stringency_idx'

Now let’s look at our dataset:

str(covid_small)

## 'data.frame':    160 obs. of  8 variables:
##  $ country_name  : Factor w/ 160 levels "Afghanistan",..: 7 1 5 2 4 150 6 8 9 10 ...
##  $ world_region  : Factor w/ 6 levels "Africa","Asia and the Pacific",..: 4 2 1 3 3 6 4 2 3 3 ...
##  $ new_cases     : num  0 400 0.6 9.6 1.2 ...
##  $ past_cases    : num  1.4 52.6 0 15.6 14.4 401 91.4 44 162 52.4 ...
##  $ gdp_per_capita: num  35974 1804 5819 11803 NA ...
##  $ work_close    : Ord.factor w/ 4 levels "none"<"recommend"<..: 4 4 3 3 3 3 3 3 3 4 ...
##  $ income_support: Ord.factor w/ 3 levels "none"<"less than 50%"<..: 1 1 1 3 3 1 2 2 3 3 ...
##  $ stringency_idx: num  83.5 76.3 86.8 88.4 59.1 ...

Looks good! The only step left is to remove NAs from the data. As we have seen in the second lesson, we can do it later when we will actually use the data, and this might be the best approach if there is a lot of scattered missing data. This is because na.omit on the whole data.frame will remove all rows with data missing for at least one column. In this dataset only 12 of the 160 countries included have missing datain some variable, so it won’t hurt much to remove these countries now so we do not need to think about it anymore.

covid_small <- na.omit(covid_small)

The data is ready, let’s start modeling.

Fitting a model with categorical variables using formulas

The syntax that R uses to fit a model are formulas. We have seen formulas before, using them in plots in tutorial 3 and to fit models in tutorial 2. Let’s explain formulas a bit more here. You can get more help on formulas by using the help function:

help(formula)

The basic syntax of a formula is:

response ~ predictor

This means that a variable named response depeends (~) on a variable named predictor. We can also use formulas with more than one variable as predictor, by using a plus sign:

response ~ predictor1 + predictor2

This means that the variable response depends on predictor1 and predictor2, and that the effect of each predictor variable is independent of each other.

If the effect of predictors not only add to each other but also interact to generate the response, we can indicate that with an asterisk:

response ~ predictor1 * predictor2

This means that the two predictors interact with each other and also have individual effects.

It is also possible to indicate in a formula that a variable is nested within another. Let’s imagine that you repeated an experiment in several locations and that you want to indicate that, a common formula would be:

response ~ predictor %in% location

Finally, you can include mathematical expressions in formulas. Let’s say you want to log-transform your response variable, for example. You can write this as:

log(response) ~ predictor

There are more things you can do with formulas, but let’s stop here for now and see it in practice. Let’s start by using a linear model to check whether the number of total cases varies between world regions. First, as we recommended before, let’s visualize whether this seems to be the case by using a boxplot. Both boxplots and linear models can take formulas!

boxplot(formula = new_cases ~ world_region, data = covid_small)

It seems it does, but it might be easier to visualize with a log scale. With base R, we use the parameter log='y' to indicate that we want a log scale on the y axis:

boxplot(formula = new_cases ~ world_region, data = covid_small, log='y')

## Warning in plot.window(xlim = xlim, ylim = ylim, log = log, yaxs = pars$yaxs):
## nonfinite axis limits [GScale(-inf,4.3564,2, .); log=1]

Oops, there was an error! The reason is that some countries have 0 new cases in the past 5 days, and $log(0)$ is undefined. In base R, there is no way to correct this with graph scales (we will see below how to do with ggplot). However, we can add the log transformation to the formula and pass to boxplot() the transformed data instead of the raw data. A typical log transformation when there are 0 values is $log_{10}(x+1)$:

boxplot(formula = log10(new_cases+1) ~ world_region, data = covid_small)

It seems there are variations between regions, but North America has many more cases. Is that difference significant? We will use a linear model to test. The syntax is similar that what we used to build the boxplot, but we will use the function lm():

region_model <- lm(formula = new_cases ~ world_region, data = covid_small)

To check the results of our model fit, we can use the function summary:

summary(region_model)

## 
## Call:
## lm(formula = new_cases ~ world_region, data = covid_small)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10784.4   -536.8   -311.7    -39.6  10784.4 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                    62.97     290.34   0.217
## world_regionAsia and the Pacific              327.72     464.14   0.706
## world_regionEurope                            501.52     418.42   1.199
## world_regionLatin America and the Caribbean   689.28     459.06   1.501
## world_regionNorth America                   11872.23    1361.80   8.718
## world_regionWest Asia                         502.81     662.07   0.759
##                                             Pr(>|t|)    
## (Intercept)                                    0.829    
## world_regionAsia and the Pacific               0.481    
## world_regionEurope                             0.233    
## world_regionLatin America and the Caribbean    0.135    
## world_regionNorth America                   6.76e-15 ***
## world_regionWest Asia                          0.449    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1882 on 142 degrees of freedom
## Multiple R-squared:  0.3502, Adjusted R-squared:  0.3273 
## F-statistic:  15.3 on 5 and 142 DF,  p-value: 5.082e-12

Summary shows a few different things about the model. For now, let’s pay attention to the coefficients table. This is a table with the estimates for all components of our model, and the p-value of each coefficient. When the predictor is an unordered factor, it uses the first level of the factor as the intercept. Here it is Africa, which is considered our baseline, and the other estimated coefficients denote the difference between each region and Africa. It seems that North America is significantly different from Africa, while the other regions are not.

However, before we jump to conclusions, we have to answer two questions first to check if our interpretation is valid: 1. does this model fit our assumptions about the data? 2. do we need world regions at all to explain differences between countries?

To answer the first question, we need to evaluate our model fit, and R has several tools for that

Evaluate model fit

Let’s call the summary function again to talk about another part of the output:

summary(region_model)

## 
## Call:
## lm(formula = new_cases ~ world_region, data = covid_small)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10784.4   -536.8   -311.7    -39.6  10784.4 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                    62.97     290.34   0.217
## world_regionAsia and the Pacific              327.72     464.14   0.706
## world_regionEurope                            501.52     418.42   1.199
## world_regionLatin America and the Caribbean   689.28     459.06   1.501
## world_regionNorth America                   11872.23    1361.80   8.718
## world_regionWest Asia                         502.81     662.07   0.759
##                                             Pr(>|t|)    
## (Intercept)                                    0.829    
## world_regionAsia and the Pacific               0.481    
## world_regionEurope                             0.233    
## world_regionLatin America and the Caribbean    0.135    
## world_regionNorth America                   6.76e-15 ***
## world_regionWest Asia                          0.449    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1882 on 142 degrees of freedom
## Multiple R-squared:  0.3502, Adjusted R-squared:  0.3273 
## F-statistic:  15.3 on 5 and 142 DF,  p-value: 5.082e-12

Another information given by summary() is the distribution of residuals. Residuals are the variation that cannot be explained by the model: the difference between what our model predicts and what we actually observe. For example, in the coefficients table we see that the coefficient estimated for Africa (the model intercept) is 62.97. So the model estimates that an African country would have an average of 62.97 new cases of covid19 now. Let’s look at how many cases they actually have by plotting each country as a point.

We will use ggplot this time since it is more flexible. For example, instead of transforming the data we can tell it to use a transformation of log plus one for the y axis. We will also plot a straight line at the intercept, which is the inference for Africa, so we can compare the distribution of points.

Let’s first load the package.

library(ggplot2)

And now we can plot.

Africa_only <- covid_small[covid_small$world_region =='Africa', ]

ggplot(Africa_only) +
  geom_point(aes(x = world_region, y = new_cases)) +
  scale_y_continuous(trans='log1p', breaks = c(1,10,100)) +
  geom_hline(yintercept = 62.97, color = 'red')

Here we can see two things:

We slightly overstimated the number of cases for Africa
There is a large variation within Africa. The differences between an actual African country and our estimate are the residuals.

In a typical linear model, we assume that the residuals should be normally distributed, and that they follow the same distribution across all the data. In out case, it means that the residuals should be more or less the same for all world regions. Is that the case? R has a simple function to calculate residuals, named resid():

residuals <- resid(region_model)

We saved the result of resid(region_model) in the object residuals, which is a vector with residual values:

str(residuals)

##  Named num [1:148] -752.25 9.31 -62.37 -554.89 160.82 ...
##  - attr(*, "names")= chr [1:148] "1" "2" "3" "4" ...

The result of this function is a vector that shows the residuals for each one of our data points. Let’s build a data.frame with our data and residuals to visualize if our assumptions about residuals hold:

resid_df <- data.frame(world_region = covid_small$world_region, residual = residuals)

Let’s look at histograms of our residuals for each world region now:

ggplot(resid_df)+
  geom_histogram(aes(x=residuals)) +
  facet_wrap(~world_region,ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The residuals do not look normally distribued, and also not the same for all regions!

We did all this to get an intuition about how we evaluate model fit, but R has nice function to help us understand whether the model assumption holds. If we give our model as input to the function plot(), it will return a series of graphs:

plot(region_model)

The first graph shows our fitted values (i. e., model predictions) against residuals. It seems the dispersion of residuals is increasing with increased fitted values. The average trend (red line) is decreasing. In a good model, there should be no trend and the variance should be constant.
The second graph is a Q-Q plot and compares our residuals to what would be expected if they were normally distributed. For a good model, it should be a straight line with slope of 1. This is far from what we are observing
The third graph is similar to the first one, but the y axis is the square root of residuals, with sign removed. Again, it indicates that the variation in residuals is increasing with higher predicted response
the fourth graphs helps to identify outliers, which are the labelled points farthest from the dashed region, and happen to be data points 22, 28 and 153. Which countries are these? These labels are the row names, not row numbers, so we need to provide a character vector to index them:

covid_small$country_name[c('22','28','153')]

## [1] <NA> <NA> <NA>
## 160 Levels: Afghanistan Albania Algeria Andorra Angola Argentina ... Zimbabwe

This is saying that these three data points are the ones that are most influencing the model. we might consider removing them if we can’t build a good model at all.

But before doing that, it seems we can improve our model by log-transforming the response variable. When variation in residuals increases with fitted values, this can often help.

Let’s try this.

region_model_log <- lm(formula = log(new_cases) ~ world_region, data = covid_small)

## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): NA/NaN/Inf in 'y'

Oops, we got an error. This is because some countries have a number of cases of 0, so we cannot calculate the log. As we did with the graph, we can do a slightly different transformation and add 1 to all countries:

region_model_log <- lm(formula = log(new_cases+1) ~ world_region, data = covid_small)

Now let’s plot the model to check the model fit:

plot(region_model_log)

It seems much better! The variation in residuals seems more or less constant across fitted values, the q-q plot is closer to a straight line and there are no points outside the dashed region in the last plot.

Has this log transformation changed anything about our estimates? Let’s use summary to figure out:

summary(region_model_log)

## 
## Call:
## lm(formula = log(new_cases + 1) ~ world_region, data = covid_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0715 -2.1173 -0.0149  1.7933  5.8240 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                   2.5873     0.3584   7.219
## world_regionAsia and the Pacific              0.9848     0.5730   1.719
## world_regionEurope                            1.6666     0.5165   3.226
## world_regionLatin America and the Caribbean   0.9039     0.5667   1.595
## world_regionNorth America                     5.9528     1.6811   3.541
## world_regionWest Asia                         2.5359     0.8173   3.103
##                                             Pr(>|t|)    
## (Intercept)                                 2.91e-11 ***
## world_regionAsia and the Pacific             0.08784 .  
## world_regionEurope                           0.00156 ** 
## world_regionLatin America and the Caribbean  0.11295    
## world_regionNorth America                    0.00054 ***
## world_regionWest Asia                        0.00231 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.323 on 142 degrees of freedom
## Multiple R-squared:  0.147,  Adjusted R-squared:  0.1169 
## F-statistic: 4.892 on 5 and 142 DF,  p-value: 0.000368

Now we gained much more power to find the differences. It seems most regions are different from Africa, with Latin America and Caribbean being more similar. Also notice that the estimates are different, because now we are modeling the log-transformed covid19 cases. This is not important now, but will be for prediction, which we will see below.

Before proceeding, let’s note that the choice of intercept for categorical variables is arbitrary. R chooses one arbitrary level in a factor as the intercept. To choose the baseline comparison yourself, you can explicitly set the reference level. Let’s set Latin America and the Caribbean using the function relevel():

covid_small$world_region <- relevel(covid_small$world_region,
                                    ref = 'Latin America and the Caribbean' )

Now let’s fit a new model and print the summary. Latin America will be the intercept and all comparisons will be done in relation to it. Remember: this makes no difference for model predictions or for model test (which we will see below), just for which contrasts are displayed here!

region_model_log <- lm(formula = log(new_cases+1) ~ world_region, data = covid_small)

summary(region_model_log)

## 
## Call:
## lm(formula = log(new_cases + 1) ~ world_region, data = covid_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0715 -2.1173 -0.0149  1.7933  5.8240 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       3.49119    0.43898   7.953 5.18e-13 ***
## world_regionAfrica               -0.90389    0.56671  -1.595   0.1129    
## world_regionAsia and the Pacific  0.08093    0.62653   0.129   0.8974    
## world_regionEurope                0.76267    0.57537   1.326   0.1871    
## world_regionNorth America         5.04887    1.70014   2.970   0.0035 ** 
## world_regionWest Asia             1.63200    0.85572   1.907   0.0585 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.323 on 142 degrees of freedom
## Multiple R-squared:  0.147,  Adjusted R-squared:  0.1169 
## F-statistic: 4.892 on 5 and 142 DF,  p-value: 0.000368

While summary() is a good function to display model results in R, it does not produce results that we can save as a csv and open, for example, in Excel to make a nice table for a paper. The R package broom contains functions to do that. Let’s load this package and use the function tidy to see the results. Remeber to install the package broom with install.packages('broom') if you do not have it already.

library(broom)
tidy(region_model_log)

## # A tibble: 6 x 5
##   term                             estimate std.error statistic  p.value
##   <chr>                               <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                        3.49       0.439     7.95  5.18e-13
## 2 world_regionAfrica                -0.904      0.567    -1.59  1.13e- 1
## 3 world_regionAsia and the Pacific   0.0809     0.627     0.129 8.97e- 1
## 4 world_regionEurope                 0.763      0.575     1.33  1.87e- 1
## 5 world_regionNorth America          5.05       1.70      2.97  3.50e- 3
## 6 world_regionWest Asia              1.63       0.856     1.91  5.85e- 2

Much better for saving! Let’s now assign this table to a variable and save it so we can open in Excel later.

nice_table <- tidy(region_model_log)
write.csv(nice_table, file = 'nice_table.csv')

Model choice

Now that we used a transformation and made sure that the model fits the normality assumption, we have one more step: to show that we need to consider region at all. It could be the case that differences between some regions exist only by chance, and that a single normal distribution without considering world region could adequately explain our data. To test whether this is the case, we will use the function drop1(), which will try to remove predictors from our model and test if the result is a better model:

drop1(region_model_log)

## Single term deletions
## 
## Model:
## log(new_cases + 1) ~ world_region
##              Df Sum of Sq    RSS    AIC
## <none>                    766.17 255.34
## world_region  5    131.99 898.16 268.86

The function drop1 tries to remove each predictor from the model (in our case, it was just one!) and then calculates the Akaike Information Criteria (AIC), which is a statistic that tries to balance model complexity with explanatory power. The lower the value of AIC, the better a model is. Differences of AIC of 2 or more are typically considered very strong. What the table is saying here is that not removing any predictor () results in a much better model than removing world_region.

Another way to test whether a predictor is important is by doing an F test, using the function anova():

anova(region_model_log)

## Analysis of Variance Table
## 
## Response: log(new_cases + 1)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## world_region   5 131.99 26.3971  4.8923 0.000368 ***
## Residuals    142 766.17  5.3956                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here the F-test resulted in a very small p-value for world_region, indicating that this term significantly improves the model.

Usually both methods will have similar results, but they might differ when the p-value is close to 0.05.

We can also do the F-test and calculate AICs at the same time by providing an additional argument to drop1():

drop1(region_model_log, test = 'F')

## Single term deletions
## 
## Model:
## log(new_cases + 1) ~ world_region
##              Df Sum of Sq    RSS    AIC F value   Pr(>F)    
## <none>                    766.17 255.34                     
## world_region  5    131.99 898.16 268.86  4.8923 0.000368 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ordered categorical variables and continuous variables

We saw above an example using a categorical variable as predictor, but we can also use ordered factors. For example, workplace closing:

str(covid_small$work_close)

##  Ord.factor w/ 4 levels "none"<"recommend"<..: 4 4 3 3 3 3 3 3 4 1 ...

Let’s plot the relationship between both, log-transforming the number of cases:

plot(formula = log10(new_cases+1) ~ work_close, data = covid_small)

Maybe it has an effect, but not clear. It seems that this is a non-linear relationship that increases up until some point and then decreases.

model_work <- lm(formula = new_cases ~ work_close, data = covid_small)

Let’s evaluate the model fit:

plot(model_work)

Model fit does not look good, let’s see if we can improve it with a log-transformation of the response variable:

model_work_log <- lm(formula = log(new_cases+1) ~ work_close, data = covid_small)

plot(model_work_log)

Much better. Let’s look at the summary now:

summary(model_work_log)

## 
## Call:
## lm(formula = log(new_cases + 1) ~ work_close, data = covid_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1163 -1.7015 -0.1385  1.7441  6.4731 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.2259     0.2546  12.670  < 2e-16 ***
## work_close.L   1.4585     0.4983   2.927  0.00398 ** 
## work_close.Q  -1.1923     0.5092  -2.342  0.02058 *  
## work_close.C  -0.2243     0.5199  -0.431  0.66677    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.388 on 144 degrees of freedom
## Multiple R-squared:  0.08592,    Adjusted R-squared:  0.06688 
## F-statistic: 4.512 on 3 and 144 DF,  p-value: 0.004677

In the case of ordered factors, it is a little less intuitive to understand how R fits them, but basically they are treated as a number. Instead of fitting one parameter per factor level, it fits what we call a polynomial regression. work_close.L is the linear effect of travel control, indicating that the higher the stringency of workplace close orders, the lower the number of new cases. work_close.Q is the quadratic effect, which is negative, indicating a peak.work_close.C is the cubic effect, allowing the relationship between work_close and our response variable to be nonlinear.

Each estimated parameters is not as important for us now as whether travel control is important overall. For that, we use the function drop1().

drop1(model_work_log, test = 'F')

## Single term deletions
## 
## Model:
## log(new_cases + 1) ~ work_close
##            Df Sum of Sq    RSS    AIC F value   Pr(>F)   
## <none>                  820.99 261.57                    
## work_close  3    77.172 898.16 268.86   4.512 0.004677 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, it seems that a model including workplace closings is much better than one that does not, and that closing workplaces generally increases the number of new coronavirus cases.

That seems counter-intuitive: does closing work places make the pandemics worse??? We will have a closer look in the next section.

Multiple predictors

The last example will be useful for us to introduce fitting models with multiple predictors. We can imagine, for example, that the effect that we found could be spurious, and workplace restrictions are actually associated with another variable that is the true driver of new covid19 cases. For example, it could be that countries decided to close workplaces based on how severe covid19 was a month ago, and that the number of past cases is really the main factor explaining current trends.

We can fit a new regression including past cases as a predictor and test if, considering past cases, workplace closings are still important.

We can use ggplot to visualize the effect of both variables at the same time. The number of past cases will be the X axis, while work closures will be color-coded:

ggplot(covid_small) +
  geom_point(aes(x = past_cases, y = new_cases, color = work_close)) +
  scale_x_continuous(trans = 'log1p', breaks = 10^(0:5)) +
  scale_y_continuous(trans = 'log1p', breaks = 10^(0:5))

With this plot, it now becomes clear that the number of new cases is determined by the number of past cases. It does not look like work closures have much influence other than that. More importantly, we should be cautious: all countries with more than 100 cases seem to have adopted some sort of workplace closure intervention, most of them with the more strict controls. That means that we will have little power to actually detect any effect of work closures! If this were an experiment, we should have better balanced the different treatments. Since this is not an experiment but rather observations, we will move on, knowing that this imbalance will impact our ability to detect differences.

Now we can fit the model, evaluate it and test with drop1():

model_work_past <- lm(formula = log(new_cases+1) ~ log(past_cases+1) + work_close, 
                      data = covid_small)

plot(model_work_past)

drop1(model_work_past, test = 'F')

## Single term deletions
## 
## Model:
## log(new_cases + 1) ~ log(past_cases + 1) + work_close
##                     Df Sum of Sq    RSS     AIC  F value Pr(>F)    
## <none>                           272.19 100.176                    
## log(past_cases + 1)  1    548.79 820.99 261.567 288.3172 <2e-16 ***
## work_close           3      1.84 274.03  95.173   0.3224 0.8092    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, the lowest AIC is a model without work closures (notice that the table is NOT ordered by AIC). The F test also did not find a significant difference between a model deleting no variable and a model deleting work_close. Since they are not statistically different, we use the parsimony principle and keep the simpler model.

Now we will fit a new model, withut work_close, and do the test again to see if deleting past_cases improves the model further.

model_pastonly <- lm(formula = log(new_cases+1) ~ log(past_cases+1), data = covid_small)

drop1(model_pastonly, test = 'F')

## Single term deletions
## 
## Model:
## log(new_cases + 1) ~ log(past_cases + 1)
##                     Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                           274.03  95.173                      
## log(past_cases + 1)  1    624.13 898.16 268.864  332.52 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now the difference is significant: we cannot improve our model by removing any more terms.

We can now use summary(), tidy(), plot() and all functions shown above to the result of step():

summary(model_pastonly)

## 
## Call:
## lm(formula = log(new_cases + 1) ~ log(past_cases + 1), data = covid_small)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.83253 -0.97520  0.06853  1.03690  2.98861 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.57470    0.20183   2.847  0.00504 ** 
## log(past_cases + 1)  0.87242    0.04784  18.235  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.37 on 146 degrees of freedom
## Multiple R-squared:  0.6949, Adjusted R-squared:  0.6928 
## F-statistic: 332.5 on 1 and 146 DF,  p-value: < 2.2e-16

plot(model_pastonly)

tidy(model_pastonly)

## # A tibble: 2 x 5
##   term                estimate std.error statistic  p.value
##   <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)            0.575    0.202       2.85 5.04e- 3
## 2 log(past_cases + 1)    0.872    0.0478     18.2  1.83e-39

tidy(drop1(model_pastonly))

## Warning in tidy.anova(drop1(model_pastonly)): The following column names in
## ANOVA output were not recognized or transformed: AIC

## # A tibble: 2 x 5
##   term                   df sumsq   rss   AIC
##   <chr>               <dbl> <dbl> <dbl> <dbl>
## 1 <none>                 NA   NA   274.  95.2
## 2 log(past_cases + 1)     1  624.  898. 269.

So far we learnt to explore our data before fitting models, and then fitting and evaluating models, making transformations if needed. Finally, we learnt to simplify models and drop unnecessary terms. Now we will learn how to use these models to make predictions, and plot them.

Prediction

We will finish our overview of linear models in R by using them for predictions. First, we will fit a new, very complex, model in which the current number of cases of covid19 depends on all variables that we kept on the dataset. We will also include an interaction between gdp per capita and income support, since the effect of support might be greater in poorer countries, for example. We will then do several rounds of drop1() and fitting a simpler model, until we are left only with the variables that are important in explaining variation of new cases across countries.

covid_full_model <- lm(formula = log(new_cases+1) ~ log(past_cases+1) +
                        gdp_per_capita*income_support +
                        stringency_idx +
                        work_close, 
                       data=covid_small)

With many variables like this, it will be a lot of work to use drop1() several times, fit simpler models, and repeat. Luckily, R has a function that does that automatically, names step(). As the name indicates, step() does drop1() in steps.

minimal_model <- step(covid_full_model)

## Start:  AIC=90.71
## log(new_cases + 1) ~ log(past_cases + 1) + gdp_per_capita * income_support + 
##     stringency_idx + work_close
## 
##                                 Df Sum of Sq    RSS     AIC
## - work_close                     3      4.23 239.66  87.339
## <none>                                       235.44  90.706
## - stringency_idx                 1      4.62 240.06  91.581
## - gdp_per_capita:income_support  2      8.64 244.08  92.041
## - log(past_cases + 1)            1    532.18 767.62 263.620
## 
## Step:  AIC=87.34
## log(new_cases + 1) ~ log(past_cases + 1) + gdp_per_capita + income_support + 
##     stringency_idx + gdp_per_capita:income_support
## 
##                                 Df Sum of Sq    RSS     AIC
## - stringency_idx                 1      1.55 241.22  86.296
## <none>                                       239.66  87.339
## - gdp_per_capita:income_support  2      8.91 248.58  88.744
## - log(past_cases + 1)            1    560.37 800.03 263.741
## 
## Step:  AIC=86.3
## log(new_cases + 1) ~ log(past_cases + 1) + gdp_per_capita + income_support + 
##     gdp_per_capita:income_support
## 
##                                 Df Sum of Sq    RSS     AIC
## <none>                                       241.22  86.296
## - gdp_per_capita:income_support  2      8.03 249.24  87.140
## - log(past_cases + 1)            1    576.49 817.71 264.976

Here, it first found that removing workplace closures improved the model, which we knew already. Then it removed several other predictors in the next steps of drop1(). In the last step, a model including number of past cases and the interaction between income support and gdp per capita was better than any model removing these variables, so we kept them.

Let’s have a look at the chosen model:

plot(minimal_model)

summary(minimal_model)

## 
## Call:
## lm(formula = log(new_cases + 1) ~ log(past_cases + 1) + gdp_per_capita + 
##     income_support + gdp_per_capita:income_support, data = covid_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3306 -0.8396 -0.0870  0.9145  3.1723 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.715e-01  2.486e-01   1.092  0.27664    
## log(past_cases + 1)              9.827e-01  5.353e-02  18.357  < 2e-16 ***
## gdp_per_capita                  -1.475e-05  6.322e-06  -2.333  0.02108 *  
## income_support.L                -1.106e+00  3.662e-01  -3.019  0.00301 ** 
## income_support.Q                -2.333e-01  2.929e-01  -0.797  0.42695    
## gdp_per_capita:income_support.L  2.350e-05  1.085e-05   2.165  0.03208 *  
## gdp_per_capita:income_support.Q -1.414e-06  1.012e-05  -0.140  0.88909    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.308 on 141 degrees of freedom
## Multiple R-squared:  0.7314, Adjusted R-squared:   0.72 
## F-statistic:    64 on 6 and 141 DF,  p-value: < 2.2e-16

The summary indicates that the effect of income support alone is mostly linear: the higher the income support, the lower the number of current cases, after accounting for number of past cases. However, the interaction term indicates that if gdp per capita increases, the effect of income support is smaller: income support has a greater effect on poorer countries! It might be hard to understand the model behavior by looking at this table alone, so we will make it easier: let’s use the model to make predictions and then plot these predictions together with the data.

Let’s start by make predictions to compare to the data used to fit the model. We can achieve this with the function predict(). Let’s use predict() to obtain predictions and their associated error:

prediction <- predict(minimal_model, interval = 'prediction')

## Warning in predict.lm(minimal_model, interval = "prediction"): predictions on current data refer to _future_ responses

head(prediction)

##         fit        lwr      upr
## 1 0.6695544 -2.0094681 3.348577
## 2 4.8131704  2.1868262 7.439515
## 3 0.7722633 -1.8490491 3.393576
## 4 2.1703236 -0.5161315 4.856779
## 6 4.7016893  1.7994289 7.603950
## 7 4.6524578  2.0418860 7.263030

Now we need to remember that we log-transformed the response variable (number of new covid19 cases) before fitting the model, so we need to do the inverse operation here so our prediction is in the original scale. The inverse function of $log(x+1)$ is $exp(x)-1$.

prediction_df = as.data.frame(exp(prediction)-1)

head(prediction_df)

##           fit        lwr        upr
## 1   0.9533668 -0.8659400   27.46220
## 2 122.1213442  7.9068999 1700.92386
## 3   1.1646600 -0.8426132   28.77222
## 4   7.7611190 -0.4031751  127.60925
## 6 109.1330581  5.0461935 2005.10358
## 7 103.8423464  6.7051274 1425.57182

We can join this data.frame to the data used in the model fit by binding columns with cbind()

data_plus_pred = cbind(prediction_df, covid_small)

head(data_plus_pred)

##           fit        lwr        upr         country_name
## 1   0.9533668 -0.8659400   27.46220                Aruba
## 2 122.1213442  7.9068999 1700.92386          Afghanistan
## 3   1.1646600 -0.8426132   28.77222               Angola
## 4   7.7611190 -0.4031751  127.60925              Albania
## 6 109.1330581  5.0461935 2005.10358 United Arab Emirates
## 7 103.8423464  6.7051274 1425.57182            Argentina
##                      world_region new_cases past_cases gdp_per_capita
## 1 Latin America and the Caribbean       0.0        1.4      35973.781
## 2            Asia and the Pacific     400.0       52.6       1803.987
## 3                          Africa       0.6        0.0       5819.495
## 4                          Europe       9.6       15.6      11803.431
## 6                       West Asia     726.6      401.0      67293.483
## 7 Latin America and the Caribbean     338.0       91.4      18933.907
##     work_close income_support stringency_idx
## 1 require most           none          83.46
## 2 require most           none          76.33
## 3 require some           none          86.77
## 4 require some  more than 50%          88.36
## 6 require some           none          86.77
## 7 require some  less than 50%          87.57

How good was our prediction for Panama, Brazil and the USA, for example?

countries = c('Panama', 'Brazil', 'United States')

data_plus_pred[data_plus_pred$country_name %in% countries, c('lwr','fit','upr','new_cases','country_name')]

##           lwr        fit        upr new_cases  country_name
## 22  141.15920  1999.9313  28162.677   11104.8        Brazil
## 113  12.25588   182.4582   2538.018     164.0        Panama
## 153 923.24146 13542.7191 198466.971   22719.6 United States

The number of cases is within the prediction interval, but the average is one order of magnitude wrong for Brazil.

We can also make predictions based on new data. This is useful to make nice plots including the data and model predictions, for example. Let’s try it out. First, we will figure out what is the range of the data we will plot.

summary(covid_small$income_support)

##          none less than 50% more than 50% 
##            60            56            32

summary(covid_small$past_cases)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     3.9    28.0   501.0   161.6 27620.8

summary(covid_small$gdp_per_capita)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    702.2   5288.6  13567.3  20588.0  31654.4 116935.6

So we need to make predictions across the range of 0-27620 of new daily cases, for the 3 levels of income support. It would be hard to do that for all of the range of gdp per capita, so we will make predictions for two kinds of countries: a poor country with gdp per capita of US$ 5,000 per year and a rich country with gdp per capita of US$ 50,000 per year.

We need create a data.frame with all possible combinations for the values from all three variables, which we can do with the function expand.grid(). expand.grid() creates a new data frame with all possible combinations of the vectors provided.

values_past_cases <- seq(from = 0, to=27620, length.out = 1000)
values_income <- levels(covid_small$income_support)

values_gdp <- c(5000, 50000)

new_data <- expand.grid(values_past_cases, values_income, values_gdp)

head(new_data)

##        Var1 Var2 Var3
## 1   0.00000 none 5000
## 2  27.64765 none 5000
## 3  55.29530 none 5000
## 4  82.94294 none 5000
## 5 110.59059 none 5000
## 6 138.23824 none 5000

A side effect of expand.grid() is that ordered factors loose their order, so we need to use the function factor() to put the correct order back. This is important for plotting, since ggplot will automatically recognize the order and use an appropriate color scale.

new_data$Var2 <- factor(new_data$Var2, 
                       levels = levels(covid_small$income_support),
                       ordered = T)

Now we have to rename our data.frame columns to match the model parameters, and to transform income support into a factor as it was originally.

colnames(new_data) <- c('past_cases', 'income_support','gdp_per_capita')

head(new_data)

##   past_cases income_support gdp_per_capita
## 1    0.00000           none           5000
## 2   27.64765           none           5000
## 3   55.29530           none           5000
## 4   82.94294           none           5000
## 5  110.59059           none           5000
## 6  138.23824           none           5000

Now that we have the data.fame with new data, we can use it to predict. Let’s make the predictions and add them as new columns to the new_data data.frame after transforming back to the original scale

prediction <- predict(minimal_model, newdata = new_data, interval = 'prediction')
prediction <- exp(prediction) - 1

new_data <- cbind(new_data, prediction)

head(new_data)

##   past_cases income_support gdp_per_capita        fit        lwr        upr
## 1    0.00000           none           5000   1.222061 -0.8385641   29.58525
## 2   27.64765           none           5000  59.067018  3.3928636  820.34273
## 3   55.29530           none           5000 115.665759  7.4938184 1601.44764
## 4   82.94294           none           5000 171.764132 11.5332004 2380.47037
## 5  110.59059           none           5000 227.537685 15.5303396 3158.61287
## 6  138.23824           none           5000 283.070160 19.4950177 3936.34013

Before plotting, let’s add a new column that will say if a country is rich or poor depending on the gdp per capita. We will call countries above 30,000 USD as rich and those below this as poor by using the function ifelse(). This function works similar to an if{} else{} statement, but it is a much shorter version for vectors. The function below means that the column country_type will have a value of rich if the country has a gdp per capita above 30000, and poor if not.

new_data$country_type <- ifelse(new_data$gdp_per_capita > 30000,
                               'rich',
                               'poor')

We will do the same for our real data.

covid_small$country_type = ifelse(covid_small$gdp_per_capita > 30000,
                               'rich',
                               'poor')

Now we can use ggplot to plot the data and the predictions. Let’s start with the predictions. We will plot rich and poor countries as facets, income support as colors, past cases in the X axis and new cases in the Y axis as a line. Since we are using a line and not points, we need also to use the group aesthetics, to tell which points will be part of the same line

p <- ggplot(new_data) +
  geom_line(aes(x = past_cases, 
                 y = fit, 
                 color = income_support,
                 group = income_support)) +
  facet_wrap(~country_type)

p

Looking at the prediction alone, it seems that for poor countries the higher the income support, the lower the increase in number of new cases during the last month. The effect is smaller for rich countries.

Now let’s add the real data. Remember that since we assigned our plot to the variable p, we can add a new plot layer by adding a geom_ to p. Let’s add points, and indicate the the data comes from covid_small.

p <- p +
  geom_point(aes(x = past_cases, 
                 y = new_cases, 
                 color = income_support,
                 group = income_support), 
             data = covid_small) 

p

Oops, it is hard to visualize the data. Let’s transform scales to a log1p scale for better visualizing:

p <- p +
  scale_x_continuous(trans = 'log1p', breaks = 10^(0:5)) +
  scale_y_continuous(trans = 'log1p', breaks = 10^(0:5))

p

With this plot, a few things stand out:

There is quite a lot of variation in the degree of income support across countries, which might have played a role in the power that we had to detect a relationship.
There is some degree of correlation between gdp per capita and income support: a larger share of rich countries seem to have done it, when compared to poor countries. That said, there is variation in both classes.
We should be careful when interpreting results. It does not mean that international travel control does not work, for example. Since most countries implemented some degree of control, it might be the case that we just do not have enough variation to test whether it works or not. This is one of the limitations of observational data in comparison to experiments!

Finally, we can embellish the plot with nice labels and a nice theme. We will also add a 1:1 dashed line which will indicate which countries increased and which decreased cases with geom_abline(). Points below the dashed line decreased in cases in the last 30 days, and it is pretty clear that income support brings more points below the line:

p +
  theme_minimal() +
  labs(y = 'Average new cases in the past 5 days',
       x = 'Average new cases a month ago',
       title = 'Growth of COVID19 across countries', 
       subtitle = 'Income decreases cases of COVID19') +
  geom_abline(intercept = 0,slope = 1,linetype='dashed') +
  coord_equal()

This concludes our tutorial on linear models. There is a lot more that can be done in R for more complex models, but they all use similar functions to the ones we used here. To learn more, check the following:

Generalized linear models: function glm.
Time Series: several packages.
Mixed models: several packages
Linear models accounting for phylogeny: several packages
Bayesian models: rstanarm for an easy interface similar to the one we used here.

Clustering and ordination

R is also a great tool to work with multivariate data, which we typically find in community ecology, for example. A common analysis with this kind of data is clustering and ordination. In our context here, we can ask ourselves how countries differ overall in the kind of interventions that they have taken against covid19, and whether countries in the same region tend to be more similar. We will use a method known as ‘nonmetric multidimensional scaling’ to visualize that, and an analysis of variance based on permutation (PERMANOVA) to test if kinds of measures differ by world region.

Let’s start by reading a new data.frame including only columns for country interventions against covid19, world regions and country names. This is a smaller version of the full data, in which we removed interventions with too much missing data

covid_interventions <- read.csv('covid_interventions_2020-05-16.csv', row.names = 1)
head(covid_interventions)

##                      world_region         country_name C1_School.closing
## 1 Latin America and the Caribbean                Aruba                 3
## 2            Asia and the Pacific          Afghanistan                 3
## 3                          Africa               Angola                 3
## 4                          Europe              Albania                 3
## 5                          Europe              Andorra                 3
## 6                       West Asia United Arab Emirates                 3
##   C1_Flag C2_Workplace.closing C3_Cancel.public.events C3_Flag
## 1       1                    3                       2       1
## 2       1                    3                       2       1
## 3       1                    2                       2       1
## 4       1                    2                       2       1
## 5       1                    2                       1       1
## 6       1                    2                       2       1
##   C4_Restrictions.on.gatherings C5_Close.public.transport
## 1                             4                         0
## 2                             4                         2
## 3                             4                         1
## 4                             4                         2
## 5                             0                         1
## 6                             4                         1
##   C6_Stay.at.home.requirements C7_Restrictions.on.internal.movement
## 1                            2                                    2
## 2                            2                                    2
## 3                            2                                    2
## 4                            2                                    2
## 5                            1                                    0
## 6                            2                                    1
##   C8_International.travel.controls E1_Income.support E2_Debt.contract.relief
## 1                                3                 0                       2
## 2                                1                 0                       0
## 3                                4                 0                       2
## 4                                4                 2                       1
## 5                                3                 2                       0
## 6                                4                 0                       2
##   H1_Public.information.campaigns H1_Flag H2_Testing.policy H3_Contact.tracing
## 1                               2       1                 1                  0
## 2                               2       1                 1                  1
## 3                               1       1                 2                  0
## 4                               2       1                 2                  1
## 5                               2       1                 3                  1
## 6                               2       1                 3                  2
##   H4_Emergency.investment.in.healthcare H5_Investment.in.vaccines
## 1                                     0                         0
## 2                                     0                         0
## 3                                     0                         0
## 4                                     0                         0
## 5                                     0                         0
## 6                                     0                         0

The first step to do an ordination is to transform this table with several variables a summary with how different each country is from each other overall. In ecology, this would be the $\beta$ diversity, but here we will calculate Euclidean distances. This can be done with the R function dist(). For ecology, the package ecodist has functions to calculate several kinds of distance metrics.

The differences between all countries (ignoring columns with country names and regions) are:

intervention_differences <- dist(covid_interventions[,-c(1,2)])

If we look at this object now, we find that it is an object of the class dist, that records the pairwise distances between all 147 countries (10731 distances total)

str(intervention_differences)

##  'dist' num [1:10731] 3.61 2.24 3.61 6.08 3.46 ...
##  - attr(*, "Size")= int 147
##  - attr(*, "Labels")= chr [1:147] "1" "2" "3" "4" ...
##  - attr(*, "Diag")= logi FALSE
##  - attr(*, "Upper")= logi FALSE
##  - attr(*, "method")= chr "euclidean"
##  - attr(*, "call")= language dist(x = covid_interventions[, -c(1, 2)])

We will now use nonmetric multidimensional scaling (NMDS) to reduce this distance matrix of 147$$147 to only 2 dimensions, that we humans can look at in a graph. NMDS will attempt to represent the distances betwen all points in these two dimensions as well as possible.

The R package vegan() contains several functions to work with multivariate data, including one to do MDS. Let’s load the package first and remembet to install it with install.packages('vegan') in case you do not have it in your library already.

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.5-6

The function to do NMDS is metaMDS:

mds <- metaMDS(intervention_differences)

## Run 0 stress 0.2121924 
## Run 1 stress 0.2103559 
## ... New best solution
## ... Procrustes: rmse 0.03011934  max resid 0.1736358 
## Run 2 stress 0.2109125 
## Run 3 stress 0.2127114 
## Run 4 stress 0.210966 
## Run 5 stress 0.2109142 
## Run 6 stress 0.210914 
## Run 7 stress 0.2101792 
## ... New best solution
## ... Procrustes: rmse 0.01025439  max resid 0.1061251 
## Run 8 stress 0.2169023 
## Run 9 stress 0.2208988 
## Run 10 stress 0.214967 
## Run 11 stress 0.2190474 
## Run 12 stress 0.2110621 
## Run 13 stress 0.2149936 
## Run 14 stress 0.2203084 
## Run 15 stress 0.2203263 
## Run 16 stress 0.2128381 
## Run 17 stress 0.2110632 
## Run 18 stress 0.210858 
## Run 19 stress 0.2115449 
## Run 20 stress 0.2209754 
## *** No convergence -- monoMDS stopping criteria:
##     20: stress ratio > sratmax

We can use plot to visualize the results now:

plot(mds)

## species scores not available

If we want to color by world region, we can combine the points from the mds result with our column for country names and regions. The object mds, to which we assigned MDS results, holds the data frame with points under a slot named points. So here, with mds$points we can retrieve it.

mds_result <- cbind(mds$points, covid_interventions)

head(mds_result)

##         MDS1       MDS2                    world_region         country_name
## 1 -1.2609993  0.7811568 Latin America and the Caribbean                Aruba
## 2 -1.2163974 -1.9220767            Asia and the Pacific          Afghanistan
## 3 -0.7699213  0.3721168                          Africa               Angola
## 4 -0.5776075 -0.1048845                          Europe              Albania
## 5  3.6053812 -0.7335618                          Europe              Andorra
## 6 -0.5520690  0.6102937                       West Asia United Arab Emirates
##   C1_School.closing C1_Flag C2_Workplace.closing C3_Cancel.public.events
## 1                 3       1                    3                       2
## 2                 3       1                    3                       2
## 3                 3       1                    2                       2
## 4                 3       1                    2                       2
## 5                 3       1                    2                       1
## 6                 3       1                    2                       2
##   C3_Flag C4_Restrictions.on.gatherings C5_Close.public.transport
## 1       1                             4                         0
## 2       1                             4                         2
## 3       1                             4                         1
## 4       1                             4                         2
## 5       1                             0                         1
## 6       1                             4                         1
##   C6_Stay.at.home.requirements C7_Restrictions.on.internal.movement
## 1                            2                                    2
## 2                            2                                    2
## 3                            2                                    2
## 4                            2                                    2
## 5                            1                                    0
## 6                            2                                    1
##   C8_International.travel.controls E1_Income.support E2_Debt.contract.relief
## 1                                3                 0                       2
## 2                                1                 0                       0
## 3                                4                 0                       2
## 4                                4                 2                       1
## 5                                3                 2                       0
## 6                                4                 0                       2
##   H1_Public.information.campaigns H1_Flag H2_Testing.policy H3_Contact.tracing
## 1                               2       1                 1                  0
## 2                               2       1                 1                  1
## 3                               1       1                 2                  0
## 4                               2       1                 2                  1
## 5                               2       1                 3                  1
## 6                               2       1                 3                  2
##   H4_Emergency.investment.in.healthcare H5_Investment.in.vaccines
## 1                                     0                         0
## 2                                     0                         0
## 3                                     0                         0
## 4                                     0                         0
## 5                                     0                         0
## 6                                     0                         0

Now we can use ggplot to plot the points and color by region:

ggplot(mds_result) +
  geom_point(aes(x = MDS1,
                 y = MDS2,
                 color = world_region))

With ggplot, we can also add ellipses surrounding the points for each region. We use the aesthetics group because stat_ellipse() needs to know how to group the points. We also moved the aesthetics up to the ggplot() function so it is inherited by both geom_point() and stat_ellipse().

ggplot(mds_result, aes(x = MDS1, y = MDS2, color = world_region, group = world_region)) +
  geom_point() +
  stat_ellipse() +
  scale_color_brewer(type='qual')

## Too few points to calculate an ellipse

## Warning: Removed 1 row(s) containing missing values (geom_path).

We got a warning because North America only has two countries and therefore it is not possible to draw an ellipse.

Maybe there is some difference between countries, but it is not clear. Let’s test whether this is the case with with a permutational ANOVA. By using the distances between countries as our response and world regions as the predictor, we want to know whether world regions differ significantly in the interventions used against covid19. The function to do this analysis is adonis2 in the package vegan, and it uses a formula similar to linear models:

adonis2(intervention_differences ~ world_region, data = covid_interventions)

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = intervention_differences ~ world_region, data = covid_interventions)
##               Df SumOfSqs      R2      F Pr(>F)    
## world_region   5    87.68 0.08315 2.5576  0.001 ***
## Residual     141   966.75 0.91685                  
## Total        146  1054.44 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems there is a difference between regions!

To summarize, we used linear models to find that there is a difference between regions in the number of new covid19 in the last few days. With a PERMANOVA, we found also that there is a difference in the interventions that countries in different regions attempted.

With a more complex model including other variables beyond region, we found that the number of new covid19 cases can be well explained by the number of past cases, and that governments providing some sort of income to their population were more sucessful in reducing the number of new cases. This effect was stronger in poorer countries.

To find out more about multivariate analyses and methods for ordination applied to ecology, checkout the page linked here.

To do

As an exercise, try to select different columns from the full dataset covid_data and use linear models to find the effect of other variables (for example, population age or density, prevalence of diabetes, etc) on the number of cases. You can also try to change the response variable to the total number of cases instead of the daily cases.
Try to apply either linear models or ordination to another dataset. If you have ecology data, come to office hours and we can work together on how to use ordination with dissimilarity metrics that are appropriate to ecology.

Appendix: Producing the dataset

Here we show how the dataset that we used in this tutorial was produced. It is the result of cleaning and merging three sources: a table including country-level covid19 cases and country demographic data, another table with country responses to covid19, and a table that includes information about countries, including in which continent they are in.

After merging this data in a single table, we saved it as the csv file that we used today.

Loading and cleaning datasets

Let’s start by downloading the data. The covid19 data in Our World in Data is updated daily, and we can get the most recent table directly from the link https://covid.ourworldindata.org/data/owid-covid-data.csv

In the previous tutorials, we have loaded datasets located in our computer, and we could in theory use a web browser to download the table first, and then load it to R. However, R allows us to load a file directly from the web if we provide a URL (i. e. a web address). This table is in the CSV format (i. e. comma-separated values), so we will use the function read.csv() to read it.

country_covid <- read.csv('https://covid.ourworldindata.org/data/owid-covid-data.csv')

head(country_covid)

##   iso_code     continent location       date total_cases new_cases total_deaths
## 1      ABW North America    Aruba 2020-03-13           2         2            0
## 2      ABW North America    Aruba 2020-03-20           4         2            0
## 3      ABW North America    Aruba 2020-03-24          12         8            0
## 4      ABW North America    Aruba 2020-03-25          17         5            0
## 5      ABW North America    Aruba 2020-03-26          19         2            0
## 6      ABW North America    Aruba 2020-03-27          28         9            0
##   new_deaths total_cases_per_million new_cases_per_million
## 1          0                  18.733                18.733
## 2          0                  37.465                18.733
## 3          0                 112.395                74.930
## 4          0                 159.227                46.831
## 5          0                 177.959                18.733
## 6          0                 262.256                84.296
##   total_deaths_per_million new_deaths_per_million new_tests total_tests
## 1                        0                      0        NA          NA
## 2                        0                      0        NA          NA
## 3                        0                      0        NA          NA
## 4                        0                      0        NA          NA
## 5                        0                      0        NA          NA
## 6                        0                      0        NA          NA
##   total_tests_per_thousand new_tests_per_thousand new_tests_smoothed
## 1                       NA                     NA                 NA
## 2                       NA                     NA                 NA
## 3                       NA                     NA                 NA
## 4                       NA                     NA                 NA
## 5                       NA                     NA                 NA
## 6                       NA                     NA                 NA
##   new_tests_smoothed_per_thousand tests_units stringency_index population
## 1                              NA                         0.00     106766
## 2                              NA                        33.33     106766
## 3                              NA                        44.44     106766
## 4                              NA                        44.44     106766
## 5                              NA                        44.44     106766
## 6                              NA                        44.44     106766
##   population_density median_age aged_65_older aged_70_older gdp_per_capita
## 1              584.8       41.2        13.085         7.452       35973.78
## 2              584.8       41.2        13.085         7.452       35973.78
## 3              584.8       41.2        13.085         7.452       35973.78
## 4              584.8       41.2        13.085         7.452       35973.78
## 5              584.8       41.2        13.085         7.452       35973.78
## 6              584.8       41.2        13.085         7.452       35973.78
##   extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers
## 1              NA                    NA               11.62             NA
## 2              NA                    NA               11.62             NA
## 3              NA                    NA               11.62             NA
## 4              NA                    NA               11.62             NA
## 5              NA                    NA               11.62             NA
## 6              NA                    NA               11.62             NA
##   male_smokers handwashing_facilities hospital_beds_per_thousand
## 1           NA                     NA                         NA
## 2           NA                     NA                         NA
## 3           NA                     NA                         NA
## 4           NA                     NA                         NA
## 5           NA                     NA                         NA
## 6           NA                     NA                         NA
##   life_expectancy
## 1           76.29
## 2           76.29
## 3           76.29
## 4           76.29
## 5           76.29
## 6           76.29

This dataset records, for each country and each date, a series of statistics on covid19 as well together with several other statistics about the country (such as population, age structure, etc). A full explanation of the dataset can be found in this github repository and an explanation of the fields included can be found here, which is copied below:

Column	Description	Source
`iso_code`	ISO 3166-1 alpha-3 – three-letter country codes	International Organization for Standardization
`location`	Geographical location	Our World in Data
`date`	Date of observation	Our World in Data
`total_cases`	Total confirmed cases of COVID-19	European Centre for Disease Prevention and Control
`new_cases`	New confirmed cases of COVID-19	European Centre for Disease Prevention and Control
`total_deaths`	Total deaths attributed to COVID-19	European Centre for Disease Prevention and Control
`new_deaths`	New deaths attributed to COVID-19	European Centre for Disease Prevention and Control
`total_cases_per_million`	Total confirmed cases of COVID-19 per 1,000,000 people	European Centre for Disease Prevention and Control
`new_cases_per_million`	New confirmed cases of COVID-19 per 1,000,000 people	European Centre for Disease Prevention and Control
`total_deaths_per_million`	Total deaths attributed to COVID-19 per 1,000,000 people	European Centre for Disease Prevention and Control
`new_deaths_per_million`	New deaths attributed to COVID-19 per 1,000,000 people	European Centre for Disease Prevention and Control
`total_tests`	Total tests for COVID-19	European Centre for Disease Prevention and Control
`new_tests`	New tests for COVID-19	European Centre for Disease Prevention and Control
`total_tests_per_thousand`	Total tests for COVID-19 per 1,000 people	National government reports
`new_tests_per_thousand`	New tests for COVID-19 per 1,000 people	National government reports
`tests_units`	Units used by the location to report its testing data	National government reports
`population`	Population in 2020	United Nations, Department of Economic and Social Affairs, Population Division, World Population Prospects: The 2019 Revision
`population_density`	Number of people divided by land area, measured in square kilometers, most recent year available	World Bank – World Development Indicators, sourced from Food and Agriculture Organization and World Bank estimates
`median_age`	Median age of the population, UN projection for 2020	UN Population Division, World Population Prospects, 2017 Revision
`aged_65_older`	Share of the population that is 65 years and older, most recent year available	World Bank – World Development Indicators, based on age/sex distributions of United Nations Population Division’s World Population Prospects: 2017 Revision
`aged_70_older`	Share of the population that is 70 years and older in 2015	United Nations, Department of Economic and Social Affairs, Population Division (2017), World Population Prospects: The 2017 Revision
`gdp_per_capita`	Gross domestic product at purchasing power parity (constant 2011 international dollars), most recent year available	World Bank – World Development Indicators, source from World Bank, International Comparison Program database
`extreme_poverty`	Share of the population living in extreme poverty, most recent year available since 2010	World Bank – World Development Indicators, sourced from World Bank Development Research Group
`cvd_death_rate`	Death rate from cardiovascular disease in 2017	Global Burden of Disease Collaborative Network, Global Burden of Disease Study 2017 Results
`diabetes_prevalence`	Diabetes prevalence (% of population aged 20 to 79) in 2017	World Bank – World Development Indicators, sourced from International Diabetes Federation, Diabetes Atlas
`female_smokers`	Share of women who smoke, most recent year available	World Bank – World Development Indicators, sourced from World Health Organization, Global Health Observatory Data Repository
`male_smokers`	Share of men who smoke, most recent year available	World Bank – World Development Indicators, sourced from World Health Organization, Global Health Observatory Data Repository
`handwashing_facilities`	Share of the population with basic handwashing facilities on premises, most recent year available	United Nations Statistics Division
`hospital_beds_per_100k`	Hospital beds per 100,000 people, most recent year available since 2010	OECD, Eurostat, World Bank, national government records and other sources

Since we are interested in understanding whether government responses work, we have to join this dataset to another one including several responses of governments to covid19. The link to this second dataset can also be found in the Our World in Data web page, which is this one: https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv. Again, we will use read.csv() to load the dataset to a table:

country_response <- read.csv('https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv')

head(country_response)

##   CountryName CountryCode     Date C1_School.closing C1_Flag
## 1       Aruba         ABW 20200101                 0      NA
## 2       Aruba         ABW 20200102                 0      NA
## 3       Aruba         ABW 20200103                 0      NA
## 4       Aruba         ABW 20200104                 0      NA
## 5       Aruba         ABW 20200105                 0      NA
## 6       Aruba         ABW 20200106                 0      NA
##   C2_Workplace.closing C2_Flag C3_Cancel.public.events C3_Flag
## 1                    0      NA                       0      NA
## 2                    0      NA                       0      NA
## 3                    0      NA                       0      NA
## 4                    0      NA                       0      NA
## 5                    0      NA                       0      NA
## 6                    0      NA                       0      NA
##   C4_Restrictions.on.gatherings C4_Flag C5_Close.public.transport C5_Flag
## 1                             0      NA                         0      NA
## 2                             0      NA                         0      NA
## 3                             0      NA                         0      NA
## 4                             0      NA                         0      NA
## 5                             0      NA                         0      NA
## 6                             0      NA                         0      NA
##   C6_Stay.at.home.requirements C6_Flag C7_Restrictions.on.internal.movement
## 1                            0      NA                                    0
## 2                            0      NA                                    0
## 3                            0      NA                                    0
## 4                            0      NA                                    0
## 5                            0      NA                                    0
## 6                            0      NA                                    0
##   C7_Flag C8_International.travel.controls E1_Income.support E1_Flag
## 1      NA                                0                 0      NA
## 2      NA                                0                 0      NA
## 3      NA                                0                 0      NA
## 4      NA                                0                 0      NA
## 5      NA                                0                 0      NA
## 6      NA                                0                 0      NA
##   E2_Debt.contract.relief E3_Fiscal.measures E4_International.support
## 1                       0                  0                        0
## 2                       0                  0                        0
## 3                       0                  0                        0
## 4                       0                  0                        0
## 5                       0                  0                        0
## 6                       0                  0                        0
##   H1_Public.information.campaigns H1_Flag H2_Testing.policy H3_Contact.tracing
## 1                               0      NA                 0                  0
## 2                               0      NA                 0                  0
## 3                               0      NA                 0                  0
## 4                               0      NA                 0                  0
## 5                               0      NA                 0                  0
## 6                               0      NA                 0                  0
##   H4_Emergency.investment.in.healthcare H5_Investment.in.vaccines M1_Wildcard
## 1                                     0                         0          NA
## 2                                     0                         0          NA
## 3                                     0                         0          NA
## 4                                     0                         0          NA
## 5                                     0                         0          NA
## 6                                     0                         0          NA
##   ConfirmedCases ConfirmedDeaths StringencyIndex StringencyIndexForDisplay
## 1             NA              NA               0                         0
## 2             NA              NA               0                         0
## 3             NA              NA               0                         0
## 4             NA              NA               0                         0
## 5             NA              NA               0                         0
## 6             NA              NA               0                         0
##   StringencyLegacyIndex StringencyLegacyIndexForDisplay GovernmentResponseIndex
## 1                     0                               0                       0
## 2                     0                               0                       0
## 3                     0                               0                       0
## 4                     0                               0                       0
## 5                     0                               0                       0
## 6                     0                               0                       0
##   GovernmentResponseIndexForDisplay ContainmentHealthIndex
## 1                                 0                      0
## 2                                 0                      0
## 3                                 0                      0
## 4                                 0                      0
## 5                                 0                      0
## 6                                 0                      0
##   ContainmentHealthIndexForDisplay EconomicSupportIndex
## 1                                0                    0
## 2                                0                    0
## 3                                0                    0
## 4                                0                    0
## 5                                0                    0
## 6                                0                    0
##   EconomicSupportIndexForDisplay
## 1                              0
## 2                              0
## 3                              0
## 4                              0
## 5                              0
## 6                              0

This table is very similar to the other one, with country, date and several columns detailing the response of each country on a given day. The explanation for each variable is available in the dataset github repository:

Containment and closure policies

ID	Name	Description	Measurement	Coding
C1	`C1_School closing`	Record closings of schools and universities	Ordinal scale	0 - no measures 1 - recommend closing 2 - require closing (only some levels or categories, eg just high school, or just public schools) 3 - require closing all levels Blank - no data
	`C1_Flag`		Binary flag for geographic scope	0 - targeted 1- general Blank - no data
C2	`C2_Workplace closing`	Record closings of workplaces	Ordinal scale	0 - no measures 1 - recommend closing (or recommend work from home) 2 - require closing (or work from home) for some sectors or categories of workers 3 - require closing (or work from home) for all-but-essential workplaces (eg grocery stores, doctors) Blank - no data
	`C2_Flag`		Binary flag for geographic scope	0 - targeted 1- general Blank - no data
C3	`C3_Cancel public events`	Record cancelling public events	Ordinal scale	0 - no measures 1 - recommend cancelling 2 - require cancelling Blank - no data
	`C3_Flag`		Binary flag for geographic scope	0 - targeted 1- general Blank - no data
C4	`C4_Restrictions on gatherings`	Record limits on private gatherings	Ordinal scale	0 - no restrictions 1 - restrictions on very large gatherings (the limit is above 1000 people) 2 - restrictions on gatherings between 101-1000 people 3 - restrictions on gatherings between 11-100 people 4 - restrictions on gatherings of 10 people or less Blank - no data
	`C4_Flag`		Binary flag for geographic scope	0 - targeted 1- general Blank - no data
C5	`C5_Close public transport`	Record closing of public transport	Ordinal scale	0 - no measures 1 - recommend closing (or significantly reduce volume/route/means of transport available) 2 - require closing (or prohibit most citizens from using it) Blank - no data
	`C5_Flag`		Binary flag for geographic scope	0 - targeted 1- general Blank - no data
C6	`C6_Stay at home requirements`	Record orders to “shelter-in-place” and otherwise confine to the home	Ordinal scale	0 - no measures 1 - recommend not leaving house 2 - require not leaving house with exceptions for daily exercise, grocery shopping, and ‘essential’ trips 3 - require not leaving house with minimal exceptions (eg allowed to leave once a week, or only one person can leave at a time, etc) Blank - no data
	`C6_Flag`		Binary flag for geographic scope	0 - targeted 1- general Blank - no data
C7	`C7_Restrictions on internal movement`	Record restrictions on internal movement between cities/regions	Ordinal scale	0 - no measures 1 - recommend not to travel between regions/cities 2 - internal movement restrictions in place Blank - no data
	`C7_Flag`		Binary flag for geographic scope	0 - targeted 1- general Blank - no data
C8	`C8_International travel controls`	Record restrictions on international travel Note: this records policy for foreign travellers, not citizens	Ordinal scale	0 - no restrictions 1 - screening arrivals 2 - quarantine arrivals from some or all regions 3 - ban arrivals from some regions 4 - ban on all regions or total border closure Blank - no data

Economic policies

ID	Name	Description	Measurement	Coding
E1	`E1_Income support` (for households)	Record if the government is providing direct cash payments to people who lose their jobs or cannot work. Note: only includes payments to firms if explicitly linked to payroll/salaries	Ordinal scale	0 - no income support 1 - government is replacing less than 50% of lost salary (or if a flat sum, it is less than 50% median salary) 2 - government is replacing more than 50% of lost salary (or if a flat sum, it is greater than 50% median salary) Blank - no data
	`E1_Flag`		Binary flag for sectoral scope	0 - formal sector workers only 1 - transfers to informal sector workers too Blank - no data
E2	`E2_Debt/contract relief` (for households)	Record if the government is freezing financial obligations for households (eg stopping loan repayments, preventing services like water from stopping, or banning evictions)	Ordinal scale	0 - no debt/contract relief 1 - narrow relief, specific to one kind of contract 2 - broad debt/contract relief
E3	`E3_Fiscal measures`	Announced economic stimulus spending Note: only record amount additional to previously announced spending	USD	Record monetary value in USD of fiscal stimuli, includes any spending or tax cuts NOT included in E4, H4 or H5 0 - no new spending that day Blank - no data
E4	`E4_International support`	Announced offers of Covid-19 related aid spending to other countries Note: only record amount additional to previously announced spending	USD	Record monetary value in USD 0 - no new spending that day Blank - no data

Health system policies

ID	Name	Description	Measurement	Coding
H1	`H1_Public information campaigns`	Record presence of public info campaigns	Ordinal scale	0 - no Covid-19 public information campaign 1 - public officials urging caution about Covid-19 2- coordinated public information campaign (eg across traditional and social media) Blank - no data
	`H1_Flag`		Binary flag for geographic scope	0 - targeted 1- general Blank - no data
H2	`H2_Testing policy`	Record government policy on who has access to testing Note: this records policies about testing for current infection (PCR tests) not testing for immunity (antibody test)	Ordinal scale	0 - no testing policy 1 - only those who both (a) have symptoms AND (b) meet specific criteria (eg key workers, admitted to hospital, came into contact with a known case, returned from overseas) 2 - testing of anyone showing Covid-19 symptoms 3 - open public testing (eg “drive through” testing available to asymptomatic people) Blank - no data
H3	`H3_Contact tracing`	Record government policy on contact tracing after a positive diagnosis Note: we are looking for policies that would identify all people potentially exposed to Covid-19; voluntary bluetooth apps are unlikely to achieve this	Ordinal scale	0 - no contact tracing 1 - limited contact tracing; not done for all cases 2 - comprehensive contact tracing; done for all identified cases
H4	`H4_Emergency investment in healthcare`	Announced short term spending on healthcare system, eg hospitals, masks, etc Note: only record amount additional to previously announced spending	USD	Record monetary value in USD 0 - no new spending that day Blank - no data
H5	`H5_Investment in vaccines`	Announced public spending on Covid-19 vaccine development Note: only record amount additional to previously announced spending	USD	Record monetary value in USD 0 - no new spending that day Blank - no data

Miscellaneous policies

ID	Name	Description	Measurement	Coding
M1	`M1_Wildcard`	Record policy announcements that do not fit anywhere else	Free text notes field	Note unusual or interesting interventions that are worth flagging

Since we want to make relations between actions as predictors and number of cases of covid19 as responses, we now need to join these two data sets, which is something that we often need to do in real life and have not done yet in this course. To join the two datasets, we will use the R function merge(). When we are joining datasets, we need to specify which columns are the same in the two tables, which here will be CountryName, CountryCode and Date. However, if you look at the data above, you will notice that we have a few problems to solve:

The date format is different between the two tables.
The datasets include more information than we want.
The column names for variables that are the same (country and date) are different.

We will now solve each of these problems.

Formatting dates

Let’s start by solving problem 1: standardizing dates. We know R can store vectors of numbers, and vectors of characters, but dates are something different: they are neither simple numbers or characters strictly. Luckily, R also has a type of variable for dates, and with the function as.Date() we can transform a character string to a date.

Let’s do it for the country_covid table first. Let’s have a look before the transformation:

head(country_covid)

##   iso_code     continent location       date total_cases new_cases total_deaths
## 1      ABW North America    Aruba 2020-03-13           2         2            0
## 2      ABW North America    Aruba 2020-03-20           4         2            0
## 3      ABW North America    Aruba 2020-03-24          12         8            0
## 4      ABW North America    Aruba 2020-03-25          17         5            0
## 5      ABW North America    Aruba 2020-03-26          19         2            0
## 6      ABW North America    Aruba 2020-03-27          28         9            0
##   new_deaths total_cases_per_million new_cases_per_million
## 1          0                  18.733                18.733
## 2          0                  37.465                18.733
## 3          0                 112.395                74.930
## 4          0                 159.227                46.831
## 5          0                 177.959                18.733
## 6          0                 262.256                84.296
##   total_deaths_per_million new_deaths_per_million new_tests total_tests
## 1                        0                      0        NA          NA
## 2                        0                      0        NA          NA
## 3                        0                      0        NA          NA
## 4                        0                      0        NA          NA
## 5                        0                      0        NA          NA
## 6                        0                      0        NA          NA
##   total_tests_per_thousand new_tests_per_thousand new_tests_smoothed
## 1                       NA                     NA                 NA
## 2                       NA                     NA                 NA
## 3                       NA                     NA                 NA
## 4                       NA                     NA                 NA
## 5                       NA                     NA                 NA
## 6                       NA                     NA                 NA
##   new_tests_smoothed_per_thousand tests_units stringency_index population
## 1                              NA                         0.00     106766
## 2                              NA                        33.33     106766
## 3                              NA                        44.44     106766
## 4                              NA                        44.44     106766
## 5                              NA                        44.44     106766
## 6                              NA                        44.44     106766
##   population_density median_age aged_65_older aged_70_older gdp_per_capita
## 1              584.8       41.2        13.085         7.452       35973.78
## 2              584.8       41.2        13.085         7.452       35973.78
## 3              584.8       41.2        13.085         7.452       35973.78
## 4              584.8       41.2        13.085         7.452       35973.78
## 5              584.8       41.2        13.085         7.452       35973.78
## 6              584.8       41.2        13.085         7.452       35973.78
##   extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers
## 1              NA                    NA               11.62             NA
## 2              NA                    NA               11.62             NA
## 3              NA                    NA               11.62             NA
## 4              NA                    NA               11.62             NA
## 5              NA                    NA               11.62             NA
## 6              NA                    NA               11.62             NA
##   male_smokers handwashing_facilities hospital_beds_per_thousand
## 1           NA                     NA                         NA
## 2           NA                     NA                         NA
## 3           NA                     NA                         NA
## 4           NA                     NA                         NA
## 5           NA                     NA                         NA
## 6           NA                     NA                         NA
##   life_expectancy
## 1           76.29
## 2           76.29
## 3           76.29
## 4           76.29
## 5           76.29
## 6           76.29

Date here is stored as a character in the format Year-Month-Day, so we will tell R that this is the format and it will figure the date out automatically. In the R syntax, we will indicate this format as %Y-%m-%d

country_covid$date <- as.Date(country_covid$date, format='%Y-%m-%d')

Let’s look again. Now the type of the column date isdate and not character anymore.

head(country_covid)

##   iso_code     continent location       date total_cases new_cases total_deaths
## 1      ABW North America    Aruba 2020-03-13           2         2            0
## 2      ABW North America    Aruba 2020-03-20           4         2            0
## 3      ABW North America    Aruba 2020-03-24          12         8            0
## 4      ABW North America    Aruba 2020-03-25          17         5            0
## 5      ABW North America    Aruba 2020-03-26          19         2            0
## 6      ABW North America    Aruba 2020-03-27          28         9            0
##   new_deaths total_cases_per_million new_cases_per_million
## 1          0                  18.733                18.733
## 2          0                  37.465                18.733
## 3          0                 112.395                74.930
## 4          0                 159.227                46.831
## 5          0                 177.959                18.733
## 6          0                 262.256                84.296
##   total_deaths_per_million new_deaths_per_million new_tests total_tests
## 1                        0                      0        NA          NA
## 2                        0                      0        NA          NA
## 3                        0                      0        NA          NA
## 4                        0                      0        NA          NA
## 5                        0                      0        NA          NA
## 6                        0                      0        NA          NA
##   total_tests_per_thousand new_tests_per_thousand new_tests_smoothed
## 1                       NA                     NA                 NA
## 2                       NA                     NA                 NA
## 3                       NA                     NA                 NA
## 4                       NA                     NA                 NA
## 5                       NA                     NA                 NA
## 6                       NA                     NA                 NA
##   new_tests_smoothed_per_thousand tests_units stringency_index population
## 1                              NA                         0.00     106766
## 2                              NA                        33.33     106766
## 3                              NA                        44.44     106766
## 4                              NA                        44.44     106766
## 5                              NA                        44.44     106766
## 6                              NA                        44.44     106766
##   population_density median_age aged_65_older aged_70_older gdp_per_capita
## 1              584.8       41.2        13.085         7.452       35973.78
## 2              584.8       41.2        13.085         7.452       35973.78
## 3              584.8       41.2        13.085         7.452       35973.78
## 4              584.8       41.2        13.085         7.452       35973.78
## 5              584.8       41.2        13.085         7.452       35973.78
## 6              584.8       41.2        13.085         7.452       35973.78
##   extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers
## 1              NA                    NA               11.62             NA
## 2              NA                    NA               11.62             NA
## 3              NA                    NA               11.62             NA
## 4              NA                    NA               11.62             NA
## 5              NA                    NA               11.62             NA
## 6              NA                    NA               11.62             NA
##   male_smokers handwashing_facilities hospital_beds_per_thousand
## 1           NA                     NA                         NA
## 2           NA                     NA                         NA
## 3           NA                     NA                         NA
## 4           NA                     NA                         NA
## 5           NA                     NA                         NA
## 6           NA                     NA                         NA
##   life_expectancy
## 1           76.29
## 2           76.29
## 3           76.29
## 4           76.29
## 5           76.29
## 6           76.29

Let’s do the same to the second table. This time, the format is YearMonthDay, without a separator. Because of that, R read it as if it were integer numbers, which is wrong, so we need first to convert them to character and then to date:

country_response$Date <- as.character(country_response$Date)
country_response$Date <- as.Date(country_response$Date, format='%Y%m%d')

Now let’s look again:

head(country_response)

##   CountryName CountryCode       Date C1_School.closing C1_Flag
## 1       Aruba         ABW 2020-01-01                 0      NA
## 2       Aruba         ABW 2020-01-02                 0      NA
## 3       Aruba         ABW 2020-01-03                 0      NA
## 4       Aruba         ABW 2020-01-04                 0      NA
## 5       Aruba         ABW 2020-01-05                 0      NA
## 6       Aruba         ABW 2020-01-06                 0      NA
##   C2_Workplace.closing C2_Flag C3_Cancel.public.events C3_Flag
## 1                    0      NA                       0      NA
## 2                    0      NA                       0      NA
## 3                    0      NA                       0      NA
## 4                    0      NA                       0      NA
## 5                    0      NA                       0      NA
## 6                    0      NA                       0      NA
##   C4_Restrictions.on.gatherings C4_Flag C5_Close.public.transport C5_Flag
## 1                             0      NA                         0      NA
## 2                             0      NA                         0      NA
## 3                             0      NA                         0      NA
## 4                             0      NA                         0      NA
## 5                             0      NA                         0      NA
## 6                             0      NA                         0      NA
##   C6_Stay.at.home.requirements C6_Flag C7_Restrictions.on.internal.movement
## 1                            0      NA                                    0
## 2                            0      NA                                    0
## 3                            0      NA                                    0
## 4                            0      NA                                    0
## 5                            0      NA                                    0
## 6                            0      NA                                    0
##   C7_Flag C8_International.travel.controls E1_Income.support E1_Flag
## 1      NA                                0                 0      NA
## 2      NA                                0                 0      NA
## 3      NA                                0                 0      NA
## 4      NA                                0                 0      NA
## 5      NA                                0                 0      NA
## 6      NA                                0                 0      NA
##   E2_Debt.contract.relief E3_Fiscal.measures E4_International.support
## 1                       0                  0                        0
## 2                       0                  0                        0
## 3                       0                  0                        0
## 4                       0                  0                        0
## 5                       0                  0                        0
## 6                       0                  0                        0
##   H1_Public.information.campaigns H1_Flag H2_Testing.policy H3_Contact.tracing
## 1                               0      NA                 0                  0
## 2                               0      NA                 0                  0
## 3                               0      NA                 0                  0
## 4                               0      NA                 0                  0
## 5                               0      NA                 0                  0
## 6                               0      NA                 0                  0
##   H4_Emergency.investment.in.healthcare H5_Investment.in.vaccines M1_Wildcard
## 1                                     0                         0          NA
## 2                                     0                         0          NA
## 3                                     0                         0          NA
## 4                                     0                         0          NA
## 5                                     0                         0          NA
## 6                                     0                         0          NA
##   ConfirmedCases ConfirmedDeaths StringencyIndex StringencyIndexForDisplay
## 1             NA              NA               0                         0
## 2             NA              NA               0                         0
## 3             NA              NA               0                         0
## 4             NA              NA               0                         0
## 5             NA              NA               0                         0
## 6             NA              NA               0                         0
##   StringencyLegacyIndex StringencyLegacyIndexForDisplay GovernmentResponseIndex
## 1                     0                               0                       0
## 2                     0                               0                       0
## 3                     0                               0                       0
## 4                     0                               0                       0
## 5                     0                               0                       0
## 6                     0                               0                       0
##   GovernmentResponseIndexForDisplay ContainmentHealthIndex
## 1                                 0                      0
## 2                                 0                      0
## 3                                 0                      0
## 4                                 0                      0
## 5                                 0                      0
## 6                                 0                      0
##   ContainmentHealthIndexForDisplay EconomicSupportIndex
## 1                                0                    0
## 2                                0                    0
## 3                                0                    0
## 4                                0                    0
## 5                                0                    0
## 6                                0                    0
##   EconomicSupportIndexForDisplay
## 1                              0
## 2                              0
## 3                              0
## 4                              0
## 5                              0
## 6                              0

Summarizing important data

Since covid19 has an incubation time of up to 2 weeks, it is reasonable to assume that the current number of daily cases is a result of interventions that have been done in the last few weeks, so instead of using all data on all interventions ever done, we will record the most common intervention on the last 30 days. So, for example, if schools have been closed for most days during the last 4 weeks, we will record a country state as schools closed.

To do that, we will start by filtering the covid_response table, which holds our interventions, to include only records in the last 30 days. Since we transformed the Date column in an actual date already, R will allow us to do math. To get the current date (2020-07-31), we use the function Sys.Date():

Sys.Date()

## [1] "2020-07-31"

So, for example, the difference between today and January first is:

Sys.Date() - as.Date('2020-01-01')

## Time difference of 212 days

We an use that to find rows which for which the date is less or equal to 30 days before today. Let’s do that for the first 10 records:

Sys.Date() - country_response$Date[1:10] <= 30

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Now let’s use this to filter the data.frame:

country_response <- country_response[as.numeric(Sys.Date() - country_response$Date) <=  30, ]
head(country_response)

##     CountryName CountryCode       Date C1_School.closing C1_Flag
## 183       Aruba         ABW 2020-07-01                 0      NA
## 184       Aruba         ABW 2020-07-02                 0      NA
## 185       Aruba         ABW 2020-07-03                 0      NA
## 186       Aruba         ABW 2020-07-04                 0      NA
## 187       Aruba         ABW 2020-07-05                 0      NA
## 188       Aruba         ABW 2020-07-06                 0      NA
##     C2_Workplace.closing C2_Flag C3_Cancel.public.events C3_Flag
## 183                    1       1                       0      NA
## 184                    1       1                       0      NA
## 185                    1       1                       0      NA
## 186                    1       1                       0      NA
## 187                    1       1                       0      NA
## 188                    1       1                       0      NA
##     C4_Restrictions.on.gatherings C4_Flag C5_Close.public.transport C5_Flag
## 183                             0      NA                         0      NA
## 184                             0      NA                         0      NA
## 185                             0      NA                         0      NA
## 186                             0      NA                         0      NA
## 187                             0      NA                         0      NA
## 188                             0      NA                         0      NA
##     C6_Stay.at.home.requirements C6_Flag C7_Restrictions.on.internal.movement
## 183                            1       1                                    1
## 184                            1       1                                    1
## 185                            1       1                                    1
## 186                            1       1                                    1
## 187                            1       1                                    1
## 188                            1       1                                    1
##     C7_Flag C8_International.travel.controls E1_Income.support E1_Flag
## 183       1                                3                 2       0
## 184       1                                3                 2       0
## 185       1                                3                 2       0
## 186       1                                3                 2       0
## 187       1                                3                 2       0
## 188       1                                3                 2       0
##     E2_Debt.contract.relief E3_Fiscal.measures E4_International.support
## 183                       2                  0                        0
## 184                       2                  0                        0
## 185                       2                  0                        0
## 186                       2                  0                        0
## 187                       2                  0                        0
## 188                       2                  0                        0
##     H1_Public.information.campaigns H1_Flag H2_Testing.policy
## 183                               2       1                 2
## 184                               2       1                 2
## 185                               2       1                 2
## 186                               2       1                 2
## 187                               2       1                 2
## 188                               2       1                 2
##     H3_Contact.tracing H4_Emergency.investment.in.healthcare
## 183                  1                                     0
## 184                  1                                     0
## 185                  1                                     0
## 186                  1                                     0
## 187                  1                                     0
## 188                  1                                     0
##     H5_Investment.in.vaccines M1_Wildcard ConfirmedCases ConfirmedDeaths
## 183                         0          NA            799              10
## 184                         0          NA            799              10
## 185                         0          NA            800              10
## 186                         0          NA            800              10
## 187                         0          NA            801              10
## 188                         0          NA            801              10
##     StringencyIndex StringencyIndexForDisplay StringencyLegacyIndex
## 183           32.41                     32.41                 41.67
## 184           32.41                     32.41                 41.67
## 185           32.41                     32.41                 41.67
## 186           32.41                     32.41                 41.67
## 187           32.41                     32.41                 41.67
## 188           32.41                     32.41                 41.67
##     StringencyLegacyIndexForDisplay GovernmentResponseIndex
## 183                           41.67                   44.87
## 184                           41.67                   44.87
## 185                           41.67                   44.87
## 186                           41.67                   44.87
## 187                           41.67                   44.87
## 188                           41.67                   44.87
##     GovernmentResponseIndexForDisplay ContainmentHealthIndex
## 183                             44.87                  37.12
## 184                             44.87                  37.12
## 185                             44.87                  37.12
## 186                             44.87                  37.12
## 187                             44.87                  37.12
## 188                             44.87                  37.12
##     ContainmentHealthIndexForDisplay EconomicSupportIndex
## 183                            37.12                 87.5
## 184                            37.12                 87.5
## 185                            37.12                 87.5
## 186                            37.12                 87.5
## 187                            37.12                 87.5
## 188                            37.12                 87.5
##     EconomicSupportIndexForDisplay
## 183                           87.5
## 184                           87.5
## 185                           87.5
## 186                           87.5
## 187                           87.5
## 188                           87.5

Now that all dates correspond to the last 30 days, we can aggregate all columns except name and date by country code. Since R does not have a bultin function to calculate the mode (i. e. the most frequent) of a sequence of numbers, we will create a new function to do that. This function will use table() to calculate the frequency of each category and which() to index the category that has the same frequency as the maximum frequency. If there is a tie, we will pick the intervention that was in place first. If all input values are NA, we return NA

get_mode <- function(x){
  if (all(is.na(x))) {return(NA)}
  
  freqs <- table(x)
  most_common <- names(freqs)[which(freqs == max(freqs))]
  if (length(most_common) == 1) 
  {
    return(most_common)
    }
  else
  {
    first_occurrence <- match(most_common, x)
    most_common <- most_common[which(first_occurrence == min(first_occurrence))]
    return(most_common)
    }
}

Let’s check if the function works as expected:

get_mode(c(0,0,1,1))

## [1] "0"

get_mode(c(1,1,0,0))

## [1] "1"

get_mode(c('A','A','C','B','E'))

## [1] "A"

get_mode(c(0,0,1,1,NA,NA,NA))

## [1] "0"

get_mode(c(NA,NA,NA))

## [1] NA

It does! Let’s apply it to our data.frame with aggregate

only_responses <- country_response[4:length(country_response)]

common_responses <- aggregate(only_responses, by = country_response['CountryCode'], FUN = function(x) get_mode(x))
head(common_responses)

##   CountryCode C1_School.closing C1_Flag C2_Workplace.closing C2_Flag
## 1         ABW                 0    <NA>                    1       1
## 2         AFG                 3       1                    3       0
## 3         AGO                 3       1                    2       0
## 4         AIA                 2       1                 <NA>    <NA>
## 5         ALB                 2       1                    2       1
## 6         AND                 0    <NA>                    2       1
##   C3_Cancel.public.events C3_Flag C4_Restrictions.on.gatherings C4_Flag
## 1                       0    <NA>                             0    <NA>
## 2                       2       1                             4       1
## 3                       2       1                             4       0
## 4                    <NA>    <NA>                          <NA>    <NA>
## 5                       2       1                             4       1
## 6                       2       1                             3       1
##   C5_Close.public.transport C5_Flag C6_Stay.at.home.requirements C6_Flag
## 1                         0    <NA>                            1       1
## 2                         2       0                            2       0
## 3                         1       0                            2       0
## 4                      <NA>    <NA>                         <NA>    <NA>
## 5                         1       1                            0    <NA>
## 6                         1       1                            1       1
##   C7_Restrictions.on.internal.movement C7_Flag C8_International.travel.controls
## 1                                    1       1                                3
## 2                                    2       0                                1
## 3                                    2       0                                4
## 4                                 <NA>    <NA>                             <NA>
## 5                                    0    <NA>                                2
## 6                                    0    <NA>                                2
##   E1_Income.support E1_Flag E2_Debt.contract.relief E3_Fiscal.measures
## 1                 2       0                       2                  0
## 2                 0    <NA>                       0                  0
## 3                 1       1                       0                  0
## 4                 1       1                       1               <NA>
## 5                 1       1                       2                  0
## 6                 2       1                       2                  0
##   E4_International.support H1_Public.information.campaigns H1_Flag
## 1                        0                               2       1
## 2                        0                               2       1
## 3                        0                               2       1
## 4                     <NA>                               2       1
## 5                        0                               2       1
## 6                        0                               0       1
##   H2_Testing.policy H3_Contact.tracing H4_Emergency.investment.in.healthcare
## 1                 2                  1                                     0
## 2                 1                  1                                     0
## 3                 2                  0                                     0
## 4              <NA>                  2                                  <NA>
## 5                 1                  2                                     0
## 6                 2                  1                                     0
##   H5_Investment.in.vaccines M1_Wildcard ConfirmedCases ConfirmedDeaths
## 1                         0          NA            801              10
## 2                         0          NA          34451            1010
## 3                         0          NA            346              29
## 4                      <NA>          NA              3               0
## 5                         0          NA           2535              83
## 6                         0          NA            855              52
##   StringencyIndex StringencyIndexForDisplay StringencyLegacyIndex
## 1           32.41                     32.41                 41.67
## 2            78.7                      78.7                 76.19
## 3           76.39                     76.39                 78.57
## 4            <NA>                      78.7                  <NA>
## 5           59.26                     59.26                 66.67
## 6           41.67                     41.67                 48.81
##   StringencyLegacyIndexForDisplay GovernmentResponseIndex
## 1                           41.67                   44.87
## 2                           76.19                    60.9
## 3                           78.57                   61.86
## 4                           78.57                    <NA>
## 5                           66.67                   62.82
## 6                           48.81                   53.21
##   GovernmentResponseIndexForDisplay ContainmentHealthIndex
## 1                             44.87                  37.12
## 2                              60.9                  71.97
## 3                             61.86                  68.56
## 4                              <NA>                   <NA>
## 5                             62.82                  60.61
## 6                             53.21                   44.7
##   ContainmentHealthIndexForDisplay EconomicSupportIndex
## 1                            37.12                 87.5
## 2                            71.97                    0
## 3                            68.56                   25
## 4                             <NA>                   50
## 5                            60.61                   75
## 6                             44.7                  100
##   EconomicSupportIndexForDisplay
## 1                           87.5
## 2                              0
## 3                             25
## 4                             50
## 5                             75
## 6                            100

Also, the number of new cases in a particular day depends on factors such as how many people were working on tests (fewer on weekends, for example), so instead of relying on a single day, let’s average the daily cases in the last 5 days as our response variable. We will use the same trick as above, but in the country_covid table. We will also remove columns about total cases, since these don’t make sense averaging.

columns_total <- c('total_cases', 'total_deaths', 'total_cases_per_million', 'total_deaths_per_million', 'total_tests', 'total_tests_per_thousand')
idx = match(columns_total, colnames(country_covid))

country_covid_5 <- country_covid[Sys.Date() - country_covid$date < 5, -idx ]
only_covid <- country_covid_5[4:length(country_covid_5)]

average_covid <- aggregate(only_covid, by = country_covid_5['iso_code'], FUN = mean, na.rm = T)
head(average_covid)

##   iso_code       date new_cases new_deaths new_cases_per_million
## 1      ABW 2020-07-29       0.2        0.0                1.8732
## 2      AFG 2020-07-29      77.0        2.4                1.9780
## 3      AGO 2020-07-29      39.6        2.6                1.2048
## 4      AIA 2020-07-29       0.0        0.0                0.0000
## 5      ALB 2020-07-29     112.0        4.0               38.9186
## 6      AND 2020-07-29       5.0        0.0               64.7124
##   new_deaths_per_million new_tests new_tests_per_thousand new_tests_smoothed
## 1                 0.0000       NaN                    NaN                NaN
## 2                 0.0618       NaN                    NaN                NaN
## 3                 0.0790       NaN                    NaN                NaN
## 4                 0.0000       NaN                    NaN                NaN
## 5                 1.3900       NaN                    NaN                NaN
## 6                 0.0000       NaN                    NaN                NaN
##   new_tests_smoothed_per_thousand tests_units stringency_index population
## 1                             NaN          NA              NaN     106766
## 2                             NaN          NA              NaN   38928341
## 3                             NaN          NA            76.39   32866268
## 4                             NaN          NA              NaN      15002
## 5                             NaN          NA            59.26    2877800
## 6                             NaN          NA            47.22      77265
##   population_density median_age aged_65_older aged_70_older gdp_per_capita
## 1            584.800       41.2        13.085         7.452      35973.781
## 2             54.422       18.6         2.581         1.337       1803.987
## 3             23.890       16.8         2.405         1.362       5819.495
## 4                NaN        NaN           NaN           NaN            NaN
## 5            104.871       38.0        13.188         8.643      11803.431
## 6            163.755        NaN           NaN           NaN            NaN
##   extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers
## 1             NaN                   NaN               11.62            NaN
## 2             NaN               597.029                9.59            NaN
## 3             NaN               276.045                3.94            NaN
## 4             NaN                   NaN                 NaN            NaN
## 5             1.1               304.195               10.08            7.1
## 6             NaN               109.135                7.97           29.0
##   male_smokers handwashing_facilities hospital_beds_per_thousand
## 1          NaN                    NaN                        NaN
## 2          NaN                 37.746                       0.50
## 3          NaN                 26.664                        NaN
## 4          NaN                    NaN                        NaN
## 5         51.2                    NaN                       2.89
## 6         37.8                    NaN                        NaN
##   life_expectancy
## 1           76.29
## 2           64.83
## 3           61.15
## 4           81.88
## 5           78.57
## 6           83.73

The aggregate function above returns a warning several times: argument is not numeric or logical: returning NA. This warning is shown every time aggregate is trying to take the average of values that are not numbers. This is the case of the column test_units, which is a character. It has values such as “tests performed”. In this case, taking the mean does not make sense and the whole column is filled with NAs. Since we will not use it here, we will just ignore the warning.

Now we will record total cases, total deaths, etc for the last day in our dataset. To do that, we will first filter the columns in the dataset, then sort it by date and then use aggregate to get only the first record in each data

total_covid <- country_covid[,c('iso_code','date',columns_total)]
head(total_covid)

##   iso_code       date total_cases total_deaths total_cases_per_million
## 1      ABW 2020-03-13           2            0                  18.733
## 2      ABW 2020-03-20           4            0                  37.465
## 3      ABW 2020-03-24          12            0                 112.395
## 4      ABW 2020-03-25          17            0                 159.227
## 5      ABW 2020-03-26          19            0                 177.959
## 6      ABW 2020-03-27          28            0                 262.256
##   total_deaths_per_million total_tests total_tests_per_thousand
## 1                        0          NA                       NA
## 2                        0          NA                       NA
## 3                        0          NA                       NA
## 4                        0          NA                       NA
## 5                        0          NA                       NA
## 6                        0          NA                       NA

Now we can sort by date. Even though there is a function named sort(), it works to sort vectors, not data.frames. To sort a data.frame, we have to use the function order() to figure out the order in which one or more columns should be sorted, and then use this to index the whole data.frame by row. We will also use the function with so we can type column names:

So to sort this data.frame in inverse order, first by country and then by date:

total_covid <- with(total_covid, total_covid[order(iso_code,date,decreasing = T),])

Now we can use aggregate() to get just the first row for each country. We will exclude columns 1 and 2 (country and date)

get_first <- function(x)x[1]
total_covid <- aggregate(total_covid[,-c(1,2)],by=total_covid['iso_code'],FUN = get_first)

Lastly, history is probably important, so we also want to record the number of daily cases 15 days ago, before our intervention window starts. We will average the number of daily cases between 30-35 days ago. We will also rename the column so this is not confused with the recent new cases.

country_covid_30 <- country_covid[Sys.Date() - country_covid$date < 35 & Sys.Date() - country_covid$date >= 30,]

past_covid <- aggregate(country_covid_30['new_cases'], 
                          by = country_covid_30['iso_code'], 
                          FUN = mean, 
                          na.rm = T)
colnames(past_covid)[2] <- 'past_cases' 

head(past_covid)

##   iso_code past_cases
## 1      ABW        0.4
## 2      AFG      268.4
## 3      AGO       17.4
## 4      AIA        0.0
## 5      ALB       68.6
## 6      AND        0.0

Joining datasets

Now that we have cleaned and summarized the data, let’s join the datasets. Joining average_covid and past_covid is easy because the column for country code has the same name in both. We use the function merge() and provide the two data.frames as arguments:

joint_covid <- merge(past_covid, average_covid)

We will do that with total number of cases now.

joint_covid <- merge(joint_covid, total_covid)

To join the next table, we need to say which column in each one corresponds to country code, since they have different column names. We do that by using the arguments by.x to provide column names for the first table and by.y for the second table. By default, merge will drop countries that are not present in the two tables. If this is not the intended behavior, check the function help for the argument all.

joint_covid <- merge(joint_covid, common_responses, by.x = 'iso_code', by.y= 'CountryCode', all = FALSE)

Now let’s have a look at the data.frame

head(joint_covid)

##   iso_code past_cases       date new_cases new_deaths new_cases_per_million
## 1      ABW        0.4 2020-07-29       0.2        0.0                1.8732
## 2      AFG      268.4 2020-07-29      77.0        2.4                1.9780
## 3      AGO       17.4 2020-07-29      39.6        2.6                1.2048
## 4      AIA        0.0 2020-07-29       0.0        0.0                0.0000
## 5      ALB       68.6 2020-07-29     112.0        4.0               38.9186
## 6      AND        0.0 2020-07-29       5.0        0.0               64.7124
##   new_deaths_per_million new_tests new_tests_per_thousand new_tests_smoothed
## 1                 0.0000       NaN                    NaN                NaN
## 2                 0.0618       NaN                    NaN                NaN
## 3                 0.0790       NaN                    NaN                NaN
## 4                 0.0000       NaN                    NaN                NaN
## 5                 1.3900       NaN                    NaN                NaN
## 6                 0.0000       NaN                    NaN                NaN
##   new_tests_smoothed_per_thousand tests_units stringency_index population
## 1                             NaN          NA              NaN     106766
## 2                             NaN          NA              NaN   38928341
## 3                             NaN          NA            76.39   32866268
## 4                             NaN          NA              NaN      15002
## 5                             NaN          NA            59.26    2877800
## 6                             NaN          NA            47.22      77265
##   population_density median_age aged_65_older aged_70_older gdp_per_capita
## 1            584.800       41.2        13.085         7.452      35973.781
## 2             54.422       18.6         2.581         1.337       1803.987
## 3             23.890       16.8         2.405         1.362       5819.495
## 4                NaN        NaN           NaN           NaN            NaN
## 5            104.871       38.0        13.188         8.643      11803.431
## 6            163.755        NaN           NaN           NaN            NaN
##   extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers
## 1             NaN                   NaN               11.62            NaN
## 2             NaN               597.029                9.59            NaN
## 3             NaN               276.045                3.94            NaN
## 4             NaN                   NaN                 NaN            NaN
## 5             1.1               304.195               10.08            7.1
## 6             NaN               109.135                7.97           29.0
##   male_smokers handwashing_facilities hospital_beds_per_thousand
## 1          NaN                    NaN                        NaN
## 2          NaN                 37.746                       0.50
## 3          NaN                 26.664                        NaN
## 4          NaN                    NaN                        NaN
## 5         51.2                    NaN                       2.89
## 6         37.8                    NaN                        NaN
##   life_expectancy total_cases total_deaths total_cases_per_million
## 1           76.29         120            3                1123.953
## 2           64.83       36542         1271                 938.699
## 3           61.15        1078           48                  32.800
## 4           81.88           3            0                 199.973
## 5           78.57        5197          154                1805.893
## 6           83.73         922           52               11932.958
##   total_deaths_per_million total_tests total_tests_per_thousand
## 1                   28.099          NA                       NA
## 2                   32.650          NA                       NA
## 3                    1.460          NA                       NA
## 4                    0.000          NA                       NA
## 5                   53.513          NA                       NA
## 6                  673.008          NA                       NA
##   C1_School.closing C1_Flag C2_Workplace.closing C2_Flag
## 1                 0    <NA>                    1       1
## 2                 3       1                    3       0
## 3                 3       1                    2       0
## 4                 2       1                 <NA>    <NA>
## 5                 2       1                    2       1
## 6                 0    <NA>                    2       1
##   C3_Cancel.public.events C3_Flag C4_Restrictions.on.gatherings C4_Flag
## 1                       0    <NA>                             0    <NA>
## 2                       2       1                             4       1
## 3                       2       1                             4       0
## 4                    <NA>    <NA>                          <NA>    <NA>
## 5                       2       1                             4       1
## 6                       2       1                             3       1
##   C5_Close.public.transport C5_Flag C6_Stay.at.home.requirements C6_Flag
## 1                         0    <NA>                            1       1
## 2                         2       0                            2       0
## 3                         1       0                            2       0
## 4                      <NA>    <NA>                         <NA>    <NA>
## 5                         1       1                            0    <NA>
## 6                         1       1                            1       1
##   C7_Restrictions.on.internal.movement C7_Flag C8_International.travel.controls
## 1                                    1       1                                3
## 2                                    2       0                                1
## 3                                    2       0                                4
## 4                                 <NA>    <NA>                             <NA>
## 5                                    0    <NA>                                2
## 6                                    0    <NA>                                2
##   E1_Income.support E1_Flag E2_Debt.contract.relief E3_Fiscal.measures
## 1                 2       0                       2                  0
## 2                 0    <NA>                       0                  0
## 3                 1       1                       0                  0
## 4                 1       1                       1               <NA>
## 5                 1       1                       2                  0
## 6                 2       1                       2                  0
##   E4_International.support H1_Public.information.campaigns H1_Flag
## 1                        0                               2       1
## 2                        0                               2       1
## 3                        0                               2       1
## 4                     <NA>                               2       1
## 5                        0                               2       1
## 6                        0                               0       1
##   H2_Testing.policy H3_Contact.tracing H4_Emergency.investment.in.healthcare
## 1                 2                  1                                     0
## 2                 1                  1                                     0
## 3                 2                  0                                     0
## 4              <NA>                  2                                  <NA>
## 5                 1                  2                                     0
## 6                 2                  1                                     0
##   H5_Investment.in.vaccines M1_Wildcard ConfirmedCases ConfirmedDeaths
## 1                         0          NA            801              10
## 2                         0          NA          34451            1010
## 3                         0          NA            346              29
## 4                      <NA>          NA              3               0
## 5                         0          NA           2535              83
## 6                         0          NA            855              52
##   StringencyIndex StringencyIndexForDisplay StringencyLegacyIndex
## 1           32.41                     32.41                 41.67
## 2            78.7                      78.7                 76.19
## 3           76.39                     76.39                 78.57
## 4            <NA>                      78.7                  <NA>
## 5           59.26                     59.26                 66.67
## 6           41.67                     41.67                 48.81
##   StringencyLegacyIndexForDisplay GovernmentResponseIndex
## 1                           41.67                   44.87
## 2                           76.19                    60.9
## 3                           78.57                   61.86
## 4                           78.57                    <NA>
## 5                           66.67                   62.82
## 6                           48.81                   53.21
##   GovernmentResponseIndexForDisplay ContainmentHealthIndex
## 1                             44.87                  37.12
## 2                              60.9                  71.97
## 3                             61.86                  68.56
## 4                              <NA>                   <NA>
## 5                             62.82                  60.61
## 6                             53.21                   44.7
##   ContainmentHealthIndexForDisplay EconomicSupportIndex
## 1                            37.12                 87.5
## 2                            71.97                    0
## 3                            68.56                   25
## 4                             <NA>                   50
## 5                            60.61                   75
## 6                             44.7                  100
##   EconomicSupportIndexForDisplay
## 1                           87.5
## 2                              0
## 3                             25
## 4                             50
## 5                             75
## 6                            100

Adding country names and region data

You might have noticed that in the process of summarizing and merging we only used country codes and lost country names. The reason is that these three-letter codes are standardized across datasets (you can read more on Wikipedia), while country names may vary. For example one of our tables uses Kyrgyz Republic and another uses Kyrgyzstan for the same country, but both use the same 3-letter code: KGZ. To finalize the preparation of our dataset we will add back country names and also include information about the continent in which these countries are.

The R package rworldmap can be used to retrieve this data. If you do not have it installed, you can do it by using install.packages('rworldmap'). One way to access functions in a package is to to lead them first with library(package_name). Another way that we have not seem before is to call the function without loading the package (but it has to be installed), by using package_name::function_name. This is convenient to make it clear from which package a function comes from, so you remember later when read your code, and also when we just want to use one function, not the whole package functionality. Since we will not plot maps today and we just want access to one data table, we will call the function directly without loading the package. The function getMap() in this package returns a map object that includes a data frame with country data. We would need additional R packages (namely, the package sp) to actually plot the map, so for now we will simply transform the map into a simple data frame with as.data.frame() and keep the data only.

all_country_data <- as.data.frame(rworldmap::getMap())
head(all_country_data)

##   ScaleRank LabelRank        FeatureCla           SOVEREIGNT SOV_A3 ADM0_DIF
## 1         1         1 Admin-0 countries          Afghanistan    AFG        0
## 2         1         1 Admin-0 countries               Angola    AGO        0
## 3         1         1 Admin-0 countries              Albania    ALB        0
## 4         1         1 Admin-0 countries United Arab Emirates    ARE        0
## 5         1         1 Admin-0 countries            Argentina    ARG        0
## 6         1         1 Admin-0 countries              Armenia    ARM        0
##   LEVEL              TYPE                ADMIN ADM0_A3 GEOU_DIF
## 1     2 Sovereign country          Afghanistan     AFG        0
## 2     2 Sovereign country               Angola     AGO        0
## 3     2 Sovereign country              Albania     ALB        0
## 4     2 Sovereign country United Arab Emirates     ARE        0
## 5     2 Sovereign country            Argentina     ARG        0
## 6     2 Sovereign country              Armenia     ARM        0
##                GEOUNIT GU_A3 SU_DIF              SUBUNIT SU_A3
## 1          Afghanistan   AFG      0          Afghanistan   AFG
## 2               Angola   AGO      0               Angola   AGO
## 3              Albania   ALB      0              Albania   ALB
## 4 United Arab Emirates   ARE      0 United Arab Emirates   ARE
## 5            Argentina   ARG      0            Argentina   ARG
## 6              Armenia   ARM      0              Armenia   ARM
##                   NAME ABBREV POSTAL                   NAME_FORMA TERR_
## 1          Afghanistan   Afg.     AF Islamic State of Afghanistan  <NA>
## 2               Angola   Ang.     AO           Republic of Angola  <NA>
## 3              Albania   Alb.     AL          Republic of Albania  <NA>
## 4 United Arab Emirates U.A.E.     AE                         <NA>  <NA>
## 5            Argentina   Arg.     AR           Argentine Republic  <NA>
## 6              Armenia   Arm.    ARM          Republic of Armenia  <NA>
##              NAME_SORT MAP_COLOR  POP_EST GDP_MD_EST FIPS_10_ ISO_A2 ISO_A3
## 1          Afghanistan         7 28400000      22270        0     AF    AFG
## 2               Angola         1 12799293     110300        0     AO    AGO
## 3              Albania         6  3639453      21810        0     AL    ALB
## 4 United Arab Emirates         3  4798491     184300        0     AE    ARE
## 5            Argentina        13 40913584     573900        0     AR    ARG
## 6              Armenia        10  2967004      18770        0     AM    ARM
##   ISO_N3 ISO3       LON       LAT ISO3.1              ADMIN.1        REGION
## 1      4  AFG  66.08669  33.85640    AFG          Afghanistan          Asia
## 2     24  AGO  17.50291 -12.29155    AGO               Angola        Africa
## 3      8  ALB  20.03243  41.14135    ALB              Albania        Europe
## 4    784  ARE  54.20671  23.86863    ARE United Arab Emirates          Asia
## 5     32  ARG -65.14954 -35.22017    ARG            Argentina South America
## 6     51  ARM  45.00029  40.21661    ARM              Armenia        Europe
##       continent                       GEO3major              GEO3
## 1       Eurasia            Asia and the Pacific        South Asia
## 2        Africa                          Africa   Southern Africa
## 3       Eurasia                          Europe    Central Europe
## 4       Eurasia                       West Asia Arabian Peninsula
## 5 South America Latin America and the Caribbean     South America
## 6       Eurasia                          Europe    Eastern Europe
##              IMAGE24             GLOCAF          Stern SRESmajor
## 1             India+ Rest of South Asia   Central Asia      ASIA
## 2    Southern Africa Sub-Sarahan Africa South+E Africa       ALM
## 3     Central Europe             Europe         Europe       REF
## 4        Middle East        Middle East      West Asia       ALM
## 5 Rest South America   Rest of South Am  South America       ALM
## 6            Russia+                FSU         Europe       REF
##                                    SRES                         GBD
## 1                      South Asia (SAS)                 Asia, South
## 2              Sub-Saharan Africa (AFR) Sub-Saharan Africa, Central
## 3      Central and Eastern Europe (EEU)             Europe, Central
## 4    Middle East and North Africa (MEA)    North Africa/Middle East
## 5 Latin America and the Caribbean (LAM)     Latin America, Southern
## 6 Newly Independent States of FSU (FSU)               Asia, Central
##   AVOIDnumeric                AVOIDname   LDC   SID  LLDC
## 1           21     Rest of Central Asia   LDC other  LLDC
## 2           24 Southern and East Africa   LDC other other
## 3           25                   Europe other other other
## 4           30              Middle East other other other
## 5           26            South America other other other
## 6           25                   Europe other other  LLDC

The data frame above includes a lot more information than we need. We only want 3 things: 1. Country code, to match with our dataset 2. Country name 3. Region in which the country is located

There are several options for regions, and we will select GEO3major, which splits the world in 7 regions:

levels(all_country_data$GEO3major)

## [1] "Africa"                          "Asia and the Pacific"           
## [3] "Europe"                          "Latin America and the Caribbean"
## [5] "North America"                   "Polar"                          
## [7] "West Asia"

Let’s select these columns:

all_country_data <- all_country_data[c('ISO3','NAME','GEO3major')]
head(all_country_data)

##   ISO3                 NAME                       GEO3major
## 1  AFG          Afghanistan            Asia and the Pacific
## 2  AGO               Angola                          Africa
## 3  ALB              Albania                          Europe
## 4  ARE United Arab Emirates                       West Asia
## 5  ARG            Argentina Latin America and the Caribbean
## 6  ARM              Armenia                          Europe

Let’s also rename the columns to something that is easier to understand.

colnames(all_country_data) <- c('iso_code','country_name','world_region')

Since the columns with country code has the same name as in our data.frame now, we can call merge() without additional arguments.

joint_covid <- merge(all_country_data, joint_covid)

This is our final dataset:

head(joint_covid)

##   iso_code country_name                    world_region past_cases       date
## 1      ABW        Aruba Latin America and the Caribbean        0.4 2020-07-29
## 2      AFG  Afghanistan            Asia and the Pacific      268.4 2020-07-29
## 3      AGO       Angola                          Africa       17.4 2020-07-29
## 4      AIA     Anguilla Latin America and the Caribbean        0.0 2020-07-29
## 5      ALB      Albania                          Europe       68.6 2020-07-29
## 6      AND      Andorra                          Europe        0.0 2020-07-29
##   new_cases new_deaths new_cases_per_million new_deaths_per_million new_tests
## 1       0.2        0.0                1.8732                 0.0000       NaN
## 2      77.0        2.4                1.9780                 0.0618       NaN
## 3      39.6        2.6                1.2048                 0.0790       NaN
## 4       0.0        0.0                0.0000                 0.0000       NaN
## 5     112.0        4.0               38.9186                 1.3900       NaN
## 6       5.0        0.0               64.7124                 0.0000       NaN
##   new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand
## 1                    NaN                NaN                             NaN
## 2                    NaN                NaN                             NaN
## 3                    NaN                NaN                             NaN
## 4                    NaN                NaN                             NaN
## 5                    NaN                NaN                             NaN
## 6                    NaN                NaN                             NaN
##   tests_units stringency_index population population_density median_age
## 1          NA              NaN     106766            584.800       41.2
## 2          NA              NaN   38928341             54.422       18.6
## 3          NA            76.39   32866268             23.890       16.8
## 4          NA              NaN      15002                NaN        NaN
## 5          NA            59.26    2877800            104.871       38.0
## 6          NA            47.22      77265            163.755        NaN
##   aged_65_older aged_70_older gdp_per_capita extreme_poverty
## 1        13.085         7.452      35973.781             NaN
## 2         2.581         1.337       1803.987             NaN
## 3         2.405         1.362       5819.495             NaN
## 4           NaN           NaN            NaN             NaN
## 5        13.188         8.643      11803.431             1.1
## 6           NaN           NaN            NaN             NaN
##   cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers
## 1                   NaN               11.62            NaN          NaN
## 2               597.029                9.59            NaN          NaN
## 3               276.045                3.94            NaN          NaN
## 4                   NaN                 NaN            NaN          NaN
## 5               304.195               10.08            7.1         51.2
## 6               109.135                7.97           29.0         37.8
##   handwashing_facilities hospital_beds_per_thousand life_expectancy total_cases
## 1                    NaN                        NaN           76.29         120
## 2                 37.746                       0.50           64.83       36542
## 3                 26.664                        NaN           61.15        1078
## 4                    NaN                        NaN           81.88           3
## 5                    NaN                       2.89           78.57        5197
## 6                    NaN                        NaN           83.73         922
##   total_deaths total_cases_per_million total_deaths_per_million total_tests
## 1            3                1123.953                   28.099          NA
## 2         1271                 938.699                   32.650          NA
## 3           48                  32.800                    1.460          NA
## 4            0                 199.973                    0.000          NA
## 5          154                1805.893                   53.513          NA
## 6           52               11932.958                  673.008          NA
##   total_tests_per_thousand C1_School.closing C1_Flag C2_Workplace.closing
## 1                       NA                 0    <NA>                    1
## 2                       NA                 3       1                    3
## 3                       NA                 3       1                    2
## 4                       NA                 2       1                 <NA>
## 5                       NA                 2       1                    2
## 6                       NA                 0    <NA>                    2
##   C2_Flag C3_Cancel.public.events C3_Flag C4_Restrictions.on.gatherings C4_Flag
## 1       1                       0    <NA>                             0    <NA>
## 2       0                       2       1                             4       1
## 3       0                       2       1                             4       0
## 4    <NA>                    <NA>    <NA>                          <NA>    <NA>
## 5       1                       2       1                             4       1
## 6       1                       2       1                             3       1
##   C5_Close.public.transport C5_Flag C6_Stay.at.home.requirements C6_Flag
## 1                         0    <NA>                            1       1
## 2                         2       0                            2       0
## 3                         1       0                            2       0
## 4                      <NA>    <NA>                         <NA>    <NA>
## 5                         1       1                            0    <NA>
## 6                         1       1                            1       1
##   C7_Restrictions.on.internal.movement C7_Flag C8_International.travel.controls
## 1                                    1       1                                3
## 2                                    2       0                                1
## 3                                    2       0                                4
## 4                                 <NA>    <NA>                             <NA>
## 5                                    0    <NA>                                2
## 6                                    0    <NA>                                2
##   E1_Income.support E1_Flag E2_Debt.contract.relief E3_Fiscal.measures
## 1                 2       0                       2                  0
## 2                 0    <NA>                       0                  0
## 3                 1       1                       0                  0
## 4                 1       1                       1               <NA>
## 5                 1       1                       2                  0
## 6                 2       1                       2                  0
##   E4_International.support H1_Public.information.campaigns H1_Flag
## 1                        0                               2       1
## 2                        0                               2       1
## 3                        0                               2       1
## 4                     <NA>                               2       1
## 5                        0                               2       1
## 6                        0                               0       1
##   H2_Testing.policy H3_Contact.tracing H4_Emergency.investment.in.healthcare
## 1                 2                  1                                     0
## 2                 1                  1                                     0
## 3                 2                  0                                     0
## 4              <NA>                  2                                  <NA>
## 5                 1                  2                                     0
## 6                 2                  1                                     0
##   H5_Investment.in.vaccines M1_Wildcard ConfirmedCases ConfirmedDeaths
## 1                         0          NA            801              10
## 2                         0          NA          34451            1010
## 3                         0          NA            346              29
## 4                      <NA>          NA              3               0
## 5                         0          NA           2535              83
## 6                         0          NA            855              52
##   StringencyIndex StringencyIndexForDisplay StringencyLegacyIndex
## 1           32.41                     32.41                 41.67
## 2            78.7                      78.7                 76.19
## 3           76.39                     76.39                 78.57
## 4            <NA>                      78.7                  <NA>
## 5           59.26                     59.26                 66.67
## 6           41.67                     41.67                 48.81
##   StringencyLegacyIndexForDisplay GovernmentResponseIndex
## 1                           41.67                   44.87
## 2                           76.19                    60.9
## 3                           78.57                   61.86
## 4                           78.57                    <NA>
## 5                           66.67                   62.82
## 6                           48.81                   53.21
##   GovernmentResponseIndexForDisplay ContainmentHealthIndex
## 1                             44.87                  37.12
## 2                              60.9                  71.97
## 3                             61.86                  68.56
## 4                              <NA>                   <NA>
## 5                             62.82                  60.61
## 6                             53.21                   44.7
##   ContainmentHealthIndexForDisplay EconomicSupportIndex
## 1                            37.12                 87.5
## 2                            71.97                    0
## 3                            68.56                   25
## 4                             <NA>                   50
## 5                            60.61                   75
## 6                             44.7                  100
##   EconomicSupportIndexForDisplay
## 1                           87.5
## 2                              0
## 3                             25
## 4                             50
## 5                             75
## 6                            100

We will now save it to use later. Since this code will pull up new data every day it is run, let’s save the file with today’s date in the name. We can concatenate different character vectors by using the function paste0() and get today’s date as a character by using the function as.character()

filename = paste0('covid_data_', as.character(Sys.Date()),'.csv')
write.csv(x = joint_covid, file = filename)

Tutorials

Tutorial 4 - Statistical Modeling