knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE )
This Vignette
presents the methodology that allows to import Epidemium data and merge them. The following Epidemium data will be used
training_IARC.csv
has been preferred to training.txt, the latter exhibting more missing values.
To ensure Vignette
compilation, every dataframe is downloaded from the internet. It might be time-consuming. It is thus recommended to rather manually download required files using a web browser, put them in a subdirectory within the working directory. We called this directory datadir
in this Vignette
:
datadir <- paste0(getwd(),"/inst")
If all the required files can be find in datadir
, you can safely use our functions without the web-import option. All the import_
functions should work when setting fromURL = F
and the speed gain should be around 50\%. The filename
and path
arguments have been filled for that: they are not necessary when importing data from the internet but are necessary if you upload from a computer.
The only required package in this Vignette
is
library(OpenCancer)
The import_*
functions share the same structure. The parameters colstokeep
allow to only keep a few variables. If colstokeep = NULL
, all variables are kept in memory. In this Vignette
, we will only keep a few variables in memory (100 variables) but, if you want the whole dataframes, replicate these examples with colstokeep = NULL
.
FAO data can be imported with import_FAO
df_FAO <- import_FAO(path = datadir, colstokeep = 1:100, fromURL = T) knitr::kable(head(df_FAO[!is.na(df_FAO[,10]),1:10]), caption = "FAO Data")
World Bank data are imported with import_WB
df_WB <- import_WB(path = datadir, colstokeep = 1:100, fromURL = T) knitr::kable(head(df_WB[!is.na(df_WB[,6]),1:6]), caption = "World Bank Data")
ILO data are imported with import_ILO
df_ILO <- import_ILO(path = datadir, colstokeep = 1:100, fromURL = T) knitr::kable(head(df_ILO[!is.na(df_ILO[,6]),1:6]))
import_training
data allows to import both training_IARC.csv and training.txt. For its completeness, the former has been preferred.
df_training <- import_training(path = datadir, colstokeep = NULL, filename = "training_IARC.csv", fromURL = TRUE, url = "http://qa.epidemium.cc/data/shinyapp/training_IARC.csv")
knitr::kable(head(na.omit(df_training)), caption = "Incidence Data")
The following function has been designed to easily import from the Epidemium website the labels of all the variables names.
df_label <- import_coding()
It requires an internet connection since files are imported from Epidemium
website.
To ensure data will be consistently merged, we import a file called creation_zonier.xls
file, stored in Github
. All countries in the FAO, ILO and World Bank database should thus be assigned a unique and consistent key, a necessary step before merging dataframes.
link <- "https://github.com/EpidemiumOpenCancer/OpenCancer/raw/master/vignettes/inst/creation%20zonier.xls" download.file(link,destfile = paste0(datadir,"/creation_zonier.xls"),mode = "wb") codes <- readxl::read_excel(path = paste0(datadir,"/creation_zonier.xls"), sheet = "Transco_Country") knitr::kable(head(codes), caption = 'Coding convention for countries')
Using the codes
dataframe, we use the matchkeys
function to match countries in a dataframe (e.g. df_FAO
) with its identifiers. It only requires to match a dataframe with its source (default is 'FAO')
df_FAO = matchkeys(df_FAO,codes) df_WB = matchkeys(df_WB,codes,origin = "World Bank") df_ILO = matchkeys(df_ILO,codes, origin = "IloStat")
At this stage, all explanatory variables dataframes share consistent keys that allow to safely merge them.
# LEFT JOIN FAO AND WORLD BANK DATA: NO RESTRICTION TO POST 1970 DATA df_covariates <- dplyr::full_join(df_FAO,df_WB, by = c("Country_Transco","year","Zonier")) # JOIN AVEC ILO df_covariates <- dplyr::left_join(df_covariates,df_ILO, by = c("Country_Transco","year","Zonier"))
To ensure consistency with the training data, we control the countries that do not match with the keys list. The number is limited for both training.txt
(8 countries) and training.csv
(3 countries). This can be handled manually
# CHANGING MISMATCHED NAMES WITH TRAINING DATASET oldnames <- unique(df_training$country[!df_training$country %in% intersect(df_covariates$Country_Transco,df_training$country)]) newnames <- c("Netherlands","Slovakia","United States") # IF DATAFRAME IS training.csv UNCOMMENT #newnames <- c("Netherlands","Slovakia","United States","Gambia, The","Australian Capital Territory", # "South Korea","VietNam","Yugoslavia") df_training$country <- plyr::mapvalues(df_training$country, from=oldnames, to=newnames, warn_missing = FALSE)
The final step in the merging process is to join the incidence dataset with other variables
# HARMONIZING COLUMN NAMES df_training <- df_training %>% dplyr::rename(Country_Transco = country) # IF NO ETHNICITY IS GIVEN, ASSUMED ALL POPULATION IS PRESENTED df_training$ethnicity[is.na(df_training$ethnicity)] <- "All population" df_training$region[is.na(df_training$region)] <- "All regions" # JOIN WITH TRAINING DATASET df_full <- dplyr::left_join(df_training,df_covariates)
Finally, we clean the dataframe to get a proper training tables that can be interpolated. We ensure we don't have empty columns with check.emptycolumn
(to ensure we don't have empty lines, we will later use check.emptyline
)
# REMOVE UNNECESSARY VARIABLES df_full <- df_full %>% select_(.dots = paste0("-", c("cancer", "area.x","area.y", "area_code","ref_area", "area","sourceFAO","sourceWB","sourceILO")) ) df_full <- df_full[,!check.emptycolumn(df_full)] knitr::kable(head(df_full[!is.na(df_full[,10]),1:10]), caption = 'Final dataframe before interpolation')
To limit the amount of RAM required by R, we clean the global environment by removing dataframes that are no longer necessary. We also fix as a starting point 1970
, where all covariates dataframes are available.
pryr::mem_change(rm(df_ILO,df_FAO,df_WB,df_training, df_covariates)) df_full <- df_full %>% filter(year>=1970)
The OpenCancer
package provides several functions dedicated to the construction of a clean training table. First of all, it might be a good idea to drop empty variables, if there are some
if (sum(check.emptycolumn(df_full))>0) df_full <- check.emptycolumn(df_full)
More generally, we can inspect the share of missing values in a variable using check.emptycolumn
. It might thus be a good idea to drop variables that have too much missing values since interpolation will show small precision with too much missing values. For instance, if we want to drop variables that have more of 40\% of missing values (r sum(check.emptycolumn(df_full,proportion = T)>0.4)
variables over r ncol(df_full)
), we do
df_full$ethnicity[is.na(df_full$ethnicity)] <- "All population" colstokeep <- !check.emptycolumn(df_full,proportion = T)>0.4 df_full <- df_full[,colstokeep]
Another modification that might be useful is converting some variables into factors
. The check_factor
variable has been designed for that. First, if we want to inspect the cardinal of each potential factor, we use the argument check.levels = T
.
head(check.factor(df_full,check.levels = T,threshold = 12))
Now, let's say we want to convert each variable with cardinal lower than 12 as factor
. In that case we must put check.levels = F
df_full <- check.factor(df_full,check.levels = F,threshold = 12)
It might also be a good idea to exclude constant variables since they do not provide information when building statistical models.
# VARIABLES WITH ZERO VARIANCE: CONSTANT VARIABLES -> DROP THEM if (sum(apply(df_full[,!sapply(df_full, is.character)], 2, var, na.rm=TRUE) == 0)>0){ df_full <- df_full[,-which(apply(df_full[,!sapply(df_full, is.character)], 2, var, na.rm=TRUE) == 0)] } dim(df_full)
At this stage, df_full
has r sum(complete.cases(df_full))
complete rows over r ncol(df_full)
. Interpolating missing values might help to build robust statistical models. A series of function being part of the OpenCancer
package have been specifically built to deal with missing values.
A first useful function has been designed to identify potential miscoding issues. convert_fakezero
has been designed to inspect variables where zeros might rather be NAs
. We encapsulate the convert_fakezero
call within a mapvalues
call to print variables labels rather than codes that are obscure
head(plyr::mapvalues(names(which(convert_fakezero(df_full,convert = F))), from = df_label$code, to = df_label$label, warn_missing = FALSE) )
For consumer price index variables, zero values make no sense. For food production variables, a zero value at national level (observation unit for FAO
data) is not likely and might, once again, reflect coding issues. We thus convert, for all these variables except incidence
, zeros as NAs.
potential.miscoding <- names(which(convert_fakezero(df_full,convert = F))) miscoded <- potential.miscoding[!potential.miscoding %in% "incidence"] is.na(df_full[,miscoded]) <- !df_full[,miscoded]
The interpolate_data
function has been designed to give control when imputing NAs by the means of the zoo::na.approx
function. In further release of the package, advanced options will be added (knn
imputation for instance).
interpolate_data
has been designed to perform imputations by groups. For the imputation to be consistent, the groupingvar
argument must present the set of keys that uniquely identifies an observation, the year vector being apart. For instance, if the identifiers of an observations are variables called country
, region
, subregion
and year
, the individual dimension of our panel will be 3 and groupingvar
argument must be provided the following way: groupingvar = c("country","region","subregion")
. If we also have a sex dimension, groupingvar
will be groupingvar = c("country","region","subregion","sex")
. It thus provides, for hierarchical levels dataframes, flexibility in defining the individual unit of observation (at least one grouping variable must be provided and as many as needed are accepted). The name of the year variable must be provided in the year.var
argument. For each individual unit, a preliminary step checks whether the time dimension is complete, i.e. whether year vector exhibits a complete sequence between its minimum and maximum values. If the time vector is incomplete, intermediate missing years are appended and will also be interpolated thanks to the fill_dataframe
function. For instance, assume you have the following incomplete set of years, fill_dataframe
will produce
df <- data.frame(year = c(1980,1983:1985,1990),y = rnorm(length(c(1980,1983:1985,1990)))) knitr::kable(df,caption = "Initial incomplete dataframes") # FILL DATAFRAME df <- fill_dataframe(df,year.var = 'year',label.var = NULL) knitr::kable(df, caption = "Missing years have been appended")
where the label.var
argument has been set to NULL
so that it is ignored. Note that year.var = 'year'
is the default value and can be avoided. This intermediate function fill_dataframe
is important since it is necessary to have a balanced time vector.
The final argument to control the interpolation behavior of interpolate_data
is threshold
, a parameter that controls which unit of observations, for each variable, will be imputed. It is designed to avoid filling a variable that has too much missing values. More precisely, threshold
defines the proportion of missing values by group that is tolerated. If the proportion of missing values, for a given observation unit, is larger than threshold
, the variable will not be interpolated. If the proportion of missing values is lower than threshold, observations will be interpolated.
interpo <- interpolate_data(df_full[,c(1:15, which(colnames(df_full) == "Zonier"))], groupingvar = c("sex","age","Country_Transco", "region","ethnicity"), year.var = 'year', label.var = 'Zonier', threshold = 0.2)
The output of interpolate_data
is made of two elements:
data
) is the input dataframe with values interpolated. matrixNA
) is a matrix recording the changes induced by the dataframe. It has the following code: NA
if the observation has not been part of any imputation, 0
if the observation exists and is part of a vector that has been interpolated and 1
if the observation has been interpolated (or extrapolated). In our case, r sum(interpo$NAchange[interpo$NAchange == 1],na.rm=T)
have been imputed. Keeping a record of the imputations made might be useful to control rows or columns in order to ensure they do not have too many missing values that have been imputed since it can affect the quality of a statistical model. In further versions of the package, some functions will allow post-imputations control.
Let's call df_full2
the new dataframe
df_full2 <- interpo$data
big.matrix
The dataframe is now ready for statistical modelling. When complete, i.e. when putting colstokeep = NULL
rather than colstokeep = 1:100
as above, it might be preferrable to use big.matrix
object (see bigmemory
documentation) rather than standard dataframes.
A drawback of using objects of class big.matrix
is that it requires all data to be numeric. This means
big.object
To tackle these issues, two functions have been created:
convert.character
creates a correspondance table between a character variable and a numeric variables allowing to change variable type without loosing information (it calls several times OpenCancer::encode.character
)expand.dummies
has been designed to transform a factor variable as a set of dummy (as it is treated by statistical models)conversion <- convert.character(df_full2) names(conversion)
convert.character
returns a list:
data
returns the dataframe with character variables converted as numericcorrespondance
is a list of all correspondance tablesknitr::kable(head(conversion$data[,1:10]), caption = "Dataframe with converted character vectors") knitr::kable(head(conversion$correspondance[[1]]), caption = "An example of a correspondance table") df_full2 <- conversion$data
This step is not necessary but might be valuable since some models implementation do not accept factor variables or silently convert factors as numeric while they require a different treatment. expand.dummies
takes all factor variables - except those provided in the labelvar
argument - of a dataframe and creates as many columns as needed to store all possibilities (with a M
valued factor, M-1
dummies will be created).
df_full2$fakefactor <- factor(sample(1:8, size = nrow(df_full2), replace = T)) knitr::kable(head(expand.dummies(df_full2,labelvar = "sex")))
To replace all factor variables in dataframe by the set of dummies:
factorvar <- which(sapply(df_full2,is.factor)) factorvar <- factorvar[-which(names(factorvar) == 'sex')] df_full2 <- cbind(df[,-factorvar],expand.dummies(df_full2,labelvar = "sex"))
OpenCancer
The OpenCancer
package proposes other interesting function to easily build training tables.
createlag
This function has been designed to easily create lagged variables for several columns. Given a maximum order $k$, each function is lagged at orders $1,...,k$, taking care of the grouping variables (variables defining the individual unit).
df_full2 <- createlag(df_full2[,1:8], groupingvar = c("sex","age","Country_Transco", "region","ethnicity"), k = 2, labelvar = c("year","incidence")) knitr::kable(df_full2[,colnames(df_full2)[8]])
performPreProcess
Epidemium do not share common scales. This might affect feature selection in giving too much weight to a few variables. A possibility to limit this problem is centering and scaling all variables before estimating models (and keeping up in memory initial scales to be able to transform back variables to their original scales if needed).
The caret::preProcess
allows to that on standard dataframes. The OpenCancer::performPreProcess
has
been made to apply this function to pointers (see dedicated Vignettes
). This function returns a preProcess
model that can afterwards be used to transform variables.
X <- df_full[,1:15] X <- convert.character(X)$data Y <- as.big.matrix(data.frame(X)) preprocess.model <- performPreProcess(Y, groupingvar = c("sex","age","Country_Transco", "region","ethnicity"), labelvar = "year")
readr::write_csv(df_full2,path = paste0(datadir,"/exampledf.csv"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.