knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE
)

Requirement

This Vignette presents the methodology that allows to import Epidemium data and merge them. The following Epidemium data will be used

training_IARC.csv has been preferred to training.txt, the latter exhibting more missing values.

To ensure Vignette compilation, every dataframe is downloaded from the internet. It might be time-consuming. It is thus recommended to rather manually download required files using a web browser, put them in a subdirectory within the working directory. We called this directory datadir in this Vignette:

datadir <- paste0(getwd(),"/inst")

If all the required files can be find in datadir, you can safely use our functions without the web-import option. All the import_ functions should work when setting fromURL = F and the speed gain should be around 50\%. The filename and path arguments have been filled for that: they are not necessary when importing data from the internet but are necessary if you upload from a computer.

The only required package in this Vignette is

library(OpenCancer)

Importing dataframes

Importing covariates

The import_* functions share the same structure. The parameters colstokeep allow to only keep a few variables. If colstokeep = NULL, all variables are kept in memory. In this Vignette, we will only keep a few variables in memory (100 variables) but, if you want the whole dataframes, replicate these examples with colstokeep = NULL.

FAO data can be imported with import_FAO

df_FAO <- import_FAO(path = datadir, colstokeep = 1:100, fromURL = T)
knitr::kable(head(df_FAO[!is.na(df_FAO[,10]),1:10]), caption = "FAO Data")

World Bank data are imported with import_WB

df_WB <- import_WB(path = datadir, colstokeep = 1:100, fromURL = T)
knitr::kable(head(df_WB[!is.na(df_WB[,6]),1:6]), caption = "World Bank Data")

ILO data are imported with import_ILO

df_ILO <- import_ILO(path = datadir, colstokeep = 1:100, fromURL = T)
knitr::kable(head(df_ILO[!is.na(df_ILO[,6]),1:6]))

Importing training data

import_training data allows to import both training_IARC.csv and training.txt. For its completeness, the former has been preferred.

df_training <- import_training(path = datadir, colstokeep = NULL,
                                   filename = "training_IARC.csv",
                                   fromURL = TRUE,
                                   url = "http://qa.epidemium.cc/data/shinyapp/training_IARC.csv")
knitr::kable(head(na.omit(df_training)), caption = "Incidence Data")

Importing variables codes

The following function has been designed to easily import from the Epidemium website the labels of all the variables names.

df_label <- import_coding()

It requires an internet connection since files are imported from Epidemium website.

Importing joining keys

To ensure data will be consistently merged, we import a file called creation_zonier.xls file, stored in Github. All countries in the FAO, ILO and World Bank database should thus be assigned a unique and consistent key, a necessary step before merging dataframes.

link <- "https://github.com/EpidemiumOpenCancer/OpenCancer/raw/master/vignettes/inst/creation%20zonier.xls"
download.file(link,destfile = paste0(datadir,"/creation_zonier.xls"),mode = "wb")
codes <- readxl::read_excel(path = paste0(datadir,"/creation_zonier.xls"),
                            sheet = "Transco_Country")

knitr::kable(head(codes), caption = 'Coding convention for countries')

Merging dataframes

Mapping data to a unique key

Using the codes dataframe, we use the matchkeys function to match countries in a dataframe (e.g. df_FAO) with its identifiers. It only requires to match a dataframe with its source (default is 'FAO')

df_FAO = matchkeys(df_FAO,codes)
df_WB = matchkeys(df_WB,codes,origin = "World Bank")
df_ILO = matchkeys(df_ILO,codes, origin = "IloStat")

Join data

At this stage, all explanatory variables dataframes share consistent keys that allow to safely merge them.

# LEFT JOIN FAO AND WORLD BANK DATA: NO RESTRICTION TO POST 1970 DATA
df_covariates <- dplyr::full_join(df_FAO,df_WB, by = c("Country_Transco","year","Zonier"))

# JOIN AVEC ILO
df_covariates <- dplyr::left_join(df_covariates,df_ILO, by = c("Country_Transco","year","Zonier"))

To ensure consistency with the training data, we control the countries that do not match with the keys list. The number is limited for both training.txt (8 countries) and training.csv (3 countries). This can be handled manually

# CHANGING MISMATCHED NAMES WITH TRAINING DATASET
oldnames <- unique(df_training$country[!df_training$country %in% 
                                         intersect(df_covariates$Country_Transco,df_training$country)])
newnames <- c("Netherlands","Slovakia","United States")

# IF DATAFRAME IS training.csv UNCOMMENT
#newnames <- c("Netherlands","Slovakia","United States","Gambia, The","Australian Capital Territory",
#              "South Korea","VietNam","Yugoslavia")
df_training$country <- plyr::mapvalues(df_training$country, from=oldnames, to=newnames,
                                       warn_missing = FALSE)

The final step in the merging process is to join the incidence dataset with other variables

# HARMONIZING COLUMN NAMES
df_training  <- df_training %>% dplyr::rename(Country_Transco = country)

# IF NO ETHNICITY IS GIVEN, ASSUMED ALL POPULATION IS PRESENTED
df_training$ethnicity[is.na(df_training$ethnicity)] <- "All population" 
df_training$region[is.na(df_training$region)] <- "All regions" 

# JOIN WITH TRAINING DATASET
df_full <- dplyr::left_join(df_training,df_covariates)

Finally, we clean the dataframe to get a proper training tables that can be interpolated. We ensure we don't have empty columns with check.emptycolumn (to ensure we don't have empty lines, we will later use check.emptyline)

# REMOVE UNNECESSARY VARIABLES
df_full <- df_full %>% select_(.dots = paste0("-", c("cancer", "area.x","area.y",
             "area_code","ref_area",
             "area","sourceFAO","sourceWB","sourceILO"))
)

df_full <- df_full[,!check.emptycolumn(df_full)]

knitr::kable(head(df_full[!is.na(df_full[,10]),1:10]), caption = 'Final dataframe before interpolation')

To limit the amount of RAM required by R, we clean the global environment by removing dataframes that are no longer necessary. We also fix as a starting point 1970, where all covariates dataframes are available.

pryr::mem_change(rm(df_ILO,df_FAO,df_WB,df_training,
   df_covariates))

df_full <- df_full %>% filter(year>=1970)

Cleaning dataset

Ensuring variables types are consistent

The OpenCancer package provides several functions dedicated to the construction of a clean training table. First of all, it might be a good idea to drop empty variables, if there are some

if (sum(check.emptycolumn(df_full))>0) df_full <- check.emptycolumn(df_full)

More generally, we can inspect the share of missing values in a variable using check.emptycolumn. It might thus be a good idea to drop variables that have too much missing values since interpolation will show small precision with too much missing values. For instance, if we want to drop variables that have more of 40\% of missing values (r sum(check.emptycolumn(df_full,proportion = T)>0.4) variables over r ncol(df_full)), we do

df_full$ethnicity[is.na(df_full$ethnicity)] <- "All population"

colstokeep <- !check.emptycolumn(df_full,proportion = T)>0.4
df_full <- df_full[,colstokeep]

Another modification that might be useful is converting some variables into factors. The check_factor variable has been designed for that. First, if we want to inspect the cardinal of each potential factor, we use the argument check.levels = T.

head(check.factor(df_full,check.levels = T,threshold = 12))

Now, let's say we want to convert each variable with cardinal lower than 12 as factor. In that case we must put check.levels = F

df_full <- check.factor(df_full,check.levels = F,threshold = 12)

It might also be a good idea to exclude constant variables since they do not provide information when building statistical models.

# VARIABLES WITH ZERO VARIANCE: CONSTANT VARIABLES -> DROP THEM
if (sum(apply(df_full[,!sapply(df_full, is.character)], 2, var, na.rm=TRUE) == 0)>0){
  df_full <- df_full[,-which(apply(df_full[,!sapply(df_full, is.character)], 2, var, na.rm=TRUE) == 0)]
}

dim(df_full)

Interpolation

At this stage, df_full has r sum(complete.cases(df_full)) complete rows over r ncol(df_full). Interpolating missing values might help to build robust statistical models. A series of function being part of the OpenCancer package have been specifically built to deal with missing values.

Ensuring consistency in missing values treatment

A first useful function has been designed to identify potential miscoding issues. convert_fakezero has been designed to inspect variables where zeros might rather be NAs. We encapsulate the convert_fakezero call within a mapvalues call to print variables labels rather than codes that are obscure

head(plyr::mapvalues(names(which(convert_fakezero(df_full,convert = F))),
                from = df_label$code,
                to = df_label$label, warn_missing = FALSE)
)

For consumer price index variables, zero values make no sense. For food production variables, a zero value at national level (observation unit for FAO data) is not likely and might, once again, reflect coding issues. We thus convert, for all these variables except incidence, zeros as NAs.

potential.miscoding <- 
  names(which(convert_fakezero(df_full,convert = F)))
miscoded <- potential.miscoding[!potential.miscoding %in% "incidence"]
is.na(df_full[,miscoded]) <- !df_full[,miscoded]

Interpolation

The interpolate_data function has been designed to give control when imputing NAs by the means of the zoo::na.approx function. In further release of the package, advanced options will be added (knn imputation for instance).

interpolate_data has been designed to perform imputations by groups. For the imputation to be consistent, the groupingvar argument must present the set of keys that uniquely identifies an observation, the year vector being apart. For instance, if the identifiers of an observations are variables called country, region, subregion and year, the individual dimension of our panel will be 3 and groupingvar argument must be provided the following way: groupingvar = c("country","region","subregion"). If we also have a sex dimension, groupingvar will be groupingvar = c("country","region","subregion","sex"). It thus provides, for hierarchical levels dataframes, flexibility in defining the individual unit of observation (at least one grouping variable must be provided and as many as needed are accepted). The name of the year variable must be provided in the year.var argument. For each individual unit, a preliminary step checks whether the time dimension is complete, i.e. whether year vector exhibits a complete sequence between its minimum and maximum values. If the time vector is incomplete, intermediate missing years are appended and will also be interpolated thanks to the fill_dataframe function. For instance, assume you have the following incomplete set of years, fill_dataframe will produce

df <- data.frame(year = c(1980,1983:1985,1990),y = rnorm(length(c(1980,1983:1985,1990))))
knitr::kable(df,caption = "Initial incomplete dataframes")

# FILL DATAFRAME
df <- fill_dataframe(df,year.var = 'year',label.var = NULL)
knitr::kable(df, caption = "Missing years have been appended")

where the label.var argument has been set to NULL so that it is ignored. Note that year.var = 'year' is the default value and can be avoided. This intermediate function fill_dataframe is important since it is necessary to have a balanced time vector.

The final argument to control the interpolation behavior of interpolate_data is threshold, a parameter that controls which unit of observations, for each variable, will be imputed. It is designed to avoid filling a variable that has too much missing values. More precisely, threshold defines the proportion of missing values by group that is tolerated. If the proportion of missing values, for a given observation unit, is larger than threshold, the variable will not be interpolated. If the proportion of missing values is lower than threshold, observations will be interpolated.

interpo <- interpolate_data(df_full[,c(1:15, which(colnames(df_full) == "Zonier"))],
                 groupingvar = c("sex","age","Country_Transco", "region","ethnicity"),
                 year.var = 'year', label.var = 'Zonier',
                 threshold = 0.2)

The output of interpolate_data is made of two elements:

In our case, r sum(interpo$NAchange[interpo$NAchange == 1],na.rm=T) have been imputed. Keeping a record of the imputations made might be useful to control rows or columns in order to ensure they do not have too many missing values that have been imputed since it can affect the quality of a statistical model. In further versions of the package, some functions will allow post-imputations control.

Let's call df_full2 the new dataframe

df_full2 <- interpo$data 

Preparing a big.matrix

The dataframe is now ready for statistical modelling. When complete, i.e. when putting colstokeep = NULL rather than colstokeep = 1:100 as above, it might be preferrable to use big.matrix object (see bigmemory documentation) rather than standard dataframes.

A drawback of using objects of class big.matrix is that it requires all data to be numeric. This means

To tackle these issues, two functions have been created:

Converting character vectors as numeric

conversion <- convert.character(df_full2)
names(conversion)

convert.character returns a list:

knitr::kable(head(conversion$data[,1:10]), caption = "Dataframe with converted character vectors")
knitr::kable(head(conversion$correspondance[[1]]), caption = "An example of a correspondance table")

df_full2 <- conversion$data

Creating set of dummies from factor variables

This step is not necessary but might be valuable since some models implementation do not accept factor variables or silently convert factors as numeric while they require a different treatment. expand.dummies takes all factor variables - except those provided in the labelvar argument - of a dataframe and creates as many columns as needed to store all possibilities (with a M valued factor, M-1 dummies will be created).

df_full2$fakefactor <- factor(sample(1:8, size = nrow(df_full2),
                              replace = T))

knitr::kable(head(expand.dummies(df_full2,labelvar = "sex")))

To replace all factor variables in dataframe by the set of dummies:

factorvar <- which(sapply(df_full2,is.factor))
factorvar <- factorvar[-which(names(factorvar) == 'sex')]

df_full2 <- cbind(df[,-factorvar],expand.dummies(df_full2,labelvar = "sex"))

Other data management functions in OpenCancer

The OpenCancer package proposes other interesting function to easily build training tables.

createlag

This function has been designed to easily create lagged variables for several columns. Given a maximum order $k$, each function is lagged at orders $1,...,k$, taking care of the grouping variables (variables defining the individual unit).

df_full2 <- createlag(df_full2[,1:8], groupingvar = c("sex","age","Country_Transco",
                                    "region","ethnicity"), k = 2,
                      labelvar = c("year","incidence"))

knitr::kable(df_full2[,colnames(df_full2)[8]])

performPreProcess

Epidemium do not share common scales. This might affect feature selection in giving too much weight to a few variables. A possibility to limit this problem is centering and scaling all variables before estimating models (and keeping up in memory initial scales to be able to transform back variables to their original scales if needed).

The caret::preProcess allows to that on standard dataframes. The OpenCancer::performPreProcess has been made to apply this function to pointers (see dedicated Vignettes). This function returns a preProcess model that can afterwards be used to transform variables.

X <- df_full[,1:15]
X <- convert.character(X)$data
Y <- as.big.matrix(data.frame(X))

preprocess.model <- performPreProcess(Y, groupingvar = c("sex","age","Country_Transco",
                                     "region","ethnicity"),
                  labelvar = "year")
readr::write_csv(df_full2,path = paste0(datadir,"/exampledf.csv"))


EpidemiumOpenCancer/OpenCancer documentation built on May 12, 2019, 7:46 a.m.