knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

Register with github

GitHub is a version control repository that is used very widely by Statisticians for versioning and collaborating on code and data analysis, both publicly and privately. Almost all of our stats group are using it. If you have not done yet, you need to get an account now to make a submission to this year's Data Challenge.

Installation

To get started, install in R required packages and then the DataChallenge.2017 package:

install.packages("data.table")
install.packages("devtools")
devtools:::install_github("olli0601/DataChallenge.2017")
require(data.table)
require(DataChallenge.2017)
?ihr
?train

Loading the package makes the various data sets available. They are listed under the functions tab above. Take a look to learn their names and get a first feel for the data points in there. All data sets have two common columns, country code (ISO) and year of data collection (YEAR). You can also do:

require(data.table)
require(DataChallenge.2017)
data(ihr)
?ihr

Quick survival guide

The very first step is to combine different data sets by country code and year of data collection. The data sets are stored as a data.table object. A data.table is very similar to a standard R data.frame, but much faster and you can do a lot more. Have a look at one or two data table tutorials online.

First hint: use data.table's merge function by country code and year to combine the data sets. This is much more efficient than for loops or lapply.

Fallback: you can convert the data to data.frames, and then work with those, including the standard R merge function.

require(data.table)
require(DataChallenge.2017)
data(ihr)
ihr<- as.data.frame(ihr)

Second hint: the reason we encourage you to use data.tables is their powerful [...] syntax. Here is a short example, which calculates the proportion of the data that are missing for each column of the pov data set. Let's start with a for loop:

require(data.table)
require(DataChallenge.2017)
data(pov)
df <- melt(as.data.frame(pov), id.vars=c('ISO','YEAR'))
ans <- matrix(  data=NA, 
                nrow=length(levels(df$variable)), 
                ncol=3, 
                dimnames=list(levels(df$variable), c('N','N_NA','P_NA')))
for(x in levels(df$variable))
{
    tmp <- which(df$variable==x)
    z <- length(df$value[tmp])
    z2 <-length(which(is.na(df$value[tmp])))
    tmp <- which(rownames(ans)==x)
    ans[tmp, 'N'] <- z
    ans[tmp, 'N_NA'] <- z2
    ans[tmp, 'P_NA'] <- z2/z
}       

Here is the corresponding data.table code:

df <- melt(pov, id.vars=c('ISO','YEAR'), stringsAsFactors=FALSE)
df[, {              
        z <- length(value)      
        z2 <-length(which(is.na(value)))
        list(N=z, N_NA=z2, P_NA= z2/z) 
     }, by=c('variable')]       

With the for loop, you need to track indices, and I bet you can see this is becoming a pain when you have 2 or more columns to loop over. With data.tables, all you need to do is to change by=c('variable') to by=c('variable1','variable2').

Third hint: there really are not that many data points in the epidemic training data set to begin with, while there are a large number of covariates. The challenge is to make the most in this small n large p setting. What techniques could be appropriate for predicting the number of 2015 and 2016 outbreaks? Start with some regressions, and google your way.

Fourth hint: the epidemiologic outbreak data are counts, but you don t have to worry about this too much for this year's Data Challenge. It is fine to make predictions that are real values.

Interested in even more data?

You should have plenty of data to get started. You are welcome to search and add in any other data that you think are useful in building the best predictive model. Here are a few suggestions:

  1. INFORM open-source risk assessment for humanitarian crises and disasters.
  2. WHO data repository on health workforce.
  3. WHO data repository on health financing.


olli0601/DataChallenge.2017 documentation built on May 29, 2019, 7:34 a.m.