README.md

Regression-based Imputation

Several predictive approaches to imputation using linear regressions and lasso based approach for data in the Fragile Families Challenge. For more information on the challenge: fragilefamilieschallenge.org.

Tl;dr

To get started, in R: 1. Make sure "devtools" is installed and loaded: - library(devtools)

See below for options to customize each function.

Final output is a dataframe of imputed values (either constructed only or the full data frame, depending on the options you specify), where an original value is missing, and original values where they exist. For example, if the dataset is missing data in the first case in househost income from mom's survey in wave 4 (cm4hhinc), but not cases 2 and 3, the function will only impute the first case, and return the original values for cases 2 and 3.

Note: This software is still under development, so it's possible things may not work exactly as they should. If you encounter a problem, please help by submitting an issue on this project page.

Available options

Initialization

This is the initialization function that imports data from the available background.csv file (see Fragile Families website for how to obtain the data), and performs a couple of basic processing functions. By default, the function converts all values below 0 to NAs, and imputes age for mom and dad using available information across waves. This can be extended to other kinds of logical imputations (PRs welcome!).

initImpute(data='', dropna = 1, ageimpute=1, meanimpute=0)

Correlation Matrix

The bulk of the resource intensive portion of this imputation is in computing a correlation Matrix. The calculation of the Pearson correlation is reimplemented here to allow for better error handling, and is now vectorized leading to improvement performance. In theory, this function needs to be run only once per set of variables of interest, thus this portion of the process is abstracted into a separate function, the output of which can be easily stored.

corMatrix(data='', continuous='', categorical='', varpattern='',debug=0, test=0, parallel = 0)

Saving and restoring output from CorMatrix

CorMatrix produces a correlation matrix of all useable columns in a given dataframe, optionally filtered by a regular expression. The function is now vectorized and should perform orders of magnitude faster, however in the interests of re-use of elements of the pipeline, below are instructions for preserving the output of the function:

To save the resulting object for reuse: saveRDS([output variable from corMatrix], "cormatrix.rds")

To restore the object: restored <- readRDS("cormatrix.rds")

Breaking change: This function used to generate a dataframe alongside the cormatrix, but this is now unnecessary due to improvements in regImpute. Hard-coded dependencies on the object structure of older versions may need to be revised.

Regression Imputation

Performs a prediction of a missing value based on values of other variables in the given data set that are highly correlated. Requires a correlation matrix (see corMatrix above). The function can also (optionally) treat continuous and categorical variables separately, that is convert categorical covariates to dummies when they are independent variables, and utilize multinomial regression when the variable to be imputed is categorical.

regImpute(dataframe='', matrix='', continuous='', categorical='', method='lm', degree=1, parallel=0, threshold=0.4,top_predictors=3, debug=0, test=0)

Todo



annafil/FFCRegressionImputation documentation built on May 12, 2019, 1:59 p.m.