post_imp_diag: Post imputation diagnostics

Description Usage Arguments Details Value Examples

View source: R/post_imp_diag.R

Description

post_imp_diag serves as post imputation diagnostics. The function compares the original dataset (with missing data) with the imputed dataset. The function outputs statistics and visualizations that will help the user compare the original and the imputed datasets.

Usage

1
post_imp_diag(X_orig, X_imp, scale = TRUE, n.boot = 100)

Arguments

X_orig

Dataframe - the original data that contains missing values.

X_imp

Dataframe - the imputed data with no missing values.

scale

Boolean with default TRUE. Scaling will scale and center all variables to mean = 0 and standard deviation = 1 in the original dataframe with missingness. The user should select TRUE or FALSE here depending on whether the imputed dataframe has scaled or unscaled values (which is controlled by the scale argument in impute_data. Factor variables will not be scaled.

n.boot

Number of bootstrap iterations to generate mean pairwise Pearson correlation coefficients and 95% confidence intervals for variable pairs from the original and the imputed dataframes.

Details

This function uses the original dataframe and produces plots that allows the user to compare the distributions of the original values and the imputed values for each numeric variables. If there are factors present in the dataframes, the function will recognize this and create bar charts for these. In addition, the function will calculate bootstrapped pairwise Pearson correlation coefficients between numeric variables in the original dataframe (with missingness) and the imputed dataframe and plot these for the user to assess whether the imputation distorted the original data structure or not. The function will also visualize variable clusters in the original dataframe and the imputed one. Should the imputation algorithm perform well, the variable distributions and the variable clusters should be similar.

Value

Histograms

List of histograms of all numeric variables. The histograms show the original values and the imputed values overlaid for each variables in the dataframe

Boxplots

List of boxplots of all numeric variables. The boxplots show the original values and the imputed values for each variables in the dataframe. As normally, the boxplots show the median values, the IQR and the range of values

Barcharts

List of bar charts of all categorical (factor) variables. The bar charts show the original categories and the imputed categories for each categorical variables in the dataframe. Bar charts will only be output if scale is set to FALSE and both the original and imputed data contain the same factor variables

Statistics

List of output statistics for all variables. A named vector containing means and standard deviations of the original and imputed values, P value from Welch's t test and D test statistic from a Kolmogorov–Smirnov test comparing the original and the imputed values by variable

Variable_clusters_orig

Variable clusters based on the original dataframe (with missingness). Regardless of the argument scale being set to TRUE or FALSE, the clusters are assessed based on normalized data

Variable_clusters_imp

Variable clusters based on the imputed dataframe. Regardless of the argument scale being set to TRUE or FALSE, the clusters are assessed based on normalized data

Correlation_stats

Mean pairwise Pearson's correlation coefficients and 95% confidence intervals from the original dataframe (with missingness) and the imputed dataframe

Correlation_plot

Scatter plot of mean pairwise Pearson's correlation coefficients from the original dataframe (with missingness) and the imputed dataframe. The blue line represents a line with slope 1 and intercept 0. The red line is a fitted line of the correlation coefficient pairs. The error bars around the points represent the individual 95% confidence intervals drawn from bootstrapping the correlation coefficients

Examples

1
2
3
4
# diagnostics <- post_imp_diag(X_orig = df_miss, X_imp = df_imputed, scale=TRUE)
# diagnostics$Histograms$variable_X
# diagnostics$Boxplots$variable_Z
# diagnostics$Statistics$variable_Y

missCompare documentation built on Dec. 1, 2020, 9:09 a.m.