Handle All Missing (Values)
Our package intends to explore the pattern of missing values in users' dataset and also imputes the missing values using several methods.
We decided to make this project because we have not found any package that handle both tasks in either R or Python. In R, we found Amelia and vis_dat package that only visualize the missing data. In Python we found fancyimpute that deals with missing value but does not have any visualization, and missingno that visualizes missing data. We thought this would be better package for users who do not have much experience in data wrangling.
R
:devtools::load_all()
devtools::install_github("UBC-MDS/hamr")
Usage: vis_missing(dfm, missing_val_char = NA)
Input:
dfm
: a data frame or matrix containing missing valuesmissing_val_char
: the character representing missing values in data frame. One of: c(NA, " ", "", "?")Output: A visualization of missing data across the data frame. Note: currently colour changes and annotations are not supported. This will be included in later versions.
Example:
df <- data.frame(x = c(1, " ", 3), y = c(1, 8, 9))
vis_missing(df, missing_val_char = " ")
--
Usage: impute_missing(dfm, col, method, missing_val_char)
Input:
dfm
: a data frame or a matrix with missing valuescol
: a column name (string)method
: a method name ("CC", "MIP", "DIP")missing_val_char
: missing value characters (NA, NaN, "", "?")Output: a data frame with no missing values in the specified column
Example:
> df <- data.frame(exp = c(1, 2, 3), res = c(0, 10, ""))
> impute_missing(df, "res", "MIP", "")
exp res
1 1 0
2 2 10
3 3 5
--
Usage: compare_model(df, feature, methods, missing_val_char)
Input:
df
(ndarray) -- the original dataset with missing values that needs to be imputed.
feature (str) -- name of a specified feature from the original dataset containing missing values that need to be imputed.
methods
(str or list) -- the methods that users want to compare (default: ["CC","IMP"])
Supporting methods are:
CC - Complete Case MIP - Imputation with mean value DIP - Imputation with median value
missing_val_char
(str) -- missing value types.
Supporting types are:
NaN - Not a Number "" - Blank "?" - Question mark
Output: a summary table comparing the summary statistics: count, mean, std, min, 25%, 50%, 75%, max.
Example:
> df <- data.frame(exp = c(1, 2, 3), res = c(0, 10, ""))
> compare_model(df, "res", c("CC","MIP"), "")
column mean sd min median max
2 res_after_CC 5 7.071068 0 5 10
3 res_after_MIP 5 5.000000 0 5 10
--
This package is also available in Python.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.