knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The goal of mde
is to ease exploration of missingness.
Loading the package
library(mde)
To get a simple missingness report, use na_summary
:
na_summary(airquality)
To sort this summary by a given column :
na_summary(airquality,sort_by = "percent_complete")
If one would like to reset (drop) row names, then one can set row_names
to
TRUE
This may especially be useful in cases where rownames
are simply
numeric and do not have much additional use.
na_summary(airquality,sort_by = "percent_complete", reset_rownames = TRUE)
To sort by percent_missing
instead:
na_summary(airquality, sort_by = "percent_missing")
To sort the above in descending order:
na_summary(airquality, sort_by="percent_missing", descending = TRUE)
To exclude certain columns from the analysis:
na_summary(airquality, exclude_cols = c("Day", "Wind"))
To include or exclude via regex match:
na_summary(airquality, regex_kind = "inclusion",pattern_type = "starts_with", pattern = "O|S")
na_summary(airquality, regex_kind = "exclusion",pattern_type = "regex", pattern = "^[O|S]")
To get this summary by group:
test2 <- data.frame(ID= c("A","A","B","A","B"), Vals = c(rep(NA,4),"No"),ID2 = c("E","E","D","E","D")) na_summary(test2,grouping_cols = c("ID","ID2"))
na_summary(test2, grouping_cols="ID")
get_na_counts
This provides a convenient way to show the number of missing values column-wise. It is relatively fast(tests done on about 400,000 rows, took a few microseconds.)
To get the number of missing values in each column of airquality
, we can use the function as follows:
get_na_counts(airquality)
The above might be less useful if one would like to get the results by group. In that case, one can provide a grouping vector of names in grouping_cols
.
test <- structure(list(Subject = structure(c(1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), res = c(NA, 1, 2, 3), ID = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "2"), class = "factor")), class = "data.frame", row.names = c(NA, -4L)) get_na_counts(test, grouping_cols = "ID")
percent_missing
This is a very simple to use but quick way to take a look at the percentage of data that is missing column-wise.
percent_missing(airquality)
We can get the results by group by providing an optional grouping_cols
character vector.
percent_missing(test, grouping_cols = "Subject")
To exclude some columns from the above exploration, one can provide an optional character vector in exclude_cols
.
percent_missing(airquality,exclude_cols = c("Day","Temp"))
sort_by_missingness
This provides a very simple but relatively fast way to sort variables by missingness. Unless otherwise stated, this does not currently support arranging grouped percents.
Usage:
sort_by_missingness(airquality, sort_by = "counts")
To sort in descending order:
sort_by_missingness(airquality, sort_by = "counts", descend = TRUE)
To use percentages instead:
sort_by_missingness(airquality, sort_by = "percents")
Please note that the mde
project is released with a
Contributor Code of Conduct.
By contributing to this project, you agree to abide by its terms.
For further exploration, please browseVignettes("mde")
.
To raise an issue, please do so here
Thank you, feedback is always welcome :)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.