knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The goal of mde is to ease exploration of missingness.

Loading the package

library(mde)

To get a simple missingness report, use na_summary:

na_summary(airquality)

To sort this summary by a given column :

na_summary(airquality,sort_by = "percent_complete")

If one would like to reset (drop) row names, then one can set row_names to TRUE This may especially be useful in cases where rownames are simply numeric and do not have much additional use.

na_summary(airquality,sort_by = "percent_complete", reset_rownames = TRUE)

To sort by percent_missing instead:

na_summary(airquality, sort_by = "percent_missing")

To sort the above in descending order:

na_summary(airquality, sort_by="percent_missing", descending = TRUE)

To exclude certain columns from the analysis:

na_summary(airquality, exclude_cols = c("Day", "Wind"))

To include or exclude via regex match:

na_summary(airquality, regex_kind = "inclusion",pattern_type = "starts_with", pattern = "O|S")
na_summary(airquality, regex_kind = "exclusion",pattern_type = "regex", pattern = "^[O|S]")

To get this summary by group:

test2 <- data.frame(ID= c("A","A","B","A","B"), Vals = c(rep(NA,4),"No"),ID2 = c("E","E","D","E","D"))

na_summary(test2,grouping_cols = c("ID","ID2"))
na_summary(test2, grouping_cols="ID")

This provides a convenient way to show the number of missing values column-wise. It is relatively fast(tests done on about 400,000 rows, took a few microseconds.)

To get the number of missing values in each column of airquality, we can use the function as follows:

get_na_counts(airquality)

The above might be less useful if one would like to get the results by group. In that case, one can provide a grouping vector of names in grouping_cols.

test <- structure(list(Subject = structure(c(1L, 1L, 2L, 2L), .Label = c("A", 
"B"), class = "factor"), res = c(NA, 1, 2, 3), ID = structure(c(1L, 
1L, 2L, 2L), .Label = c("1", "2"), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

get_na_counts(test, grouping_cols = "ID")

This is a very simple to use but quick way to take a look at the percentage of data that is missing column-wise.

percent_missing(airquality)

We can get the results by group by providing an optional grouping_cols character vector.

percent_missing(test, grouping_cols = "Subject")

To exclude some columns from the above exploration, one can provide an optional character vector in exclude_cols.

percent_missing(airquality,exclude_cols = c("Day","Temp"))

This provides a very simple but relatively fast way to sort variables by missingness. Unless otherwise stated, this does not currently support arranging grouped percents.

Usage:

sort_by_missingness(airquality, sort_by = "counts")

To sort in descending order:

sort_by_missingness(airquality, sort_by = "counts", descend = TRUE)

To use percentages instead:

sort_by_missingness(airquality, sort_by = "percents")

Please note that the mde project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

For further exploration, please browseVignettes("mde").

To raise an issue, please do so here

Thank you, feedback is always welcome :)



Try the mde package in your browser

Any scripts or data that you put into this service are public.

mde documentation built on Feb. 10, 2022, 5:08 p.m.