In alastairrushworth/inspectdf: Inspection, Comparison and Visualisation of Data Frames

Illustrative data: `starwars`

The examples below make use of the starwars from the dplyr package.

library(dplyr)
data(starwars, package = "dplyr")

# print the first few rows
head(starwars)

`inspect_cat()` for a single data frame

inspect_cat() returns a tibble summarising categorical features in a data frame, combining the functionality of the inspect_imb() and table() functions. The tibble generated contains the columns

col_name name of each categorical column
cnt the number of unique levels in the feature
common the most common level (see also inspect_imb())
common_pcnt the percentage occurrence of the most dominant level
levels a list of tibbles each containing frequency tabulations of all levels

library(inspectdf)

# explore the categorical features 
x <- inspect_cat(starwars)
x

For example, the levels for the hair_color column are

# show frequency tibble for `hair_color` column:
x$levels$hair_color

Note that by default, if missing (NA) values are present, they are counted as a distinct categorical level. A barplot showing the composition of each categorical column can be created using the show_plot() function. Note how missing values are shown as grey bars:

x %>% show_plot()

The argument high_cardinality in the show_plot() function can be used to bundle together categories that occur only a small number of times. For example, to combine categories only occurring once, use:

x %>% 
  show_plot(high_cardinality = 1)

The resulting bundles are shown in purple.

`inspect_cat()` for two data frames

To illustrate the comparison of two data frames, we first create two new data frames by randomly sampling the rows of starwars and also dropping some of the columns. The results are assigned to the objects star_1 and star_2:

# sample 50 rows from `starwars`
star_1 <- starwars %>% sample_n(50)
# sample 50 rows from `starwars` and drop the first two columns
star_2 <- starwars %>% sample_n(50) %>% select(-1, -2)

To compare the character columns in a pair of data frames, use the inspect_cat():

inspect_cat(star_1, star_2)

The tibble returned has the following columns

jsd, the Jensen-Shannon divergence: a measure of how different the distribution of levels are between columns with the same name present in both data frames. Values are between 0 and 1 - values closer to 1 indicate bigger differences in distribution.
pval, p values associated with a modified $\chi^2$ test of the relative frequencies of levels in columns with the same name present in both data frames.
lvls_1 and lvl2_2 are named list columns containing the frequency tables for each column in the first and second data frame input to inspect_cat()

alastairrushworth/inspectdf documentation built on Aug. 15, 2022, 1:23 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

alastairrushworth/inspectdf
Inspection, Comparison and Visualisation of Data Frames

In alastairrushworth/inspectdf: Inspection, Comparison and Visualisation of Data Frames

Illustrative data: `starwars`

`inspect_cat()` for a single data frame

`inspect_cat()` for two data frames

R Package Documentation

Browse R Packages

We want your feedback!

alastairrushworth/inspectdf Inspection, Comparison and Visualisation of Data Frames

In alastairrushworth/inspectdf: Inspection, Comparison and Visualisation of Data Frames

Illustrative data: starwars

inspect_cat() for a single data frame

inspect_cat() for two data frames

R Package Documentation

Browse R Packages

We want your feedback!

alastairrushworth/inspectdf
Inspection, Comparison and Visualisation of Data Frames

Illustrative data: `starwars`

`inspect_cat()` for a single data frame

`inspect_cat()` for two data frames