starwars
The examples below make use of the starwars
from the dplyr
package.
library(dplyr) data(starwars, package = "dplyr") # print the first few rows head(starwars)
inspect_cat()
for a single data frameinspect_cat()
returns a tibble summarising categorical features in a data frame, combining the functionality of the inspect_imb()
and table()
functions. The tibble generated contains the columns
col_name
name of each categorical columncnt
the number of unique levels in the featurecommon
the most common level (see also inspect_imb()
) common_pcnt
the percentage occurrence of the most dominant level levels
a list of tibbles each containing frequency tabulations of all levelslibrary(inspectdf) # explore the categorical features x <- inspect_cat(starwars) x
For example, the levels for the hair_color
column are
# show frequency tibble for `hair_color` column: x$levels$hair_color
Note that by default, if missing (NA
) values are present, they are counted as a distinct categorical level. A barplot showing the composition of each categorical column can be created using the show_plot()
function. Note how missing values are shown as grey bars:
x %>% show_plot()
The argument high_cardinality
in the show_plot()
function can be used to bundle together categories that occur only a small number of times. For example, to combine categories only occurring once, use:
x %>% show_plot(high_cardinality = 1)
The resulting bundles are shown in purple.
inspect_cat()
for two data framesTo illustrate the comparison of two data frames, we first create two new data frames by randomly sampling the rows of starwars
and also dropping some of the columns. The results are assigned to the objects star_1
and star_2
:
# sample 50 rows from `starwars` star_1 <- starwars %>% sample_n(50) # sample 50 rows from `starwars` and drop the first two columns star_2 <- starwars %>% sample_n(50) %>% select(-1, -2)
To compare the character columns in a pair of data frames, use the inspect_cat()
:
inspect_cat(star_1, star_2)
The tibble returned has the following columns
jsd
, the Jensen-Shannon divergence: a measure of how different the distribution of levels are between columns with the same name present in both data frames. Values are between 0 and 1 - values closer to 1 indicate bigger differences in distribution.pval
, p values associated with a modified $\chi^2$ test of the relative frequencies of levels in columns with the same name present in both data frames. lvls_1
and lvl2_2
are named list columns containing the frequency tables for each column in the first and second data frame input to inspect_cat()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.