knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) iris <- tibble::tibble(iris)
library(validata) library(tidyselect)
In data analysis tasks we often have data sets with multiple possible ID columns, but it's not always clear which combination uniquely identifies each row.
sample_data1 has 125 row with 3 ID type columns and 3 value columns.
head(sample_data1)
Let's use confirm_distinct
iteratively to find the uniquely identifying columns of sample_data1.
sample_data1 %>% confirm_distinct(ID_COL1)
sample_data1 %>% confirm_distinct(ID_COL1, ID_COL2)
sample_data1 %>% confirm_distinct(ID_COL1, ID_COL2, ID_COL3)
Here we can conclude that the combination of 3 ID columns is the primary key for the data.
These steps can be automated with the wrapper function determine distinct
.
sample_data1 %>% determine_distinct(matches("ID"))
confirm_mapping
tells you the mapping between two columns in a data frame:
confirm_mapping
gives the option to view which type of mapping is associated with each individual row.
sample_data1 %>% confirm_mapping(ID_COL1, ID_COL2, view = F)
sample_data1 %>% determine_mapping(everything())
The overlap
functions give a venn style description of the values in 2 columns. This is especially useful before performing a join
function, and you want to confirm that the dataframes have matching keys.
confirm_overlap
is different from the other confirm
functions in that it takes 2 vectors as arguments, instead of a data frame. This is to allow the user to test overlap between different dataframes, or arbitrary vectors if necessary
confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap
confirm_overlap
returns a summary data frame invisibly allowing you to access individual elements using the helper functions.
print(iris_overlap)
Find the elements unique to the first column
iris_overlap %>% co_find_only_in_1() %>% head()
Find the elements unique to the second column
iris_overlap %>% co_find_only_in_2() %>% head()
Find the elements shared by both columns
iris_overlap %>% co_find_in_both() %>% head()
determine_overlap
takes a dataframe and a tidyselect specification, and returns a tibble summarizing all of the pairwise overlaps. Only pairs with matching types are tested.
iris %>% determine_overlap(everything())
Note that the overlap
functions only test pairwise overlaps. For multi-column and large-scale overlap testing, see Complex Upset Plots
Get a frequency table of string lengths in a character column. Table is printed while the original df is returned invisibly with a column indicating the string lengths.
iris %>% confirm_strlen(Species) -> species_len
output is a dataframe
head(species_len)
A helped function for the output of confirm_strlen
that filters the database for chosen string lengths.
species_len %>% choose_strlen(len = 6) %>% head()
Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set.
iris %>% diagnose()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.