PackageIntroduction

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)


iris <- tibble::tibble(iris)
library(validata)
library(tidyselect)

Distinct

Confirm Distinct

In data analysis tasks we often have data sets with multiple possible ID columns, but it's not always clear which combination uniquely identifies each row.

sample_data1 has 125 row with 3 ID type columns and 3 value columns.

head(sample_data1)

Let's use confirm_distinct iteratively to find the uniquely identifying columns of sample_data1.

sample_data1 %>% 
  confirm_distinct(ID_COL1)
sample_data1 %>% 
  confirm_distinct(ID_COL1, ID_COL2)
sample_data1 %>% 
  confirm_distinct(ID_COL1, ID_COL2, ID_COL3)

Here we can conclude that the combination of 3 ID columns is the primary key for the data.

Determine Distinct

These steps can be automated with the wrapper function determine distinct.

sample_data1 %>% 
  determine_distinct(matches("ID"))

Mapping

confirm_mapping tells you the mapping between two columns in a data frame:

Confirm mapping

confirm_mapping gives the option to view which type of mapping is associated with each individual row.

sample_data1 %>% 
  confirm_mapping(ID_COL1, ID_COL2, view = F)

Determine mapping

sample_data1 %>% 
  determine_mapping(everything())

Overlap

The overlap functions give a venn style description of the values in 2 columns. This is especially useful before performing a join function, and you want to confirm that the dataframes have matching keys.

Confirm Overlap

confirm_overlap is different from the other confirm functions in that it takes 2 vectors as arguments, instead of a data frame. This is to allow the user to test overlap between different dataframes, or arbitrary vectors if necessary

confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap

confirm_overlap returns a summary data frame invisibly allowing you to access individual elements using the helper functions.

print(iris_overlap)

Find the elements unique to the first column

iris_overlap %>% 
  co_find_only_in_1() %>% 
  head()

Find the elements unique to the second column

iris_overlap %>% 
  co_find_only_in_2() %>% 
  head()

Find the elements shared by both columns

iris_overlap %>% 
  co_find_in_both() %>% 
  head()

Determine Overlap

determine_overlap takes a dataframe and a tidyselect specification, and returns a tibble summarizing all of the pairwise overlaps. Only pairs with matching types are tested.

iris %>% 
  determine_overlap(everything())

Note that the overlap functions only test pairwise overlaps. For multi-column and large-scale overlap testing, see Complex Upset Plots

string length

confirm string length

Get a frequency table of string lengths in a character column. Table is printed while the original df is returned invisibly with a column indicating the string lengths.

iris %>% 
  confirm_strlen(Species) -> species_len

output is a dataframe

head(species_len)

choose string length

A helped function for the output of confirm_strlen that filters the database for chosen string lengths.

species_len %>% 
  choose_strlen(len = 6) %>% 
  head()

diagnose

Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set.

iris %>% 
  diagnose()


Try the validata package in your browser

Any scripts or data that you put into this service are public.

validata documentation built on Oct. 5, 2021, 9:08 a.m.