knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(scrutiny)
You can use scrutiny to detect duplicate values in any dataset. Duplicates can go a long way in assessing the reliability of published research.
This vignette walks you through scrutiny's tools for detecting, counting, and summarizing duplicates. It uses the pigs4
dataset as a simple example:
pigs4
duplicate_count()
A good first step is to create a frequency table. To do so, use duplicate_count()
:
pigs4 %>% duplicate_count()
It returns a tibble (data frame) that lists each unique value
. The tibble is ordered by the frequency
of values in the input data frame, so the values that appear most often are at the top. The locations
are the names of all the columns in which a given value appears. They are counted by locations_n
.
For example, 5.17
is the most frequent value in pigs4
. It appears 3 times (frequency
), namely in the snout
, tail
, and wings
columns; so locations_n
is also 3
. The next most frequent value is 4.22
which appears twice, but both of these instances are in the snout
column, so locations_n
is 1
.
Run audit()
after duplicate_count()
to get summary statistics for the two numeric columns, frequency
and locations_n
:
pigs4 %>% duplicate_count() %>% audit()
duplicate_count_colpair()
Sometimes, a sequence of data may be repeated in multiple columns. duplicate_count_colpair()
helps find such cases:
pigs4 %>% duplicate_count_colpair()
x
and y
represent all combinations of columns in pigs4
. The count
is the number of values that appear in both respective columns. total_x
and total_y
are the numbers of non-missing values in the original columns listed under x
and y
. Similarly, rate_x
is the rate of x
values that also appear in y
, and rate_y
is the rate of y
values that also appear in x
. If there are no missing values, total_x
is the same as total_y
, and rate_x
is the same as rate_y
.
Here, snout
and tail
are the column pair with the most overlap: 2 out of 5 values are the same, a duplication rate of 0.4.
Again, you can call audit()
for summary statistics:
pigs4 %>% duplicate_count_colpair() %>% audit()
duplicate_tally()
Unlike the other two functions, duplicate_tally()
largely preserves the structure of the original data frame. It only adds a column ending on _n
next to each original column. The new columns count how often the values to their left appear in the data frame as a whole:
pigs4 %>% duplicate_tally()
In snout
, for example, 4.22
appears twice, so its entries in snout_n
are 2
. But likewise, 8.13
appears in both snout
and tail
, so both observations are marked 2
in the _n
columns.
When following up duplicate_tally()
with audit()
, it shows summary statistics for each _n
column. The last row summarizes all of these columns together:
pigs4 %>% duplicate_tally() %>% audit()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.