duplicate_count: Count duplicate values
In scrutiny: Error Detection in Science

duplicate_count

R Documentation

Count duplicate values

Description

duplicate_count() returns a frequency table. When searching a data frame, it includes values from all columns for each frequency count.

This function is a blunt tool designed for initial data checking. It is not too informative if many values have few characters each.

For summary statistics, call audit() on the results.

Usage

duplicate_count(x, ignore = NULL, locations_type = c("character", "list"))

Arguments

`x`	Vector or data frame.
`ignore`	Optionally, a vector of values that should not be counted.
`locations_type`	String. One of `"character"` or `"list"`. With `"list"`, each `locations` value is a vector of column names, which is better for further programming. By default (`"character"`), the column names are pasted into a string, which is more readable.

Value

If x is a data frame or another named vector, a tibble with four columns. If x isn't named, only the first two columns appear:

value: All the values from x.
frequency: Absolute frequency of each value in x, in descending order.
locations: Names of all columns from x in which value appears.
locations_n: Number of columns named in locations.

The tibble has the scr_dup_count class, which is recognized by the audit() generic.

Summaries with `audit()`

There is an S3 method for the audit() generic, so you can call audit() following duplicate_count(). It returns a tibble with summary statistics for the two numeric columns, frequency and locations_n (or, if x isn't named, only for frequency).

Examples

# Count duplicate values...
iris %>%
  duplicate_count()

# ...and compute summaries:
iris %>%
  duplicate_count() %>%
  audit()

# Any values can be ignored:
iris %>%
  duplicate_count(ignore = c("setosa", "versicolor", "virginica"))

scrutiny documentation built on Sept. 22, 2024, 9:06 a.m.