duplicate_rows: Find duplicate rows

View source: R/duplicate_rows.R

duplicate_rowsR Documentation

Find duplicate rows

Description

Find duplicate rows

Usage

duplicate_rows(
  data,
  ...,
  .keep_all = FALSE,
  .both_ways = FALSE,
  .add_count = FALSE,
  .drop_empty = FALSE,
  sort = FALSE,
  .by = NULL,
  .cols = NULL
)

Arguments

data

A data frame.

...

Variables used to find duplicate rows.

.keep_all

If TRUE then all columns of data frame are kept, default is FALSE.

.both_ways

If TRUE then duplicates and non-duplicate first instances are retained. The default is FALSE which returns only duplicate rows.
Setting this to TRUE can be particularly useful when examining the differences between duplicate rows.

.add_count

If TRUE then a count column is added to denote the number of duplicates (including first non-duplicate instance). The naming convention of this column follows dplyr::add_count().

.drop_empty

If TRUE then empty rows with all NA values are removed. The default is FALSE.

sort

Should result be sorted? If FALSE (the default), then rows are returned in the exact same order as they appear in the data. If TRUE then the duplicate rows are sorted.

.by

(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select.

.cols

(Optional) alternative to ... that accepts a named character vector or numeric vector. If speed is an expensive resource, it is recommended to use this.

Details

This function works like dplyr::distinct() in its handling of arguments and data-masking but returns duplicate rows. In certain situations in can be much faster than data %>% group_by() %>% filter(n() > 1) when there are many groups. fduplicates2() returns the same output but uses a different method which utilises joins and is written almost entirely using dplyr.

Value

A data.frame of duplicate rows.

See Also

fcount group_collapse fdistinct

Examples

library(dplyr)
library(timeplyr)
library(ggplot2)

# Duplicates across all columns
diamonds %>%
  duplicate_rows()
# Alternatively with row ids
diamonds %>%
  filter(frowid(.) > 1)
# Diamonds with the same dimensions
diamonds %>%
  duplicate_rows(x, y, z)
# Can use tidyverse select notation
diamonds %>%
  duplicate_rows(across(where(is.factor)), .keep_all = FALSE)
# Similar to janitor::get_dupes()
diamonds %>%
  duplicate_rows(.add_count = TRUE)
# Keep the first instance of each duplicate row
diamonds %>%
  duplicate_rows(.both_ways = TRUE)
# Same as the below
diamonds %>%
  fadd_count(across(everything())) %>%
  filter(n > 1)


timeplyr documentation built on Sept. 12, 2024, 7:37 a.m.