duplicate_rows: Find duplicate rows
In timeplyr: Fast Tidy Tools for Date and Date-Time Manipulation

duplicate_rows

R Documentation

Find duplicate rows

Description

Find duplicate rows

Usage

duplicate_rows(
  data,
  ...,
  .keep_all = FALSE,
  .both_ways = FALSE,
  .add_count = FALSE,
  .drop_empty = FALSE,
  sort = FALSE,
  .by = NULL,
  .cols = NULL
)

Arguments

`data`	A data frame.
`...`	Variables used to find duplicate rows.
`.keep_all`	If `TRUE` then all columns of data frame are kept, default is `FALSE`.
`.both_ways`	If `TRUE` then duplicates and non-duplicate first instances are retained. The default is `FALSE` which returns only duplicate rows. Setting this to `TRUE` can be particularly useful when examining the differences between duplicate rows.
`.add_count`	If `TRUE` then a count column is added to denote the number of duplicates (including first non-duplicate instance). The naming convention of this column follows `dplyr::add_count()`.
`.drop_empty`	If `TRUE` then empty rows with all `NA` values are removed. The default is `FALSE`.
`sort`	Should result be sorted? If `FALSE` (the default), then rows are returned in the exact same order as they appear in the data. If `TRUE` then the duplicate rows are sorted.
`.by`	(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select.
`.cols`	(Optional) alternative to `...` that accepts a named character vector or numeric vector. If speed is an expensive resource, it is recommended to use this.

Details

This function works like dplyr::distinct() in its handling of arguments and data-masking but returns duplicate rows. In certain situations in can be much faster than data %>% group_by() %>% filter(n() > 1) when there are many groups. fduplicates2() returns the same output but uses a different method which utilises joins and is written almost entirely using dplyr.

Value

A data.frame of duplicate rows.

Examples

library(dplyr)
library(timeplyr)
library(ggplot2)

# Duplicates across all columns
diamonds %>%
  duplicate_rows()
# Alternatively with row ids
diamonds %>%
  filter(frowid(.) > 1)
# Diamonds with the same dimensions
diamonds %>%
  duplicate_rows(x, y, z)
# Can use tidyverse select notation
diamonds %>%
  duplicate_rows(across(where(is.factor)), .keep_all = FALSE)
# Similar to janitor::get_dupes()
diamonds %>%
  duplicate_rows(.add_count = TRUE)
# Keep the first instance of each duplicate row
diamonds %>%
  duplicate_rows(.both_ways = TRUE)
# Same as the below
diamonds %>%
  fadd_count(across(everything())) %>%
  filter(n > 1)

timeplyr documentation built on Sept. 12, 2024, 7:37 a.m.