find_duplicates: Identify and return duplicated rows in a data frame or...
In cleanepi: Clean and Standardize Epidemiological Data

View source: R/find_and_remove_duplicates.R

find_duplicates

R Documentation

Identify and return duplicated rows in a data frame or linelist.

Description

Identify and return duplicated rows in a data frame or linelist.

Usage

find_duplicates(data, target_columns = NULL)

Arguments

`data`	The input `<data.frame>` or `<linelist>`.
`target_columns`	A `<vector>` of columns names or indices to consider when looking for duplicates. When the input data is a `<linelist>` object, this parameter can be set to `linelist_tags` from which duplicates to be removed. Its default value is `NULL`, which considers duplicates across all columns.

Value

A <data.frame> or <linelist> of all duplicated rows with following 2 additional columns:

row_id: The indices of the duplicated rows from the input data. Users can choose from these indices, which row they consider as redundant in each group of duplicates.
group_id: a unique identifier associated to each group of duplicates.

Examples

dups <- find_duplicates(
  data = readRDS(
    system.file("extdata", "test_linelist.RDS", package = "cleanepi")
  ),
  target_columns = c("dt_onset", "dt_report", "sex", "outcome")
)

cleanepi documentation built on April 4, 2025, 5:12 a.m.