knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This tutorial provides an overview and detailed usage examples for the tag_duplicates
function in the mStats
package. It is designed to identify and tag duplicate observations based on specified variables. It mimics the functionality of Stata
's duplicates command in R. It provides a report of duplicates and creates a tibble with three columns: .n_
, .N_
, and .dup_
.
library(mStats) library(dplyr) # Example with a custom dataset data <- data.frame( x = c(1, 1, 2, 2, 3, 4, 4, 5), y = letters[1:8] ) # Identify and tag duplicates based on the x variable data %>% mutate(tag_duplicates(x)) # Identify and tag duplicates based on multiple variables data %>% mutate(tag_duplicates(x, y)) # Identify and tag duplicates based on all variables data %>% mutate(tag_duplicates(everything()))
In the examples above, we use the tag_duplicates
function to identify and tag duplicates in the data dataframe. We can specify one or multiple variables to check for duplicates. By using everything(), we can check for duplicates based on all variables in the dataframe.
The function provides a concise report, showing the number of duplicates and surplus observations for each group of variables. The .dup_
column indicates whether an observation is a duplicate or not.
The report of duplciates
provides information about duplicate observations in a dataset and how they are interpreted. It consists of three components: copies
, observations
, and surplus.
Copies
: The copies column indicates the number of times a set of observations is duplicated. In other words, it shows the number of identical rows present in the dataset.Observations
: The observations column represents the total count of unique observations in the dataset, including both original and duplicated rows.Surplus
: The surplus column denotes the number of additional observations beyond the first occurrence of each duplicated set. It indicates the number of duplicate rows excluding the original row.To interpret the duplicates report in the context of the tag_duplicates
R function, consider the following:
Copies
: If the copies value is 1, it means there are no duplicates for that specific set of variables. If it is greater than 1, it indicates the number of duplicate sets present.Observations
: The observations count includes all unique rows in the dataset, including both original and duplicate rows. This count helps identify the total number of distinct observations.Surplus
: The surplus value represents the number of duplicate rows beyond the first occurrence of each set. These surplus rows need to be further examined or potentially removed.By analyzing the duplicates report, you can identify which variables or combinations of variables are causing duplicates in the dataset. This information can guide further investigation or data cleaning steps to handle duplicate observations appropriately.
.n_
, .N_
, and .dup_
When using the tag_duplicates function in R, the function returns three columns: .n_
, .N_
, and .dup_
. Here's an explanation of how to interpret these columns:
.n_
The .n_
column represents the count of each observation in the dataset, including duplicates. It indicates the number of times each row appears in the dataset, regardless of whether it is a duplicate or unique. For unique observations, the value in this column will be 1. If an observation is duplicated multiple times, the value will be greater than 1.
Interpretation of .n_
:
If .n_
equals 1, it means the observation is unique and does not have any duplicates in the dataset.
If .n_
is greater than 1, it indicates that the observation is duplicated and appears multiple times in the dataset.
.N_
The .N_
column represents the count of unique observations. It provides the number of unique occurrences for each observation in the dataset, considering both duplicates and unique rows. Each row in the dataset is assigned the count of its unique occurrence.
Interpretation of .N_
:
The .N_
column helps identify the total number of unique occurrences for each observation in the dataset, regardless of whether it is a duplicate or unique.
If .N_
equals 1, it means the observation is unique and has only one occurrence in the dataset.
If .N_
is greater than 1, it indicates that the observation has duplicates and appears multiple times in the dataset.
.dup_
The .dup_
column is a logical indicator that flags whether an observation is a duplicate or not. It assigns a value of TRUE
if the observation is a duplicate and FALSE if it is unique.
Interpretation of .dup_
:
If .dup_
is TRUE
, it means the observation is a duplicate and appears multiple times in the dataset.
If .dup_
is FALSE
, it indicates that the observation is unique and does not have any duplicates.
By examining these three columns together, you can determine which observations in your dataset are duplicates, how many times they occur, and whether an observation is unique or duplicated. This information can be valuable for further data analysis, quality control, or data cleaning processes.
Read more about the tutorial here: https://stats.oarc.ucla.edu/stata/faq/how-can-i-detect-duplicate-observations-3/
hsb2 <- haven::read_dta("https://stats.idre.ucla.edu/stat/stata/notes/hsb2.dta") hsb2_mod <- hsb2 |> dplyr::select(id, female, ses, read, write, math) |> dplyr::arrange(id) hsb2_mod2 <- hsb2_mod |> dplyr::filter(dplyr::row_number() <= 5) |> dplyr::bind_rows(hsb2_mod) |> dplyr::arrange(id) |> dplyr::mutate(math = ifelse(dplyr::row_number() == 1, 84, math)) hsb2_mod2 |> dplyr::mutate(tag_duplicates(everything())) hsb2_mod2 |> dplyr::mutate(tag_duplicates(id)) hsb2_mod2 |> dplyr::mutate(tag_duplicates(id, .add_tags = TRUE)) |> # filter duplicate observations dplyr::filter(.dup_)
Reference
StataCorp. (2021). Stata Base Reference Manual: Duplicates. Retrieved from Stata Press: https://www.stata.com/manuals/rbase/duplicates.pdf
You can read more about it from the Stata Base Reference Manual, which contains detailed information about the duplicates command, through the Stata Press website or by searching for "Stata duplicates command" online.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.