knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)
library(tidyverse)

Introduction

Working with pedigree data often involves dealing with inconsistencies, missing information, and errors. The BGmisc package provides tools to identify and, where possible, repair these issues automatically. This vignette demonstrates how to validate and clean pedigree data using BGmisc's validation functions.

Identifying and Repairing ID Issues

The checkIDs() function detects two types of common ID errors in pedigree data:

These problems are especially common when merging family records or processing historical data. Let’s explore how they show up — and what they imply for downstream structure.

A Clean Dataset

We'll begin with the Potter family dataset, cleaned and reformatted with ped2fam():

library(BGmisc)

# Load our example dataset
df <- ped2fam(potter, famID = "newFamID", personID = "personID")

# Check for ID issues
checkIDs(df, repair = FALSE)

There are no duplicated or self-referential IDs here. But things rarely stay that simple.

What checkIDs() Reports

The checkIDs() function checks for:

If you set repair = TRUE, the function will attempt to fix the issues it finds. We'll explore this later.

A Tale of Two Duplicates

To understand how these tools work in practice, let's create a dataset with two common real-world problems. First, we'll accidentally give Vernon Dursley the same ID as his sister Marjorie (a common issue when merging family records). Then, we'll add a complete duplicate of Dudley Dursley (as might happen during data entry).

# Create our problematic dataset
df_duplicates <- df
# Sibling ID conflict
df_duplicates$personID[df_duplicates$name == "Vernon Dursley"] <-
  df_duplicates$personID[df_duplicates$name == "Marjorie Dursley"]
# Duplicate entry
df_duplicates <- rbind(
  df_duplicates,
  df_duplicates[df_duplicates$name == "Dudley Dursley", ]
)

If we look at the data using standard tools, the problems aren't immediately obvious:

library(tidyverse)

summarizeFamilies(df_duplicates,
  famID = "newFamID",
  personID = "personID"
)$family_summary %>%
  glimpse()

But checkIDs() detects the problems clearly:

# Identify duplicates
result <- checkIDs(df_duplicates)
print(result)

As we can see from this output, there are r result$total_non_unique_ids non-unique IDs in the dataset, specifically r result$non_unique_ids. Let's take a peek at the duplicates:

# Let's examine the problematic entries
df_duplicates %>%
  filter(personID %in% result$non_unique_ids) %>%
  arrange(personID)

Yep, these are definitely the duplicates.

Repairing Between-Row Duplicates

Some ID issues can be fixed automatically. Let's try the repair option:

df_repair <- checkIDs(df, repair = TRUE)

df_repair %>%
  filter(ID %in% result$non_unique_ids) %>%
  arrange(ID)

result <- checkIDs(df_repair)

print(result)

Great! Notice what happened here: the function was able to repair the full duplicate, without any manual intervention. That still leaves us with the sibling ID conflict, but that's a more complex issue that would require manual intervention. We'll leave that for now.

So far we’ve only checked for violations of uniqueness. But do these errors also affect the graph structure? Let's find out.

Oedipus ID

Just as Oedipus discovered his true relationship was not what records suggested, our data can reveal its own confused parentage when an ID is incorrectly listed as its own parent. Let's examine this error:

Sometimes, an individual's parents' IDs may be incorrectly listed as their own ID, leading to within-row duplicates. The checkIDs function can also identify these errors:

# Create a sample dataset with within-person duplicate parent IDs

df_within <- ped2fam(potter, famID = "newFamID", personID = "personID")

df_within$momID[df_within$name == "Vernon Dursley"] <- df_within$personID[df_within$name == "Vernon Dursley"]

# Check for within-row duplicates
result <- checkIDs(df_within, repair = FALSE)
print(result)

In this example, we have created a within-row duplicate by setting the momID of Vernon Dursley to his own ID. The checkIDs function correctly identifies that this error is present.

To repair within-row duplicates, you will be able to set the repair argument to TRUE, eventually. This feature is currently under development and will be available in future versions of the package. In the meantime, you can manually inspect and then correct these errors in your dataset.

# Find the problematic entry

df_within[df_within$momID %in% result$is_own_mother_ids, ]

There are several ways to correct this issue, depending on the specifics of your dataset. In this case, you could correct the momID for Vernon Dursley to the correct value, resolving the within-row duplicate, likely by assuming that his sister Marjorie shares the same mother.

Identifying and Repairing Sex Coding Issues

Another critical aspect of pedigree validation is ensuring the consistency of sex coding. This brings us to an important distinction in genetic studies between biological sex (genotype) and gender identity (phenotype):

The checkSex function focuses on biological sex coding consistency, particularly looking for: - Mismatches between parent roles and recorded sex - Individuals listed as both parent and child - Inconsistent sex coding across the dataset

Let's examine how it works:

# Validate sex coding

results <- checkSex(potter,
  code_male = 1,
  code_female = 0,
  verbose = TRUE, repair = FALSE
)
print(results)

When inconsistencies are found, you can attempt automatic repair:

# Repair sex coding
df_fix <- checkSex(potter,
  code_male = 1,
  code_female = 0,
  verbose = TRUE, repair = TRUE
)
print(df_fix)

When the repair argument is set to TRUE, repair process follows several rules: - Parents listed as mothers must be female - Parents listed as fathers must be male - Sex codes are standardized to the specified code_male and code_female values - If no sex code is provided, the function will attempt to infer what male and female are coded with. The most frequently assigned sex for mothers and fathers will be used as the standard.

Note that automatic repairs should be carefully reviewed, as they may not always reflect the correct biological relationships. In cases where the sex coding is ambiguous or conflicts with known relationships, manual inspection and domain knowledge may be required.

Best Practices for Pedigree Validation

Through extensive work with pedigree data, we've learned several key principles:

By following these best practices, and leveraging functions like checkIDs, checkSex, and recodeSex, you can ensure the integrity of your pedigree data, facilitating accurate analysis and research.



R-Computing-Lab/BGMisc documentation built on April 3, 2025, 3:12 p.m.