PedCompare Example (Griffin Pedigree)"

Example: PedCompare

knitr::opts_chunk$set(echo = TRUE, eval=TRUE,
                      fig.height=4, fig.width=6, fig.pos="ht!") 
library(sequoia)

Why compare pedigrees

It is often worthwhile to compare a newly inferred genetic pedigree to an older (field-)pedigree, as no dataset is flawless, nor is genetic pedigree reconstruction infallible.

Discrepancies between the pedigrees may bring to light mislabeled samples, incorrect birth or death years, or incorrectly inferred pedigree links. For example, if an individual's genetically inferred mother differs from its field-observed mother, this may be due to a mislabeled DNA sample, a pedigree inference error, an error in the field records, or egg dumping/adoption. Eliminating possibilities and tracking down the most likely scenario can be rather time consuming, but will increase the overall quality of the dataset and pedigree.

In addition, the pedigree comparison may be able to match dummy parents in the genetic pedigree to real, non-genotyped parents in the field pedigree. This allows pedigree records (offspring number, mates, etc.) to be combined with phenotypic records (age, size, etc.) of those non-genotyped individuals.

Here, the process is illustrated by comparing genetically assigned and field observed mothers in a fictional population of griffins.

Study population

Nests have been monitored in a small, closed population of griffins, where each year exactly 20 baby griffins hatch. The mother of most individuals is known from field observations (example data FieldMums_griffin included in the package). From 2001 to 2010, most hatchlings are tagged, and sampled for SNP genotyping. From this, a genetic pedigree has been reconstructed (example data SeqOUT_griffin). Breeding females without tags, are given a two-colour code [^x]. This may including females who have lost their tags.

[^x]: with animal-friendly spray paint

Comparing pedigrees

library(sequoia)
data(SeqOUT_griffin, FieldMums_griffin, package="sequoia")

PCG  <- PedCompare(Ped1 = cbind(FieldMums_griffin,
                                sire = NA),
                   Ped2 = SeqOUT_griffin$Pedigree,
                   SNPd = SeqOUT_griffin$PedigreePar$id,
                   Symmetrical = TRUE, Plot=FALSE)

In essence, PedCompare() lies the two pedigrees side-by-side and classifies each individual's parent as being the same in the two pedigrees (Match) or not (Mismatch), or only having a parent in only one of the pedigrees (P1only or P2only).

While Match versus Mismatch is straightforward when all individuals are genotyped, and all IDs are (thus) consistent between the two pedigrees, it becomes more complicated when non-genotyped individuals and dummy IDs are involved. PedCompare() tries to match each sibship-dummy-parent in Pedigree2 (F0001, F0002, etc.) to a non-genotyped parent in Pedigree1 (here these are the IDs consisting of 2 colours: PinkBlue, BlueRed, etc.).

$MergedPed

The side-by-side pedigrees are given in the list element $MergedPed of the output:

PCG$MergedPed[c(127:133), c("id", "dam.1", "dam.2", "dam.class")]

This part-pedigree for example shows that individuals i165_2009_F and i166_2009_F have the non-genotyped female 'PinkBlue' as field observed mother (dam.1 = dam in Pedigree1). The pair are also genetically assigned as maternal siblings, with 'F0002' as dummy ID for the mother (dam.2). And, while the ID of the mother differs between the two pedigrees, it is nonetheless considered a Match (dam.class, short for classification).

id columns

A point of clarification on the id columns:

For example, 'PinkBlue' does not have a mother in Pedigree1 (first row in MergedPed below: id.r=PinkBlue, dam.1=NA). Pedigree2 tells us that 'F0002' has genetic mother 'F0007', which is not hugely informative of itself (MergedPed: id=F0002, dam.2=F0007). But PedCompare() then tells us that this means that 'PinkBlue' (id.r) has 'YellowPink' as mother (dam.r), which has a lot more meaning (to someone working in the field).

# subset some individuals:
these <- c("i177_2009_M", "i179_2009_M", "i165_2009_F", "i166_2009_F", "F0002",
           "F0007", "YellowPink", "PinkBlue")
knitr::kable(list(Ped1 = FieldMums_griffin[FieldMums_griffin$id %in% these, ], 
                  Ped2 = SeqOUT_griffin$Pedigree[SeqOUT_griffin$Pedigree$id %in% these, 1:3]),
             caption = "Subsets of Pedigree1 (left) and Pedigree2 (right)")
PCG$MergedPed[PCG$MergedPed$id %in% these, 
              c("id", "id.r", "dam.1", "dam.2", "dam.r")]

$DummyMatch

The output list element $DummyMatch summarises the matches:

head(PCG$DummyMatch[, -c(3:5)], n=6)

For the match between 'F0002' in Pedigree2 and 'PinkBlue' in Pedigree1 (bottom row), there are 3 matching offspring, and no offspring by 'PinkBlue' that are genetically assigned to a different sibship cluster or mother (off.Mismatch) or have no genetic mother assigned (off.P1only), nor are there any members of the genetic sibship F0002 with no mother in Pedigree1 (off.P2only). Thus, a perfect one-to-one match.

For genetic sibship 'F0001', however, the situation is more complicated, and this will be worked through in detail below.

'nomatch' in the id.1 column indicates either that none of the individuals in this genetic sibship had a field-observed mother, or that the field-observed mother is matched to a different genetic sibship, with which there was a larger overlap.

$Counts

The total number of matches and mismatches is summarised in the 3D array $Counts, which can be visualised with PlotPedComp():

PlotPedComp(PCG$Counts)

Since there are no sires in our field pedigree, we will have a look at the dam slice only:

PCG$Counts[,,"dam"]

The counts for each classification are subdivided into various categories (rows), based on whether the focal individual (first letter) and parent (second letter) are Genotyped or a Dummy individual, as well as the Totals. The totals counts includes individuals in Pedigree1 who are neither genotyped nor 'dummifiable', such as a non-genotyped parent with a single offspring, and therefore exceeds the sum of G and D.

We will go through each of the three classes of discrepancies in turn -- Mismatch, P1only (only field mum), and P2only (only genetic mum).

Mismatch

To get more detail on the 11 mismatches, we head back to $MergedPed. For brevity, we display only the columns in which we're currently interested:

PCG$MergedPed[which(PCG$MergedPed$dam.class == "Mismatch"), c("id", "dam.1", "dam.2", "id.dam.cat")]

Thus while there are 11 individuals with mismatching mothers, there are only 4 unique mothers involved, which we will look at in turn.

Mismatch Issue1: GreenBlue

PedM <- PCG$MergedPed[, c("id", "dam.1", "dam.2")]   # short-hand to minimise typing

# does the mismatch affect all of GreenBlue's offspring?
PedM[which(PedM$dam.1 == "GreenBlue"), ]
# > yes, these 4 are all of her known offspring

# does genetic mother i081_2005_F have any field-observed offspring?
PedM[which(PedM$dam.1 == "i081_2005_F"), ]
# no.

# does i081_2005_F have any other genetic offspring?
PedM[which(PedM$dam.2 == "i081_2005_F"), ]
# no.

It seems 'GreenBlue' is a perfect match with 'i081_2005_F': all four of GreenBlue's observed offspring are genetically assigned i081_2005_F as mother, and all four of i081_2005_F's genetic offspring have GreenBlue as observed mother. Perhaps this female lost her tag, was therefore not recognised, and received a new field ID. Field records may be able to back this theory up, or disprove it: Did GreenBlue look about 1 year old when first recorded in 2007? Was i081_2005_F ever seen in or after the 2006 breeding season? Does i081_2005_F have a known death date? If so, was there a post-mortem, or was she presumed dead because she had not seen for several months/years?

An alternative explanation might be a pedigree inference error, but this is highly unlikely when offspring and parent are both genotyped (id.dam.cat = 'GG'), and only plausible with a limited number of SNPs with low call rate and high genotyping error rate. The four assignments are independent if they were done during parentage assignment, but not necessarily so during full pedigree reconstruction: they may first have been clustered as maternal siblings, and subsequently the dummy mother may have been replaced by i081_2005_F.

Dummy individuals in Pedigree2 are never matched to genotyped individuals in Pedigree1, even if there is a perfect match such as here -- there is almost always something odd going on that requires user inspection.

Resolution: Merge IDs 'i081_2005_F' and 'GreenBlue'

Mismatch Issue2: OrangeGreen

# why is this flagged as a mismatch?
PedM[which(PedM$dam.1 == "OrangeGreen"), ]
# all of OrangeGreen's offspring are in sibship F0001

PedM[which(PedM$dam.2 == "F0001"), c("id", "dam.1", "dam.2")]
# but sibship F0001 is split across two field mothers

So, the genetic sibship with 'F0001' as dummy mother includes individuals with two different field-observed mothers, 'OrangeGreen' and 'BlueRed'. OrangeGreen's offspring hatched in 2007, and BlueRed's in 2008 and 2009. Perhaps this female was not regularly monitored, was not recognised at the start of the 2008 breeding season, and got a new ID?

As for the previous case, field records that could disprove this theory are census records of BlueRed in or before 2007, or of OrangeGreen in or after 2008 (or if they were seen together!). Records backing up this theory are notes like 'this could be...' or 'looks similar to ...' when BlueRed was first described. This cautionary approach to use a new ID is advisable, as it is often much easier to combine the records of two IDs later on, than to try to tease apart records under a single ID into their respective IDs.

An alternative explanation is again a pedigree inference error, namely that sequoia has erroneously merged two actual sibships. This may sporadically happen when the two mothers are closely related. When the mothers are related by $>0.5$ (closer than regular full siblings, due to inbreeding), such erroneous merging can happen even with large powerful SNP sets.

Resolution: Probably merge IDs OrangeGreen' and 'BlueRed'

Mismatch Issue3: YellowBlue

# as before
PedM[which(PedM$dam.1 == "YellowBlue"), ]
# something odd going on involving sibships F0003 & F0005

PedM[which(PedM$dam.2 %in% c("F0003", "F0005")), ]

so:

A likely explanation is that the samples of these individuals were accidentally swapped around in the lab during DNA extraction, i.e. that genotype 'i147_2008_F' belongs to field id 'i148_2008_F', and v.v.. Lab notes might shed light on this theory, e.g. was the DNA extracted in the same batch? Where the samples adjacent on the same 96-well plate?

An alternative explanation is sloppy handwriting or typing error. Whether in this case genotype 'i147_2008_F' belongs to field id 'i147_2008_FM' or 'i148_2008_F' depends on when this mixup happened, i.e. whether the field data are swapped between the individuals too.

A pedigree inference error seems unlikely in this particular case -- individuals may occasionally get assigned to the wrong sibship, due to genotyping errors, a low call rate sample, or an uncommon draw in the Mendelian inheritance lottery, but it seems highly unlikely that this would result in two individuals swapping place.

Resolution: Swap IDs 'i147_2008_F' and 'i148_2008_F' in genetic data (and/or in field data)

Mismatch Issue4: GreenYellow

This turned out to be the same as Issue3!

Pedigree1-only

General

The second class of discrepancies are individuals who have a parent assigned in Pedigree1 (the field-observation-based pedigree), but not in Pedigree2 (the genetically inferred pedigree), i.e. there is no genetic confirmation of the field parent. It is important here to distinguish between:

  1. Parent and/or siblings in Pedigree1 are definitely not closely genetically related to the focal individual
  2. Parent and/or siblings in Pedigree1 are a genetic match, but not assigned due to
    • lacking birth year information (who is the parent and who is the offspring?),
    • ambiguity about what kind of second degree relative the sibling in Pedigree 1 is, or
    • the likelihood ratio falling short of the assignment threshold.
  3. Parent and siblings are not genotyped, or have such a low call rate that they are automatically excluded; i.e. there is no evidence one way or another.

A range of tools is available to help distinguish between these three alternatives:

Griffins

PCG$MergedPed[which(PCG$MergedPed$dam.class == "P1only"), 
              c("id", "id.r", "dam.1", "dam.2", "id.dam.cat")]

The id.dam.cat = 'XX' indicates that neither the focal individual nor their dam in Pedigree1 is genotyped or 'dummifiable' (has at the very minimum one genotyped offspring, see ?getAssignCat). Thus there is in this case simply no way of genetically testing whether or not these two are indeed maternal siblings, as the field data indicates.

Genetic data could become available in the future, due to DNA sampling of i053_2003_M's offspring, or of himself (e.g. post-mortem). This is one of the reasons why it is recommended to rerun pedigree analysis with all individuals when additional individuals have been genotyped.

Pedigree2-only (newly assigned mum)

General

Analogous to the Pedigree1-only classification, it is here useful to differentiate between

  1. The new genetic assignment is impossible or extremely unlikely based on field data

    • The assigned parent died before the birth year of the putative offspring
    • The putative offspring's parents are known with great certainty
    • The spatial locations of the assigned parent-pair were too far apart around the time of conception for them to have mated
    • The putative offspring is a dummy individual, but the assigned mother's offspring are all accounted for:
      • SNP genotyped and different ID, or
      • different sex, or
      • definite death date before breeding age
      • birth year incompatible with dummy individual's offspring
  2. The assignment is quite likely based on field data

    • The assigned mother was not seen in the breeding season her new offspring was born (e.g. bred outside the study area)
    • The offspring and newly assigned mother are part of the same social group
  3. There is no field data to suggest one way or the other

When the field data indicate that a newly assigned parent is highly implausible, there are various possible explanations, such as:

Griffins

As example, let's take the assignment of 'YellowPink' (aka F0007) as mother of 'PinkBlue' (aka F0002):

SeqOUT_griffin$DummyIDs[c(6,7), c("id", "dam", "BY.est", "NumOff", "O1", "O2", "O3", "O4")]

Dummy - dummy pedigree links as these tend to have a higher error rate than genotyped - genotyped pedigree links (see also EstConf()), especially when the sibships are small, as all genetic information is 'second hand'. They are therefore worth checking if independent data is available.

The question here is whether it is plausible that YellowPink had a non-genotyped daughter in 2008 that survived to breeding age. She had one genotyped offspring in 2008 ('i141_2008_M'), so the nest was monitored that year. Field and lab records will indicate whether there were any additional hatchlings that were not sampled, or not successfully genotyped (+ tag loss of PinkBlue). If YellowPink had only one female, non-genotyped, potentially-surviving offspring that year, then that hatchling's data can be combined with PinkBlue's breeding records. If she definitely had no offspring that are unaccounted for, the pedigree is most likely incorrect and the parent-offspring link would best be removed.

Warning re dummy numbering

The dummy number corresponding to each non-genotyped individual is not consistent between runs. It simply reflects the order in which sibships were found, which may change when individuals are added or removed from the genotype data, or even when the order of individuals or the presumed genotyping error rate is changed [^2]. Therefore, a script like this is NOT robust:

[^2]: Note that the dummy numbering in this griffin example has also changed between version 2.3.5 and 2.4

PedX <- SeqOUT_griffin$Pedigree
# NOT LIKE THIS:
PedX$id[ PedX$id == 'F0001'] <- 'OrangeGreen'
PedX$dam[ PedX$dam == 'F0001'] <- 'OrangeGreen'
PedX$id[ PedX$id == 'F0002'] <- 'PinkBlue'
PedX$dam[ PedX$dam == 'F0002'] <- 'PinkBlue'

Instead, match by an individual within the sibship (the dummy's offspring), which should be consistent between runs:

PedX <- SeqOUT_griffin$Pedigree
# INSTEAD, SOMETHING LIKE THIS:
Name_match <- matrix(c(PedX$dam[PedX$id=='i123_2007_F'], 'OrangeGreen',  
                       PedX$dam[PedX$id=='i165_2009_F'], 'PinkBlue',  
                       PedX$dam[PedX$id=='i121_2007_M'], 'GreenYellow'),
                     ncol = 2, byrow=TRUE)
Name_match

for (i in 1:nrow(Name_match)) {
  PedX$id[ PedX$id == Name_match[i,1] ] <- Name_match[i,2] 
  PedX$dam[ PedX$dam == Name_match[i,1] ] <- Name_match[i,2] 
}

The offspring IDs can for example be found in SeqOUT_griffin$DummyIDs.

head(SeqOUT_griffin$DummyIDs[, c('id','dam','sire','NumOff','O1','O2','O3')])


Try the sequoia package in your browser

Any scripts or data that you put into this service are public.

sequoia documentation built on Sept. 8, 2023, 5:29 p.m.