Sequoia User Guide"

library(knitr)
library(sequoia)
knitr::opts_chunk$set(echo = TRUE, eval=TRUE,
                      auto_pdf = TRUE,
                      fig.height=4, fig.width=6, fig.pos="htb!")

\newpage

Quick-start examples

Simulated Data

An example pedigree and associated life history data are provided with the package, which can be used to try out some of the functions. This fictional pedigree consists of 5 generations with interconnected half-sib clusters (Pedigree II in @huisman17).

# install.packages("sequoia")  # only required first time
library(sequoia)             # load the package
#
# get the example pedigree and life history data
data(Ped_HSg5, LH_HSg5, package="sequoia")
tail(Ped_HSg5)   # have a look at the data
#
# simulate genotype data for 200 SNPs; 40% parents non-genotyped
Geno <- SimGeno(Ped = Ped_HSg5, nSnp = 200, SnpError=0.001, ParMis=0.4)
#
# run sequoia; duplicate check & parentage assignment only at first
ParOUT <- sequoia(GenoM = Geno,
                  LifeHistData = LH_HSg5,
                  Module = "par")

names(ParOUT)  # output: a list with the following elements
# [1] "Specs"  "AgePriors"  "LifeHist"  "PedigreePar"  "MaybeParent"
# "TotLikParents"
#
# run sibship clustering & grandparent assignment
# optional: use parents assigned above (in 'ParOUT$PedigreePar')
SeqOUT <- sequoia(GenoM = Geno,
                  SeqList = ParOUT,
                  Module = "ped")     
#
# compare the assigned real and dummy parents to the true pedigree
chk <- PedCompare(Ped1 = Ped_HSg5, Ped2 = SeqOUT$Pedigree)
chk$Counts["TT",,]
#
# save results: single compressed .RData file, and/or folder with text files
save(SeqOUT, Geno, OtherStuff, file="Sequoia_output_date.RData")
writeSeq(SeqList = SeqOUT, GenoM = Geno, folder = "Sequoia-OUT")

For a detailed walk-through of this example, see here

Real Data

First get a subset of SNPs which are as informative (high MAF, high call rate), reliable (low error rate), and independent (low LD) as possible. Ideally around 400 -- 700 for full pedigree reconstruction; fewer are necessary for only parentage assignment, while more may be necessary with very high levels of polygamy or inbreeding. In addition you need a dataframe with the sex and birth/hatching year of as many individuals as possible, in arbitrary order.

It is often prudent to first do a quick partial run to check for duplicates and genetic mismatches between known parent-offspring pairs, before running full pedigree reconstruction.

# read in genotype data if already coded as 0/1/2, with missing=-9:
Geno <- as.matrix(read.csv("mydata.csv", header=FALSE, row.names=1))
CheckGeno(Geno)
# or read in from several other input formats (not .vcf (yet)):
Geno <- GenoConvert(InFile = "mydata.ped", InFormat="ped")
#
# read in lifehistory data: ID-Sex-birthyear, column names ignored
LH <- read.table("LifeHistoryData.txt", header=T)
#
# duplicate check & parentage assignment (takes few minutes)
ParOUT <- sequoia(GenoM = Geno,  LifeHistData = LH_HSg5,  Err=0.005,
                  Module="par", quiet = FALSE, Plot = TRUE)
#
ParOUT$DupGenotype  # inspect duplicates (intentional or accidental)
#
# compare assigned parents to field pedigree (check column order!)
FieldPed <- read.table("FieldPed.txt", header=T)
PC.par <- PedCompare(Ped1 = FieldPed[, c("id", "dam", "sire")],
                     Ped2 = ParOUT$PedigreePar)
PC.par$Counts["TT",,]
#
# ..........................................................
# polish dataset: 
# remove one indiv. from each duplicate pair
Geno2 <- Geno[!rownames(Geno) %in% ParOUT$DupGenotype$ID2,] 
# & drop low call rate samples
# & drop SNPs with high error rate and/or low MAF
stats <- SnpStats(Geno, ParOUT$PedigreePar)
MAF <- ifelse(stats[,"AF"] <= 0.5, stats[,"AF"], 1-stats[,"AF"])
Geno2 <- Geno2[, -which(stats[,"Err.hat"]>0.05 | MAF < 0.1)]
#
Indiv.Mis <- apply(Geno2, 1, function(x) sum(x == -9)) / ncol(Geno2)
Geno2 <- Geno2[Indiv.Mis < 0.2, ]
# check histograms for sensible thresholds, iterate if necessary
#
# ..........................................................
# run full pedigree reconstruction (may take up to a few hours)
SeqOUT <- sequoia(GenoM = Geno2,
                  LifeHistData = LH_HSg5,
                  Module = "ped", 
                  Err = 0.001)
#
SummarySeq(SeqOUT)  # inspect assigned parents, prop. dummy parents, etc.
#
# check full sibs, half sibs etc.
Pairwise <- ComparePairs(SeqOUT$Pedigree, patmat = TRUE)
#
# save results: single compressed .RData file, and/or folder with text files
save(SeqOUT, Geno, OtherStuff, file="Sequoia_output_date.RData")
writeSeq(SeqList = SeqOUT, GenoM = Geno, folder = "Sequoia-OUT")

Background {#sec:Background}

The core of Sequoia is to

Sequoia provides a conservative hill-climbing algorithm to construct a high-likelihood pedigree from data on hundreds of single nucleotide polymorphisms (SNPs), described in @huisman17. Explicit consideration of the likelihoods of alternative relationships (parent-offspring, full siblings, grandparent--grand-offspring, ...) before making an assignment reduces the number of false positives, compared to parentage assignment methods that rely on the likelihood ratio between parent-offspring versus unrelated only [@thompson87]. The heuristic, sequential approach used is considerably quicker than most alternative approaches such as MCMC, and when genetic information is abundant there is little to no loss in accuracy. Typical computation times are a few minutes for parentage assignment, and a few hours for full pedigree reconstruction when not all individuals are genotyped.

A word of caution: the most likely relationship is not necessarily the true relationship between a pair, due to the random nature of Mendelian segregation, and possible genotyping errors. In addition, the most likely relationship for a pair will not necessarily result in the highest global likelihood, and may therefore not have been assigned.

Key Points

Mendelian Errors {#sec:MendelErrors}

Parent and offspring will never be opposing homozygotes (one $aa$, other $AA$) at a locus, except due to genotyping errors. This metric is easy and quick to calculate even in very large datasets, and provides a good way to subset candidate parents (done by sequoia internally), or to check if an existing pedigree is consistent with new genetic data (function CalcOHLLR(, CalcLLR=FALSE)).

The count of opposing homozygous (OH) SNPs is not always a reliable indicator to distinguish parents from not-parents. When the genotyping error rate is high, true parent--offspring pairs may have a high OH count, while some full siblings and other close relatives may have a very low OH count. Therefore, sequoia() uses OH only as a filter to reduce the number of candidate parents for each individual, and uses likelihood ratios for actual assignments.

Parent-pairs

The number of Mendelian errors in an offspring-mother-father trio includes the OH counts between offspring and each parent, as well as cases where the offspring should have been heterozygous, but isn't (mother $aa$, father $AA$, offspring $aa$ or $AA$), and cases where it is heterozygous, but shouldn't be (when mother and father are identical homozyotes). This metric is used to subset 'compatible' parent-pairs when there are candidate parents from both sexes, or with unknown sex.

Maximum mismatches

The maximum number of Mendelian mismatches, and mismatches between potential duplicates, is from version 2.0 onward calculated by CalcMaxMismatch(), based on the (average) genotyping error rate. This parameter can no longer be altered directly, but can be changed via parameters Err and ErrFlavour.

Likelihoods

Assignments are based on the likelihods for all different possible relationships between a pair, calculated conditional on all assignments made up onto that point. If the focal relationship is more likely than all alternatives (by a margin Tassign), the assignment is made.

The relationships are condensed into seven categories (Table \@ref(tab:rel7)). Under the default settings (Complx='full'), these also include their combinations (see section Unusual relationships](#sec:WeirdRels)), e.g. a pair may be both paternal half-siblings and maternal aunt/niece if they were produced by a mother-daughter pair mating with the same male.

knitr::kable(cbind(" "= c("PO", "FS", "HS", "GP", "FA", "HA", "U"),
                   Relationship = c("Parent - offspring",
                     "Full siblings",
                     "Half siblings",
                     "Grandparental",
                     "Full avuncular (aunt/uncle)",
                     "Half avuncular (+ other 3rd degree)",
                     "Unrelated")),
             caption = "Main relationships considered", escape=FALSE,
             booktabs=TRUE, position="!th")

To reduce computational time, these likelihoods are only calculated after filtering pairs based on the number of Mendelian errors, their age difference, and a quick calculation of LLR(focal / unrelated) without conditioning on any earlier assigned relatives (must pass Tfilter).

For any pair of genotyped individuals, these seven likelihoods can be calculated with CalcPairLL().

Genotyping errors {#sec:GenoErrors}

Effect

A higher rate of genotyping errors leads to more false negatives and more false positives during pedigree reconstruction, as illustrated in Figure \@ref(fig:ErrSimSeq). The magnitude of the effect varies: One wrong assignment due to a few genotyping errors in a key individual may have numerous knock-on effects, while dozens of genotyping errors in 'dead end' individuals may have no consequences at all.

The simulations (SimGeno()) assume amongst others that genotyping errors are independent from each other and from a sample's call rate, and likely give a more optimistic picture than real data.

Assumed genotyping error rate Err

The reduced performance at high genotyping error rates can be somewhat mitigated by running sequoia() with a roughly correct presumed genotyping error rate Err (Figure \@ref(fig:ErrSimSeq): compare red vs yellow symbols at an error rate of 1-2%).

Among others, when Err is much lower than the actual genotyping error rate, some true parent-offspring pairs will be assigned as full siblings. On the other hand, when Err is much higher than the actual genotyping error rate some true full siblings may be classified as parent-offspring, potentially leading to various 'secondary' errors. This trade-off can be explored using CalcPairLL() if some known parent-offspring and full sibling pairs are available.

(ref:caption-err) High simulated genotyping error rate (x-axes) increases both false negatives (top) and false positives (bottom). A roughly correct presumed error rate (colours) improves performance. Results from pedigree Ped_HSg5 with 200 SNPs, call rate 0.99, 40% of parents non-genotyped. Diamonds: means across 10 replicates.

knitr::include_graphics("Errs-effect.png")

NOTE that the default assumed genotyping error rate of 0.01% is based on data from a SNP array after stringent quality control; in many cases the genotyping error rate will be (much) higher [^a]. It is typically worthwhile to explore if and how results change when you vary this parameter, possibly in combination with non-default values for the thresholds Tfilter and Tassign. Do keep in mind the relative consequences of potential misassignments (false positives) versus non-assignments (false negatives) during later use of the pedigree.

[^a]: but changing the default value is likely to cause problems for people re-using old code.

Performance with high genotyping error rates

sequoia does not cope well with high rates of genotyping errors, as the sequential approach (see Background) can lead to a cascading avalanche of assignment errors. If the rate is around 1\%, be cautious when interpreting the results and please use EstConf() to estimate the assignment error. With genotyping error rates above 5\% I would strongly suggest to use MCMC-based approaches, or other approaches more robust to poor genotyping quality.

knitr::include_graphics("ARER-cats.png")

Assignment of genotyped parents to genotyped offspring is fairly robust to low SNP number and high genotyping errors (left-most column in Figure \@ref(fig:ARER-cats)), while pedigree links involving a dummy individual are more prone to mis-assignment and non-assignment (columns 2-4 in Figure \@ref(fig:ARER-cats)). In these simulations, all SNPs are completely independent, call rate is high, and founders are unrelated; in real data the overall performance is probably lower but the patterns will be similar.

(ref:caption-ARERcats) Non-assignment (top) and assignment errors (bottom) for a range of SNP numbers (x-axes) and simulated genotyping errors (colours). sequoia presumed error rates ('Err') set equal to simulated error rates. Results for a deer pedigree with overlapping generations and inbreeding; 40% of parents were presumed non-genotyped, call rate=0.99 for remainder.

Effect on runtime

Computational time tends to increases with both simulated and assumed genotyping error rate, as the time-saving filtering steps become less effective (Figure \@ref(fig:Errs-time)).

knitr::include_graphics(c("Errs-time-HSg5.png", "Errs-time-deer.png"))

Error matrix

Different genotyping methods are likely to have different error structures (Table \@ref(tab:ErrM)). Therefore, from version 2.0 sequoia allows fine-scale control over how potential genotyping errors are accounted for. An error matrix with the probability to observe genotype $G$ (columns), conditional on actual genotype $g$ (rows), can now be specified via parameter ErrFlavour. The current default (version2.0) is given in Table \@ref(tab:ErrM), and previous defaults are shown in the help file of ErrToM(). The slight differences between versions only have any consequences when the error rate is high ($>1%$).

ErrM <- matrix(c("$(1-E/2)^2$", "$E\\times(1-E/2)$", "$(E/2)^2$",
                 "$E/2$", "$1-E$", "$E/2$",
                 "$(E/2)^2$", "$E\\times(1-E/2)$", "$(1-E/2)^2$"),
               3,3, byrow=TRUE,
               dimnames=rep(list(c("aa","Aa","AA")), 2))
knitr::kable(ErrM,
             caption = "Probability of observed genotype (columns) conditional on actual genotype (rows) and per-locus error rate E", escape=FALSE, booktabs=TRUE,
             position="!hb")

From version 2.7 onwards, it is also possible to specify the genotyping error structure as a length 3 vector, with probabilities (between brackets = current defaults):

During pedigree inference, the error rate is presumed equal across all SNPs. It is also assumed that none of the errors (or missingness) are due to heritable mutations.

Estimate

Function SnpStats() can estimate the genotyping error rate per SNP, conditional on a provided pedigree and error structure (ErrFlavour). These estimated error rates will not be as accurate as from duplicate samples, since a single error in an individual with many offspring will be counted many times, while errors in individuals without parents or offspring will not be counted at all.

Birth years \& Ageprior {#sec:BY-AP}

Providing individual birth years will typically greatly improve the completeness and speed of pedigree reconstruction. Age data is used to subset the potential relationships between each pair, and is required to orientate parent--offspring pairs the right way around. Birth year information is not strictly required for every individual: parent-parent-offspring trios can be assigned even in absence of any age data, as long as the sex of either parent is known.

'Birth year' should be interpreted loosely, as the time unit (week/month/year/decade/...) of birth/hatching/germination/... .

For individuals (or 'mystery samples') with unknown birth years, the probability that they were born in year $y$ can be estimated using CalcBYprobs(). This relies on the (estimated) birth years of their parents and offspring, and the age distribution of parent-offspring pairs as specified in the age prior.

Definition \& interpretation

The 'agepriors' as used by sequoia() is not an official term, nor a proper prior, but a set of age-difference based probability ratios which can be interpreted as

If I were to pick two individuals with an age difference $A$, and two individuals at random, how much more likely are the first pair to have relationship $R$, compared to the second pair?

The value may vary from $0$ (impossible), to $1$ (just as likely), and typically not far beyond $10$ (10x as likely). This quantification allows age differences to be used jointly with genetic information, which is mostly done in a conservative fashion (except when UseAge='extra'): only when the focal relationship is the most likely both when considering only genetic data, and when considering genetic + age data, is the assignment made. 'Most likely' applies to from the set of possible relationships, i.e. those with an age prior value $>0$.

For example, in datasets with only one or two ageclasses, it is worth contemplating whether or not avuncular relationships should be considered as alternatives to siblings. The behaviour can be changed by explicitly declaring generations as Discrete (avuncular cannot have age difference of 0), and/or by setting MaxAgeParent.

You can find extensive details about the ageprior in this separate vignette, including worked examples, mathematical framework, and full range of options.

Implementation

Parentage assignment is run with an ageprior only specifying that parents and offspring cannot be the same age. The resulting 'scaffold pedigree' is then used by MakeAgeprior() to estimate the dataset-specific age-difference distributions for each relationship (mother, father, full sibling, maternal and paternal half sibling), which are standardised by the age-difference distribution among all pairs. The resulting ageprior probability ratios are then used during full pedigree reconstruction (Figure \@ref(fig:AP-pipeline)).

knitr::include_graphics("agepriors-pipeline.png")

When running sequoia(), arguments can be passed to MakeAgePrior() via args.AP.

Smooth & Flatten

Often not all biologically possible age-differences are present in the scaffold pedigree. Therefore MakeAgeprior() by default applies a correction to allow for relatives to differ 1-2 years more in age than observed in its input pedigree, and to smooth out any dips or gaps in the age distributions. However, when such abrupt terminations or gaps are biologically meaningful, these corrections can and should be turned off (Smooth=FALSE and/or Flatten=FALSE). This is done automatically in case discrete generations are declared (Discrete=TRUE) or detected in the scaffold pedigree.

Age-based non-assignment

Any genetically identified parent-offspring pairs and other relatives which are deemed impossible based on their age difference and the age prior matrix, and thus not assigned, can be found by GetMaybeRel(). You may then correct or discard their birth years and/or update the ageprior to allow for their age difference (see Customising the ageprior), and re-run pedigree reconstruction.

Parent LLR

When parentage assignment or pedigree reconstruction is completed (when the total likelihood has asymptoted), a log10-likelihood ratio (LLR) is calculated for each assigned parent-offspring pair. This is

The ratio between the likelihood of being parent and offspring, versus the next-most-likely relationship between them, conditional on all other relationships in the pedigree.

This differs from for example Cervus [@marshall98], which returns the ratio between the likelihood that the assigned parent is the parent, versus that the next most likely candidate is the parent.

The LLRs for the dam and sire separately are calculated by temporarily dropping the co-parent (if any), and calculating the likelihoods assuming the co-parent is a random draw from the population. The LLR for the parent-pair is the minimum of the dam and sire LLR, each conditional on the co-parent. To retrieve the likelihoods underlying each ratio, use CalcPairLL(); specify column dropPar1 for single-parent likelihoods.

Since this step can be quite time-consuming, sequoia() can be run with CalcLLR=FALSE, and CalcOHLLR() can be run separately later. That function can also calculate parental LLR's for any existing pedigree, including for any non-genotyped individuals with at least one genotyped offspring.

Confidence {#sec:EstConf}

The parental likelihood ratio does not necessarily indicate how likely it is that the assignment is correct. Pedigree-wide confidence probabilities can, among others, be estimated as follows

When repeated at least 10--20 times, the proportion of assignments that is correct then provides an estimate of the confidence probability. This will be an optimistic estimate, as SimGeno() assumes all SNPs are independent, that parental genotypes are missing at random, and makes various other simplifying assumptions.

This process is conveniently wrapped in the function EstConf(), which calculates separate confidence levels for dams and sires (which may have very different sampling proportions), for genotyped and dummy parents, and for single parents and members of parent-pairs. The process can be rather time consuming, but from version 2.1 onwards it is possible to run on multiple computer cores in parallel.

Low assignment rate {#sec:lowAR}

Pedigree reconstruction with sequoia() regularly suffers from a low assignment rate, especially when only a small proportion of parents have been genotyped. This is fundamental to the method: when in doubt, MCMC approaches will return an assignment with a low posterior probability, but sequoia() will return no assignment at all.

In sequoia(), an incorrect assignment early on may cause a cascade of other incorrect assignments later in the run, and great care is taken to prevent this. On the flip side, when there is limited genetic or birth year information and/or very few parents have been genotyped, the cascade of correct assignments, where maternal assignments fascilitate paternal assignments and v.v. (Figure \@ref(fig:ClusterMatPat)), may never take off. For such situations, different programs with MCMC based approaches are more suitable.

knitr::include_graphics("clustering_mat_helps_pat.png")

Causes

There are four main causes a pair of true relatives are not assigned, which are not mutually exclusive:

  1. There is more than one type of relationship possible between the pair, which cannot be distinguished with certainty (details below);
  2. The type of relationship is clear (e.g. half siblings), but it is unclear whether they are maternal or paternal relatives;
  3. It is clear the pair are mother and daughter (or grandmother -- grand-granddaughter), but it is unclear which of the two is the (grand)mother, due to unknown birth year(s) and absence of a 'complementary' candidate father (and analogous for father-son);
  4. There is more than one plausible candidate (dummy) (grand)parents of one sex, all of which are more likely to be a parent than otherwise related, even when considered jointly. This includes the situation where true offspring are among the candidate parents due to unknown birth years.

For more details, see FAQ.

Cause (1) has two common underlying causes, which again are not mutually exclusive:

1a. Half-siblings, grandparent--grand-offspring and full avuncular pairs are genetically indistinguishable, even with perfect genetic data. Assignment of half-sibling pairs is only possible when grand-parental and avuncular relationships can be excluded based on the age-difference of the pair, or when both individuals already have a parent assigned (in which case the likelihoods do differ). 1b. The SNP data is not informative enough to clearly distinguishing between parents and full siblings, between full and half siblings, between half siblings and cousins, etc.. The problem may be specific to a few pairs and due to the randomness of Mendelian segregation, or be widespread due to a low number of SNPs, low MAF, and/or high error rate.

Remedies

Sometimes it is possible to reduce the number of non-assigned parents in the pedigree by providing more life history data, removing SNPs with the highest error rates, or lowering the thresholds Tfilter and/or Tassign. The latter will reduce the probability of false negatives, but at the same time increase the probability for false positives. If inbreeding and double relationships are rare, using Complx='simp' may also improve assignment (see here).

Small adjustments may 'get the snowball rolling' (Figure \@ref(fig:ClusterMatPat)) and greatly increase assignments. Once a sibling pair is assigned, adding additional siblings to the cluster becomes progressively easier as it becomes more and more obvious whether an individual is related to all individuals in the sibship, or to none.

There are functions in the package that may help identify whether there are un-assigned likely relatives in the dataset, and/or the probable cause of non-assignment of presumed relatives:

\clearpage

Function overview

tbl_core_fun <- rbind(
c("sequoia","Y","","","+","+","Pedigree, SeqList","SeqListSummary"),
c("GetMaybeRel","Y","","","+","+","Pairs","PlotRelPairs"),
c("CalcOHLLR","Y","Y","","+","+","","SeqListSummary"),
c("CalcPairLL","Y","+","Y","+","+","","PlotPairLL "),
c("MakeAgePrior","","+","","","","AgePrior","PlotAgePrior"),
c("SimGeno","","Y","","","","GenoM","SnpStats"),
c("PedCompare","","YY","","","","","PlotPedComp"),
c("ComparePairs","","Y+","+","","","Pairs","PlotRelPairs"))
colnames(tbl_core_fun) <- 
 c("Function","GenoM","Pedigree","Pairs","AgePrior","SeqList","Reusable output","Plot function")

knitr::kable(tbl_core_fun, align='lcccccll',
             caption = "Core functions with selected subset of their required (Y) and optional (+) input, and their output that can be re-used as input in other functions.", escape=FALSE, booktabs=TRUE)

Input

Simulate

Ageprior

Pedigree reconstruction

Pedigree check

These functions can be applied to any pedigree, not just pedigrees reconstructed by sequoia. Required input between brackets

Pairwise relationships

[^d]: The matrix returned by DyadCompare \emph{(Deprecated)} is a subset of the matrix returned here using default settings.

Miscellaneous

sequoia() input

Genotypes {#sec:GenoFormat}

The SNP data should be provided as a numeric matrix GenoM with one line per individual, and one column per SNP, with each SNP is coded as 0, 1, 2 copies of the reference allele, or missing (-9). The rownames should be the individual IDs, and column names are ignored.

GenoM <- as.matrix(read.table("MyGenoData.txt", row.names=1, header=FALSE))

When the genotype data is in another format, such as Colony input files or PLINK's .ped or .raw files, GenoConvert() can be used.

Subset SNPs

Using tens of thousands of SNP markers for pedigree reconstruction is unnecessary, will slow down computation, and may even hamper inferences by their non-independence. Rather, a subset of SNPs with a decent genotyping call rate (e.g. $>0.9$), in low linkage disequilibrium (LD) with each other, and with high minor allele frequencies (e.g. MAF $> 0.3$) ought to be selected first if more than a few hundred SNPs are available. The calculations assume independence of markers, and while low (background) levels of LD are unlikely to interfere with pedigree reconstruction, high levels may give spurious results. The recommended maximum number of SNPs therefore depends on genome size and chromosome number. Markers with a high MAF provide the most information, as although rare allele provide strong evidence when they are inherited, this does not balance out the rarity of such events.

It is advised to 'tweak' the filtering thresholds until a set with a few hundred SNPs (300--700) is created. To assist with this, the function SnpStats() gives for each SNP both the allele frequency and the missingness. In addition, when a pedigree is provided (e.g. an existing one, or from a preliminary parentage-only run), the number of Mendelian errors per SNP is calculated and used to estimate the genotyping error rate.

PLINK

Since the full dataset may be too large to easily load into R, creating a subset of SNPs can for example be done using PLINK:

plink --file mydata --geno 0.1 --maf 0.3 --indep 50 5 2

which on a windows machine is equivalent to running inside R

system("cmd", input = "plink --file mydata --maf 0.3 --indep 50 5 2")

This will create a list of SNPs with a missingness below 0.1, a minor allele frequency of at least 0.3, and which in a window of 50 SNPs, sliding by 5 SNPs per step, have a VIF of maximum 2. VIF, or variance inflation factor, is $1/(1-r^2)$. For further details, see the plink website.

The resulting list (plink.prune.in) can be used to create the genotype file used as input for Sequoia, with SNPs codes as 0, 1, 2, or NA, with the command

plink --file mydata --extract plink.prune.in --recodeA --out inputfile_for_sequoia

This will create a file with the extension .RAW, which can be converted to the required input format using

GenoM <- GenoConvert(InFile = "inputfile_for_sequoia.raw", InFormat="raw")

This function can also convert from two-columns-per-SNP format, as e.g. used by Colony.

Family IDs

By default, the 'Family ID' (1st) column in the PLINK file is ignored, and IDs are extracted from the second column only. If the family IDs are essential to distinguish between individuals, use GenoConvert() with the flag UseFID = TRUE which will combine individual IDs and family IDs as FID__IID. Ensure the IDs in the life history file are in the same format, for example by using LHConvert(). The FID and IID can be split again in the resulting pedigree using PedStripFID().

Low call rate

Samples with a very low genotyping success rate (call rate) can sometimes wrongly be assigned as parents to unrelated individuals, as sequoia() does not (yet) deal perfectly with these cases. In addition, at least in my experience with SNP arrays, a low sample call rate is often indicative of poor sample quality or a poor genotyping run, and associated with a high sample error rate. Samples with a call rate below 5% are automatically excluded, but it is strongly advised to create a subset where all individuals are genotyped for at least 50% of SNPs, and preferably for at least 80%.

In addition, SNPs with a call rate below 10% are excluded, as these contribute almost no information. Again, a stricter threshold is advised of at least 50% to minimise the risk of spurious results.

Checks for individuals and SNPs with low call rate, and monomorphic SNPs, are done automatically when calling sequoia() and various other functions, and can be done separately by calling CheckGeno().

Very large datasets

When the number of individuals is very large, it may be impossible to load the genotype data into R as it will exceed R's memory limit. A stand-alone version of the algorithm underlying this R package does not suffer from this limitation, and is available as Fortran source code from github, where you can also find its manual. The standalone is not faster than the R package, as the bulk of the computations are done in Fortran regardless.

Birth years \& sex

The relative age of individuals can greatly improve pedigree reconstruction (see Birth years & Ageprior), and sex information is necessary to determine whether a candidate parent is the mother or father.

LifeHistData

The life history data (LifeHistData) should be a dataframe with three or five columns (column names are ignored, order is important!):

Ideally this basic life history information is provided for all genotyped individuals, but this is not strictly necessary. This dataframe may be in a different order than the genotype data, and may include many more individuals (which are ignored).

Unknown birth years

The year of birth may be unknown for some individuals, and for those sequoia() cannot determine whether they are the parent or the offspring if a genetic parent-offspring pair is found (unless a 'complementary' co-parent is identified). Especially in wild populations this information can be unknown for a substantial part of the sample, but often a minimum and/or maximum possible birth year (BY.min and BY.max) can be determined. For example, BY.max is at the latest the first year in which an individual was observed.

LH <- data.frame(ID = c("Alpha", "Beta", "Gamma"), 
                 Sex = c(2,2,1),
                 BirthYear = c(NA, 2012, NA),
                 BY.min = c(NA, NA, 2008),
                 BY.max = c(2010, NA, 2009))
knitr::kable(LH,
             caption = "Example LifeHistData with the optional columns for minimum + maximum birth year", booktabs=TRUE, position="b")

Even very wide ranges may help in pedigree reconstruction: for example, Alpha and Beta are genetically identified to be parent and offspring, Alpha was first seen in 2011 as an adult, and Beta was known to be born 2012 (Table \@ref(tab:BY-example)). Then Alpha is certainly the parent of Beta. Before version 2.0 there was no method to inform sequoia that Alpha could not have been born after 2012, and thus could not be an offspring of Beta. It was (and still is) possible to 'guestimate' birth years, but this can be time consuming and is potentially prone to errors. For example, if Alpha also forms a genetic parent-offspring pair with Gamma, then setting Alpha's birth year to 2010 will result in it being assigned as Gamma's offspring, while it may just as well be Gamma's parent.

To minimise such pedigree errors where parent and offspring are flipped the wrong way around, and the 'secondary' errors derived from these cases, it is recommended to set the possible birth year range as wide as possible, at least during an initial (preliminary) round of parentage assignment. To see the pairs that are genetically parent and offspring, but for which it cannot be determined which of the two is the parent, use function GetMaybeRel().

args.AP (Customising the ageprior) {#sec:custom-AP}

When running sequoia(), you can pass arguments to MakeAgePrior() via args.AP. For example, when you are sure generations do not overlap, you can specify this via MakeAgePrior()'s argument Discrete:

SeqOUT <- sequoia(GenoM = SimGeno_example, LifeHistData = LH_HSg5, Module='par',
                  args.AP = list(Discrete = TRUE))

Or you can specify that the maximum age of (non-genotyped) parents exceeds the observed age range among SNP-genotyped parent-offspring pairs in PedigreePar:

SeqOUT <- sequoia(GenoM = Geno, LifeHistData = LH,
                  args.AP = list(MaxAgeParent = c(11, 9)),  # dams, sires
                  SeqList = ParOUT[c("Specs", "PedigreePar")])

Alternatively, if you have a field-pedigree or microsatellite-based pedigree that contains many more individuals than have been SNP-genotyped, an ageprior estimated from that old pedigree will be more informative than the one estimated from a limited number of SNP-genotyped parent-offspring pairs. This can be done as follows:

APfromOld <- MakeAgePrior(Pedigree = MyOldPedigree,
                          LifeHistData = LH,
                          Smooth = TRUE)
SeqOUT <- sequoia(GenoM = Geno,
                  LifeHistData = LH,
                  SeqList = list(AgePriors = APfromOld))

When argument SeqList contains an element AgePriors, it is used during both parentage assignment and sibship clustering. Thus, if you wish to use different agepriors during those different phases, you need to run them separately (see reuse).

For further information on the ageprior, please see the separate vignette on this topic. If you wish to manually edit the AgePriors matrix before using it as input in sequoia(), or have done so in the past, it is advised to look at the latest version of this document, as the implementation has changed in package version 2.0, and the documentation was clarified in version 2.1.

Tfilter & Tassign

These are threshold values for log10-likelihood ratios (LLRs).

Tfilter is the threshold LLR between a proposed relationship versus unrelated, to select candidate relatives. It is typically negative, and a more negative value may prevent filtering out of true relatives, but will increase computational time.

Tassign is the threshold used for acceptance of a proposed relationship, and is relative to next most likely relationship. It must be positive, with higher values resulting in more conservative assignments. Counter-intuitively a high Tassign can sometimes increase the number of wrong assignments, as a correct low-threshold assignment may prevent wrong assignments among close relatives (a.o. by revealing genotyping errors and changing the most-likely true genotype of an individual).

The default values of Tfilter and Tassign seem to work well in a wide range of datasets, but their interaction with genotyping error rate and other dataset characteristics has not been thoroughly explored.

Complex {#sec:Complx}

The complexity of the mating system considered. One of:

Unusual relationships {#sec:WeirdRels}

Pedigree inference is often applied in small, (semi-)closed populations, and is regularly done to check for inbreeding. In such cases, pairs of individuals may be related via more than one route. For example, maternal half-siblings may also be niece and aunt via the paternal side, and be mistaken for full-siblings. Or a pair may be both paternal half-siblings, and maternal full cousins. A range of such double relationships is considered explicitly (Table \@ref(tab:DoubleRels)) to minimise such mistakes. If such a type is common in your population but not yet considered by sequoia, and seems to be causing problems, please send me an email as adding additional relationships is relatively straightforward.

DR <- rbind(PO = c("--", "--", "Y", "Y","","Y","","","", "Y"),
            FS = c("--", "--", "--", "--", "--", "Y","", "--", "Y", "Y"),
            HS = c("Y", "--", "(FS)", "Y", "Y ^2^", "Y", "Y ^2^","","", "Y"),
            GP = c("Y", "--", "Y", "^1^","","","","","", "Y"),
            FA = c("", "--","","","", "Y","","","", "Y"),
            HA = c("", "Y", "Y ^2^","","","","","","", "Y"),
            GGG = c("", "--","","","","", "^3^","","", "Y"),
            F1C = c("","","","","","","","","", "Y"))
colnames(DR) <- c(rownames(DR), "H1C", "U")
knitr::kable(DR,
             caption = "Double relationships between pairs of individuals. Abbreviations as before, and GGG=great-grandparent, F1C=full first cousins, H1C=half first cousins (parents are HS).",
             format = "pipe", booktabs=TRUE)  # forces ^1^ etc. through pandoc first 

--: impossible
Y: explicitly considered
empty: not (yet) explicitly considered (but may be inferred as 'side effect')
^1^: Can not be considered explicitly, as likelihood identical to PO
^2^: Including the special case were one is inbred
^3^: Can not be considered explicitly, as likelihood identical to GP

Herm

Hermaphrodites, one of: no Dioecious (separate sexes) species A Heed dam versus sire role of individuals, and assign a parent only if it is clear that it was the female or male parent of an individual. Requires a pedigree prior (e.g. the plant from which the seed was collected), or age difference between maternal versus paternal role (see Hermaphrodites. * B No distinction is made between dam versus sire role; any parent is assigned in the first available 'slot', and no conclusions can be drawn from whether individuals are assigned as maternal, paternal or 'cross' half-siblings.

If any individuals in the life history data have sex=4, Herm='no' will automatically be changed into Herm='A'. Hermaphrodites may be mixed with known-sex (or known sex role) individuals. Herm='A' or 'B' can also be selected if none of the genotyped individuals are hermaphrodites, but non-genotyped (dummy) parents could be.

UseAge

The strength of reliance on birth year information during full pedigree reconstruction.

In addition to providing no birth year information at all, there are three possible levels of reliance on birth years and ageprior during full pedigree reconstruction:

UA <- cbind("no" = c("Y", "Y", ""),
            "yes" = c("Y", "Y", "Y"),
            "extra" = c("Y", "", "Y"))
rownames(UA) <- c("$LR_{age} > 0$", "$LLR_{SNP} > T_{assign}$", 
                  "$LLR_{SNP} + LLR_{age} > T_{assign}$")

knitr::kable(UA, caption = "Parameter 'UseAge' (see text for definitions)", align='ccc', escape=FALSE, booktabs=TRUE, position="!h")

Below an example of how to calculate $LLR_{SNP}$ and $LLR_{age}$, for a grandparent -- grand-offspring pair from the griffin example pedigree, with an age difference of 4 years. This pair would not be assigned with UseAge = 'yes', as purely genetically the likelihoods for the pair being HS, GP or FA are identical (unless both individuals already have at least 1 parent assigned), but would get assigned with UseAge = 'extra', as in this fictional population 4-year-older paternal grandmothers are about 10x as common as 4-year-older paternal aunts (difference of 1 unit on log10-scale).

Such assignments which rely strongly on the age difference of a pair, in combination with the estimated age distribution of relative pairs, are often not desirable, and are therefore not made by default (UseAge = 'yes').

data(Ped_griffin, SeqOUT_griffin, package="sequoia")
GenoX <- SimGeno(Ped_griffin, nSnp = 400, ParMis=0)
LLR_SNP <- CalcPairLL(Pairs = data.frame(ID1="i122_2007_M", ID2="i042_2003_F"), 
                      GenoM = GenoX, Plot=FALSE)
LLR_Age <- GetLLRAge(SeqOUT_griffin$AgePriorExtra, agedif=4, patmat=2)
knitr::kable(rbind(SNP = LLR_SNP[,colnames(LLR_Age)],
                   Age = LLR_Age,
                   "SNP + Age" = LLR_SNP[,colnames(LLR_Age)] +
                     LLR_Age),
             digits=1, booktabs=TRUE, 
             caption = "LLR example for a grandparent - grand-offspring pair")

SeqList {#sec:reuse}

Parameter settings and input data from one sequoia() run can be re-used in a subsequent sequoia() run, and as input by various other functions (Table \@ref(tab:CoreFun)). This makes it easier to have consistent parameter values and input data across different pedigree inference runs and different functions.

The parameter values used as arguments when calling sequoia() are returned in the list element Specs. This 1-row dataframe may be edited (with caution!) before re-use. Other elements of the sequoia() output list (SeqList) that are used as input by one or more functions are LifeHist, AgePriors, and PedigreePar/Pedigree. Details on which elements are used is provided in the help file of each function. SeqList[["Specs"]] nearly always take precedent over similarly-named input parameters, with exception of e.g. sequoia()'s Module argument. The other SeqList elements generally also take precedent, but not always with a warning. If you wish to use a mixture of SeqList elements and 'new' data, it is safest to provide only the minimal subset of SeqList, and specify the other data as separate input (or as 'fake' SeqList elements).

For example, if you wish to change the maximum sibship size, but do not want to re-run parentage assignment:

ParOUT$Specs$MaxSibshipSize <- 150
SeqOUTX <- sequoia(GenoM = Geno,
                  SeqList = ParOUT[c("Specs", "PedigreePar", "LifeHist")],
                  Module = "ped")

Running sequoia()

Module (MaxSibIter)

Most datasets initially contain imperfections. Pedigree reconstruction can be a useful tool in quality control, to find outliers with many Mendelian errors among e.g. known mother--offspring pairs (sample mislabeling?) or among SNPs (poor quality SNPs to be excluded).

To speed up iterative quality control, it is possible to run sequoia() only up to a certain point, use the output thus far to, as necessary, exclude SNPs, exclude/relabel samples, or adjust input parameters, and re-run again from the start or resume from the last point.

There are three possible intermediate stopping points, plus the final end point, each named after the last 'Module' that is run:

  1. 'pre': Input check
    check that the genotype data, life history data and input parameters are in a valid format, create 'Specs' element of output list
  2. 'dup': check for duplicates
    Check for (nearly) identical genotypes, and for duplicated IDs in the genotype and life history data
  3. 'par': parentage assignment
    Assign genotyped parents to genotyped individuals. Includes call to MakeAgePrior() to estimate AgePriors based on the just-assigned parents.
  4. 'ped': full pedigree reconstruction
    Cluster half- and full-siblings and assign each cluster a dummy-parent; assign grandparents to sibships and singletons.

MaxSibIter {#sec:MaxSibIter}

Up to sequoia version 2.0, MaxSibIter was used to switch between these options, with numeric values

The maximum number of iterations was always only intended as a safety net in case the total likelihood did not stabilise, but this has not been an issue since sequoia version 1.0 (as far as I am aware). Even for large, complex datasets (several thousand individuals with considerable close inbreeding) the total likelihood plateaus in 10--15 iterations. From version 2.1 onward, MaxSibIter$=42$; if this number of iterations is reached it is a strong sign of problems with the data, e.g. that the SNPs are not informative enough to reconstruct the pedigree.

Input check

SNP datasets are typically too large to easily spot any problems by simply looking at the data in a spreadsheet. There are many tools available to deal with genotype data and do proper quality control; the function CheckGeno() merely checks that the data are in the correct format for sequoia, and that there are no SNPs or individuals with excessively many missing values.

The function SnpStats() can be used to count the number of Mendelian errors per SNP. It is not automatically called by sequoia(), but excluding SNPs with exceptionally high genotyping error rates is recommended to improve accuracy of the inferred pedigree.

Check for duplicates

The data may contain positive controls, as well as other intentional and unintentional duplicated samples, with or without life-history information. sequoia() searches the data for (near) identical genotypes, allowing for MaxMismatchDUP mismatches between the genotypes, which may or may not have the same individual ID. The threshold value depends on the presumed genotyping error rate, allele frequencies and number of SNPs, and is calculated by CalcMaxMismatch().

Not all pairs flagged as potential duplicate genotypes are necessarily actual duplicates: inbred individuals may be nearly indistinguishable from their parent(s), especially when the number of SNPs is limited.

This check will also return a vector of individuals included in the genotype data, but not in the life history data (NoLH). This is merely a service to the user; individuals without life history information can often be successfully included in the pedigree.

Parentage assignment

The number of pairs to be checked if they are parent and offspring is very large for even moderate numbers of individuals, e.g. 5,000 pairs for 100 individuals, and 2 million for 2,000 individuals. To speed up computation, three 'sieves' are applied sequentially to find candidate parent-offspring pairs, with decreasing 'mesh size'

The older of the pair is assigned as parent of the younger. If it is unclear which is the older, or if it is unclear whether the parent is the mother or the father, no assignment is made, but these pairs can be found with GetMaybeRel() (see section Likely relatives). If there are multiple candidate parents of one or both (or unknown) sex(es), the parent pair or single parent resulting in the highest likelihood is assigned.

The heuristic sequential filtering approach makes parentage assignment quick, and usually takes only a few minutes, especially when setting CalcLLR=FALSE (do not re-calculate LLR parent-offspring vs next-most-likely relationship for all assigned parents, based on the final pedigree).

AgePrior

The assigned parents are used to update the ageprior, before returning the results to the user of continuing with full pedigree reconstruction. Any arguments to MakeAgePrior() can be passed via args.AP. For example, you can specify the maximum age of dams and sires:

SeqOUT.B <- sequoia(GenoM = SimGeno_example, Err = 0.005,
                    LifeHistData = LH_HSg5, Module="par", Plot=FALSE, quiet=TRUE,
                    args.AP = list(MaxAgeParent = c(3,2), Smooth=FALSE))
PlotAgePrior(SeqOUT.B$AgePriors)

This way, even if sampled parents are all age 1, siblings sharing an unsampled parent may still have an age difference of 1 or 2 years, and grandparent -- grand-offspring pairs may have an age difference of 2-6 years.

Full reconstruction

During full pedigree reconstruction, assignment of all first and second degree links between all individuals is attempted, using the following steps within each iteration

The order of these steps, and the skipping of some steps in the first iteration(s), has been established by trial & error to minimise false positive assignments while maximising correct assignments across a wide range of datasets. When running sequoia(), you can keep track of the progress through the various steps by setting quiet = 'verbose'.

Full pedigree reconstruction may take from a few seconds to several hours, depending on the number of individuals without an already assigned parent, the proportion of individuals with unknown sex or birth year, the number of sibships that is being clustered and their degree of interconnection, and the number of SNPs and their error rate. Generally computation is faster if the data is more complete and more accurate.

Dummy Individuals

The 'dummy' parent assigned to each cluster of half-siblings is denoted by a 4-digit number with prefix F for females (F0001, F0002, ...) and M for males (M0001, M0002, ...). In the output, dummy individuals are appended at the bottom of the pedigree with their assigned parents, i.e. the sibship's grandparents. In addition, all information for each dummy individual is given in dataframe DummyIDs, including its parents (again), its offspring, and the estimated birth year, as a point estimate (BY.est) and lower and upper bound of the 95% probability interval. This information is intended to make it easier to match dummy IDs to real IDs of observed but non-genotyped individuals (see also Compare pedigrees).

Total likelihood

The total likelihood provides a measure of how well the observed genotype data is explained by the currently inferred pedigree, the allele frequencies of the SNPs, the presumed genotyping error rate, and the assumption that founder genotypes were drawn from a genepool in Hardy-Weinberg equilibrium. It is always a negative number, and closer to zero indicates a better fit. When an asymptote is reached, typically in 5--10 iterations, either the algorithm is concluded (the default) or dependency on the age prior is increased (if UseAge = 'extra') and the algorithm continues until a new asymptote is reached.

Save output

There are various ways in which the output can be stored. This includes saving the seqoia list object, and optionally various other objects, in an .RData compressed file

save(SeqList, LHdata, Geno, file="Sequoia_output_date.RData")

which can be read back into R at a later point

load("Sequoia_output_date.RData")
# 'SeqList' and 'LHdata' will appear in R environment

The advantage is that all data is stored and can easily be manipulated when recalled. The disadvantage is that the file is not human-readable, and (to my knowledge) can only be opened by R.

Alternatively, the various dataframes and list elements can each be written to a text file in a designated folder. This can be done using write.table or write.csv, or using the wrapper writeSeq():

writeSeq(SeqList, GenoM = Geno, folder=paste("Sequoia_OUT", Sys.Date()))

The same function can also write the dataframes and list elements to an excel file (.xls or .xlsx), each to a separate sheet, using package xlsx:

writeSeq(SeqList, OutFormat="xls", file="Sequoia_OUT.xlsx")

Note that 'GenoM' is ignored, as a very large genotype matrix may result in a file that is too large for excel to open. If you have a genotype matrix of modest size, you can add it to the same excel file; see the example in ?writeSeq.

library(xlsx)
write.xlsx(Geno, file = "Sequoia_OUT.xlsx", sheetName="Genotypes",
      col.names=FALSE, row.names=TRUE, append=TRUE, showNA=FALSE)

The option append=TRUE ensures that the sheet is appended to the file, rather than the file entirely overwritten.

Output check

Pedigree stats \& plots

Various summary statistics can be calculated and displayed using SummarySeq(), such as the number and kind of assigned parents (Figure \@ref(fig:SumSeq-parents)), family sizes (Figure \@ref(fig:SumSeq-sibsize)), and the distributions of Mendelian errors. Its output also includes a table analogous to pedigreeStats by R package pedantics (which was archived on CRAN in December 2019).

knitr::include_graphics("SummarySeq_deer.png")
knitr::include_graphics("SummarySeq_deer_sibships.png")

There is a range of software available to plot pedigrees in various styles, such as R package kinship2 or the program PedigreeViewer. Be aware that many use a different column order (id - sire - dam, rather than id - dam - sire), use $0$ to denote missing parents rather than NA, and/or do not allow individuals to have a single parent (they must have 0 or 2 parents). The last two issues can be resolved with Sequoia's function PedPolish().

Negative LLR's

The parent LLR's are calculated based on genetic data only, without any consideration of age. Consequently, they are often close to zero for dummy -- dummy parent-offspring pairs, as genetically the parent may just as well be the offspring. Such pairs will not have been assigned unless there is a compatible co-parent (LLRpair should always be positive) or if there is overwhelming age-based evidence to decide which one is the parent.

When a parental LLR is negative without an accompanying co-parent, or when LLRpair is negative, this is reason to cautious about the assignment and for example compare it to a previous pedigree or inspect the genomic relatedness. In contrast to MCMC type approaches, sequoia's algorithm has limited scope to undo previously made assignments, and may get stuck at suboptimal solutions.

Extremely negative LLRs ($< -100$) are typically caused by one or more of the alternative relationships considered, now conditional on the rest of the pedigree, being very convoluted and neither being caught as 'highly improbable and not implemented' nor being implemented properly (see Table weird relationships for double relationships that are implemented). In the unlikely event that the parent-offspring relationship itself is not implemented properly you will see an error value of 444 or 777.

Dyad plot

Per popular request, from version 2.1 sequoia includes PlotRelPairs() to produce pairwise plots, similar to Colony's plot of half- and full-sibling dyads. It plots the output from GetRelM(), where you can specify if you want to trace each individual's pedigree 1 generation back (as in Figure \@ref(fig:Relplot)) or 2 (include grandparents & aunts/uncles), and whether or not you want to differentiate between paternal and maternal relatives. It includes some basic functionality to subset individuals.

# Rel.griffin <- GetRelM(PedPolish(Ped_griffin[Ped_griffin$birthyear <2004, ]), 
#                        patmat=TRUE, GenBack=1)
# PlotRelPairs(Rel.griffin)   # poor figure quality
knitr::include_graphics("RelPlot-griffin.png")

Compare pedigrees {#sec:compPed}

Often a (part) pedigree is already available to which one wants to compare the newly inferred pedigree, for example to field-observed mothers or a microsatellite-based pedigree. PedCompare() compares for each individual the assigned parents between the two pedigrees (match/mismatch), while ComparePairs() compares for each pair their relationship between the two pedigrees (e.g. full sibs vs half sibs). Both pedigrees may have genotyped as well as non-genotyped individuals.

Most of the terminology in this vignette and the helpfiles is based on Ped1 being the 'true' pedigree (the one from which data is simulated) and Ped2 is the inferred pedigree (from the simulated data), but both functions work for any two pedigrees. From version 2.0 onwards you can indicate in PedCompare() that both pedigrees may be imperfect (Symmetrical = TRUE).

Dummy matching

Comparing two pedigrees becomes complicated when dummy IDs are involved -- When are non-identical parental IDs between the two pedigrees actually referring to the same individual? For example, in Table \@ref(tab:PedComp-ex1) the mother of Blossom and Bob is Beta in Pedigree1, and dummy female F0007 in Pedigree 2. It seems fair to assume that F0007 thus refers to the non-genotyped female Beta. And since Alpha was assigned as maternal grandmother to Blossom and Bob, we can also conclude that Alpha must be the mother of Beta!

Ped1 <- data.frame(id = c("Alpha", "Beta", "Abigail", "Aster", "Blossom", "Bob", "-"),
                    dam = c("","", "Alpha", "Alpha", "Beta", "Beta", ""),
                    sire = c("","", "Zorro", "Yann", "", "Zorro", ""))
Ped2 <- data.frame(id = c("Alpha", "-", "Abigail", "Aster", "Blossom", "Bob", "F0007"),
                   dam = c("", "", "Alpha", "Alpha", "F0007", "F0007", "Alpha"),
                   sire = c("", "", "Zorro", "M0004", "", "Zorro", ""))
knitr::kable(list(Ped1, Ped2), 
             caption = "Example comparison between Pedigree1 (left) and Pedigree2 (right)", booktabs=TRUE)

Matching and replacing dummy IDs this way extends the number of individuals with both pedigree and phenotypic data, which will among others increase power of any downstream quantitative genetic analyses.

More specifically, PedCompare() considers it a 'match' if the inferred sibship (in Pedigree2) which contains the most offspring of a non-genotyped parent (in Pedigree1), consists for more than half of this individual's offspring. If for example a cluster of $5$ siblings in Pedigree1 is split into two clusters in Pedigree2, the larger sibship (of say 3 individuals) is considered a match, and the smaller a mismatch (thus 2 mismatches) --- even though it could be argued the inferred Pedigree2 does not contain any incorrect links. For this same example, ComparePairs() would tell that the $(5*4)/2=10$ halfsib pairs in Pedigree1 are $3+1$ halfsib pairs in Pedigree2, and $6$ unrelated pairs. ComparePairs() does not attempt any matching of dummy individuals.

Example

PCG  <- PedCompare(Ped1 = cbind(FieldMums_griffin,
                                sire = NA),
                   Ped2 = SeqOUT_griffin$Pedigree)

# pedigrees side-by-side (subset of columns because no field-observed sires here)
PCG$MergedPed[127:133, c("id", "dam.1", "dam.2", "dam.r", "id.dam.cat", "dam.class")]
#  Non-genotyped field mums have a two-colour code (e.g. 'BlueRed')

PCG$MergedPed[c(137,138, 6, 128,129,7), c("id", "id.r", "dam.1", "dam.2", "dam.r")]
# column 'id': ids common to both pedigrees, plus those only in Pedigree2 
# column 'id.r': 'consensus' ids, plus those only occurring in Pedigree1  

# dummy individuals from Ped2 with their best-matching non-genotyped individual in Ped1
head(PCG$DummyMatch[, -c(3:5)])
# 'nomatch' in the `id.1` column: none of the siblings in Ped2 had a field-observed
# mother, or there is a mismatch.

# Total number of matches & mismatches:
PCG$Counts
# dim1: category: genotyped/dummy/total
# dim2: classification: total (could-have-had-parent)/ match/ mismatch/ 
#       P1only (no parent in Ped2)/ P2only
# dim3: dam/sire

See here for a detailed walk-through of this example, and suggestions on how to trace the underlying cause of discrepancies between the newly-inferred genetic pedigree and an existing field pedigree.

In short, the main aim of checking the discrepancies is to figure out the underlying cause of each case:

so that the records and/or pedigree can be corrected. PedCompare() only points you to where the discrepancies are; field records, lab records and common sense are needed to determine the most likely cause (or determine that the cause is unknowable).

Fairly common (but still rare) pedigree reconstruction errors are

# ~~ Mismatches ~~
PCG$MergedPed[which(PCG$MergedPed$dam.class == "Mismatch"), 
               c("id", "dam.1", "dam.2", "id.dam.cat")]

PedM <- PCG$MergedPed[, c("id", "dam.1", "dam.2")]   # short-hand to minimise typing

# who are the mismatching individual's siblings according to Ped1 & Ped2? 
PedM[which(PedM$dam.1 == "GreenBlue"), ]
PedM[which(PedM$dam.2 == "i081_2005_F"), ]

# who are their maternal grandparents?
PedM[which(PedM$id.1 == "GreenBlue"), ]
PedM[which(PedM$id.2 == "i081_2005_F"), ]

Colony

To compare Colony output with an existing pedigree, use:

BestConfig <- read.table("Colony/file/file.BestConfig",
                         header=T, sep="", comment.char="")
PC <- PedCompare(Ped1 = ExistingPedigree,
                 Ped2 = BestConfig)

Relatedness {#sec:compR}

Pedigree relatedness {#sec:Rped}

Another way to compare two pedigrees is to compare the pedigree-based relatednesses calculated from each. In R, pedigree relatedness can for example be calculated by the R package kinship2 -- relatedness is defined as twice the kinship coefficient.

kinship2 requires a pedigree with columns id - dadid - momid - sex, and requires that individuals either have two parents, or zero. The latter can be achieved with sequoia::PedPolish(, FillParents=TRUE). From sequoia version 2.1, CalcRPed() will reformat the pedigree, call kinship2::kinship(), and return relatedness coefficients as a matrix or dataframe.

The same can be done for a second pedigree, and the two dataframes merged. As the number of pairs $p$ becomes very large even for moderate numbers of individuals $n$ ($p$ = $n \times (n-1)/2$, so e.g. 2 million pairs for 2 thousand individuals), it is a good idea to use package data.table to assist with merging (merge in seconds vs. many minutes or not at all):

Rped.old <- CalcRped(Ped_griffin, OUT="DF")
Rped.new <- CalcRped(SeqOUT_griffin$Pedigree, OUT="DF")
library(data.table)
Rped.both <- merge(data.table(Rped.old, key=c("IID1", "IID2")),
                   data.table(Rped.new, key=c("IID1", "IID2")), 
                   all=TRUE, suffixes=c(".old", ".new"))

cor(Rped.both$R.ped.new, Rped.both$R.ped.old, use="pairwise.complete")
plot(Rped.both$R.ped.old, Rped.both$R.ped.new, pch=16, cex=0.3,
     xlab = 'R (old ped)', ylab = 'R (new ped)')

Genomic relatedness

In absence of a previous pedigree, or when it is not obvious whether the previous or newly inferred pedigree is correct, one could compare the pairwise relatedness estimated from the pedigree(s) to a measure of genomic relatedness, estimated directly from the complete SNP data -- which may be many more SNPs than used for pedigree reconstruction. Genomic relatedness can be estimated for example using GCTA. Genomic relatedness will vary around the pedigree-based relatedness even for a perfect pedigree due to Mendelian variance, but outliers suggest pedigree errors.

For example:

# read in output from GCTA
Rel.snp <- read.table("GT.grm.gz")   
Rel.id <- read.table("GT.grm.id", stringsAsFactors=FALSE)
Rel.snp[,1] <- as.character(factor(Rel.snp[,1], labels=Rel.id[,2]))
Rel.snp[,2] <- as.character(factor(Rel.snp[,2], labels=Rel.id[,2]))
names(Rel.snp) <- c("IID1", "IID2", "nSNPS", "R.GRM")
Rel.snp <- Rel.snp[Rel.snp$IID1 != Rel.snp$IID2,]  # between-indiv only
#
# combine with pedigree relatedness
Rel.both <- merge(data.table(Rel.snp[,c(1,2,4)], key=c("IID1", "IID2")),
                  data.table(Rped.both, key=c("IID1", "IID2")), all.x=TRUE)
Rel.both <- as.data.frame(Rel.both)  # turn back into regular dataframe
rm(Rel.snp, Rped.both, Rped.new, Rped.old)   # clean up: remove large dataframes
#
round(cor(Rel.both[, c("R.GRM","R.ped.new", "R.ped.old")], 
          use="pairwise.complete"), 3)
#
# scatterplot doesn't work well with many thousand points
# >> use heatmap-like alternative, e.g. hexbinplot
hexbin::hexbinplot(Rel.both$R.GRM ~ Rel.both$R.ped.new, 
                   xbins=100, aspect=1,
                   xlim=c(-.05,.9), ylim=c(-.2, .9),
                   xlab="Pedigree relatedness", ylab="Genomic relatedness",
                   trans=log10, inv=function(x) 10^x, 
                   colorcut=seq(0,1,length=14), maxcnt=10^6.5,
                   colramp = function(n) {grDevices::hcl.colors(n, palette='Berlin')})
#
# if you want to add e.g. a diagonal line to that plot:
hb <- hexbin::hexbin(Rel.both$R.GRM ~ Rel.both$R.ped.new, 
                     xbins=100, xbnds=c(-.05, .9), ybnds=c(-.2, .9),
             xlab="Pedigree relatedness", ylab="Genomic relatedness")
hbp <- hexbin::plot(hb,
                    trans=log10, inv=function(x) 10^x,
                    colorcut=seq(0,1,length=14), maxcnt=10^6.5,
                    colramp = function(n) {grDevices::hcl.colors(n, palette='Berlin')}
)
hexbin::hexVP.abline(hbp$plot.vp, a=0, b=1)
knitr::include_graphics("hexbin.png")

Cluster families

Certain analyses, such as the Mendelian error check in PLINK, are done on a family-by-family basis. FindFamilies() takes a pedigree as input and clusters the individuals in as few families as possible, by repeatedly searching all ancestors and all descendants of each individual and ensuring those all have the same family ID.

This function does not take separate FID and IID columns in the input pedigree, rather these need to be joined together before running FindFamilies() (e.g. with paste()), and then split afterwards using PedStripFID().

Hermaphrodites

Hermaphrodites have been re-implemented in versions 2.2 and 2.3, and now also full pedigree reconstruction should work as expected. Selfing is fully supported, also when a selfing parent or its selfed offspring are non-genotyped, and the aim is to infer their genotyped relatives.

With hermaphrodites it is not self-evident whether an identified parent should be assigned as dam or as sire. Sex-linked markers are not implemented in sequoia, and sequential hermaphroditism (male until age $t$, and female after, or v.v.) is not the norm.

Two alternative approaches are available, by specifying parameter Herm='A' or Herm='B'.

Herm 'A': Pedigree prior

A genetically identified parent will be assigned when it matches a candidate parent, or when it forms a complementary parent-pair with such an assigned parent. The candidate dam may e.g. be the plant from which the seed was collected. Candidate parent--offspring pairs that are not a genetic match will never be assigned, non-genotyped candidate parents are ignored.

In absence of a pedigree prior, only instances of selfing will be assigned.

The pedigree 'prior' can only be used during parentage assignment, by specifying SeqList[["PedigreePar"]] as input and setting Module='par':

cand.dams <- read.table("Candidate_dams.txt", header=TRUE,
                        stringsAsFactors=FALSE)
# cand.dam has columns 'id' and 'dam', and does not need entries for all ids
cand.par <- cbind(cand.dams, sire=NA)

par.herm <- sequoia(GenoM = Geno,
                    LifeHistData = LH,
                    SeqList = list(PedigreePar = cand.par),
                    Module='par')

# re-use all settings (including the newly assigned parents):
seq.herm <- sequoia(GenoM = Geno,
                    LifeHistData = LH,
                    SeqList = par.herm,
                    Module='ped')

Herm 'B': Ignore sex role

Often it is irrelevant whether the parent was the maternal or paternal parent, and with this setting single parents are assigned as dams, and in parent-pairs dam vs sire is determined by their order in the genotype file.

Hermaphrodite dummies

Whether half-siblings share a dummy dam or a dummy sire is arbitrary. If one or more individuals in a sibship are the product of selfing, the maternal and paternal halves will be linked, and additional half-siblings will be arbitrarily assigned to either half. Dummy individuals which occur as both dam and sire get dummy prefix 'H', an alternative 3rd DummyPrefix can be specified in (most?) relevant functions.

In SeqList[["DummyIDs"]], the offspring-as-dam and offspring-as-sire are shown separately, with the same id ('H0xxx') and Sex=1 vs Sex=2 respectively. Selfed offspring will occur in both rows. Estimated birth years and other aspects may differ somewhat between the maternal and paternal halves, as during pedigree reconstruction they are (currently?) only linked in some aspects but not others.

Assignment errors

PedCompare() is not useful in combination with Herm='B', as any dam assigned as sire and v.v. will show up as error. Use instead ComparePairs(, patmat=FALSE), which will also pick up half-siblings for which the mother of $i$ is the father of $j$.

One type of assignment error is unavoidable: if the true parent is unsampled and the product of selfing, a sampled grandparent will be indistinguishable from a parent. Similarly, when the true parent is sampled, but both the parent and the grandparent are the product of selfing, they are genetically nearly identical and often erroneously interchanged when birth year information is absent.

knitr::include_graphics("selfed_parent.png")

FAQ

Error about/due to input data

For smaller datasets (say up to 1000 individuals) it may be useful to read in the data with read.table() (using header=TRUE or FALSE, as appropriate, and typically row.names=1), and inspect the data with table(as.matrix(mydata)) for odd entries that weren't caught by CheckGeno() to give an informative error message. For very large datasets, specialist tools may be required to get the data in a standardised format.

Low assignment rate {#sec:FAQlowAR}

See also key points.

In many cases, assignment rate can be boosted without increasing the number of SNPs: by assuming a different genotyping error rate, providing sex or birth year information on more individuals, or fine-tuning the ageprior. Sometimes reducing the number of SNPs may increase assignment rate, by removing those with the highest error rate and/or SNPs in close LD.

Not enough SNPs

As opposed to most other pedigree assignment programs, Sequoia does not rely on MCMC to explore many different pedigree possibilities, but instead sequentially assigns highly likely relationships, and expands the pedigree step by step. For a relationship to be highly likely, a substantial number of SNPs is necessary, larger than for MCMC methods: at least 100-200 for parentage assignment, or full sibling clustering in a monogamous population, and about 400-500 otherwise. More is not always better: with tens of thousands of SNPs many will unavoidably be in high LD (unless you work on a species with a very large genome and very high $N_e$), and the signal will get lost in the noise. Using more than 1000 SNPs is never necessary, and for a typical vertebrate genome there is little benefit of using more than 500 good SNPs.

This unfortunately also means that sequoia is unsuitable for species with a very small genome.

Genotyping errors

A few SNPs with a high (apparent) error rate may throw off pedigree reconstruction if their erroneous signal is not sufficiently offset by a large number of more accurate SNPs. Such SNPs can be identified with SnpStats(), using an existing pedigree or one from an preliminary round of parentage assignment. If there are clear outliers with regards to estimated error rate, it may be worth exploring pedigree reconstruction without these SNPs.

Lacking sex information

Currently Sequoia cannot handle sex-linked markers, and therefore cannot distinguish between maternal versus paternal relatives. Sometimes the sex of an individual can be inferred, if it forms a complementary parent-pair with an individual of known sex. This is also the manner in which the sex of dummy parents is determined; half-sibships sharing a parent of unknown sex are not currently implemented.

Pairs of relatives for which it is unclear whether they are maternal or paternal relatives, can be identified with GetMaybeRel().

Insufficient age information

A parent will only be assigned if it is known to be older than the individual with which it genetically forms a parent--offspring pair, or if a complementary co-parent is identified. Purely genetically, it is impossible to tell who is the parent in a parent--offspring pair. But once an individual has been assigned parents, any remaining individuals with which it forms a parent-offspring pair must be its offspring, and will be assigned as such. Thus, a high proportion of sampled parents may somewhat compensate for unknown birth years.

Providing minimum and maximum possible birth years can substantially increase assignment rate. The risk is that when intervals are taken too narrowly, parent--offspring pairs are flipped the wrong way around, which in turn may lead to other wrong assignments. This may especially be a problem in long-lived species which start breeding at an early age.

Age information will also help to distinguish between the three different types of second degree relatives (half siblings, grandparent -- grand-offspring and full avuncular (aunt/uncle -- niece/nephew). These are genetically indistinguishable, unless both individuals of the pair already have a parent assigned. In most species, there is limited overlap between the age-difference of half siblings versus grandparents, and only partial overlap of either with the age distribution of avuncular relationships.

Uninformative age prior

The ageprior used for full pedigree reconstruction is estimated from the birth year differences of parent-offspring pairs assigned during the initial parentage assignment. When only few parents are assigned, or when many birth years are unknown, this ageprior may not be very informative. If a large pedigree is available from the same or a similar population (e.g. based on observations and microsatellite paternity assignments), it can be useful to estimate the agepriors from that pedigree. For details, see the age vignette

Mating system

When the population has a complex mating system, with overlapping generations and many double relatives, a large number of SNPs is needed to distinguish between various plausible alternatives. When the power of the SNP panel is insufficient to make the distinction, no assignment will be made.

When inbreeding and complex relationships (e.g. paternal half-sibling as well as maternal half-aunt) are rare, ignoring these typically increases assignment rate (Complex="simp"). Similarly, when polygamy is rare and not of particular interest, assuming a monogamous mating system typically increases assignment rate (Complex="mono"). However, these choices will risk erroneous assignments when complex relationships or polygamy, respectively, do occur.

Death year

The year of death forms an upper limit to when an individual could have reproduced, and can from version 2.5 onwards be used as input via the Year.last column in LifeHistData. Note that for many species, the last possible year in which offspring can be born may be after the year of death (e.g. males in many mammal species), or the year preceding the year of death (if a individual died between January 1st and the breeding season).

If it is unclear whether an individual died or emigrated, please do not provide a Year.last -- emigrants can still be candidate parents, as they may reside just outside the study area boundaries, or return briefly and unseen during the breeding season. is unknown.

Accuracy

The real, true pedigree is normally unknown, but there are a few ways to infer the accuracy of specific pedigree links, or to estimate the average accuracy of assignments.

Field pedigree

If the genetically assigned mother matches the mother caring for the individual, or the plant from which the seed was collected, there is little reason to doubt the assignment. Similarly, when the genetically assigned father matches (one of) the observed mates of the mother, the assignment is most likely correct.

In rare other cases, the genetically assigned parent is impossible, for example because the assigned parent was not alive at the time of birth (for mothers) or conception (for fathers). The blame for such an erroneous assignment may be the pedigree reconstruction software (due to genotyping errors, or a bug in the code), but may also be due to sample mislabelling in the lab, or a case of mistaken identity in the field.

Tracking down the underlying cause of discrepancies between the field and genetic pedigree will be time consuming, but can be a valuable part of data quality control.

Genomic relatedness

When many thousands of SNPs are typed, it is possible to calculate the genomic relatedness ($R_{grm}$) between all pairs of individuals (see Relatedness). Due to the random nature of Mendelian inheritance there is always considerable scatter of genomic relatedness around pedigree relatedness, but when the pedigree relatedness is considerably higher (say $R_{ped} - R_{grm} >0.2$), this is often indicative of a pedigree error. Note however that most estimators of $R_{grm}$ assume a large, panmictic, non-inbred population, and deviations from these assumptions may contribute to differences between $R_{ped}$ and $R_{grm}$.

Data simulation

Simulations as performed by 'EstConf()' do not tell which assignments may be incorrect, but do give an estimate of the overall number of incorrect assignments. The simulations are done presuming the inferred pedigree (or an existing pedigree) is the true pedigree, i.e. for a pedigree that is (hopefully) very close to the actual true pedigree. It relies on several simplifying assumptions, and will therefore always be an optimistic estimate of the actual accuracy.

Add new individuals

When new individuals have been genotyped, such as a new cohort of offspring, it is best to re-run the pedigree reconstruction for all genotyped individuals. This ensures that older siblings of the new offspring are identified, as well as any grandparents. A possible exception is when all candidate parents have been genotyped, although even then inclusion of their parents (the new individuals' grandparents) may correct for any genotyping errors they may have.

Results differ from COLONY

Sequoia will typically produce a more conservative pedigree than COLONY, for various reasons:

You can get sequoia's 'opinion' on a COLONY reconstructed pedigree (or any other pedigree) with CalcOHLLR(), which for every assigned parent gives the log-likelihood ratio between being parent-offspring with the focal individual, versus otherwise related. You may also use CalcPairLL() for a specific set of pairs of individuals to calculate their likelihoods to be parent-offspring, full siblings, ... , and investigate how these likelihoods change when changing the ageprior or with Complex='simp'.

References



Try the sequoia package in your browser

Any scripts or data that you put into this service are public.

sequoia documentation built on Sept. 8, 2023, 5:29 p.m.