This tutorial briefly introduces the functions provided by the GESE package, using the example data included in the package. The paper describing this method is "Qiao, D. Lange, C., Laird, N.M., Won, S., Hersh, C.P., et al. 2017. Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319". We have also implemented a simple pipeline to compute the GESE p-values using this R library, which is described in detail at http://scholar.harvard.edu/dqiao/gese.
We can load the library using:
library(GESE)
A randomly simulated example data is included in the package. This data includes 198 sequenced subjects in 50 families. Only 2 genes with 10 variants each are included for these subjects. With real sequencing data, the first step is to filter down to the rare and functionally important variants since we hypothesize that there are rare causal variants of large effects for Mendelian diseases, or Mendelian-subtypes of complex diseases. One way is to filter the study data and the reference data using MAF < 0.1% and CADD score^[Kircher, M, Witten, DM, Jain, P, O’Roak, BJ, Cooper, GM, and Shendure, J. A general framework for estimating the relative pathogenicity of human genetic variants. Nature genetics 2014; 46(3):310–315.] > 15 on LoF (and missense) variants. Other annotation tools such as SIFT and Polyphen2 could also be helpful in filtering down to the functional variants.
After the filtering step, we can load the variant data of the sequenced subjects. This data frame is in the PLINK raw format (with the default header generated by PLINK). We need to make sure that the genotype is the number of the minor alleles in the corresponding population.
data(dataRaw) dim(dataRaw) dataRaw[1:10,] length(unique(dataRaw$FID))
The corresponding map file with variant information can be loaded next. The order of the variants needs to match the order in the dataRaw file above.
data(mapInfo) mapInfo[1:3, ] dim(mapInfo)
The complete pedigree information for the 50 families can be loaded next:
data(pednew) dim(pednew) pednew[1:5,]
The complete variant information for the corresponding population can be loaded next. This reference database is used to compute the background variation in the corresponding population.
data(database) dim(database) database[1:5,]
The main function for obtaining gene-based segregation information and (unweighted and weighted) segregation tests is GESE
. Other functions that may be helpful in analyzing family-based sequencing data will be discussed too.
One major function in this package for computing segregation information for different mode of inheritance is getSegInfo
. This function returns the variant-based and gene-based segregation information for each family and the total number of segregating families for three different mode of inheritance: dominant, recessive, and compound heterozygous.
For example, segregation in the family with dominant mode of inheritance means the variant is present in all the cases in the family, and absent in all the controls in the family. Therefore variants that are segregating in multiple families are more likely to be causal.
Since it is likely that different rare variants are influencing the disease susceptibility in different families, collapsing the variants into genes may give us more power to detect the causal genes. Therefore we also need to compute the total number of families that are segregating in each gene.
For recessive mode of inheritance, segregation means two copies of the alternative alleles are present in all the cases in the family, and less than two copies in all the controls of the family (varSeg
). This information can also be collapsed for each gene (geneSeg
).
For compound heterozygous (CH) mode of inheritance, segregation at two variants means the alternative alleles are present at both loci for all the cases in the family, and absent in at least one locus in all the controls in the family (varSeg
). This information can be collapsed for each gene, where only pairs of variants in the same gene are considred (geneSeg
). This can also be collapsed for each pair of genes, where pairs of variants from each of the two genes are considered (genePairSeg
).
The data frame varSeg
is a matrix containing logical value (TRUE or FALSE) for each variant (each row) and each family (each column). The first column is the variant ID. The last column numSegFam
is the total number of families the variant is segregating in. The data frame geneSeg
is also a matrix containing logical values, for each gene and each family. TRUE value means that at least one variant in this gene is segregating in the family.
We demonstrate the use of this function below.
To compute segregation information for domiant mode of inheritance:
results <- getSegInfo(pednew, dataRaw, mapInfo, mode="dominant") results
results
To compute segregation information for recessive mode of inheritance:
results <- getSegInfo(pednew, dataRaw, mapInfo) results
results
To compute segregation information for compound heterozygous mode of inheritance:
results <- getSegInfo(pednew, dataRaw, mapInfo, mode="CH") results
results
The geneSeg
or genePairSeg
return NA values, because there is no pair of variant in any gene, or gene pair that is segregating in any family, with compound heterozygous mode of inheritance.
Alternatively, we can compute segregation with dominant mode of inheritance wthout computing any probabilities using the GESE
function and specify onlySeg
to be TRUE:
results <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-5, onlySeg=TRUE) results
results
Similarly, the segregation
value returned is a data matrix, where a logical value (TRUE or FALSE) is returned for each gene(each row) and each family (column names). The last column numSegFam
sums each row, which gives the total number of pedigrees each gene segregates in. The families were first trimmed to satisfy the assumption of GESE, so that only one founder case is present for each pedigree.
The varSeg
value returned is a similar data matrix, but returns a logical value for each variant (each row) and each family (column names).
The gene-based segregation information may be helpful, however, it does not take into account the different family structure among the families, nor the different genetic background of different genes. The gene-based segregation test combines this information by approximating the marginal probability of segregation events among the families.
To compute the p-value for this test, there are a few steps in the process.
To ensure that only one founder case is present for each family, we will trim the pedigrees to keep only the founder case that is related to most other non-founder cases if necessary.
Second, we need to compute the conditional probabilities that a variant segregates in the family conditional on that variant in present in one of the founders.
Third, we need to compute the marginal probability that at least one variant in the gene segregates in the family.
Fourth, we need to compute the final p-value using the marginal segregation probabilities computed above.
These steps can be done using the GESE function in one call:
results2 <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-5) results2
results2
This call computes the GESE p-value using resampling with 1e5 times of simulations (so the smallest p-value is 1e-5). There are several other matrices returned by this call, such as the conditional segregation probability (condSegProb
) - the conditional probability of segregation event in the family conditioning on that one of the (most recent common) founders introduced the variant to the family; the gene-based segregation probability (segProbGene
) - the marginal probability that any variant in the gene is segregating in the family; and the variant-based and gene-based segregation matrices (varSeg
and segregation
).
The most useful results containing the GESE p-value here is:
results2$results
We can also incorporate additional information, such as disease severity of the cases in the families, in the weighting of the families. We can also compute the p-value for such weighted test, using resampling.
Suppose we have a disease severity measure of the cases in the families, and we are using such weighting of families for all the genes.
### creating weights for the families (same weights for all genes) fams <- unique(dataRaw$FID) nfam = length(fams) temp = runif(nfam) famWeight <- temp/sum(temp) weightFam = data.frame(FID=fams, weight=famWeight) results3 <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-5, familyWeight=weightFam) results3$results
results3$results
There are a few public methods that were used in the GESE test and may also be useful in other contexts.
This method trim_oneLineage
accepts the complete pedigree information and selects only one lineage per pedigree. The lineage is selected such that the maximum possible number of sequenced subects (cases) are included in the selected lineage. Other family-based methods, such as PVAAST, also requires one lineage in each pedigree only.
The method trim_unrelated
deals with families with multiple founder cases, which violates our assumption that only one founder introduced the causal variant. It removes the minimal number of founder cases so that the pedigree does not violate the assumption of the method.
The method condSegProbF
computes the probability that the variant is segregating in the family given that it is introducted by the most recent common ancestors of the cases in the family.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.