RV_FamSq | R Documentation |
A regional association analysis of rare variants and quantitative traits in family data RV_FamSq <- function(ped_pheno, ped_geno, maf_data, maf_cutoff,parafile,covar_col, trait_col, pop_col=c(), out,kin, estimateAF,start_par)
RV_FamSq(ped_pheno, ped_geno, maf_data, maf_cutoff, covar_col, trait_col, pop_col = c(), out, kin, estimateAF, start_par)
ped_pheno |
a dataframe of phenotype. The first four columns should be family IDs, individual IDs, father IDs and mother IDs of each sample. The ped_pheno can also include columns of covariates and populations IDs. If more than one population groups are included in the analysis, the column of population IDs should indicate which population specific MAF will be used. |
ped_geno |
a dataframe of genotype. The first two columns are family IDs and individual IDs.
The rest columns of the ped_geno are genotype score at each sites. The value of genotype score is 0, 1 or 2,
corresponding to number of rare variants observed at each sites. Number of sites included in ped_geno should be equal to the number of rows
of |
maf_data |
a dataframe describing the information of rare variants included in ped_geno. The first three columns correspond to the name of the gene, chromosome number, and position of variants. The rest columns of maf_data are MAFs of variants in each specific population group obtained from database, e.g., Genome Aggregation Database (gnomAD). |
maf_cutoff |
the cutoff of MAF for rare variants |
covar_col |
a integer or a vector of integers indicating which columns of |
trait_col |
a integer indicating which columns of |
pop_col |
a integer indicating which columns of |
out |
the directory that save the results. RVFamSq outputs a dataframe results including the name of the gene, score, p-value of the gene, number of sample size and number of families |
kin |
a kinship matrix calculated either based on family strtuctures or genomic informations. |
estimateAF |
a dataframe of estimated MAFs that do not oberserved in public datasets. The headers decribing the name
of the sub-population should be consistent with the headers in |
start_par |
an optional argument that defines the starting values of the parameters fitting. User can define specific values to start the parameters fitting, otherwise, the starting values will generate randomly. |
parafile |
the path to the file saving parameters estimated under the null model.
If the file is not existed, the package will estimate the parameters
using the available data and create a parafile in the path. Otherwise, the package will utilize the parameters in |
RVFamSq performed association analysis by estimating the parameters under the null model and calculating the statistical score using these parameters. The parameters are estimated by maximizing the multivariate normal likelihood:
L=∏_{i}(2π)^{-n_i / 2}≤ft |Ω _i\right |^{-1 /2}e^{≤ft [ y_i - E(y_i)\right ]'Ω _i^{-1}≤ft [ y_i - E(y_i)\right ]}
where n_i is the number of individuals in family i and ≤ft |Ω _i\right | is the determinant of matrix Ω _i. E(y_i) in the above function is a vector of expected phenotype of all individuals within each family and is calculated by:
E(y_i)=μ + β _x x_i
where x_i is a n_i \times q matrix of q covariates included in the association analysis. The parameters μ and β _x are population mean and a vector of covariate effects. The two parameters are estimated by maximizing the above likelihood function. The matrix Ω_i in the likelihood function is calculated within each family i and defined as:
Ω _{ijk}≤ft\{\begin{matrix}σ _g^2 + σ_e^2\ \ \ \ \ \ \ \ \ \ \ \ \ if\ j=k\\ 2\varphi _{ijk}σ _g^2\ \ \ \ \ \ \ \ \ \ \ \ if\ i\neq k \end{matrix}\right.
Here, the parameter σ_g^2 andσ_e^2 are variance components that are account for background polygenic effects and environmental effects, respectively. Additionally, Ω _{ijk} denote the kinship coefficient between individuals j and k. The values of parameters σ_g^2, σ_e^2, β_x, and μ are estimated by maximizing the above likelihood. With the values of these parameters, we can obtain the variance-covariance matrix Ω _i^{base} and vector E(y_i)^{base} based on the above equations for each family. Using these two quantities, the statistic score is defined by:
T^{SCORE}=\frac{≤ft \{ ∑ _i ≤ft [G_i -E(G_i)\right ]'[Ω _i^{(base)}]^{-1}[y_i-E(y_i)^{(base)}] \right \}^2}{∑ _i[G_i-E(G_i)]'[Ω_i^{(base)}]^{-1}[G_i-E(G_i)^{(base)}] }
where E(G_i) is defined as:
E(G_i)=∑_{m=1}^{M}2p_m
where p_m is the minor allele frequency (MAF) which is obtain from the dataframe of maf_data
.
results |
a data frame results containing the name of the gene,
the value of the score and p-value write into the file named by the name of
the gene under the directory defined by the argument |
paras |
a list of parameters estimated by maximizing the likelihood.
The values of the paras save to the file defined by the argument
|
Example data is generated for 1,000 extended families with 10 members in each family. The information of RVs within functional region of the gene AAAS with MAF<2 Quantitative traits are generated randomly based on the disribution of N(0,1) and genotypes of samples are generated based on MAF of the gene AAAS.
ped_pheno.txt A phenotype file of 1,000 extended families (10,000 individuals in total). Each column of the file represents family IDs, individual IDs, father IDs, mother IDs, sex, traits, and population IDs, respectively. RVFamSq is capable to analyze multiple covariates and extra covariants should be also included in this file after the fourth column.
ped_geno.txt A genotype file of samples included in the ped_pheno.txt file. The first two columns of the file should be family IDs and individual IDs that have the same format as in ped_pheno.txt. The rest columns of the file are genotypes of each individual and codes as 0, 1, and 2.
AAAS.sfs A file with descriptive information of RVs within the functional region of the gene AAAS. The file contains four columns representing the name of the interested gene, chromosome, position, and MAF of the variant, respectively. If more than one sub-populations are included in the samples, multiple MAFs correspond to each specific population should be included in this file and denoted by headers of population specific symbols
Zhihui Zhang and Suzanne Leal
Chen, W.-M., and Abecasis, G.R. (2007). Family-Based Association
Tests for Genomewide Association Scans. Am. J. Hum. Genet. 81, 913–926.
library (kinship2) library (RVFamSq) ## load example data data_dir<-system.file("extdata", package = "RVFamSq") ped_pheno<-read.table(phenofile<-paste0(data_dir,"/ped_pheno.txt")) ped_geno<-read.table(paste0(data_dir, "/ped_geno.txt")) maf_data<-read.table(paste0(data_dir,"/AAAS.sfs"), header = TRUE) ## Define the files that save the parameters estimated under the null model. ## If the file is not existed, the package will estimate the parameters based on the available data. ## Define output directory that save the results out<-"results" ## Load phenotype data and calculate the kinship matrix of the pedigree. In this example, we utilize the R package to estimate the kinship of samples based on family structure, but the kinship can also be estimated by genetic variants. kin_pre<-data.frame(id=ped_pheno[,2], mom=ped_pheno[,4], dad=ped_pheno[,3], sex=ped_pheno[,5]) tped <- with(kin_pre, pedigree(id, dad, mom, sex, famid=ped_pheno[,1])) kin <- kinship(tped) ## Run RVFamsSq package and calculate the statistical score of the interested gene. If \code{parafile} is not existed, the below command will estimate the parameters using the available data. ## Once you obtained the values of parameters, you can ## Results can be found under `results/` folder. RV_FamSq(ped_pheno=ped_pheno, ped_geno=ped_geno, maf_data=maf_data, maf_cutoff =0.02, covar_col=c(5), trait_col=c(6), pop_col=c(7), out=out, kin=kin)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.