RV_FamSq: Rare Variant-Family-Based Score Test for Quantitative Traits

View source: R/RVFamSq.R

RV_FamSqR Documentation

Rare Variant-Family-Based Score Test for Quantitative Traits

Description

A regional association analysis of rare variants and quantitative traits in family data RV_FamSq <- function(ped_pheno, ped_geno, maf_data, maf_cutoff,parafile,covar_col, trait_col, pop_col=c(), out,kin, estimateAF,start_par)

Usage

RV_FamSq(ped_pheno, ped_geno, maf_data, maf_cutoff, covar_col, trait_col,
  pop_col = c(), out, kin, estimateAF, start_par)

Arguments

ped_pheno

a dataframe of phenotype. The first four columns should be family IDs, individual IDs, father IDs and mother IDs of each sample. The ped_pheno can also include columns of covariates and populations IDs. If more than one population groups are included in the analysis, the column of population IDs should indicate which population specific MAF will be used.

ped_geno

a dataframe of genotype. The first two columns are family IDs and individual IDs. The rest columns of the ped_geno are genotype score at each sites. The value of genotype score is 0, 1 or 2, corresponding to number of rare variants observed at each sites. Number of sites included in ped_geno should be equal to the number of rows of maf_data.

maf_data

a dataframe describing the information of rare variants included in ped_geno. The first three columns correspond to the name of the gene, chromosome number, and position of variants. The rest columns of maf_data are MAFs of variants in each specific population group obtained from database, e.g., Genome Aggregation Database (gnomAD).

maf_cutoff

the cutoff of MAF for rare variants

covar_col

a integer or a vector of integers indicating which columns of ped_pheno will be used as covariants of the analysis. Note that multiple covariants are allowed in the analysis.

trait_col

a integer indicating which columns of ped_pheno will be used as traits.

pop_col

a integer indicating which columns of ped_pheno represent the population IDs of the sample.

out

the directory that save the results. RVFamSq outputs a dataframe results including the name of the gene, score, p-value of the gene, number of sample size and number of families

kin

a kinship matrix calculated either based on family strtuctures or genomic informations.

estimateAF

a dataframe of estimated MAFs that do not oberserved in public datasets. The headers decribing the name of the sub-population should be consistent with the headers in maf_data. If this argument is not defined, RVfamSq will estimate the missing MAF based on the dosage of founder.

start_par

an optional argument that defines the starting values of the parameters fitting. User can define specific values to start the parameters fitting, otherwise, the starting values will generate randomly.

parafile

the path to the file saving parameters estimated under the null model. If the file is not existed, the package will estimate the parameters using the available data and create a parafile in the path. Otherwise, the package will utilize the parameters in parafile to calculate the statistical score.

Details

RVFamSq performed association analysis by estimating the parameters under the null model and calculating the statistical score using these parameters. The parameters are estimated by maximizing the multivariate normal likelihood:

L=∏_{i}(2π)^{-n_i / 2}≤ft |Ω _i\right |^{-1 /2}e^{≤ft [ y_i - E(y_i)\right ]'Ω _i^{-1}≤ft [ y_i - E(y_i)\right ]}

where n_i is the number of individuals in family i and ≤ft |Ω _i\right | is the determinant of matrix Ω _i. E(y_i) in the above function is a vector of expected phenotype of all individuals within each family and is calculated by:

E(y_i)=μ + β _x x_i

where x_i is a n_i \times q matrix of q covariates included in the association analysis. The parameters μ and β _x are population mean and a vector of covariate effects. The two parameters are estimated by maximizing the above likelihood function. The matrix Ω_i in the likelihood function is calculated within each family i and defined as:

Ω _{ijk}≤ft\{\begin{matrix}σ _g^2 + σ_e^2\ \ \ \ \ \ \ \ \ \ \ \ \ if\ j=k\\ 2\varphi _{ijk}σ _g^2\ \ \ \ \ \ \ \ \ \ \ \ if\ i\neq k \end{matrix}\right.

Here, the parameter σ_g^2 andσ_e^2 are variance components that are account for background polygenic effects and environmental effects, respectively. Additionally, Ω _{ijk} denote the kinship coefficient between individuals j and k. The values of parameters σ_g^2, σ_e^2, β_x, and μ are estimated by maximizing the above likelihood. With the values of these parameters, we can obtain the variance-covariance matrix Ω _i^{base} and vector E(y_i)^{base} based on the above equations for each family. Using these two quantities, the statistic score is defined by:

T^{SCORE}=\frac{≤ft \{ ∑ _i ≤ft [G_i -E(G_i)\right ]'[Ω _i^{(base)}]^{-1}[y_i-E(y_i)^{(base)}] \right \}^2}{∑ _i[G_i-E(G_i)]'[Ω_i^{(base)}]^{-1}[G_i-E(G_i)^{(base)}] }

where E(G_i) is defined as:

E(G_i)=∑_{m=1}^{M}2p_m

where p_m is the minor allele frequency (MAF) which is obtain from the dataframe of maf_data.

Value

results

a data frame results containing the name of the gene, the value of the score and p-value write into the file named by the name of the gene under the directory defined by the argument out.

paras

a list of parameters estimated by maximizing the likelihood. The values of the paras save to the file defined by the argument parafile. If the parafile is existed, the values of the paras load from the file parafile.

Data format

Example data is generated for 1,000 extended families with 10 members in each family. The information of RVs within functional region of the gene AAAS with MAF<2 Quantitative traits are generated randomly based on the disribution of N(0,1) and genotypes of samples are generated based on MAF of the gene AAAS.

  • ped_pheno.txt A phenotype file of 1,000 extended families (10,000 individuals in total). Each column of the file represents family IDs, individual IDs, father IDs, mother IDs, sex, traits, and population IDs, respectively. RVFamSq is capable to analyze multiple covariates and extra covariants should be also included in this file after the fourth column.

  • ped_geno.txt A genotype file of samples included in the ped_pheno.txt file. The first two columns of the file should be family IDs and individual IDs that have the same format as in ped_pheno.txt. The rest columns of the file are genotypes of each individual and codes as 0, 1, and 2.

  • AAAS.sfs A file with descriptive information of RVs within the functional region of the gene AAAS. The file contains four columns representing the name of the interested gene, chromosome, position, and MAF of the variant, respectively. If more than one sub-populations are included in the samples, multiple MAFs correspond to each specific population should be included in this file and denoted by headers of population specific symbols

Author(s)

Zhihui Zhang and Suzanne Leal

References

Chen, W.-M., and Abecasis, G.R. (2007). Family-Based Association Tests for Genomewide Association Scans. Am. J. Hum. Genet. 81, 913–926.

Examples


library (kinship2)
library (RVFamSq)

##  load example data
data_dir<-system.file("extdata", package = "RVFamSq")
ped_pheno<-read.table(phenofile<-paste0(data_dir,"/ped_pheno.txt"))
ped_geno<-read.table(paste0(data_dir, "/ped_geno.txt"))
maf_data<-read.table(paste0(data_dir,"/AAAS.sfs"), header = TRUE)

## Define the files that save the parameters estimated under the null model.
## If the file is not existed, the package will estimate the parameters based on the available data.

##  Define output directory that save the results
out<-"results"

## Load phenotype data and calculate the kinship matrix of the pedigree. In this example, we utilize the R package to estimate the kinship of samples based on family structure, but the kinship can also be estimated by genetic variants.
kin_pre<-data.frame(id=ped_pheno[,2], mom=ped_pheno[,4], dad=ped_pheno[,3], sex=ped_pheno[,5])
tped <- with(kin_pre, pedigree(id, dad, mom, sex, famid=ped_pheno[,1]))
kin <- kinship(tped)

## Run RVFamsSq package and calculate the statistical score of the interested gene. If \code{parafile} is not existed, the below command will estimate the parameters using the available data.
## Once you obtained the values of parameters, you can 
## Results can be found under `results/` folder.
RV_FamSq(ped_pheno=ped_pheno, ped_geno=ped_geno, maf_data=maf_data, maf_cutoff =0.02, covar_col=c(5), trait_col=c(6), pop_col=c(7), out=out, kin=kin)


statgenetics/RVFamSq documentation built on Sept. 11, 2022, 4:39 a.m.