title: 'MAGPA: A R package for multivariate analysis of genotype–phenotype association and visualization of 3D image' tags: - R - GWAS - Multivariate correlation analysis - Visualization authors: - name: Yin Huang orcid: 0000-0003-1055-2602 affiliation: 1 affiliations: - name: CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences index: 1

date: 13 January 2020 bibliography: paper.bib

The **MAGPA** (multivariate analysis of genotype–phenotype association) is a package of multivariate correlation analysis and an interactive visualization tool for 3D image. This package was implemented for genetic association analysis of facial phenotypes and visualization related results. In addition, it can also be used in genome-wide association analysis of other multivariate phenotypes, especially three-dimensional image data. It can not only receive the prepared features, but also preprocess the features with principal component analysis and automatically select the number of variables. The genotype should be a SnpMatrix, which is a special object holding large arrays of single nucleotide polymorphism (SNP). Then，canonical correlation analysis (CCA)[@Härdle2007] is used to extract the linear combination of variables to maximize the correlation with each SNP. For the interactive visualization, the function *visual3d* is required to provide at least a reference of 3D image object and a vector, such as the phenotypic changes under different genotypes. It can draw a 3D object with different style and gradient colors.

CCA is a multivariate statistical method that reflects the overall correlation between two groups of variables by using the correlation between the pairs of comprehensive indicators, which is implemented in PLINK and has demonstrated its advantages in multivariable analysis of genotype-phenotype association. Here, briefly, X is the sample phenotypic matrix, the principle components of facial variations from each segment, and Y is the genotype of the sample. 𝜌(X,Y) is the canonical correlation (Formula 1).

(1)

The target function of CCA is to maximize the 𝜌(X′,Y′) by optimizing the corresponding projection vector 𝑎 and 𝑏, called the canonical correlation coefficients between X and Y, respectively (Formula 2).

, (2)

In this example, the genotype data and phenotype data are used to demonstrate how to use the function *magpa* and show the one of the possible input to the function. The variable `geno`

is a list with a SnpMatrix `genotypes`

(2000 rows, 50 columns) and a data frame `map`

. The variable `pheno`

is a matrix (2000 rows, 300 columns) expanded by 2000 samples with 100 three-dimensional coordinate in each sample.

If your phenotype is pre-prepared, you can follow the step 3. Otherwise, you can follow the step 1, which set `pca`

argument to `TRUE`

. Or you can follow the step 2 to select features, and then follow the step 3.

The function *magpa* is performed multivariate analysis of genotype–phenotype association based on CCA, which calls the function “cca” to carry out the canonical correlation analysis and the function “F.test.cca” to test the statistical significance by employing Rao's statistic from the R package **yacca**[@Butts2009]. The first argument of *magpa* can be a file prefix name of plink[@Purcell2007] output (.bed, .bim, .fam), an object read by *read.plink* of the **snpStats**[@Sole2006] package, or a SnpMatrix. The second argument is the phenotypic matrix (rows are the number of samples, and columns are the number of features).

```
data(geno);data(pheno)
gpa <- magpa(geno,pheno,pca = TRUE)
head(gpa)
```

The function *pcapheno* is implemented to automatically extract the principal components from high-dimension data, which calls **paran** to performs Horn's parallel analysis[@Dinno2009] for evaluating the components retained in a principle component analysis.

```
paral<- pcapheno(pheno)
new_pheno<- paral$pheno
head(new_pheno)
```

The result in a matrix included SNP, CHR, position, MAF, canonical correlation, chisq, and pvalue. It can be used to draw QQ plot and Manhattan plot.

```
gpa<- magpa(geno,new_pheno)
head(gpa)
```

The function *visual3d* is an interactive graphing function based on the **rgl**[@Adler2003] package. This function is very flexible. The first argument is a reference of 3D image object, whose file (with `.obj`

suffix) can be read by the *readobj* function. The second argument can be a vector or list. And the thrid argument is a vector of colors or a color palette, if it is not given, the default color palette will be used.
Two examples of using *visual3d* are as follows.You can see more in the help documentation.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.