import.hapmap: A function to import the hapmap formatted SNP data and the...

Description Usage Arguments Details Value Author(s) Examples

View source: R/import.hapmap.R

Description

Input: Hapmap-formatted SNP data, phenotype data

Output: Matched data files (genotype, numerical, SNP information, QC information, and phenotype) with QC and/or imputation.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import.hapmap(
  genotype = NULL,
  phenotype = NULL,
  input.type = c("object", "path"),
  save.path,
  y.col = NULL,
  y.id.col = 2,
  family = "gaussian",
  normalization = TRUE,
  remove.missingY = TRUE,
  imputation = FALSE,
  impute.type = c("distribution", "mode"),
  QC = TRUE,
  callrate.range = c(0, 1),
  maf.range = c(0, 1),
  HWE.range = c(0, 1),
  heterozygosity.range = c(0, 1)
)

Arguments

genotype

Either R object or file path can be considered. A genotype data is not a data.frame but a matrix with dimension p by (n+11). It is formatted by hapmap which has (rs, allele, chr, pos) in the first four(1-4) columns, (strand, assembly, center, protLSID, assayLSID, panel, Qcode) in the following seven(5-11) columns. If NULL, user can choose a path in interactive use.

phenotype

Either R object or file path can be considered. A phenotype data is an n by p matrix. Since the first some columns can display attributes of the phenotypes, you should enter the arguments, y.col and y.id.col, which represent the columns of phenotypes to be analyzed and the column of sample ID. If NULL, user can choose a path in interactive use.

input.type

Default is "object". If input.type is "object", obejects of genotype/phenotype will be entered, and if "path", paths of genotype/phenotype will be enterd. If you want to use an object, you have to make sure that the class of each column of genotype data is equal to "character".

save.path

A save.path which has all output files. If there exists save.path, sp.gwas will check if there is an output file. Note that if there is an output RData file in "save.path", sp.gwas will just load the output files(.RData) in there, thereby not providing the results for new "genotype" and "phenotype".

y.col

The columns of phenotypes. At most 4 phenotypes can be considered, because the plot of them will be fine. Default is 2.

y.id.col

The column of sample ID in the phenotype data file. Default is 1.

family

A family of response variable(phenotype). It is "gaussian" for continuous response variable, "binomial" for binary, "poisson" for count, etc. Now you can use only the same family for the multi phenotypes. For more details, see the function(stats::glm). Default is "gaussian".

normalization

If TRUE. phenotypes are converted to be normal-shape using box-cox transformation when all phenotypes are positive.

remove.missingY

If TRUE, the samples with missing values in phenotype data are removed. Accordingly, the corresponding genotype samples are also filtered out. Default is TRUE.

imputation

TRUE or FALSE for whether imputation will be conducted.

impute.type

Two imputation methods are supported for (only) imputation=TRUE. Default is "distribution" which impute a genotype from allele distribution. The other is "mode" which indicates an imputation from the most frequent genotype.

QC

TRUE or FALSE for whether QC for SNPs will be conducted.

callrate.range

A numeric vector indicating the range of non-missing proportion. Default is c(0, 1).

maf.range

A numeric vector indicating the range of minor allele frequency (MAF) to be used. Default is c(0, 1).

HWE.range

A numeric vector indicating the range of pvalue by Hardy-Weinberg Equillibrium to be used. Default is c(0, 1).

heterozygosity.range

A numeric vector indicating the range of heterozygosity values to be used, because, in some cases, heterozygosity higher than expected indicates the low quality variants or sample contamination. Default is c(0, 1).

Details

Hardy-Weinberg Equillibrium test was derived from "genetics" package. In imputation process, we first calculate the empirical allele frequencies. If we use a beta distribution as a prior in order to estimate the posterior distribution of allele frequency, then the posterior distribution of allele frequecy is also beta distribution. Accordingly, we impute the missing values with samples from the posterior distribution.

Value

A folder containing a genomic data set in which the samples of genotype and phenotype data are matched, and that quality control steps can be conducted for genotype data

Author(s)

Kipoong Kim <kkp7700@gmail.com>

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
genotype <- sp.gwas::genotype # load("genotype.rda")
phenotype <- sp.gwas::phenotype # load("phenotype.rda")

# object
import.hapmap(genotype = genotype, 
              phenotype = phenotype, 
              input.type = c("object", "path")[1], 
              imputation = FALSE, 
              
              # if TRUE, the following QC steps (callrate, maf, HWE, heterozygosity) are conducted.
              QC = TRUE,  
              
              callrate.range = c(0.95, 1),
              maf.range = c(1e-3, 1),
              HWE.range = c(0, 1),
              heterozygosity.range = c(0, 1),
              
              # if TRUE, the samples with any missing phenotypes are filtered out in all data.
              remove.missingY = TRUE,
              
              save.path = "./EXAMPLE_obj",
              y.id.col = 1, 
              y.col = 2:4, 
              
              #if family is not "gaussian", i.e. not continuous variable, normalization should be FALSE
              normalization = FALSE,
              family="gaussian")



# path

write.table( x = sp.gwas::genotype, file = "./genotype.csv", row.names = FALSE, col.names = FALSE, sep=",")
write.table( x = sp.gwas::phenotype, file = "./phenotype.csv", row.names = FALSE, sep="," )

import.hapmap(genotype = "./genotype.csv", 
              phenotype = "./phenotype.csv", 
              input.type = c("object", "path")[2], 
              QC = TRUE,  # if TRUE, the following QC steps (callrate, maf, HWE, heterozygosity) are conducted.
              callrate.range = c(0.95, 1),
              maf.range = c(1e-3, 1),
              HWE.range = c(0, 1),
              heterozygosity.range = c(0, 0.5),
              remove.missingY = TRUE,   # if TRUE, the samples with any missing phenotypes are filtered out in all data.
              save.path = "./EXAMPLE_path",
              y.id.col = 1, 
              y.col = 2:4, 
              normalization = FALSE, #if family is not "gaussian", i.e. not continuous variable, normalization should be FALSE
              family="gaussian")

statpng/sp.gwas documentation built on Dec. 17, 2020, 5:55 a.m.