bed_projectPCA: Projecting PCA
In bigsnpr: Analysis of Massive SNP Arrays

bed_projectPCA

R Documentation

Projecting PCA

Description

Computing and projecting PCA of reference dataset to a target dataset.

Usage

bed_projectPCA(
  obj.bed.ref,
  obj.bed.new,
  k = 10,
  ind.row.new = rows_along(obj.bed.new),
  ind.row.ref = rows_along(obj.bed.ref),
  ind.col.ref = cols_along(obj.bed.ref),
  strand_flip = TRUE,
  join_by_pos = TRUE,
  match.min.prop = 0.5,
  build.new = "hg19",
  build.ref = "hg19",
  liftOver = NULL,
  ...,
  verbose = TRUE,
  ncores = 1
)

Arguments

`obj.bed.ref`	Object of type bed, which is the mapping of the bed file of the reference data. Use `obj.bed <- bed(bedfile)` to get this object.
`obj.bed.new`	Object of type bed, which is the mapping of the bed file of the target data. Use `obj.bed <- bed(bedfile)` to get this object.
`k`	Number of principal components to compute and project.
`ind.row.new`	Rows to be used in the target data. Default uses them all.
`ind.row.ref`	Rows to be used in the reference data. Default uses them all.
`ind.col.ref`	Columns to be potentially used in the reference data. Default uses all the ones in common with target data.
`strand_flip`	Whether to try to flip strand? (default is `TRUE`) If so, ambiguous alleles A/T and C/G are removed.
`join_by_pos`	Whether to join by chromosome and position (default), or instead by rsid.
`match.min.prop`	Minimum proportion of variants in the smallest data to be matched, otherwise stops with an error. Default is `⁠20%⁠`.
`build.new`	Genome build of the target data. Default is `hg19`.
`build.ref`	Genome build of the reference data. Default is `hg19`.
`liftOver`	Path to liftOver executable. Binaries can be downloaded at https://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/liftOver for Mac and at https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver for Linux.
`...`	Arguments passed on to `bed_autoSVD` `fun.scaling` A function with parameters `X` (or `obj.bed`), `ind.row` and `ind.col`, and that returns a data.frame with `⁠$center⁠` and `⁠$scale⁠` for the columns corresponding to `ind.col`, to scale each of their elements such as followed: `\frac{X_{i,j} - center_j}{scale_j}.` Default uses binomial scaling. You can also provide your own `center` and `scale` by using `bigstatsr::as_scaling_fun()`. `roll.size` Radius of rolling windows to smooth log-p-values. Default is `50`. `int.min.size` Minimum number of consecutive outlier variants in order to be reported as long-range LD region. Default is `20`. `thr.r2` Threshold over the squared correlation between two variants. Default is `0.2`. Use `NA` if you want to skip the clumping step. `alpha.tukey` Default is `0.1`. The type-I error rate in outlier detection (that is further corrected for multiple testing). `min.mac` Minimum minor allele count (MAC) for variants to be included. Default is `10`. Can actually be higher because of `min.maf`. `min.maf` Minimum minor allele frequency (MAF) for variants to be included. Default is `0.02`. Can actually be higher because of `min.mac`. `max.iter` Maximum number of iterations of outlier detection. Default is `5`. `size` For one SNP, window size around this SNP to compute correlations. Default is `100 / thr.r2` for clumping (0.2 -> 500; 0.1 -> 1000; 0.5 -> 200). If not providing `infos.pos` (`NULL`, the default), this is a window in number of SNPs, otherwise it is a window in kb (genetic distance). I recommend that you provide the positions if available.
`verbose`	Output some information on the iterations? Default is `TRUE`.
`ncores`	Number of cores used. Default doesn't use parallelism. You may use `bigstatsr::nb_cores()`.

Value

A list of 3 elements:

⁠$obj.svd.ref⁠: big_SVD object computed from reference data.
⁠$simple_proj⁠: simple projection of new data into space of reference PCA.
⁠$OADP_proj⁠: Online Augmentation, Decomposition, and Procrustes (OADP) projection of new data into space of reference PCA.