prepare_CG_subsets: prepare_CG_subsets

Description Usage Arguments Details Value References

View source: R/CpG_subsets.R

Description

This routine selects a subset of CpGs sites used for MeDeCom analysis. Different selection methods are supported.

Usage

1
2
3
4
5
6
prepare_CG_subsets(meth.data = NULL, rnb.set = NULL, MARKER_SELECTION,
  N_MARKERS = 5000, REMOVE_CORRELATED = FALSE,
  COR_THRESHOLD = "quantile", WRITE_FILES = FALSE, WD = NA,
  REF_DATA_SET = NULL, REF_PHENO_COLUMN = NULL, N_PRIN_COMP = 10,
  RANGE_DIFF = 0.05, CUSTOM_MARKER_FILE = "", store.heatmaps = F,
  heatmap.sample.col = NULL, K.prior = NULL)

Arguments

meth.data

A matrix or data.frame containing methylation information. If NULL, methylation information needs to be provided through rnb.set

rnb.set

An object of type RnBSet-class containing methylation, sample and optional coverage information.

MARKER_SELECTION

A vector of strings representing marker selection methods. Available method are

  • "all" Using all sites available in the input.

  • "pheno" Selected are the top N_MARKERS site that differ between the phenotypic groups defined in data preparation or by rnb.sample.groups. Those are selected by employing limma on the methylation matrix.

  • "houseman2012" The 50k sites reported as cell-type specific in the Houseman's reference- based deconvolution. See Houseman et.al. 2012.

  • "houseman2014" Selects the sites said to be linked to cell type composition by RefFreeEWAS, which is similar to surrogate variable analysis. See Houseman et.al. 2014.

  • "jaffe2014" The sites stated as related to cell-type composition Jaffe et.al. 2014.

  • "rowFstat" Markers are selected as those found to be associated to the reference cell types with F-statistics. If this option is selected, REF_DATA_SET and REF_PHENO_COLUMN need to be specified.

  • "random" Sites are randomly selected.

  • "pca" Sites are selected as those with most influence on the principal components.

  • "var" Selects the most variable sites.

  • "hybrid" Selects (N_MARKERS/2) most variable and (N_MARKERS/2) random sites.

  • "range" Selects the sites with the largest difference between minimum and maximum across samples.

  • "pcadapt" Uses principal component analysis as implemented in the "bigstats" R package to determine sites that are significantly linked to the potential cell types. This requires specifying K a priori (argument K.prior). We thank Florian Prive and Sophie Achard for providing the idea and parts of the codes.

  • "edec_stage0 Employs EDec's stage 0 to infer cell-type specific markers. By default EDec's example reference data is provided. If a specific data set is to be provided, it needs to be done through REF_DATA_SET.

  • "custom" Specifying a custom file with indices.

N_MARKERS

The number of sites to be selected. Defaults to 5000.

REMOVE_CORRELATED

Flag indicating if highly correlated sites are to be removed

COR_THRESHOLD

Numeric indicating a threshold above which sites are not to be considered in the feature selection. If "quantile", sites correlated higher than the 95th quantile are removed.

WRITE_FILES

Flag indicating if the selected sites are to be stored on disk.

WD

Path to the working directory used for analyis, or data preparation.

REF_DATA_SET

An object of type RnBSet-class or a path to such an object stored on disk, if rowFstat is selected.

REF_PHENO_COLUMN

Optional argument stating the column name of the phenotypic table of REF_DATA_SET with the reference cell type.

N_PRIN_COMP

Optional argument deteriming the number of prinicipal components used for selecting the most important sites.

RANGE_DIFF

Optional argument specifying the difference between maximum and minimum required.

CUSTOM_MARKER_FILE

Optional argument containing an absolute path to a file that specifies the indices used for employing MeDeCom. Can be provided either as an RDS file containing a vector of indices to select or as a txt, csv, tsv file containing each index to be selected as a single row.

store.heatmaps

Flag indicating if a heatmap of the selected input sites is to be create from the input methylation matrix. The files are then stored in the 'heatmaps' folder in WD.

heatmap.sample.col

Column name in the phenotypic table of rnb.set, used for creating a color scheme in the heatmap.

K.prior

K determined from visual inspection. Only has an influence, if MARKER_SELECTION="pcadapt".

Details

For methods "houseman2012" and "jaffe2014", a predefined set of markers is used. Since those correspond to absolute indices on the chip, the provided rnb.set must not be preprocessed and therefore still contain all sites. For the other metods, you may used prepare_data to filter sites for quality and context.

Value

List of indices, one entry for each marker selection method specified by MARKER_SELECTION. The indices correspond to the sites that should be used in rnb.set.

References


lutsik/DecompPipeline documentation built on Oct. 13, 2019, 1:51 a.m.