prepare.CG.subsets: prepare.CG.subsets

View source: R/CpG_subsets.R

prepare.CG.subsetsR Documentation

prepare.CG.subsets

Description

This routine selects a subset of CpGs sites used for MeDeCom analysis. Different selection methods are supported.

Usage

prepare.CG.subsets(
  meth.data = NULL,
  rnb.set = NULL,
  marker.selection,
  n.markers = 5000,
  remove.correlated = FALSE,
  cor.threshold = "quantile",
  write.files = FALSE,
  out.dir = NA,
  ref.rnb.set = NULL,
  ref.pheno.column = NULL,
  n.prin.comp = 10,
  range.diff = 0.05,
  custom.marker.file = "",
  store.heatmaps = F,
  heatmap.sample.col = NULL,
  K.prior = NULL
)

Arguments

meth.data

A matrix or data.frame containing methylation information. If NULL, methylation information needs to be provided through rnb.set

rnb.set

An object of type RnBSet-class containing methylation, sample and optional coverage information.

marker.selection

A vector of strings representing marker selection methods. Available method are

  • "all" Using all sites available in the input.

  • "pheno" Selected are the top n.markers site that differ between the phenotypic groups defined in data preparation or by rnb.sample.groups. Those are selected by employing limma on the methylation matrix.

  • "houseman2012" 50k sites determined to be cell-type specific for blood cell types using the Houseman's reference- based deconvolution and the Reinius et al. reference data set. See Houseman et.al. 2012 and Reinis et.al. 2012. NOTE: This option should only be used for whole blood data generated using the 450k array.

  • "houseman2014" Selects the sites said to be linked to cell type composition by RefFreeEWAS, which is similar to surrogate variable analysis. See Houseman et.al. 2014.

  • "jaffe2014" The 600 sites stated as related to cell-type composition Jaffe et.al. 2014. NOTE: This option should only be used for whole blood data generated using the 450k array.

  • "rowFstat" Markers are selected as those found to be associated to the reference cell types with F-statistics. If this option is selected, ref.rnb.set and ref.pheno.column need to be specified.

  • "random" Sites are randomly selected.

  • "pca" Sites are selected as those with most influence on the principal components.

  • "var" Selects the most variable sites.

  • "hybrid" Selects (n.markers/2) most variable and (n.markers/2) random sites.

  • "range" Selects the sites with the largest difference between minimum and maximum across samples.

  • "pcadapt" Uses principal component analysis as implemented in the "bigstats" R package to determine sites that are significantly linked to the potential cell types. This requires specifying K a priori (argument K.prior). We thank Florian Prive and Sophie Achard for providing the idea and parts of the codes.

  • "edec_stage0 Employs EDec's stage 0 to infer cell-type specific markers. By default EDec's example reference data is provided. If a specific data set is to be provided, it needs to be done through ref.rnb.set.

  • "custom" Specifying a custom file with indices.

n.markers

The number of sites to be selected. Defaults to 5000.

remove.correlated

Flag indicating if highly correlated sites are to be removed

cor.threshold

Numeric indicating a threshold above which sites are not to be considered in the feature selection. If "quantile", sites correlated higher than the 95th quantile are removed.

write.files

Flag indicating if the selected sites are to be stored on disk.

out.dir

Path to the working directory used for analyis, or data preparation.

ref.rnb.set

An object of type RnBSet-class or a path to such an object stored on disk, if rowFstat is selected.

ref.pheno.column

Optional argument stating the column name of the phenotypic table of ref.rnb.set with the reference cell type.

n.prin.comp

Optional argument deteriming the number of prinicipal components used for selecting the most important sites.

range.diff

Optional argument specifying the difference between maximum and minimum required.

custom.marker.file

Optional argument containing an absolute path to a file that specifies the indices used for employing MeDeCom. Can be provided either as an RDS file containing a vector of indices to select or as a txt, csv, tsv file containing each index to be selected as a single row.

store.heatmaps

Flag indicating if a heatmap of the selected input sites is to be create from the input methylation matrix. The files are then stored in the 'heatmaps' folder in out.dir.

heatmap.sample.col

Column name in the phenotypic table of rnb.set, used for creating a color scheme in the heatmap.

K.prior

K determined from visual inspection. Only has an influence, if marker.selection="pcadapt".

Details

For methods "houseman2012" and "jaffe2014", a predefined set of markers is used. Since those correspond to absolute indices on the chip, the provided rnb.set must not be preprocessed and therefore still contain all sites. For the other metods, you may used prepare.data to filter sites for quality and context.

Value

List of indices, one entry for each marker selection method specified by marker.selection. The indices correspond to the sites that should be used in rnb.set.

Author(s)

Michael Scherer, Pavlo Lutsik

References

  • 1. Houseman, E. A., Accomando, W. P., Koestler, D. C., Christensen, B. C., Marsit, C. J., Nelson, H. H., ..., Kelsey, K. T. (2012). DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics, 13.

  • 2. Reinius, L. E., Acevedo, N., Joerink, M., Pershagen, G., Dahlen, S. E., Greco, D., ..., A., & Kere, J. (2012). Differential DNA methylation in purified human blood cells: Implications for cell lineage and studies on disease susceptibility. PLoS ONE, 7(7). https://doi.org/10.1371/journal.pone.0041361

  • 3. Houseman, E. A., Molitor, J., & Marsit, C. J. (2014). Reference-free cell mixture adjustments in analysis of DNA methylation data. Bioinformatics, 30(10), 1431-1439. https://doi.org/10.1093/bioinformatics/btu029

  • 4. Jaffe, A. E., & Irizarry, R. A. (2014). Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biology, 15(2), R31. https://doi.org/10.1186/gb-2014-15-2-r31


CompEpigen/DecompPipeline documentation built on Nov. 3, 2023, 5:35 p.m.