#Load library
library(scMethrix)

Download the data

The example data analysis presented here uses the following publicly available whole genome bisulfite sequencing data from GEO: GSE97179

Reference: Luo C, Keown CL, Kurihara L, Zhou J et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science 2017 Aug 11;357(6351):600-604. PMID: 28798132

The data can be directly downloaded from GEO, using the GEOquery package. The files will be downloaded into the working directory.

if(!requireNamespace("GEOquery")) {
  BiocManager::install("GEOquery")
}
library(GEOquery)  
data_dir = paste0(tempdir(),"/GSE131723/")
dir.create(path = data_dir, showWarnings = FALSE, recursive = TRUE)
lapply(c("GSM3814736","GSM3814737","GSM3814738","GSM4830506","GSM4830507","GSM4830508"), function(gsm){
  GEOquery::getGEOSuppFiles(GEO = gsm, baseDir = data_dir, makeDirectory = FALSE)
  # bed_dir <- paste0(data_dir,gsm)
  # bed_file <- list.files(bed_dir,full.names=T)
  # R.utils::gunzip(bed_file,destname=paste0(data_dir,gsm,".bed"))
  # unlink(bed_dir,recursive=TRUE,force=TRUE)
})
bed_files <- list.files(
  path = data_dir,
  pattern = "*bedgraph\\.gz$",
  #pattern = "*.bed$",
  full.names = TRUE, 
  recursive = TRUE
)

print(basename(bed_files))

Processing

CpG annotation

As a first step, we need a list of CpG sites in the respective genome. The CpG sites are listed using the respective BSgenome annotation package. The read_beds function is also able to extract CpG sites on it's own, however, it might be beneficial to do it separately.

#Genome of your preference to work with
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
library(BiocManager)
if(!requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
}
library(BSgenome.Hsapiens.UCSC.hg19) 
mm19_cpgs <- suppressWarnings(extract_CPGs(ref_genome = "BSgenome.Hsapiens.UCSC.hg19"))

Sample annotation

An annotation table is also necessary to perform some analyses. The data will be added to the scMethrix object, as a colData slot. Each row of the table represents a single sample. Sample names are taken from the input file names, and stored as row.names. Subsequent information about each sample (e.g., cell type, alternate IDs, sample source, etc) are stored as columns within this data.frame. To access this data, see [Saving experiment metadata].

sample_anno <- data.frame(
  row.names = gsub(
    pattern = paste0("(GSM[[:digit:]]*)_epi-gSCAR_(Kasumi-1|OCI-AML3)_SC[[:digit:]]_meth.bedgraph.gz"),
    replacement = "\\1",
    x = basename(bed_files)
    ),
  Group = gsub(
    pattern = paste0("(GSM[[:digit:]]*)_epi-gSCAR_(Kasumi-1|OCI-AML3)_SC[[:digit:]]_meth.bedgraph.gz"),
    replacement = "\\2",
    x = basename(bed_files)
    ),
  stringsAsFactors = FALSE
)
knitr::kable(sample_anno)

Reading bedGraph files

read_beds function is a versatile bedgraph reader intended to import bedgraph files generated virtually by any sort of methylation calling program. It requires user to provide indices for chromosome names, start position and other required fields. There are also presets available to import bedgraphs from most common programs such as Bismark (.cov format), MethylDackel, and MethylcTools. In this case, there is no need to define e.g. chr_idx start_idx arguments, the function will automatically assign them. To know the exact parameters, it might worth to take a look at one of the files, to see the column order.

res <- fread(bed_files[1])
colnames(res) <- c("chr","start","end","beta")
head(res)

The read_beds function adds CpGs missing from the reference set, and creates a methylation/coverage matrices. Once the process is complete - it returns an object of class scMethrix which in turn inherits SingleCellExperiment class. scMethrix object contains ‘methylation’ and (optionally) ‘coverage’ matrices (either in-memory or as on-disk HDF5 arrays) along with pheno-data and other basic info. This object can be passed to all downstream functions for various analysis. For further details on the data structure, see the SingleCellExperiment package: SingleCellExperiment

A more detailed description of the read_beds function is here: read_beds)

meth <- read_beds(
  files = bed_files,
  ref_cpgs = mm19_cpgs,
  chr_idx = 1,
  start_idx = 2,
  strand_idx = 3,
  cov_idx = 4,
  M_idx = 5,
  stranded = FALSE,
  zero_based = TRUE, 
  #collapse_strands = FALSE, 
  colData = sample_anno
)

Certain operations will cause metadata to be destroyed. For range metadata, bin_scMethrix() will discard row metadata, as will collapse_samples for column (sample) metadata. It is advised to extract this data from mcols() and colData(), respectively, and store it within metadata() instead.

Potential pitfalls: Too many missing CpGs, if not expected, might indicate that something went wrong.

Extracting assay matricies

After the experiment is created, an assay matrix can be generated. Genomic loci can also be added (add_loci), be formatted as a GenomicRanges object if desired, and ordered by variance (order_by_sd). Additionally, the methylation and coverage matrices can be directly obtained with the functions score() and counts().

mtx <- get_matrix(meth, assay="score")
head(mtx)
# Add genomic loci, put in a GRanges object, and order by SD
mtx <- get_matrix(meth, assay="score", add_loci = TRUE, in_granges=TRUE, order_by_sd = TRUE)
head(mtx)

Saving experiment metadata

scMethrix can store experiment metadata. Depending on the type of data, it is stored in different places:

  1. Experiment-wide metadata: General information is stored in metadata() as a named list. This can be information like genome accession, GEO IDs, authors, or any other information that needs just a single variable.
metadata(scm)
  1. Range metadata: Information about ranges is stored as a data.frame in mcols(). This is stored within the rowRanges() sub-object inside
mcols(scm)
  1. Sample metadata: Information about samples is stored as a data.frame in colData(). This can include alternate samples IDs, run IDs, tissue types, or anything associated with individual samples.
colData(scm)


CompEpigen/scMethrix documentation built on Nov. 6, 2021, 3:09 p.m.