Initial QC

Once the experiment is generated, we can create an interactive html report containing basic summary statistics of the scMethrix object with scMethrix_report function.

The report can be accessed here: Initial reports

methrix::methrix_report(meth = meth, output_dir = getwd(), prefix = "initial")

In the report, we can check genome-wide and chromosome based statistics on coverage and methylation, in order to identify potential quality issues, for example:

Filtering

Masking low/high coverage CpGs

To ensure the high quality of our dataset, the sites with very low coverage (untrustworthy methylation level) or high average coverage (technical problem) should not be used. We can mask these CpG sites. Please note, that the DSS we were using for Differential Methylation Region (DMR) calling, doesn't need the removal of the lowly covered sites, because it takes it into account by analysis. However, using a not too restrictive threshold won't interfere with DMR calling.

# Masking CpG sites with a coverage < 2 and an average coverage > 2
meth <- scMethrix::mask_by_coverage(meth, low_threshold = 2, avg_threshold = 2) 

Masking low sample count

If the coverage matrix is absent, filtering can be done via sample count instead. Similar to mask_by_coverage, low_threshold masks be low sample count, whereas prop_threshold will mask sites that appear in less than a certain proportion of all input samples.

# Masking CpG sites present in fewer than 2 samples
meth <- scMethrix::mask_by_sample(meth, low_threshold = 2) 
# Masking CpG sites present in fewer than 5% of samples
meth <- scMethrix::mask_by_sample(meth, prop_threshold = 0.05) 

Masking low variance CpGs

CpG sites with a low inter-individual methylation variance can increase the risk of false discoveries and reduce statistical significance. We can mask the sites to prevent this, as well as decreasing computation time and data size. The function mask_by_variance will calculate the variance of each CpG site and mask it if below a specified threshold.

# Masking sites with a variance < 0.05
meth <- scMethrix::mask_by_variance(meth, low_threshold = 0.05)

SNP filtering

With SNPs, the C > T mutations can disrupt methylation calling. Therefore, it is essential to remove CpG sites that overlap with common variants, if we have variation in our study population. For example working with human, unpaired data, e.g. treated vs. untreated groups.

During filtering, we can select the population of interest and the minimum minor allele frequency (MAF) of the SNPs.

SNP filtering is currently implemented for hg19 and hg38 assemblies. For mm10 assemblies, if the same strain of mice is used, SNP filtering is not generally necessary due to high homozygosity between individuals. SNP data for mice can be downloaded from here: https://www.sanger.ac.uk/science/data/mouse-genomes-project.

if(!requireNamespace("GenomicScores")) {
  BiocManager::install("GenomicScores")

}
library(GenomicScores) 
meth <- scMethrix::remove_snps(meth, keep = TRUE)
if (!requireNamespace("pheatmap", quietly = TRUE))
    install.packages("pheatmap")
snps <- meth[[2]]
meth <- meth[[1]]
pheatmap::pheatmap(get_matrix(order_by_sd(snps)[1:min(5000, length(snps)),]))

Filtering masked sites

After masking is completed, we can remove those sites that are not covered by any of the samples.

meth <- scMethrix::remove_uncovered(meth) 

Querying

Often, whole genome analysis using all samples is not necessary. This package includes multiple convenient options for querying - regions, contigs, or samples - and can processed as either inclusive or exclusive. Multiple query types can be given in each command. Subset operations in scMethrix make use of the fast binary search data.table that is several orders faster than bsseq or other similar packages.

# Subsetting to include only two samples
meth <- scMethrix::subset_methrix(meth, contigs = c("GSM2553093", "GSM2553095"), by="include")
colnames(inc)
# Subsetting to exclude two samples
meth <- scMethrix::subset_methrix(meth, contigs = c("GSM2553093", "GSM2553095"), by="exclude")
colnames(meth)
# Subsetting to include only chr1 and chr2
meth <- scMethrix::subset_methrix(meth, contigs = c("chr1", "chr2"))
meth
GenomicRanges::seqinfo(SummarizedExperiment::rowRanges(meth))
# Subsetting to include only bases 1:10M in chr1
regions <- GenomicRanges::GRanges(seqnames = "chr1", ranges = IRanges(1,10000000)) 
meth <- scMethrix::subset_methrix(meth, regions = regions)
meth
GenomicRanges::seqlengths(SummarizedExperiment::rowRanges(meth))

Visualization and QC after filtering

scMethrix report

After filtering, it worth running a report again, to see if any of the samples were so lowly covered that it warrants an action (e.g. removal of the sample)

Important: Don't forget to set the recal_stats to TRUE, since the object changed since reading in. The output directory has to be different from last time, to avoid using the intermediate files calculated during the last run.

methrix::methrix_report(meth = meth, output_dir = getwd(), recal_stats = TRUE, prefix="processed")

The report can be found here: processed_methrix_reports.html

There are additional possibilities to visualize the study. We can look at the density of coverage both sample- and group-wise:

scMethrix::plot_coverage(meth, type = "dens")
scMethrix::plot_coverage(meth, type = "dens", pheno = "Day", perGroup = TRUE)

We can visualize te beta value distribution as a violin plot or density plot. These plots (as well as the coverage plot) use the 25000 most variable CpG sites to ensure fast processing. plot_density and plot_violin has the option to use data only from a restricted region.

scMethrix::plot_density(meth)
scMethrix::plot_density(meth, ranges = GRanges("chr1:1-10000000"))
scMethrix::plot_violin(meth)

Dimensionality reduction

scMethrix offers principal component analysis (PCA), uniform manifold approximation and projection (UMAP), and t-distributed stochastic neighbor embedding (tSNE) to conduct and visualize dimensionality reduction. The resulting reduction is stored in the scMethrix object, and can be accessed via reducedDim(). First, the model is calculated, then a number of sites are selected for plotting, either random or based on variance (var option). This ensures that the calculations remain feasible.

meth <- scMethrix::dim_red_scMethrix(meth,type="PCA",top_var = 10000, n_components = 20)
plot_dim_red(meth,type="PCA")
meth <- scMethrix::dim_red_scMethrix(meth,type="UMAP",top_var = 10000, n_neighbors = 3)
plot_dim_red(meth,type="UMAP")
meth <- scMethrix::dim_red_scMethrix(meth,type="tSNE",top_var = 10000, n_components = 20)
plot_dim_red(meth,type="tSNE")

At visualization, we can provide the methrix object to allow color or shape annotation of groups or samples.

scMethrix::plot_pca(meth, col_anno = "Day", shape_anno = "Replicate")

scMethrix offers the possibility of region based filtering. With this option, selected regions (e.g. promoters) can be visualized. We will use the AnnotationHub package to assess basic annotation categories.

if(!requireNamespace("AnnotationHub")) 
  BiocManager::install("AnnotationHub", update = F)
library("AnnotationHub")
ah = AnnotationHub()
qhs = query(ah, c("RefSeq", "Mus musculus", "mm10"))
genes = qhs[[1]]
promotors = promoters(genes)
meth.subset <- subset_methrix(meth,regions = promotors)
meth <- scMethrix::dim_red_scMethrix(meth.subset,type="PCA",top_var = 10000, n_components = 20)
plot_dim_red(meth,type="PCA")


CompEpigen/scMethrix documentation built on Nov. 6, 2021, 3:09 p.m.