knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(intratumormeth)
# paths
outputDir <- "~/sfb824/packagepdgfra_output"
inputDir <- "~/sfb824/packagepdgfra_input"
samplesheet <- read.csv("~/sfb824/packagepdgfra_input/samplesheet.csv")

This document shows some quality control steps that should be performed upstream of a microarray data analysis.

Detection p-values

The EPIC microarray employs a set of control probes that can be evaluated to quantify confidence in the obtained methylation levels for each individual probes. These values are called detection p-values.

Samples with a high average detection p-value should be excluded from the study, as well as probes that fail in multiple samples.

# outputDir contains data previously generated by preprocess_methylation_data()
detP <- detection_p_analysis(outputDir, 
                             pThreshold = 0.01)

This returns a list object that contains four members:

detP$poorSamplePlot
detP$meanDetP %>% 
  dplyr::filter(mean_detp > 0.01)

These samples have an average detection p-value higher than 0.01 and should therefore be excluded from further analysis.

Next, the number of individual probes failing in a given sample will be investigated. As can be seen in the function output, there are quite a few poor-quality beta values in the dataset. These are distributed across many probes, so that excluding the respective probes for all samples is not an option since this would remove almost 500,000 CpG methylation levels in our case.

A list of probes that fail for one or more samples is returned:

detP$failingProbes %>% head()
nrow(detP$failingProbes)

There are almost 500,000 probes with a poor detection p-value in at least one sample. The majority of these probes fails in just one or two samples, but there are some that consistently fail:

detP$failingProbesPlot

For this study, probes that fail in more than three samples will be excluded, as well as samples that have an average detection p-value greater than 0.01. The corresponding sample and probe ids will be stored on the disk in order to later use these filters when loading the methylation beta values.

probeFilter <- detP$failingProbes$probe_id[detP$failingProbes$failed_in_n_samples > 3]
length(probeFilter)
sampleFilter <- detP$meanDetP[["sample_id"]][detP$meanDetP$mean_detp > 0.01]

# save these filters for later use:
# saveRDS(probeFilter, file.path(inputDir, "probe_filter.Rdata"))
# saveRDS(sampleFilter, file.path(inputDir, "sample_filter.Rdata"))

Sex prediction

Comparing the difference in intensities obtained from probes targeting the sex chromosomes reveals clusters of male and female samples. This feature can be used to predict the sex of a respective sample.

predictedSex <- predict_sex(outputDir, normalization = "swan", predictionCutoff = -2)

This list contains three ggplot objects and a tibble for the predicted sex.

predictedSex$chrX
predictedSex$chrY
predictedSex$cutoff
predictedSex$predicted_sex %>% 
  head()

Compare the predicted sex with the actual sex and filter for those cases where the predicted sex does not match the actual sex:

patientsheet <- read.csv(file.path(inputDir, "patientsheet.csv"))
samplesheet %>% dplyr::left_join(patientsheet, by = "patient_id") %>%
  dplyr::left_join(predictedSex$predicted_sex, by = "sample_id") %>% 
  dplyr::filter(predicted_sex != sex)

In sample sfb18_1, predicted and actual sex deviate, which can for example indicate poor sample preparation or sample mislabeling.

rm(list = ls())
gc()

References



fynnwi/intratumormeth documentation built on March 29, 2022, 12:06 a.m.