DNA methylation at CpG dinucleotides is one of the most extensively studied epigenetic marks. With technological advancements, geneticists can profile DNA methylation with multiple reliable approaches. However, different profiling platforms can differ substantially in the CpGs they assess, consequently hindering integrated analysis across platforms. Here, we present CpG impUtation Ensemble (CUE), which leverages multiple classical statistical and modern machine learning methods, to impute from the Illumina HumanMethylation450 (HM450) BeadChip to the Illumina HumanMethylationEPIC (HM850) BeadChip.
CUE is maintained by Gang Li [franklee@live.unc.edu] and Laura Raffield [laura_raffield@unc.edu].
From this study, we provide two sets of imputation models: one for whole blood and the other for placenta. Investigators can therefore complete their own imputation of placental or whole blood HM850 CpG sites using their own HM450 data, without access to their own reference panel. Our method is also applicable to imputation in other tissues, provided the user can provide a reference dataset assayed on both HM450 and HM850.
CUE package can be directly installed as following:
git clone https://github.com/GangLiTarheel/CUE.git
cd CUE
Please install the following version of R and R packages before performing imputation.
R version 3.6.0 (2019-04-26)
class_7.3-15
Note: to install an R packages, simply run install.packages("Name_of_package"). For example,
install.packages("xgboost")
Note: you can also install the version of xgboost we used from the source code: https://cran.r-project.org/src/contrib/Archive/xgboost/xgboost_0.82.1.tar.gz
```{r init, message=TRUE} library(randomForest) # RF library(xgboost) # XGBoost library("class") # KNN source("R/refund_lib.R")
library(parallel) # Use this package if you need parallel computing to accelerate computation speed
## Download the pretrained imputation models
Note: The compressed pre-trained models need around ~ 58 GB (ELGAN ~36 GB; and PTSD ~21 GB) memory of storage.
Please save the above pretrained models (tar.gz files) in the same directory of your CUE packages.
To extract the pre-trained models:
Make two directories for two sets of pre-trained models:
pwd # this should be your root directory of CUE mkdir PTSD mkdir ELGAN
Download the pre-trained models amd move the PTSD and ELGAN pretrained models to the corresponding folder.
### Whole blood (PTSD):
ftp://yunlianon:anon@rc-ns-ftp.its.unc.edu/CUE/PTSD_model.tar.gz
### Placenta (ELGAN):
ftp://yunlianon:anon@rc-ns-ftp.its.unc.edu/CUE/ELGAN_model.tar.gz
### then decompress.
cd PTSD tar -xf PTSD_model.tar.gz cd ../
cd ELGAN tar -xf ELGAN_model.tar.gz cd ../
## CUE imputation
Here we show the imputation for a toy dataset with three samples, 248K HM450K probes.
The reason we use only 248K probes is that when we trained models, we only retain the 248,421 HM450 CpG sites (sites overlapping between ELGAN and PTSD, and without missingness in our samples) to train since we don't want the incompleteness or the pre-imputation on HM450 affects our evaluations on HM850 imputation.
The full list of 248K HM450K probes can be found in Probes.RData.
### The input dataset X should have row as probes, columns as samples.
```{r perform imputation}
## Set the working directory
setwd(“/YOUR_DIRECTORY/CUE”)
# Replace YOUR_DIRECTORY with your directory where you downloaded CUE.
root_dir = getwd()
## load sample dataset
load("PTSD/Sample_Dataset.RData")
X<-sample_data # Input HM450
m<-dim(X)[2] # number of samples # Number of smaples
## load required datasets
load("PTSD/Annotations.RData") # load Annotation data
load("Data/PTSD.neighbors.RData") # load All neighbors
load("Data/Probes.RData") # load All probes' names
## load required datasets for PTSD imputation
load("Data/PTSD_CpG_Best_method_list.RData") # load the best method list
load("PTSD/RF/best_RF.RData") # load neighbors for RF
load("PTSD/XGB/best_XGB.RData") # load neighbors for XGB
load("PTSD/KNN/best_KNN.RData") # load neighbors for KNN
load("PTSD/TCR/best_TCR.RData") # load neighbors for PFR
## load all the required packages
library(randomForest)
library(xgboost)
library(xgboost)
source("R/refund_lib.R")
## load the CUE imputation function
source("R/impute.R")
## Check the input
CUE_check(X) # check if the inuput X match the requirements
## TCR required input
temp <- TCR.input(X)
X_logit = temp[[1]] # Logit transformation make it better follows Gaussian distribution
test.funcs <- temp[[2]] # Pre-calculated sample-specific density function
## Imputation (Single CPU might takes 4-5 days to imputation for a middle sized methylation dataset (n=~100 samples).)
m.imputed<-CUE.impute(X,m,"PTSD")
## Save the imputed probes as y_impute.RData
setwd(root_dir)
save(m.imputed,file="y_impute.RData")
Note: (a) we impute all 339K HM850 specific probes which had complete data in our reference whole blood and placenta datasets. Users of CUE must use the following quality control steps to retain the well imputed probes only for use in subsequent analysis (such as epigenome wide association studies).
(b) we provided the corresponding annotation files within PTSD datasets ("PTSD/Annotations.RData") for building funtional predicitors for penalized function regression models. The annntation files can also be downloaded from the Illumina website.
File link: HM850
We provdies two sets of QC+ probes list for two datsets with the following thresholds:
QC for ELGAN: RMSE<0.1 and Accruracy > 90%;
QC for PTSD: RMSE<0.05 and Accruracy > 95%. ```{r subset} QC_probes<-read.csv("Placenta (ELGAN)QC_probe_list.csv") # for Placenta
load("y_impute.RData") QC_probes<-unlist(QC_probes) m.QC <- m.imputed[,paste(QC_probes)] save(m.QC, file="m.imputed.QC.RData")
(Optional): We also provide an shiny app, CUE_QC.R, an visualization tool to select the QC+ probe list.
```{r CUE_QC}
runApp('CUE_QC.R')
The shiny app will save the list of well imputed probes. Users can use the previous code to subset the output probes.
The final QC+ imputed HM850 probes will be saved as "m.imputed.QC.RData". See the output:
```{r output from CUE}
dim(m.QC)
head(m.QC[,1:5])
```
Li G, Raffield L, Logue M, Miller MW, Santos HP Jr, O'Shea TM, Fry RC, Li Y. CUE: CpG impUtation ensemble for DNA methylation levels across the human methylation450 (HM450) and EPIC (HM850) BeadChip platforms. Epigenetics. 2021 Aug;16(8):851-861. doi: 10.1080/15592294.2020.1827716. Epub 2020 Oct 4. PMID: 33016200; PMCID: PMC8330997.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.