README.md

SELINA

SELINA is a deep learning-based framework for single cell assignment with multiple references. The algorithm consists of three main steps: cell type balancing, pre-training and fine-tuning. The rare cell types in reference image.pngdata are first oversampled using SMOTE(Synthetic Minority Oversampling Technique), and then the reference data is trained with a supervised deep learning framework using MADA(Multi-Adversarial Domain Adaptation). An autoencoder is subsquently used to fine-tune the parameters of the pre-trained model. Finally, the labels from reference data are transferred to the query data based on the fully-trained model. Along with the annotation algorithm, we also collect 136 datasets which were uniformly processed and curated to provide users with comprehensive pre-trained models.

Installation

If you have gpu on your device and want to use it, you should install cudatoolkit and cudnn based on your system version before SELINA installation.

We recommend you install SELINA with devtools::install_github() from R:

if (!require("devtools", quietly = TRUE))
    install.packages("devtools")
devtools::install_github("SELINA-team/SELINA.R")

Usage

Preprocess of query data

You could preprocess query data with steps in documentation.

Pre-training of the reference data

Train model with train_model. You will get a list, which includes a training model and it's meta information. Files used in here are included in folder demos. You can check parameter details with command ?train_model.

NOTE: Please put expression and meta files in one folder, meta file should include Celltype and Platform columns for normal datasets, and an additional Disease column for disease datasets.

library(SELINA)

model <- train_model(path_in="demos/normal_data/reference_data",
                     disease=FALSE)

Save the model with save_model.

save_model(model, path_out, prefix)

In this step, two output files will be generated in the path_out folder. 1. pre-trained_params.pt : a file containing all parameters of the trained model. 2. pre-trained_meta.rds : a file containing the cell types and genes of the reference data.

Prediction

Annotate query data with query_predict. You will get a list, which includes prediction results and corresponding probability for query data.

Files used in here are included in folder demos. You can check parameter details with command ?query_predict.

SELINA has trained models for 35 kinds of normal tissues and 3 kinds of disease tissues, you can load them with command ?load_selina_model. All the tissue names are showed in the toggle list below.

Tissue models (Click Me)

1.Normal Adrenal-Gland Airway-Epithelium Artery Bladder Blood Bone-Marrow Brain Breast Choroid Decidua Esophagus Eye Fallopian-Tube Gall-Bladder Heart Intestine Kidney Liver Lung Muscle Nose Ovary Pancreas Peritoneum Placenta Pleura Prostate Skin Spleen Stomach Testis Thyroid Ureter Uterus * Visceral-Adipose

2.Disease AD (type II diabetes) T2D (non-small-cell lung carcinoma) * NSCLC (Alzheimer’s disease)

Load the MADA model.

library(SELINA)

## If you predict directly after training, then can skip the next load model step.
# If you want to use models trained by yourself:
model <- read_model(path_model)

# If you want to load model SELINA prepared (Please make sure the input tissue name is included in our documentation, eg: Pancreas):
model <- load_selina_model(tissue)

Predict with SELINA.

queryObj <- readRDS(path_query)
query_result <- query_predict(query_expr = queryObj,
                              model = model,
                              path_out = path_out,
                              outprefix = 'query', 
                              disease = TRUE, 
                              cell_cutoff = 5,
                              prob_cutoff = 0.9)

This step will output eight files in the path_out folder.

For disease mode, an extra file will be generated. - 5. query_cellsources.txt: predicted cell source for each cell in the query data

Change Log

v0.1



SELINA-team/SELINA.R documentation built on Sept. 17, 2022, 8:45 a.m.