Imaging tissue contains genes spatially organised by the extracellular matrix, individual cell types and groups of cell compartments. Using linear algebra and vectors, the expected patterns of the histology images are learned (example patterns of (1): extracellular matrix stained in eosin, and (2): nuclei of cells stained in hematoxylin). Assigning each sample image to a learned pattern (by the Generative Encoder algorithm) is similar to assigning an image weight to a pattern vector. Accordingly, genes are also assigned weights to the same set of expected image patterns. This encodes the information into the same space.
The library that wraps around the Generative Encoder algorithm (gcode) is called GeneCodeR and provides a more comprehensive interface to load data from path, run gcode, extract relevant transformations, and validate the learned patterns.
# Main libraries for analysis library(GeneCodeR) library(gcode) # Main libraries for plotting library(ggplot2) library(grid) library(gridExtra)
Please replace these paths
main_path <- "~/Documents/main_files/AskExplain/Q4_2022/gcode/" # Please replace these paths path_to_save <- paste(main_path,"./temp_save_dir/",sep="") path_to_work <- paste(main_path,"./temp_work_dir/",sep="")
Download data and format
dir.create(path_to_save) dir.create(path_to_work) setwd(path_to_work) curl::curl_download(url = "https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/29ntw7sh4r-5.zip", destfile = paste(path_to_work,"he_et_al.zip",sep="")) utils::unzip(zipfile = paste(path_to_work,"he_et_al.zip",sep=""), exdir = path_to_work) path_to_work <- paste(path_to_work,"29ntw7sh4r-5",sep="") setwd(path_to_work) main_files <- list.files(path_to_work, full.names = T) new_files <- gsub(pattern = "BC", replacement = "BT", x = main_files) file.rename(from = main_files, to = new_files)
GeneCodeR takes in three input files from a generic Spatial Transcriptomic dataset, per tissue slide:
1 - coordinates 2 - gene expression 3 - images
# Input list of files file_path_list <- GeneCodeR::extract_path_framework(F) file_path_list$meta$path <- list.files(path_to_work,pattern = "*Coords.tsv.gz",full.names = T) file_path_list$coord$path <- list.files(path_to_work,pattern = "*csv.gz",full.names = T) file_path_list$gex$path <- list.files(path_to_work,pattern = "*stdata.tsv.gz",full.names = T) file_path_list$pixel$path <- list.files(path_to_work,pattern = "*jpg",full.names = T)
set.seed(1) train_ids <- sample(c(1:length(file_path_list$coord$path)),round(length(file_path_list$coord$path)*0.9,0)) test_ids <- c(1:length(file_path_list$coord$path))[-train_ids] train_file_path_list <- lapply(file_path_list,function(X){ list(path=X$path[train_ids]) }) test_file_path_list <- lapply(file_path_list,function(X){ list(path=X$path[-train_ids]) })
The GeneCodeR configuration file takes in several parameters.
The first set of parameters is information to extract spots - such as spot size, and any augmentations such as displacement of the coordinates, or rotation of the image.
The second set of parameters is for after the gcode model is learned (via GeneCodeR wrapper). This is to transfrom from one modality, to another modality (for example, from image to gene expression).
# Set up genecoder configuration parameters genecoder.config <- GeneCodeR::extract_config_framework(F)
The meta information file takes in several parameters per coordinate, gene expression and image.
The first set of parameters is to read in the file. Here, example parameters such as the file type (for example, is it a .csv, .tsv, .jpg, .png, market matrix file (.mtx)?), and other important parameters based on file type (for example, does it contain a header, what is the separator ...).
Importantly, another set of parameters is the factor identifier in the coordinate file - which columns contain the x-coordinates, y-coordinates and the label identifier for the spots? For example, the label identifiers, are given by the rownames indicated by a 0.
# Set up meta information meta_info_list <- GeneCodeR::extract_meta_framework(F) meta_info_list$meta$read_file$file_type <- "tsv" meta_info_list$coord$read_file$file_type <- "csv" meta_info_list$gex$read_file$file_type <- "tsv" meta_info_list$pixel$read_file$file_type <- "image" meta_info_list$meta$read_file$meta$sep <- "\t" meta_info_list$coord$read_file$meta$header <- meta_info_list$gex$read_file$meta$header <- meta_info_list$meta$read_file$meta$header <- T meta_info_list$coord$read_file$meta$sep <- "," meta_info_list$gex$read_file$meta$sep <- "\t" meta_info_list$meta$read_file$meta$row.names <- 1 meta_info_list$coord$read_file$meta$quote <- meta_info_list$gex$read_file$meta$quote <- meta_info_list$meta$read_file$meta$quote <- "" meta_info_list$coord$read_file$meta$row.names <- 1 meta_info_list$gex$read_file$meta$row.names <- 1 meta_info_list$coord$factor_id$labels <- 0 meta_info_list$coord$factor_id$coord_x <- 1 meta_info_list$coord$factor_id$coord_y <- 2
To run the analysis appropriately, a set of common genes are required for all spots.
common_genes <- lapply(c(1:length(file_path_list$gex$path)),function(X){ print(X) colnames(GeneCodeR::read_file(path = file_path_list$gex$path[X], meta_info = meta_info_list$gex$read_file)[[1]]) }) meta_info_list$gex$factor$common_genes <- Reduce("intersect",common_genes)
Coordinate labels also need to be extracted.
meta_info_list$coord$factor$labels <- lapply(c(1:length(train_file_path_list$coord$path)),function(X){ row.names(GeneCodeR::read_file(path = train_file_path_list$coord$path[X], meta_info = meta_info_list$coord$read_file)[[1]]) })
Gene expression data is extracted using the path list, meta information, and configuration parameters.
# Extract gex data gex_data <- GeneCodeR::prepare_gex(file_path_list = train_file_path_list,meta_info_list = meta_info_list,config = genecoder.config)
Importantly, labels for the meta information from the gene expression data extraction is required for extracting relevant spots that contain gene expression greater than background.
# Store meta data meta_info_list$gex$factor$labels <- gex_data$labels # Extract spot data genecoder.config$extract_spots$window_size <- 30 spot_data <- GeneCodeR::prepare_spot(file_path_list = train_file_path_list,meta_info_list = meta_info_list,config = genecoder.config, gex_data = gex_data$gex)
Relevant frameworks are set up to begin learning the Generative Encoder model.
Firstly, set up the data list, which stores the gene expression and spot data.
# Set up data data_list <- list(spot_data$spot,as.matrix(spot_data$gex))
The join framework is set up, creating the model structure that learns (1): a similar function for "all" datasets, or (2): separate functions across gene expression and image pixel spots.
Here, only the incode and the beta parameters are set up to learn the individual patterns between genes and images. Given the samples are extracted from the same tissue site (image and genes at the same spot), the sample patterns are shared. The feature patterns and the sample patterns (see first paragraph - genes assigned to expected image patterns of hematoxylin and eosin stains) are mapped to a shared latent space via the alpha and beta codes. For example, the use of image patterns is analogous to pattern matching to the example image spot, the latent codes un-rotate any example image spots until it best fits the image pattern.
# Set up join information join <- gcode::extract_join_framework(F) join$complete$data_list <- c(1,2) join$complete$alpha_sample <- c(1,1) join$complete$beta_sample <- c(1,2) join$complete$alpha_signal <- c(1,1) join$complete$beta_signal <- c(1,2) join$complete$code <- c(1,1) join$covariance <- c(0,0) reference <- gcode::extract_references_framework(F) reference$data_list <- c(1)
Set up configuration information - the dimensions of the encoding space for the samples (i_dim), and the features (j_dim), as well as the dimensions of the latent code (k_dim) are set up. The initialisation for these parameters are set up via the irlba library using partial Singular Value Decomposition.
# Set up gcode config gcode.config <- gcode::extract_config(F) gcode.config$init <- list(alpha_sample="irlba",beta_sample="irlba") gcode.config$i_dim <- 500 gcode.config$j_dim <- 500 gcode.config$dimension_reduction <- F gcode.config$max_iter <- 100 gcode.config$seed <- 1 gcode.config$tol <- 1 gcode.config$n.cores <- 5 gcode.config$learn_rate <- 1e-1 gcode.config$batch_size <- 500
Run the Generative Encoder model
# Set up gcode model genecoder.model <- GeneCodeR::learn_model(data_list = data_list, config = gcode.config, join = join, reference = reference)
Now begins the testing validation phase...
First extract the test set of the coordinate factor labels.
# Evaluate on test set meta_info_list$coord$factor$labels <- lapply(c(1:length(test_file_path_list$coord$path)),function(X){ row.names(GeneCodeR::read_file(path = test_file_path_list$coord$path[X], meta_info = meta_info_list$coord$read_file)[[1]]) })
Next extract the test set of the gene expression data.
# Extract test gex data test_gex_data <- GeneCodeR::prepare_gex(file_path_list = test_file_path_list,meta_info_list = meta_info_list,config = genecoder.config) meta_info_list$gex$factor$labels <- test_gex_data$labels
Save relevant information, including the model, the test meta information, the test file list and the test gene expression data. This is to analyse results later.
base_test_spot_data <- GeneCodeR::prepare_spot(file_path_list = test_file_path_list,meta_info_list = meta_info_list,config = genecoder.config, gex_data = test_gex_data$gex) non_zero_markers <- base_test_spot_data$gex>0 save.image(file = paste(sep="",path_to_save,"all_genecoder.RData"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.