knitr::opts_chunk$set( collapse=TRUE, comment="#>", fig.width=8, fig.height=5 ) options(rmarkdown.html_vignette.check_title=F)
knitr::include_graphics(c("img/dir_structure.png", "img/metadata_file.png"))
The package spatialGE
provides a collection of tools for the visualization of gene expression from spatially-resolved transcriptomic experiments. The data input methods have been designed so that any data can be analyzed as long as it contains gene expression counts per spot or cell, and the spatial coordinates of those spots or cells. Specialized data input options are available to allow interoperability with workflows such as Space Ranger.
The spatialGE
repository is available at GitHub and can be installed via devtools
. To install devtools
(in case it is not already installed in your R), please run the following code:
if("devtools" %in% rownames(installed.packages()) == FALSE){ install.packages("devtools") }
After making sure devtools
is installed, proceed to install spatialGE
:
#devtools::install_github("fridleylab/spatialGE")
To use spatialGE
, load the package.
library('spatialGE')
To show the utility of some of the functions in spatialGE
, we use the spatial transcriptomics data set generated with the platform Visium by @bassiouni_2023. This data set includes triple negative breast cancer biopsies from 22 patients, with two tissue slices per patient. The Visium platform allows gene expression quantitation in approximately 5000 capture areas (i.e., "spots") separated each other by 100μM. The spots are 55μM in diameter, which corresponds to 1-10 cells approximately according to the manufacturer.
Data for all the tissue slices are available at the Gene Expression Omnibus (GEO). For the purpose of this tutorial, we will use eight samples from four patients.
The GEO repositories can be accessed using the following links:
For convenience, the data is also available at our spatialGE_Data GitHub repository. During this tutorial, we will pull the data from this repository. Nonetheless, users are eoncouraged to download the data and explore its contents in order to familiarize with the input formats accepted by spatialGE
. For each sample directory, the file filtered_feature_bc_matrix.h5
contains the gene expression counts. The spatial
sub-directory contains the tissue_positions_list.csv
, tissue_hires_image.png
, and scalefactors_json.json
files. This file structure, corresponds roughly to the file structure in outputs generated by Space Ranger, the software provided by 10X Genomics to process raw Visium data.
In this tutorial, data will be deposited in a temporary directory. However, the download path can be changed within the following code block:
visium_tmp = tempdir() unlink(visium_tmp, recursive=TRUE) dir.create(visium_tmp) download.file('https://github.com/FridleyLab/spatialGE_Data/raw/refs/heads/main/tnbc_bassiouni.zip?download=', destfile=paste0(visium_tmp, '/', 'tnbc_bassiouni.zip'), mode='wb') zip_tmp = list.files(visium_tmp, pattern='tnbc_bassiouni.zip$', full.names=TRUE) unzip(zipfile=zip_tmp, exdir=visium_tmp)
In spatialGE
, raw and processed data are stored in an STlist (S4 class object). As previously mentioned, an STlist can be created with the function STlist
, from a variety of data formats (see here for more info or type ?STlist
in the R console). In this tutorial we will provide the file paths to the folders downloaded in the previous step.
Additionally, we will input meta data associated with each sample. The meta data is provided in the form of a comma-delimited file. We have extracted some of the clinical meta data for the eight samples from the original publication and saved it as part of the spatialGE
package. The user is encouraged to look at the structure of this file by downloading it from the spatialGE_Data GitHub repository. The most important aspect when constructing this file is that the sample names are in the first column, and they match the names of the folders containing the data:
From the temporary directory, we can use R to generate the file paths to be passed to the STlist
function:
visium_folders <- list.dirs(paste0(visium_tmp, '/tnbc_bassiouni'), full.names=TRUE, recursive=FALSE)
The meta data can be accessed directly from the spatialGE
package installed in the computer like so:
clin_file <- list.files(paste0(visium_tmp, '/tnbc_bassiouni'), full.names=TRUE, recursive=FALSE, pattern='clinical')
We can load the files into an STlist using this command:
tnbc <- STlist(rnacounts=visium_folders, samples=clin_file, cores=2)
The tnbc
object is an STlist that contains the count data, spot coordinates, and clinical meta data.
tnbc
As observed, by calling the tnbc
object, information on the number of spots and genes per sample is displayed. For count statistics, the summarize_STlist
function can be used:
summarize_STlist(tnbc)
The minimum number of counts per spot is 42, which seems low. We can look at the distribution of counts and genes per spot using the distribution_plots
function:
cp <- distribution_plots(tnbc, plot_type='violin', plot_meta='total_counts') cp[['total_counts']]
Now, let us remove spots with low counts by keeping only those spots with at least 5000 counts. We will also restrict the data set to spots expressing at least 1000 genes. We also will remove a few spots that have abnormally large number of counts, as in sample_094c. These criteria are not a rule, and samples in each study have to be carefully examined. For example, this criteria may be not enough to reduce the differences in counts, especially for sample_120d and samples_120e.
We run this filter with the filter_data
function:
tnbc <- filter_data(tnbc, spot_minreads=5000, spot_mingenes=1000, spot_maxreads=150000) cp2 <- distribution_plots(tnbc, plot_type='violin', plot_meta='total_counts') cp2[['total_counts']]
The functions pseudobulk_samples
and pseudobulk_pca_plot
are the initial steps to obtain a quick snapshot of the variation in gene expression among samples. The function pseudobulk_samples
creates (pseudo) "bulk" RNAseq data sets by combining all counts from each sample. Then, it log transforms the "pseudo bulk" RNAseq counts and runs a Principal Component Analysis (PCA). Note that the spatial coordinate information is not considered here, which is intended only as an exploratory analysis analysis. The max_var_genes
argument is used to specify the maximum number of genes used for computation of PCA. The genes are selected based on their standard deviation across samples.
tnbc <- pseudobulk_samples(tnbc, max_var_genes=5000)
In this case, we apply the function to look for agreement between samples from the same patient: It is expected that tissue slices from the same patient are more similar among them than tissue slices from other patients. The pseudobulk_pca
allows to map a sample-level variable to the points in the PCA by including the name of the column from the sample metadata (patient_id
in this example).
pseudobulk_dim_plot(tnbc, plot_meta='patient_id')
Users can also generate a heatmap from pseudobulk counts by calling the pseudobulk_heatmap
function, which also requires prior use of the pseudobulk_samples
function. The number of variable genes to show can be controlled via the hm_display_genes
argument.
hm_p <- pseudobulk_heatmap(tnbc, plot_meta='patient_id', hm_display_genes=30)
Many transformation methods are available for RNAseq count data. In spatialGE
, the function transform_data
applies log-transformation to the counts, after library size normalization performed on each sample separately. Similar to Seurat, it applies a scaling factor (scale_f=10000
by default).
tnbc <- transform_data(tnbc, method='log', cores=2)
Users also can apply variance-stabilizing transformation (SCT; @hafemeister_2019), which is another method increasingly used in single-cell and spatial transcriptomics studies. See here for details.
After data transformation, expression of specific genes can be visualized using "quilt" plots. The function STplot
shows the transformed expression of one or several genes. We have adopted the color palettes from the packages khroma
and RColorBrewer
. The name of a color palette can be passed using the argument color_pal
. The default behavior of the function produces plots for all samples within the STlist, but we can pass specific samples to be plotted using the argument samples
.
Let's produce a quilt plot for the genes NDRG1 and IGKC (hypoxia and B-cell markers, respectively), for sample number 1 of patient 14 (samples=sample_094c
).
quilts1 <- STplot(tnbc, genes=c('NDRG1', 'IGKC'), samples='sample_094c', color_pal='BuRd', ptsize=0.8)
Because spatialGE
functions output lists of ggplot objects, we can plot the results side-by-side using functions such as ggarrange()
:
ggpubr::ggarrange(plotlist=quilts1, nrow=1, ncol=2, legend='bottom')
We can see that gene expression patterns of both genes are non-overlapping: IGKC is expressed in the upper right portion of the tissue, whereas NDRG1 is expressed in the right bottom portion (although relatively lower compared to IGKC as indicated by the white-colored spots). The location of gene expression of those two genes may be indicative of an immune-infiltrated area and a tumor area. With the help of spatial interpolation, visualization of these regions can be easier as will be showed next.
We can predict a smooth gene expression surface for each sample. In spatialGE
, this prediction is achieved by using a spatial interpolation method very popular in spatial statistics. The method known as 'kriging' allows the estimation of gene expression values in the un-sampled areas between spots, or cells/spots that were filtered during data quality control. Estimating a transcriptomic surface via kriging assumes that gene expression of two given points is correlated to the spatial distance between them.
The function gene_interpolation
performs kriging of gene expression via the fields
package. We specify that kriging will be performed for two of the spatial samples (samples=c('sample_094c', 'sample_117e')
):
tnbc <- gene_interpolation(tnbc, genes=c('NDRG1', 'IGKC'), samples=c('sample_094c', 'sample_117e'), cores=2)
Generating gene expression surfaces can be time consuming. The finer resolution to which the surface is to be predicted (ngrid
argument), the longer the time it takes. The execution time also depends on the number of spots/cells. The surfaces can be visualized using the STplot_interpolation()
function:
kriges1 <- STplot_interpolation(tnbc, genes=c('NDRG1', 'IGKC'), samples='sample_094c') #ggpubr::ggarrange(plotlist=kriges1, nrow=1, ncol=2, common.legend=TRUE, legend='bottom')
By looking at the transcriptomic surfaces of the two genes, it is easier to detect where "pockets" of high and low expression are located within the tissue. It is now more evident that expression of NDRG1 is higher in the lower right region of the tumor slice, as well in a smaller area to the left (potentially another hypoxic region).
If tissue images were uploaded to the STlist (tissue_hires_image.png.gz
), it can be displayed for comparison using the plot_image
function:
ip = plot_image(tnbc, samples='sample_094c') ip[['image_sample_094c']]
Detecting tissue compartments or niches is an important part of the study of the tissue architecture. We can do this by applying STclust
, a spatially-informed clustering method implemented in spatialGE
. The STclust
method uses weighted average matrices to capture the transcriptomic differences among the cells/spots. As a first step in STclust
, top variable genes are identified via Seurat's FindVariableFeatures
, and transcriptomic scaled distances are calculated using only those genes. Next, scaled euclidean distances are computed from the spatial coordinates of the spots/cells. The user defines a weight (ws
) from 0 to 1, to apply to the physical distances. The higher the weight, the less biologically meaningful the clustering solution is, given that the clusters would only reflect the physical distances between the spots/cells and less information on the transcriptomic profiles will be used. After many tests, we have found that weights between 0.025 - 0.25 are enough to capture the tissue heterogeneity. By default, STclust
uses dynamic tree cuts [@langfelder_2008] to define the number of clusters. But users can also test a series of k values (ks
). For a more detailed description of the method, please refer to the paper describing spatialGE
and STclust
[@ospina_2022].
We'll try several weights to see it's effect on the cluster assignments:
tnbc <- STclust(tnbc, ks='dtc', ws=c(0, 0.025, 0.05, 0.2), cores=2)
Results of clustering can be plotted via the STplot
function:
cluster_p <- STplot(tnbc, samples='sample_094c', ws=c(0, 0.025, 0.05, 0.2), color_pal='highcontrast') ggpubr::ggarrange(plotlist=cluster_p, nrow=2, ncol=2, legend='right')
We can see that from w=0
and w=0.05
, we can only detect two tissue niches. At w=0.025
, we gain higher resolution as one of the clusters is split and a third ('yellow') cluster appears, potentially indicating that an different transcriptional profile is present there. At w=0.2
, the clusters seem too compact, indicating that the weight of spatial information is probably too high. We have used here the dynamic tree cuts (dtc
) to automatically select the number of clusters, resulting in very coarse resolution tissue niches, however, users can define their own range of k to be evaluated, allowing further detection of tissue compartments.
To explore the relationship between a clinical (sample-level) variable of interest and the level of gene expression spatial uniformity within a sample, we can use the SThet()
function:
tnbc <- SThet(tnbc, genes=c('NDRG1', 'IGKC'), method='moran', cores=2)
The SThet
function calculates the Moran's I statsitic (or Geary's C) to measure the level of spatial heterogeneity in the expression of the genes ( NDRG1, IGKC). The estimates can be compared across samples using the function compare_SThet()
p <- compare_SThet(tnbc, samplemeta='overall_survival_days', color_by='patient_id', gene=c('NDRG1', 'IGKC'), color_pal='muted', ptsize=3) p
The calculation of spatial statistics with SThet
and and multi-sample comparison with compare_SThet
provides and easy way to identify samples and genes exhibiting spatial patterns. The previous figure shows that expression of NDRG1 is more spatially uniform (lower Moran's I) across the tissues in samples from patients 2 and 8 compared to patents 9 and 14. The samples with higher spatial uniformity in the expression of NDRG1 also tended to have higher overall survival. Trends are less clear for IGKC, however, it looks like samples where expression of IGKC was spatially aggregated in "pockets" (higher Moran's I) tended to have lower survival (but note that patient 9 does not follow this trend). As studies using spatial transcriptomics become larger, more samples will provide more insightful patterns into the association of gene expression spatial distribution and non-spatial traits associated with the tissues.
The computed statistics are stored in the STlist for additional analysis/plotting that the user may want to complete. The statistics value can be accessed as a data frame using the get_gene_meta
function:
get_gene_meta(tnbc, sthet_only=TRUE)
SThet
can be interpreted?
See the table below for a simplistic interpretation of the spatial autocorrelation statistics calculated in spatialGE
:
library('magrittr') # Use of pipe operator sphet_info <- tibble::tibble( "Statistic"=c('Moran’s I', 'Geary’s C'), "Clustered expression"=c('Closer to 1', 'Closer to 0'), "No expression pattern"=c('Closer to 0', 'Closer to 1'), "Uniform expression"=c('Closer to -1', 'Closer to 2') ) kbl_t = kableExtra::kbl(sphet_info, align='c', centering=TRUE) %>% kableExtra::kable_styling(position="center", full_width=FALSE) kableExtra::add_footnote(kbl_t, c("Note: The boundaries indicated for each statistic are reached when the number of spots/cells is very large."), notation="none")
To better understand how the Moran's I and Geary's C statistics quantify spatial heterogeneity, tissue can be simulated using the scSpatialSIM
and spatstat
packages. Also tidyverse
and janitor
for some data manipulation.
library('rpart') library('spatstat') library('scSpatialSIM') library('tidyverse') library('janitor')
The sc.SpatialSIM
package uses spatial point processes to simulate the locations of spots/cells within a tissue. To facilitate interpretability of the Moran's I and Geary's C statistics, first tissue will be simulated, and then gene expression values will be simulated.
The first step is to create a spatstat
observation window:
wdw <- owin(xrange=c(0, 3), yrange=c(0, 3)) sim_visium <- CreateSimulationObject(sims=1, cell_types=1, window=wdw)
Next, the sc.SpatialSIM
is used to generate the spatial point process (positions of the spots) within the observation window. Then, assignments of spots to tissue domains are simulated and visualized:
# Generate point process # Then, simulate tissue compartments set.seed(12345) sim_visium <- GenerateSpatialPattern(sim_visium, gridded=TRUE, overwrite=TRUE) %>% GenerateTissue(k=1, xmin=1, xmax=2, ymin=1, ymax=2, sdmin=1, sdmax=2, overwrite=TRUE) PlotSimulation(sim_visium, which=1, what="tissue points") + scale_shape_manual(values=c(19, 1))
Now, the simulated tissue domain assignments are extracted from the SpatSimObj
object. Gene counts will be simulated in such a way that:
# Extract tissue assignments from the `SpatSimObj` object # Simulate expression of 'gene_1' sim_visium_df <- sim_visium@`Spatial Files`[[1]] %>% clean_names() %>% mutate(gene_1=case_when(tissue_assignment == 'Tissue 1' ~ 1, TRUE ~ 0.1)) # Generate expression patter of "gene_2" for(i in 1:nrow(sim_visium_df)){ if(i%%2 == 0){ sim_visium_df[i, 'gene_2'] = 1 } else{ sim_visium_df[i, 'gene_2'] = 0.1 } } # Generate expression of "gene_3" # Set seed for resproducibility set.seed(12345) sim_visium_df[['gene_3']] <- sample(sim_visium_df[['gene_2']])
To visualize the simulated expression and run SThet
, an STlist is created:
# Extract simulated expression data sim_expr <- sim_visium_df %>% add_column(libname=paste0('spot', seq(1:nrow(.)))) %>% select(c('libname', 'gene_1', 'gene_2', 'gene_3')) %>% column_to_rownames('libname') %>% t() %>% as.data.frame() %>% rownames_to_column('genename') # Extract simulated spot locations sim_xy <- sim_visium_df %>% add_column(libname=paste0('spot', seq(1:nrow(.)))) %>% select(c('libname', 'y', 'x')) # The `STlist` function can take a list of data frames simulated <- STlist(rnacounts=list(sim_visium=sim_expr), spotcoords=list(sim_visium=sim_xy), cores=2) # Plot expression ps <- STplot(simulated, genes=c('gene_1', 'gene_2', 'gene_3'), data_type='raw', color_pal='sunset', ptsize=1) ggpubr::ggarrange(plotlist=ps, ncol=3)
As expected, simulated expression of "gene_1" is aggregated (i.e., clustered, red spots concentrated towards the center of the tissue). This result was intently obtained by setting the simulation parameters to xmin=2
, xmax=3
, ymin=2
, ymax=3
in sc.SpatialSIM
.
# The `SThet` function requires normalized data simulated <- transform_data(simulated, cores=2) # Run `SThet` simulated <- SThet(simulated, genes=c('gene_1', 'gene_2', 'gene_3'), method=c('moran', 'geary')) # Extract results get_gene_meta(simulated, sthet_only=TRUE)
The results for each of the metrics are as expected: "gene_1" shows aggregation/clustering, indicative of "hot-spot" expression. "gene_2" and "gene_3" show uniform and random expression respectively. Notice that none of these values are very far from the "random" expectation (Moran's I = 0 and Geary's C = 1). One of reasons for this result is the effect of library-size normalization. In addition, obtaining extreme values of I and C require extreme spatial patterns, which are unlikely to be observed even in the way data has been simulated here.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.