knitr::opts_chunk$set(echo = TRUE)
Here is a link to the text of these exercises.
A second 10X Genomics single cell RNA-seq dataset is included in the pbda package. This dataset is derived from a human solid tumor. Get its path using the code below:
tumor_10x <- system.file("extdata", "tumor_gene_bc_matrices", "hg19", package = "pbda")
Using the new data, create a new Seurat object named tumor
. Next, extract the list of mitochondrial genes in the matrix, calculate the percentage of reads that are derived from the mitochondrial genome in each cell, and add this to a metadata slot called "percent.mito" in the tumor
object.
Now, using the GenePlot()
function, plot nUMI vs. nGene and nUMI vs. percent.mito. Based on these plots, filter the data to remove dead cells and likely doublets (using FilterCells()
).
Now that you've filtered your cells, normalize the data (NormalizeData()
) then identify variable genes using the code below (the cutoffs have been adjusted for this dataset - don't worry about interpreting the VMR plot):
tumor <- FindVariableGenes(tumor, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.025, x.high.cutoff = 3, y.cutoff = 1)
Next, scale the data and run PCA on your tumor
object using the variable genes you've just identified.
Finally, use an elbow plot and/or PC heatmaps to select the prinicpal components to use for tSNE and clustering. Justify your choice of PCs (do they capture the full variation present in the data?).
Using the PCs selected in Question 2, run the tSNE algorithm on tumor
and identify clusters using FindClusters()
. Plot the resulting tSNE. How many clusters are identified? Does this appear reasonable based on the structure present in the tSNE projection?
Included in the package is a matrix called tx_rates
containing transcription rates for ~7000 genes. Transcription rates were measured every 15 minutes
for 3 hours after exposure of dendritic cells to LPS, which activates innate immunity responses. The processed data was obtained from GEO
For this exercise perform the following steps:
1) Exclude genes that have rates of 0 across all samples
2) Log transform the data (add a pseudocount of 1 to avoid Infinite values)
3) Normalize the data to the 0 hour time point (i.e. subtract the "rate at 0" from all of the columns )
4) run kmeans clustering on the log transformed and normalized matrix. Use k = between 5 and 10 clusters.
5) Make a heatmap with rows split based on the clusters identified in step 4. This may take a few minutes to plot, depending on your computers resources. Use the code below to generate a color palette. Discuss any interesting patterns in your clusters in the interpretation.
cols_to_use <- circlize::colorRamp2(c(-4, 0, 4), c("blue", "white", "red")) # pseudocode Heatmap(mat, col = cols_to_use, ...
Using your tidyverse skills plot the average transcription rate for each cluster identified in Q4 using ggplot. Facet by cluster. (Hint gather and tidy the matrix, then join with the cluster assignments)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.