knitr::opts_chunk$set( message = FALSE, warning = FALSE, collapse = TRUE, fig.width = 6, fig.height = 4.5, comment = "#>" )
This guide will illustrate how you can process your own RNAseq data and utilize it in conjunction with our Eye in a Disk (EiaD) database to conduct differential gene expression analysis.
We will begin by downloading and loading necessary packages for use throughout this tutorial:
Comment out the install.packages
if you get errors on the library
loads below
# # The following packages will be used throughout this tutorial for data manipulation # install.packages("tidyverse") # install.packages("tibble") # install.packages("data.table") # install.packages("tximport") # Load packages library(tidyverse) library(tibble) library(data.table) library(tximport)
In order to combine your personal RNA-seq dataset with our EiaD database for use in gene expression analysis, it must first be processed using an RNAseq processing tool(kallisto/STAR/etc.). In this tutorial we will be using Salmon. For a a full explanation of how to use Salmon, please visit the latest salmon documentation here{.uri}.
Importing the EiaD gene counts will take some time, as this file is over 175 MB. If you would like to, you can also swap out gene counts for transcript counts, which can be found here{.uri}. If you utilize the EiaD transcript counts, your local counts will need to be aggregated to the gene level.
# Retrieve metadata and gene counts for EiaD metadata <- data.table::fread("https://hpc.nih.gov/~mcgaugheyd/eyeIntegration/2023/eyeIntegration23_meta_2023_09_01.built.csv.gz") gene_counts <- data.table::fread("https://hpc.nih.gov/~mcgaugheyd/eyeIntegration/2023/gene_counts.csv.gz")
We will compare adult human retina tissue against iPSC neuron tissue. The table below shows that ERP126780
(study_accession column) is a possibility for the retina sample. Note how we also save exactly which sample_accessions
meet the criteria for our desired sample.
metadata %>% filter(Tissue == 'Retina', Source == 'Native', Age == 'Adult') EiaD_retina_samples <- metadata %>% filter(Tissue == 'Retina', Source == 'Native', Age == 'Adult') %>% filter(study_accession == 'ERP126780') %>% pull(sample_accession)
After quantifying our RNAseq dataset using Salmon, we must import our transcript-level estimates and aggregate them to the gene level for gene-level differential expression analysis using tximport. A full guide for the R package can be found here{.uri}.
You can use the following steps to process and load your personal data:
SRP096788
(non-EiaD iPSC neuron data) for thisexampleSalmon version 1.10.2 was used throughout this guide
The chunk must be run on the (bash) command line (like in "Terminal" on a Mac) NOT in R.
```{bash download, eval = FALSE}
fasterq-dump SRR5177944 fasterq-dump SRR5177946 fasterq-dump SRR5177947
## Get pre-made reference This is not the one used in our pipeline, but it is substantially smaller (about 2GB instead of 17GB). If you still want ours, it is [here](https://hpc.nih.gov/~mcgaugheyd/eyeIntegration/2023/salmon_index_gencode.v44_salmon.tar) ```{bash refgenie, eval = FALSE} wget -O index.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/salmon_partial_sa_index?tag=default tar -xzvf index.tgz rm index.tgz
```{bash, eval = FALSE} salmon quant --index ../default --libType A --seqBias --gcBias --validateMappings \ -p 8 \ SRR5177944.fastq \ -o SRR5177944
salmon quant --index ../default --libType A --seqBias --gcBias --validateMappings \ -p 8 \ SRR5177946.fastq \ -o SRR5177946
salmon quant --index ../default --libType A --seqBias --gcBias --validateMappings \ -p 8 \ SRR5177947.fastq \ -o SRR5177947
## Load annotation data and create transcript-to-gene mapping Since we have counts at the transcript level, we will create a mapping between each transcript and its associated gene so that your locally processed data can be mapped to the gene level: ```r exon_anno <- rtracklayer::readGFF('https://hpc.nih.gov/~mcgaugheyd/eyeIntegration/2023/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.gtf.gz') tx2gene <- exon_anno %>% filter(!is.na(transcript_id)) %>% mutate(tx = paste0(transcript_id, '.', transcript_version), gene = paste0(gene_id, '.', gene_version)) %>% dplyr::select(tx, gene) %>% unique()
Now, we can use this mapping to load in our Salmon-quantified transcripts:
# import with tximport txi <- tximport(c('SRR5177944/quant.sf', 'SRR5177946/quant.sf', 'SRR5177947/quant.sf'), type = 'salmon', countsFromAbundance = 'no', txOut = FALSE, tx2gene = tx2gene)
DESeq2 requires metadata for each sample. Some metadata fields we use in EiaD that you can replicate to make your life easier are: - Tissue - Sub_Tissue - Age - Perturbation
# Create metadata for your locally processed samples local_metadata <- data.frame(sample_accession = c("SRR5177944", "SRR5177946", "SRR5177947"), Tissue = c("Neuron", "Neuron", "Neuron"), Sub_Tissue = c(NA, NA, NA), Age = c(NA, NA, NA), Source = c("iPSC", "iPSC", "iPSC"), Perturbation = c("ALS", NA, NA))
The example above illustrates how to process your RNAseq data using Salmon and load it into R. If your RNAseq data was processed using a tool other than Salmon, please refer to the DESeq2 guide to learn more about how you can load your data into R for use with DESeq2.
txi
object# match up row names EiaD_data <- gene_counts[, ..EiaD_retina_samples] %>% as.matrix() row.names(EiaD_data) <- gene_counts$Gene # rows to keep EiaD_keep <- row.names(EiaD_data) %in% row.names(txi$counts) EiaD_data_subset <- EiaD_data[EiaD_keep, EiaD_retina_samples] # update txi$counts colnames to something informative txi$counts <- cbind(txi$counts[EiaD_data_subset %>% row.names,], EiaD_data_subset) %>% as.matrix() colnames(txi$counts) <- c('SRR5177944', 'SRR5177946', 'SRR5177947', EiaD_retina_samples) # copy length data of first to match number of samples you add txi$length <- cbind(txi$length, replicate(ncol(EiaD_data_subset), txi$length[,1])) # confirm it works by looking at a few rows (if you see NA a problem has happened) txi$counts %>% as_tibble(rownames = 'ID') %>% sample_n(10) # Build combined DESeq counts matrix for EiaD and locally processed samples DESeq_counts <- txi$counts
# Pull EiaD metadata EiaD_metadata <- metadata %>% filter(sample_accession %in% c(EiaD_retina_samples)) # Combine metadata with the metadata we created above for our locally processed iPSC neuron samples DESeq_meta <- bind_rows(local_metadata, EiaD_metadata)
# install.packages("DESeq2") library(DESeq2) # build dds dds_EiaD_iPSC_neuron <- DESeqDataSetFromMatrix(countData = round(DESeq_counts), colData = DESeq_meta, design = ~ Tissue) # build deseq2 table DESeq2Table <- DESeq(dds_EiaD_iPSC_neuron)
top100 <- results(DESeq2Table) %>% as_tibble(rownames = 'ensembl_gene_id') %>% arrange(padj) %>% head(100) top100
We see how some of the top positive and negative log2FoldChange genes are associated with Retina and iPSC Motor Neurons, respectively.
You can click on the column fields to sort by that column.
# library(org.Hs.eg.db) conv_table <- AnnotationDbi::select(org.Hs.eg.db::org.Hs.eg.db, keys=gsub('\\.\\d+','',top100$ensembl_gene_id), columns=c("ENSEMBL", "SYMBOL", "MAP","GENENAME"), keytype="ENSEMBL") top100 %>% # ugly code to remove the .digit ending to get aligment of ethe ensembl id mutate(ENSEMBL = gsub('\\.\\d+','',ensembl_gene_id)) %>% left_join(conv_table, by = 'ENSEMBL') %>% relocate(SYMBOL, GENENAME) %>% arrange(-log2FoldChange) %>% DT::datatable()
results(DESeq2Table) %>% as_tibble(rownames = 'ENSEMBL') %>% arrange(padj) %>% head(10) %>% mutate(ENSEMBL = gsub('\\.\\d+','',ENSEMBL)) %>% left_join(conv_table, by = 'ENSEMBL') %>% relocate(SYMBOL, GENENAME) %>% arrange(log2FoldChange) %>% DT::datatable()
We see that PC1 separates retina and iPSC neuron tissue.
vst <- varianceStabilizingTransformation(DESeq2Table) library(matrixStats) ntop = 1000 Pvars <- rowVars(assay(vst)) select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop, length(Pvars)))] PCA <- prcomp(t(assay(vst)[select, ]), scale = F) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], PC3 = PCA$x[,3], PC4 = PCA$x[,4], Tissue = colData(vst)$Tissue, Source=colData(vst)$Source) ggplot(dataGG, aes(PC1, PC2, color=Tissue, shape=Source)) + geom_point(size=3) + xlab(paste0("PC1: ",percentVar[1],"% variance")) + ylab(paste0("PC2: ",percentVar[2],"% variance")) + cowplot::theme_cowplot()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.