knitr::opts_chunk$set( message = FALSE, warning = FALSE, collapse = TRUE, fig.width = 5, fig.height = 2.5, comment = "#>" )
Here we go through a process where we pull the raw gene expression counts and sample metadata, inspect the metata, do some custom filtering by samples and gene, plot the data, aggregate the data and export to a csv file.
The counts
import will take some time, as this file is ~200mb
library(tidyverse) library(edgeR) # if the above library(tidyverse) step throws an error, you need to install tidyverse as follows: # install.packages('tidyverse') # BiocManager::install('edgeR') # https://eyeintegration.nei.nih.gov -> Data -> Data Download for links # you may want to just paste these links into your browser and download the files # you then would have to update the path to find them on your local computer # this will save downloading the 200mb file each time you run this metadata <- data.table::fread("https://hpc.nih.gov/~mcgaugheyd/eyeIntegration/2023/eyeIntegration23_meta_2023_09_01.built.csv.gz") gene_counts <- data.table::fread("https://hpc.nih.gov/~mcgaugheyd/eyeIntegration/2023/gene_counts.csv.gz")
metadata %>% sample_n(10) gene_counts[1:10,c(1,100:110)]
metadata %>% colnames() %>% sort() metadata %>% group_by(Tissue, Sub_Tissue, Source, Age, Perturbation, Sex_ML) %>% summarise(Count = n()) %>% DT::datatable()
These are raw counts. So a sample "A" with twice as many reads as sample "B" will have each gene appear to have about twice the expression. So we will use CPM normalization to normalize the total counts.
gene_counts_mat <- gene_counts[,2:ncol(gene_counts)] gene_cpm_mat <- edgeR::cpm(gene_counts_mat) # quick check on first sample gene_cpm_mat[,1] %>% sum() # add back gene names row.names(gene_cpm_mat) <- gene_counts$Gene
gene_tx_table <- data.table::fread("https://hpc.nih.gov/~mcgaugheyd/eyeIntegration/2023/gene_anno.csv.gz") gene_tx_table %>% select(-Name) %>% sample_n(10)
pivot_longer
This is to facilitate adding metadata to the counts
gene_counts_long <- gene_cpm_mat %>% as_tibble(rownames = 'Gene') %>% pivot_longer(cols=2:ncol(gene_counts), names_to = 'sample_accession', values_to = 'CPM') gene_counts_long %>% sample_n(10)
left_join
to label each sampleWe chose to label with the following:
gene_counts_long <- left_join(gene_counts_long, metadata %>% select(sample_accession, Tissue, Sub_Tissue, Source, Age, Sex_ML, study_title) %>% unique(), by = 'sample_accession') gene_counts_long %>% sample_n(10)
The eyeIntegration data is presented by the Tissue
- Sub_Tissue
- Source
- Age
- Perturbation
grouping
As we are working in R we can easily do some custom filtering and grouping.
Let's do this:
1. Cornea
and Skin
only
2. Filter to retain only Primary culture
or Native
sources of tissue
3. Filter Skin
to remove Transformed fibroblasts
3. Group by Tissue
- Source
- Sex_ML
Plot LUM, MITF, and DCN expression. We use a fancier "geom_quasirandom" to give the points some shaping. We don't use this on the web app because it is computationally expensive.
plot_data <- gene_counts_long %>% filter(Tissue %in% c("Cornea","Skin"), !Sub_Tissue %in% 'Transformed fibroblasts', Source %in% c("Native","Primary Culture")) %>% mutate(ID = paste(Tissue, Source, Age, Sex_ML, sep = ' - ')) %>% left_join(gene_tx_table %>% select(gene_id, gene_name) %>% unique(), by = c("Gene" = "gene_id")) %>% filter(gene_name %in% c("LUM","MITF","DCN")) plot_data %>% ggplot(aes(y=ID,x=log2(CPM+1), color = study_title)) + geom_boxplot(color = 'black') + ggbeeswarm::geom_quasirandom() + facet_wrap(~gene_name) + scale_color_manual(values = pals::alphabet2() %>% unname())
A sort of fancy set of steps where we:
study_title
level (i.e. if a study has eight samples, we take the mean of those eight samples)table <- plot_data %>% group_by(gene_name, study_title, ID) %>% summarise(meanCPM = log2(mean(CPM)+1)) %>% pivot_wider(values_from = meanCPM, names_from = c(ID,study_title)) table
table %>% write_csv(file = 'my_table.eyeIntegration.csv')
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.