knitr::opts_chunk$set(
  message = FALSE,  warning = FALSE,
  collapse = TRUE,
  fig.width = 5, fig.height = 2.5,
  comment = "#>"
)

Intro

Here we go through a process where we pull the raw gene expression counts and sample metadata, inspect the metata, do some custom filtering by samples and gene, plot the data, aggregate the data and export to a csv file.

Load libraries, import count data and metadata

The counts import will take some time, as this file is ~200mb

library(tidyverse)
library(edgeR)
# if the above library(tidyverse) step throws an error, you need to install tidyverse as follows:
# install.packages('tidyverse')
# BiocManager::install('edgeR')

# https://eyeintegration.nei.nih.gov -> Data -> Data Download for links

# you may want to just paste these links into your browser and download the files
# you then would have to update the path to find them on your local computer
# this will save downloading the 200mb file each time you run this
metadata <- data.table::fread("https://hpc.nih.gov/~mcgaugheyd/eyeIntegration/2023/eyeIntegration23_meta_2023_09_01.built.csv.gz")
gene_counts <- data.table::fread("https://hpc.nih.gov/~mcgaugheyd/eyeIntegration/2023/gene_counts.csv.gz")

Let's peek at the two data frames we loaded into R

metadata %>% sample_n(10)

gene_counts[1:10,c(1,100:110)]

Quickly inspect what kind of metadata and sample counts we have

metadata %>% colnames() %>% sort()
metadata %>% group_by(Tissue, Sub_Tissue, Source, Age, Perturbation, Sex_ML) %>% summarise(Count = n()) %>% DT::datatable()

We can aggregate data to the sub_tissue level in a few steps.

First convert to CPM

These are raw counts. So a sample "A" with twice as many reads as sample "B" will have each gene appear to have about twice the expression. So we will use CPM normalization to normalize the total counts.

gene_counts_mat <- gene_counts[,2:ncol(gene_counts)]
gene_cpm_mat <- edgeR::cpm(gene_counts_mat)
# quick check on first sample
gene_cpm_mat[,1] %>% sum()

# add back gene names
row.names(gene_cpm_mat) <- gene_counts$Gene

Grab human readable gene names conversion table

gene_tx_table <- data.table::fread("https://hpc.nih.gov/~mcgaugheyd/eyeIntegration/2023/gene_anno.csv.gz")
gene_tx_table %>% select(-Name) %>% sample_n(10)

Make the data long with pivot_longer

This is to facilitate adding metadata to the counts

gene_counts_long <- gene_cpm_mat %>% 
  as_tibble(rownames = 'Gene') %>% 
  pivot_longer(cols=2:ncol(gene_counts), names_to = 'sample_accession', values_to = 'CPM')

gene_counts_long %>% sample_n(10)

Now left_join to label each sample

We chose to label with the following:

gene_counts_long <- left_join(gene_counts_long, 
                           metadata %>% 
                             select(sample_accession, Tissue, Sub_Tissue, 
                                    Source, Age, Sex_ML, study_title) %>% 
                             unique(), 
                           by = 'sample_accession')

gene_counts_long %>% sample_n(10)

Now we can create a custom expression plot

The eyeIntegration data is presented by the Tissue - Sub_Tissue - Source - Age - Perturbation grouping

As we are working in R we can easily do some custom filtering and grouping.

Let's do this: 1. Cornea and Skin only 2. Filter to retain only Primary culture or Native sources of tissue 3. Filter Skin to remove Transformed fibroblasts 3. Group by Tissue - Source - Sex_ML

Plot LUM, MITF, and DCN expression. We use a fancier "geom_quasirandom" to give the points some shaping. We don't use this on the web app because it is computationally expensive.

plot_data <- gene_counts_long %>% 
  filter(Tissue %in% c("Cornea","Skin"),
         !Sub_Tissue %in% 'Transformed fibroblasts',
         Source %in% c("Native","Primary Culture")) %>% 
  mutate(ID = paste(Tissue, Source, Age, Sex_ML, sep = ' - ')) %>% 
  left_join(gene_tx_table %>% select(gene_id, gene_name) %>% unique(), by = c("Gene" = "gene_id")) %>% 
  filter(gene_name %in% c("LUM","MITF","DCN")) 

plot_data %>% 
  ggplot(aes(y=ID,x=log2(CPM+1), color = study_title)) +
  geom_boxplot(color = 'black') +
  ggbeeswarm::geom_quasirandom() + 
  facet_wrap(~gene_name) +
  scale_color_manual(values = pals::alphabet2() %>% unname())

Data in table form

A sort of fancy set of steps where we:

  1. Aggregate expression to the study_title level (i.e. if a study has eight samples, we take the mean of those eight samples)
  2. "Widen" the data from its long form to make it fit more tightly in an output file (and in a form that most people are familiar with in Excel)
table <- plot_data %>% 
  group_by(gene_name, study_title, ID) %>% 
  summarise(meanCPM = log2(mean(CPM)+1)) %>% 
  pivot_wider(values_from = meanCPM, names_from = c(ID,study_title))

table

Output table to csv for excel viewing

table %>% write_csv(file = 'my_table.eyeIntegration.csv')
sessionInfo()


davemcg/eyeIntegration_app documentation built on May 18, 2024, 1:37 p.m.