library(crossr)

Introduction

Crossr contains a set of functions and a workflow that I use to compare gene expression levels between species.

To achieve this, crossr relies extensively on exploratory data analysis methods and not on strict statistical tests. Designing a statistical test for differential expression among species is very difficult because of the complexity of the underlying data and because of the low replicate number that we generally use in transcriptomics experiments.

Providing a full statistical test for differential expression among species is beyond the scope of this package and beyond my capability. What crossr provides instead is a simple explorative workflow that will allow you to make useful observation of how groups of homologous genes behave in different species.

Crossr builds on the output of :

Here I will illustrate the workflow of crossr on the GSE69077 dataset, which was published by Rawat et al. in 2015. This dataset contains matched RNA-seq samples of heat-shocked and recovered Arabidopsis thaliana and Arabidopsis lyrata plants. This experimental setup perfectly matches what crossr is designed for.

In order to make this vignette lighter, I will use only a random subset of 500 orthogroups from original dataset. The simplified dataset is available in the exdata folder.

Map RNA-seq Reads and Quantify Transcripts

Mapping of RNA-seq reads is not directly a part of crossr's workflow. Crossr requires only two normalized gene expression matrices (one for each species taken into consideration) as starting point.

The RNA-seq reads can be mapped with Salmon. Salmon is fast, precise and easy to use, moreover it directly quantifies gene/transcript expression and provides this as an output. To parse and load the output from Salmon into R we provide the function make_TPM_df().

These are the bash commands that we use to quantify RNA-seq reads with Salmon (at this moment we have tested Salmon version 0.8.2):

```{bash, eval = FALSE}

make index

./Salmon-0.8.2_linux_x86_64/bin/salmon\ index -t path/to/genome.fasta\ -i th_transcripts_index\ --type quasi -k 31

Quantify and save mapping to SAM files for diagnostics

for f in reads/*.fastq; do

base=${f%.fastq} echo $base base=${base#reads/} echo $base outsam=path/to/sam/folder/$base.sam echo $outsam

./Salmon-0.8.2_linux_x86_64/bin/salmon quant \ -i path/to/index -l U \ -r $f -o $base\ -p 8\ --writeMappings=$outsam done

# Import transcript quantification and metadata in R

You will have to import in R transcript quantifications and sample metadata.

## Import transcript quantification from Salmon

You can import the output of Salmon into R with the function `make_TPM_df()`. This function takes as input the **path** to the folder that the quantification output from Salmon.

In detail, the function `make_TPM_df()`:

1. extracts the TPM values from all the `quant.sf` files in every subfolder of the input path,
2. organizes those values in a `data.frame` using the **subfolder names** as column ids and the feature names as row ids. This kind of `data.frame` is named **expression matrix** or **expression set**. 

If the subfolders contain additional files `make_TPM_df()` will ignore them.

In our example, the folder that contains the quantification for *A. thaliana* has this structure:

bash$ tree -A inst/exdata/thaliana

thaliana ├── SRR2033948 │ └── quant.sf ├── SRR2033949 │ └── quant.sf ├── SRR2033950 │ └── quant.sf ├── SRR2033951 │ └── quant.sf ├── SRR2033952 │ └── quant.sf └── SRR2033953 └── quant.sf

With `make_TPM_df` we can extract the TPM values from all the `quant.sf` files and build the expression set.

```r
##  get the path to the thaliana folder
thaliana_path <- system.file("exdata/thaliana", package = "crossr") 

## extract TPMs and build a dataframe
thaliana <- make_TPM_df(thaliana_path)

## this is how the top rows of the dataframe should look like
head(thaliana)

And the same for A. lyrata

lyrata_path <- system.file("exdata/lyrata", package = "crossr")
lyrata <- make_TPM_df(lyrata_path)
str(lyrata)

Import quantifications from other mapping algorithms

While we suggest Salmon, you can use crossr downstream of any mapping algorithm but you will have to import the expression set in R by yourself. The package tximport can help.

Anyway, we suggest to use expression values that:

for example transcript per million (TPM) is a good option.

Import sample metadata

Now you might want to Import in R also a table of metadata that describes the experimental condition under which every RNAseq sample was harvested.

Sample metadata must be available if you are downloading RNAseq data from a public repository. If you are using your own data you will have to write the sample metadata yourself.

In R, the data.frame that stores sample metadata is conventionally named coldata. Every row of coldata matches one RNAseq sample (and has to match one column of the expression matrix) and every column of `coldata stores information on an experimental variable.

In this example we use this sample metadata set. We do not provide a function to load metadata into R, you can use the base function read.table or any similar function. Another good choice is the the modern read_csv() function from the package readr.

sample_info_path <- system.file("exdata",
                                "SRP058527_metadata.txt",
                                package = "crossr")

sample_info <-  read.table(sample_info_path,
                           header = TRUE,
                           sep = "\t")  

str(sample_info)

As seen above, the sample metadata contain extensive and redundant information.

Please, apply anything you know about tidying datasets to make coldata more direct and easier to use.

This is how we have proceeded on this dataset:

reduce_info <- function(info) {
    if(info == "heat stress 0 h") "0h"
    else if(info == "heat stress 6 h 37\302\260C") "heat_6h"
    else "recovery"
}
sample_info$etime <- vapply(sample_info$exposure_time_s,
                            reduce_info,
                            character(1))
sample_info$etime <- paste(sample_info$etime,
                           c(1, 2),
                           sep = "_")
sample_info$etime
switch_name <- function(name) {
    if(name %in% sample_info$Run_s) {
        new_name <- sample_info[sample_info$Run_s == name, "etime", drop = TRUE]
        return(new_name)
    } else return(name)
}
colnames(thaliana) <- vapply(colnames(thaliana), switch_name, character(1))
colnames(thaliana) <- paste("thal",
                            colnames(thaliana),
                            sep = "_")
str(thaliana)

colnames(lyrata) <- vapply(colnames(lyrata), switch_name, character(1))
colnames(lyrata) <- paste("lyr",
                          colnames(lyrata),
                          sep = "_")
str(lyrata)
coldata_tidy <- sample_info[, c("Run_s", "Organism_s", "etime")]
knitr::kable(coldata_tidy)

Homology groups (orthogroups)

When we analyze an RNA-seq experiment that was performed on samples originating from one single species, we generally:

  1. Quantify the expression of every gene in the genome across different condition or in different tissues,
  2. Compare the expression of every gene with itself across the different condition.

We rarely compare expression of one gene with another, mainly because we do not know if our quantification method is sequence biased.

Adapting this protocol to multispecies experiments presents a very important issue: different species have different set of genes with (almost always) different sequences.

This is part of a bigger issue: when we compare gene expression between different species, which gene from one species should be compared with which gene from the other?

To tackle this problem we first define groups of genes that descend from one single gene in the last common ancestor. These are known as "homology groups" or "orthogroups. Many computational packages infer this relationship. We generally use Orthofinder.

Orthofinder infers orthogroups from the proteomes, and indeed defines an orthogroup as:

The set of genes that are descended from a single gene in the last common ancestor of all the species being considered.

We can start solving the issues of multispecies transcriptomic comparison from here: genes in the same orthogroup might have the same function.

Ortholog inference

We use Orthofinder for ortholog inference. Discussing Orthofinder is beyond the scope of this document; please, refer to its documentation for details.

Orthofinder requires the proteomes of the species under study in order to infer orthogroups. In this example we downloaded the proteomes sequences from NCBI ftp site using the following version:

The samples are in the exdata folder.

We use this bash lines in order to produce orthogroups:

```{bash, eval = FALSE}

blast everything

python OrthoFinder-master/orthofinder.py -f path/to/proteomes -t 3

make orthogroups

python OrthoFinder-master/orthofinder.py -b path/to/results/WorkingDirectory

## Import orthogroups in R

Orthofinder outputs two equivalent orthogroup files in two different formats: `OrthologousGroups.csv` and `OrthologousGroups.txt`.

In order to parse the `OrthologousGroups.csv` file, we provide the function `parse_orthogroups()`. This function takes as input the path to the `OrthologousGroups.csv` file. Please note that `parse_orthogroups()` is designed for parsing **exclusively the *.csv* file** and not the *.txt* one.

```r
(ogroups_path <- system.file("exdata",
                            "ogroups_sample.csv",
                            package = "crossr"))

ogroups <- parse_orthogroups(ogroups_path)

The number of genes for orthogroup can be explored with the function explore_ogroups().

explore_ogroups(ogroups)

Convert protein ID in transcript ID

Since orthogroups are inferred on protein sequences but short reads are mapped on transcript sequences you might have to convert the protein ID in the orthogroups into transcript ID.

The function switch_ids() performs this task. It requires an external file that maps the protein ID to transcript ID, which must be loaded in R.

We have downloaded the feature tables that containsmappings between transcript ID and protein ID from NCBI from the same folder as the proteomes. Also this file is stored in the exdata folder.

Importing the feature tables from refseq into R can be challenging and the issues encountered are often table specific. In our experience, issues can arise from special characters in gene names and gene functional annotation within the table, such as hashes ( # ) and quotes ( ' ). Functional annotations often contain quotes because this symbol is used in the 5' and 3' notation.

This code loads the A. thaliana feature table.

thaliana_id_path <- system.file("exdata",
                                "thaliana_features_sample.txt",
                                package = "crossr")


thaliana_ids <- read.table(file = thaliana_id_path,
                           sep = "\t", header = FALSE,
                           stringsAsFactors = FALSE,
                           quote = "")

## Not sure why the header is commented with "#"
colnames(thaliana_ids) <-  scan(file = thaliana_id_path,
                                what = "",
                                nlines = 1)[-1]  

And this loads the A. lyrata one.

lyrata_id_path <- system.file("exdata",
                              "lyrata_features_sample.txt",
                              package = "crossr")

lyrata_ids <- read.table(file = lyrata_id_path,
                         sep = "\t", header = FALSE,
                         stringsAsFactors = FALSE,
                         quote = "",
                         skip = 1,
                         comment.char = "") # One gene name contains "#"

## Not sure why the header is commented with "#"
colnames(lyrata_ids) <-  scan(file = lyrata_id_path,
                                what = "",
                                nlines = 1)[-1]  

Then we can use switch_ids() to convert the IDS from A. thaliana first,

ogroups <- switch_ids(ogroups = ogroups,
                      ids_table = thaliana_ids,
                      px_id = "product_accession",
                      tx_id = "related_accession",
                      mc.cores = 1)

and then the ones from A. lyrata.

ogroups <- switch_ids(ogroups = ogroups,
                      ids_table = lyrata_ids,
                      px_id = "product_accession",
                      tx_id = "related_accession",
                      mc.cores = 1)

Note that the orthogroups contain IDS from both species, therefore switch_ids() must be called twice.

Implement a S4 object

Now that we have loaded the expression data and the orthogroup data, we can start implementing an S4 element to organize and link together the expression sets, annotations and statistics that we have loaded and/or that we are going to produce.

We suggest this chapter from Hadley Wickham's Advanced R book for an introduction to S4 objects in R.

The ogset class

The crossr package defines the S4 class ogset which can be created with the make_ogset constructor function.

og_set <-  make_ogset(og = ogroups,
                       spec1_exp = thaliana,
                       spec2_exp = lyrata)

The ogset class is designed to store orthogroup expression data and metadata; it has slots for orthogroup expression data, list of genes in orthogroups, single species expression data, design and annotation data. None of these elements is necessary when initializing the object with make_ogset() and also an empty set can be initialized.

The ogset class is modeled on Bioconductor Summarized experiment class.

When the ogset element is initialized, make_ogset() calls the function check_ogset() in order to check that the elements supplied match with each others. In details, check_ogset() checks that:

check_ogset() can be called again at any time later. For example, after adding new elements to an ogset object, one can use it to check that the object is still valid. In case check_ogset() will return TRUE.

check_ogset(og_set)

Store sample information (colData) in the ogset class element

The sample metadata for the two species can be saved in the spec1_colData and spec2_colData slots of the ogset element.

In this case we have already expressed simplified information on the samples in the colnames of the two datasets. We can use this to specify very basic colData. Anyway, any kind of detailed sample metadata can be provided as data.frame.

colnames(og_set@spec1_exp)
colnames(og_set@spec2_exp)

coldata_spec1 <- sapply(colnames(og_set@spec1_exp), function(i) strsplit(i, split = "_")[[1]][2])
coldata_spec1 <- data.frame(treat = coldata_spec1, stringsAsFactors = FALSE)
coldata_spec1

coldata_spec2 <- sapply(colnames(og_set@spec2_exp), function(i) strsplit(i, split = "_")[[1]][2])
coldata_spec2 <- data.frame(treat = coldata_spec2, stringsAsFactors = FALSE)
coldata_spec2

og_set@spec1_colData <- coldata_spec1
og_set@spec2_colData <- coldata_spec2

Experimental factor

We have to provide information on which column of the colData datasets contains the main information on the experimental conditions under which we are comparing the two species (it could be different growth stages, treatments etc.).

The name of the column that encodes for the experimental factor in colData can be stored in the exp_cond slot.

This column must contain the same categorical variables in both the spec1_colData and the spec2_colData datasets; indeed, we are testing for the response to the same experimental condition in the two species.

og_set@exp_cond <- "treat"

Afterwards, we can check that the ogset element is still valid.

check_ogset(og_set)

Collapse orthologues

As seen in Figure 1 most of the orthogroup contain only two genes, generally one from A. lyrata and one from A. thaliana.

Many other orthogroups though contain different (any) number of genes from one species and from the other. A "quick and dirty"" solution to restore a one to one relationship among features is to collapse the orthologues by adding up the expression values of genes within the same orthogroup.

The basic assumption of this method is that genes in the same orthogroup have the same function.

The function collapse_orthologs() performs this task on an ogset class element, and returns the same element with the collapsed orthogroup expression set in the designated slot.

og_set <- collapse_orthologs(og_set, mc.cores = 1)  ## This could take a while.

Not all orthogroups contain genes from both species, the orthogroups that contain genes and hence expression data from both species are stored in the slot og_exp for further analysis,

str(og_set@og_exp)

the ones that does not are stored in the slot og_nomatch

str(og_set@og_nomatch)

After we have collapsed orthologues we can check for consistency of the ogset object with the function check_ogset().

check_ogset(og_set)

Functional annotation for orthogroups

In order to get rudimentary functional annotations for the orthogroups we simply associate to an orthogroup all the functional annotations of the genes composing it.

In this example we use functional annotations from A. thaliana exclusively, because they are more reliable than the ones from A. lyrata.

Functional annotations in practice

The features names from the feature table of A. thaliana (the one that we have imported in Section 5 in the object thaliana_ids) contain enough functional information for our purpose.

The thaliana_ids object stores both features names and transcript ids, thus we can use this table to associate functional annotations to the transcripts contained in the orthogroups.

These transcript ids are stored the product_accession column of the thaliana_ids dataframe,

head(thaliana_ids$product_accession)

while the features names are in the name column.

head(thaliana_ids$name, 10)

With this code snippet we can:

  1. collect functional annotations for each orthogroup
  2. store them in the rowData slot.
get_annos <- function(ogroup) 
{
    annos <- thaliana_ids[thaliana_ids$product_accession %in% ogroup, "name"]
    annos <- paste(unique(annos), collapse = " -- ")
    return(annos)
}

og_set@og_annos <- data.frame(func_annos = vapply(og_set@og,
                                                  get_annos,
                                                  character(1)),
                              stringsAsFactors = FALSE)

head(og_set@og_annos)

Afterwards, as always, we can check for consistency of the og_set element.

check_ogset(og_set)

Select groups that behave differently between species

After moving to orthogroup-wise expression, we can search for orthogroups that behave differently in the two species.

Experimental design with interaction factor

Since genes from the two species within the same orthogroup have different sequences, we do not perform direct comparison of orthogroup expression level between the two species. Instead, we use a linear model in order to detect relative differences of expression of each orthogroup between the two species across the different experimental condition.

We achieve this by setting a design formula with one term for the species (spc in this case) one term for the treatment / experimental factor (treat in this case) and then an interaction factor (spc:treat in this case). So in this case the design formula will be ~ spc + treat + spc:treat. Than we use the F-value of the interaction term as ranking feature.

Also, since the data have gone through many transformation it is hard to make assumption on their distribution, therefore we do not extract a p-value from the linear model. We use the F-value on the interaction term as a ranking feature, in order to select the genes that seem to behave differently among the species.

Orthogroup ranking in practice

The function that fits the linear model and performs ANOVA is called add_fit()

In order to call it:

  1. first we have load information on the column of the orthogroup expression set og_exp into the ogset element,
  2. then we have to specify an experimental design.

Specify colData for orthogroup-wise expression

The column data were previously encoded in the colnames of the two expression sets from the two different species. Those names are preserved in the colnames of the ortholog expression set.

colnames(og_set@og_exp)

So I organize those info in a dataframe:

coldata <- data.frame(spc = sapply(colnames(og_set@og_exp),
                                   function(i) strsplit(i, split = "_")[[1]][1]),
                      treat = sapply(colnames(og_set@og_exp),
                                     function(i) strsplit(i, split = "_")[[1]][2]),
                      stringsAsFactors = FALSE)

coldata

And provide those information to the ogset element in the colData slot.

og_set@colData <- coldata
check_ogset(og_set)

Specify design

The experimental design can be specified in the design slot:

og_set@design <- ~ 0 + spc + treat + spc:treat
check_ogset(og_set)

Estimate F-values and rank

Now we can calculate the F-values with add_fit().

We would suggest calculate the F-value on log scaled value, this is more appropriate for studying relative changes. This could be done with the argument log_scale = TRUE to the add_fit(). In this way, the add_fit() function:

  1. add 1 to the every observation in the orthogroup expression dataset,
  2. log transforms the data, 3 estimates the F-values,
  3. writes the F-values in the stats slot.

So, in this way the F-value are estimated on log(data + 1).

og_set <- add_fit(ogset = og_set, log_scale = TRUE)
str(og_set@stats)

The info on the log_scale parameter are stored in the metadata slot.

og_set@metadata

Explore results

add_fit() saves an the F-statistic for all the terms in the design formula for every gene

head(og_set@stats)

As mentioned above, the F-statistic associated to the interaction term spc:treat can be used as a ranking feature. The orthogroups with the highest F-value are the one that display a more divergent behavior in the two species.

The function get_top_tags() extracts the ID of the orthogroups with the highest F-value.

tt <- get_top_tags(ogset = og_set,
                   rank_stat = "spc:treat",
                   n =  10)
tt

get_top_tags() returns a named numeric vector containing both the id of the top orthogroups and their associated statistics.

We can extract the functional information for the top tags with:

og_set@og_annos[names(tt), ]

Plot orthogroup expression

The function plot_all_stages is a wrapper fot ggplot2 that plots a stripchart of the expression of the supplied orthogroup.

It requires as input:

The code below selects the highest differentially expressed group and plots it.

top_group <- names(get_top_tags(og_set, "spc:treat", 1))
ggplot_all_stages(orthogroups = top_group,
                  condition_var = "treat",
                  species_var = "spc",
                  ogset = og_set)

If multiple orhogroups are supplied, ggplot_all_stages plots them all separately using facetting. This code selects the top 12 differentially expressed orthogroups and plots their expression.

tt <- names(get_top_tags(og_set, "spc:treat", 12))
ggplot_all_stages(orthogroups = tt,
                  condition_var = "treat",
                  species_var = "spc",
                  ogset = og_set)

You can use the use_annos parameter to display functional annotation in the facet titles.

top_group <- names(get_top_tags(og_set, "spc:treat", 1))
ggplot_all_stages(orthogroups = top_group,
                  condition_var = "treat",
                  species_var = "spc",
                  ogset = og_set,
                  use_annos = TRUE)

Plot all the genes composing a specific orthogroup

The workflow of crossr relies on the assumption that all the genes composing one orthogroup have the same function.

If this useful working assumption was always verified, the expression values of those genes could be added without generating artifacts. Unfortunately this is not the case.

Therefore we suggest that, before drawing any conclusion, the user should explore by plotting also the expression pattern of the single genes contained in the orthogroups that are differentially expressed.

This can be achieved with the function plot_og_genes, that prints the number and the ID of the genes composing the user supplied orthogroup and plot their expression.

ggplot_ogroup_genes(orthogroup = top_group, 
                    ogset = og_set)

The orthogroup r top_group contains only 2 genes, but we can expect orthogroups of different size and different ratios of genes from one species and from the other.

Example, orthogroup OG0002779

example_group <- "OG0002779"
# also OG0000214 makes a good example

Orthogroup r example_group contains contains 4 genes, 2 from each species.

The F-value associated to this group is r og_set@stats[example_group, "spc:treat"], and it ranks r which(names(get_top_tags(og_set, "spc:treat", Inf)) == example_group) out of 500 groups.

The associated annotation is: r og_set@og_annos[example_group, "func_annos"](which is not incredibly helpful).

Looks like it is behaving differently in response to heat in the 2 species;

ggplot_all_stages(orthogroups = example_group,
                  condition_var = "treat",
                  species_var = "spc",
                  ogset = og_set)

and it contains 4 genes of which one is behaving peculiarly differently than the others.

ggplot_ogroup_genes(orthogroup = example_group, 
                    ogset = og_set)

Clusters

We rank orthogroups by F-value in order to extract orthogroups that behave differently in the two species. But in this way, we extract orthogroups that behave differently in any kind of way.

After extracting orthogroups that behave differently, we cluster their expression values in order to group the ones that behave differently in a similar way.

We can use the function get_top_tags() in order to apply an arbitrary cutoff and to extract the top n differentially expressed orthogroups, in this case 50.

n <- 50
tt <- get_top_tags(ogset = og_set,
                   rank = "spc:treat",
                   n = n)

And then we cluster the the orthogroups with the pheatmap() function from the pheatmap package (but feel free to substitute it with your favourite clustering function if you prefer).

pheatmap::pheatmap(t(scale(t(og_set@og_exp[names(tt), ]))),
                   cluster_cols = FALSE,
                   labels_row = paste(names(tt), og_set@og_annos[names(tt), "func_annos"], sep = ", "),
                   cellheight = 10, cellwidth = 25)

Known issues and limitations

Applying the crossr workflow helped me a lot when I had to analyze cross-species RNA-seq datasets and when I had to extract biological observations from those data.

Nevertheless crossr has know issues and limitations.

Limitations

While writing crossr I favoured simplicity over statistical precision, therefore crossr relies on explorative data analysis methods and does not perform rigorous testing.

Making reliable statistical tests on cross-species transcriptomic data is a complicated and way beyond my expertise. Please, do not take results on a low number of replicates (2, 3 or 5) as definitive. Consider validating your observations with more measurements.

Issues

I have identified three issues with crossr workflow. I am currently learning how to deal with these issues.

Session info

Here is the output of devtools::session_info() on the system on which this document was compiled:

devtools::session_info()


OthoMantegazza/crossr documentation built on Feb. 1, 2023, 4:15 a.m.