require(png)
require(grid)
require(gridExtra)

Content

0.0 Introduction

In the last years a class of ncRNAs were described named lincRNAs (large intergenic noncoding RNAs) as untranslated transcripts with a length of over 200 bp. [1, 2] A number of papers focused on their tissue-specific upregulation in cancer cells. [1, 3, 4] The annotation and functional classification of lincRNAs (and ncRNAs) remain challenging. Given a RNAseq or microarray experiment, one possible approach to identify candidate ncRNAs poses ``guilty by association'', a principle stating that coding and noncoding genes showing a comparable expression pattern are likely to share functionality. [5] This idea can now be easily applied on arbitrary expression matrices using this R package LINC. The basic section will introduce the function justlinc() which will immediately give results. In the sections focusing on the advanced methods more complex functions are described providing control over most thresholds used in the computations. This package is intended to describe lincRNAs and their interaction to other lincRNAs plus protein-coding genes based on gene expression data.

1.0 Basic: Computation in one step!

The function justlinc is a wrapper. It requires a gene expression matrix with columns corresponding to samples and rows to genes. A preprocessing of the matrix is not mandatory. Gene ids need to be given preferably as Ensembl gene ids. Importanly, this function uses predefined thresholds, so it will not give a result in every case! On the other hand it does not need additional information.
The following commands show how justlinc is called. (Please note, that the large gene expression matrix named GTEX_LIVER_CRUDE is not provided in this Version of the package.)

suppressWarnings(suppressMessages(require(LINC)))
data(LIVER_EXPR)

# a gene expression matrix with > 50,000 genes
str(GTEX_LIVER_CRUDE)

# 'justlinc' will search for the 10 best candidates
justlinc(GTEX_LIVER_CRUDE)

The optional argument targetGenes can be either a gene 'biotype' (the default value is 'lincRNA'), a single candidate gene or a vector of genes.

# 'justlinc' called with queries
my_lincRNAs <- c("ENSG00000224153", "ENSG00000197813",
                 "ENSG00000179136", "ENSG00000259439",
                 "ENSG00000267462")

res <- justlinc(GTEX_LIVER_CRUDE, targetGenes = my_lincRNAs)

For multiple candidate genes it is assumed that they have a connection, for instance comparable co-expression pattern. This relationship is represented by the dendrogram (marked in dark red). As distance between two lincRNAs the Czekanovski dice distance [6] is considered, a measure for the number of shared interaction partners. The pathways found for the co-expressed genes are shown below the dendrogram.

2.0 Advanced 1: correlation matrix and statistics

The main function of this package termed linc() computes a correlation matrix and perfoms statistical correction. Spearman's rank correlations is the default method. It provides options for the removal of principle components and gene outliers. These methods can indicate whether correlation values are influenced by confounding factors. Negative correlation can be computed by providing a user-defined correlation function (example C). Protein-coding genes and queries (lincRNAs and other ncRNAs) are separated by the argument codingGenes, an assignment of protein-coding genes given as a logical vector or a vector of gene names. The output of this function is a LINCmatrix instance. These objects can be plotted applying the plotlinc() function. The resulting plot illustrates the statistics of the input matrix. It is useful to subset the input expression matrix such that only lincRNAs of interest as well as variance- and/or expression-selected protein-coding genes are considered!

suppressWarnings(suppressMessages(require(LINC)))
data(BRAIN_EXPR)

# (A) call 'linc' with no further arguments
crbl_matrix <- linc(cerebellum, codingGenes = pcgenes_crbl, verbose = FALSE)
# 'cerebellum' is a matrix of expression values; rows correspond to genes
# 'pcgenes_crbl' is a TRUE/FALSE vector; TRUE indicates a protein-coding gene 

# (B) remove first seven principle components
crbl_matrix_pc <- linc(cerebellum, codingGenes = pcgenes_crbl, rmPC = c(1:7), verbose = FALSE)

# (C) negative correlation by using 'userFun'
crbl_matrix_ncor <- linc(cerebellum, codingGenes = pcgenes_crbl,
                         userFun = function(x,y){ -cor(x,y) }, verbose = FALSE)

# (D) remove outliers using the ESD method
crbl_matrix_esd <- linc(cerebellum, codingGenes = pcgenes_crbl, outlier = "esd", verbose = FALSE)

# (E) plot this object
plotlinc(crbl_matrix_esd)

2.1 Advanced 2: cluster analysis or single query

Starting from the correlation matrix, the LINCmatrix object, there are two functions to compute either (I) a cluster with clusterlinc() or (II) an analysis for a single query with singlelinc(). The clusterlinc() functions perfroms a correlation test (Spearman's rho by default) and clusters the candidate lincRNAs based on their interaction partners or their correlations.

# call 'clusterlinc' with no further arguments
crbl_cluster <- clusterlinc(crbl_matrix, verbose = FALSE)
# apply the distance method "correlation instead of "dicedist"
crbl_cluster_cor <- clusterlinc(crbl_matrix, distMethod = "correlation", verbose = FALSE)
# comparing two distance methods
plotlinc(crbl_cluster)
plotlinc(crbl_cluster_cor)

Another function querycluster() enables clustering not only across different lincRNAs in one data set, but also clustering of one lincRNA in multiple data sets. In order to distinguish between different expression matrices the function feature can be applied in order to add user-defined names (customID) and colors (customCol).

# add custom names and colors
gbm_cluster <- gbm_cluster + feature(customID = "CANCER_GBM", customCol = "red")
ctx_cluster <- ctx_cluster + feature(customID = "HEALTHY_CTX", customCol = "blue")
hpc_cluster <- hpc_cluster + feature(customID = "HEALTHY_HPC", customCol = "blue")
crbl_cluster <- crbl_cluster + feature(customID = "HEALTHY_CRBL", customCol = "blue")

# plot the dendrogram
querycluster('647979', queryTitle = 'NORAD',
             gbm_cluster,  # Glioblastoma
             ctx_cluster,  # Cortex
             hpc_cluster,  # Hippocampus
             crbl_cluster) # Cerebellum

The enriched biological terms from gene annotation resources like Reactome PA and Gene Ontology (GO) for the cluster (multiple candidate lincRNA genes) can be identified applying getbio() which will call supported functions from the clusterProfiler package. [7 - 9] This analysis will reveal which functions, pathways or compartments are associated with the lincRNA-co-expressed genes.

# Function call for 'getbio()' and plotting:
crbl_cc <- getbio(crbl_cluster, translate = 'none', ont = "CC")
plotlinc(crbl_cc)

It can be interesting to compare two objects in terms of intersecting biological terms for shared queries. In the following example the terms for "cellular compartment" from GO are compared between two data sets, one representing the tissue Cortex and the other Cerebellum.

# compare two brain tissues using 'overlaylinc'
ctx_cc <- ctx_cc + feature(customID = "Cortex vs. Cerebellum")
overlaylinc(ctx_cc, crbl_cc)

# alternatively, compute the intersection of both objects
brain_cc <- ctx_cc + crbl_cc
plotlinc(brain_cc)

For a single query gene they function singlelinc() comes with many options in terms of co-expression selection. One can select for absolute values, negative correlations and a maximum for co-expressed genes. Here, the query is 'MEG3 - maternally expressed 3' (its Entrez gene id is '55384'). In contrast to clusterlinc(), singlelinc() will directly call a gene annotation resource for the co-expressed genes.

meg3 <- singlelinc(crbl_matrix, query = "55384", threshold = 0.00005, ont = 'BP', verbose = FALSE)
plotlinc(meg3)

2.2 Advanced 3: Workflow and more about the LINC class

The figure below depicts the workflow for the adavanced mehods in the LINC package. For multiple queries the workflow as sequence of function calls would be linc() -> clusterlinc() -> getbio() -> plotlinc() ... and linc() -> singlelinc() -> plotlinc() for a single query. If there are more than > 100 queries (lincRNA genes) in the matrix it will be beneficial to subset the expression matrix.

  overview_img <- readPNG(system.file("extdata", "overview_img.png",
                                     package ="LINC"))
    overview_plot <- rasterGrob(overview_img, interpolate = TRUE)
    grid.arrange(overview_plot)

As seen in the previous sections the function linc() produces a LINCmatrix instance, the function clusterlinc() a LINCcluster instance and so on. Two functions, termed feature() and getlinc(), provide quick access to the important data structures stored in these objects. Converting a LINCcluster into a LINCmatrix for instance could be useful in case one wants to plot the distribution of correlation values instead of the cluster applying plotlinc(). The function call getlinc(..., subject = "queries") will return the possible queries in the object. Calling plotlinc(..., showCor =) will output scatterplots of correlations between a query and up to five subjects. getcoexpr() returns a vector of co-expressed genes.

data(BRAIN_EXPR)
class(crbl_matrix)

# feature() can be used to convert objects
crbl_matrix <- crbl_cluster + feature(setLevel = "LINCmatrix", showLevels = TRUE)

# getlinc() is used to accesss information
getlinc(crbl_cluster, subject = "geneSystem")
getlinc(crbl_cluster, subject = "queries")
# Get the co-expressed genes for the gene with the entrez id "441951" (ZFAS1)
zfas1 <- getcoexpr(crbl_cluster, query = "441951" ); str(zfas1)
str(crbl_matrix@assignment) # protein-coding genes are stored in the 'assignment' slot

plotlinc(crbl_cluster, showCor = c("647979", "6726", "3337", "3304" ,"3320"))

It is possible to write the ids of co-expressed genes of a LINCcluster object and the biological terms of a LINCbio object into a table. The size of the table is restricted to 500 observations for each query, respectively.

linctable(file_name = "crbl_co_expr", input = crbl_cluster)

3.0 List of Functions

# WRAPPER
justlinc()         # gene selection, co-expression

# MAIN FUNCTIONS
linc()             # cor. matrix and statistics
clusterlinc()      # cluster and cor. test 
singlelinc()       # single query co-expression

# PLOTTING FUNCTIONS
plotlinc()         # main plotting function
overlaylinc()      # intersection of biological terms
querycluster()     # one query in multiple data sets

# HELPING FUNCTIONS
getbio()           # enriched terms for a cluster 
object + feature() # level and data labeling
getlinc()          # subsetting of 'LINC' objects
linctable()        # write to table

4.0 Acknowledgment

I want to thank my supervisor Dr. Carl Herrmann who supported me during the development of this package. In addition, I would like to mention the members of the eilslabs and the DKFZ (Deutsches Krebsforschungszentrum) at the University of Heidelberg. In particular, Prof. Dr. Benedikt Brors, Mattia Falcone, Dr. Michael Flechter, Calvin Chan, Sebastian Steinhauser and Qi Wang.

5.0 References

[1] Cheetham SW, Gruhl F, Mattick JS and Dinger ME (2013) Long noncoding RNAs and the genetics of cancer. British Journal of Cancer 108, 2419 - 2425.

[2] Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A and Rinn JL (2011) Integrative an-notation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes and Development doi: 10.1101/gad.17446611.

[3] Garzon R, Volinia S, Papaioannou D, Nicolet D, Kohlschmidt J, Yan PS, Mrózek K, Bucci D, Carroll AJ, Baer MR, Wetzler M, Carter TH, Powell BL, Kolitz JE, Moore JO, Eisfeld AK, Blachly JS, Blum W, Caligiuri MA, Stone RM, Marcucci G, Croce CM, Byrd JC and Bloomfield CD (2014) Expression and prognostic impact of lncRNAs in acute myeloid leukemia. PNAS 111 :18679 - 18684.

[4] Zhang J, Zhu N and Chen X (2015) A novel long noncoding RNA LINC01133 is upregulated in lung squamous cell cancer and predicts survival. Tumor Biology doi 10.1007/s13277-015-3460-9.

[5] Gillis J and Pavlidis P (2012) “Guilt by Association” Is the Exception Rather Than the Rule in Gene Networks. PLoS Computational Biology 8: e1002444.

[6] Christine Brun, Francois Chevenet, David Martin, Jerome Wojcik, Alain Guenoche and Bernard Jacq" Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network" (2003) Genome Biology, 5:R6.

[7] Yu G, Wang LG, Han Y, He QY (2015) clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284 - 287.

[8] Yu G, Wang L, Yan G and He Q (2015). DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 31: 608 - 609.

[9] Yu G and He QY (2016). ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 12: 477 - 479.



ManuelGoepferich/LINC_before_DEVEL documentation built on May 7, 2019, 2:46 p.m.