In priesgo/TCGAome: MultiOmics (just RNAseq and RPPA...) Data Analyzer on TCGA data

library(data.table)
library(microbenchmark)
library(fastmatch)
library(ggplot2)
devtools::load_all(".")

Motivation

We will test the performance of the common operations that we require to map terms to genes and vice versa: basic operations on data tables and specially set operations are critical (i.e.: union, intersection, etc.). For this purpose we will compare two object types the native data.frame and the data.frame subclass data.table https://cran.r-project.org/web/packages/data.table/index.html. We will also compare the library fastmatch https://cran.r-project.org/web/packages/fastmatch/index.html and its native counterpart match. Also, we will evaluate the improvement obtained by using functions compiled in byte code.

Critical components that are executed several thousand times, like the functions that compute similarity between terms, have been implemented as functions avoiding S4 method dispatching as it adds an undesired overhead.

No effort has been done on parallelization or implementation in C/C++.

The following benchmarks have been performed on fixed and relatively small datasets, we did not analyse how these techniques behave when data scales up.

data.table

As stated in its CRAN site "Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread).". It is very likely that the datasets under analysis are not big enough to observe the advantages of using data.table. Furthermore, as we will see using data.table adds an overhead when dealing with the dataset under analysis.

fastmatch

fastmatch relies on keeping the hash table in memory, thus critical improvement in performance will only be observed when accessing repeteadly the same data. This justifies the dispersed distributions on the execution time that we will observe as compared to native match. For this reason the median will be used to compare native match and fastmatch performance.

Test data

The test data is based on the Gene Ontology annotations for biological process. We will compute our tests on the data structure that associates terms to genes (i.e.: term2gene).

## Loads raw annotations
raw_annotations <- TCGAome::load_goa(ontology = "BP")@raw_annotations

# Creates the data.frame
term2gene_df <- aggregate(data = raw_annotations,
                       Gene ~ Term, c)
rownames(term2gene_df) <- term2gene_df$Term


# Creates the data.table and sets the key of the table
term2gene_dt <- data.table::data.table(
                  aggregate(data = raw_annotations,
                            Gene ~ Term, c))
data.table::setkey(term2gene_dt, Term)
rownames(term2gene_dt) <- term2gene_dt$Term

# Function to select random terms
get_random_term <- function(raw_annotations) {
    random_term <- raw_annotations$Term[runif(
        1,
        max=length(raw_annotations$Term))]
    return(random_term)
}

Additionally, we will use two sets of 1000 elements for the set operations.

first_set = sample(10000, size = 1000, replace = FALSE)
second_set = sample(10000, size = 1000, replace = FALSE)

Single column selection

Evaluate the performance of different methods for column selection.

Select a column using the $ operator on a data.frame
Select a column using the $ operator on a data.table
Select a column using the column name on a data.frame
Select a column using the column name on a data.table (returns a data.table instead of a vector)
Select a column using the data.table column variables
Select a column using native match on the column name on a data.frame
Select a column using fastmatch on the column name on a data.frame
Select a column using native match on the column name on a data.frame (considering call to names())
Select a column using fastmatch on the column name on a data.frame (considering call to names())
Select a column using fastmatch on the column name on a data.frame (considering call to names() and package resolution)
Select a column using native match on the column name on a data.table
Select a column using fastmatch on the column name on a data.table
Select a column using native match on the column name on a data.table (considering call to names())
Select a column using fastmatch on the column name on a data.table (considering call to names())

column_names <- names(term2gene_df)
res <- microbenchmark(
    "1" = term2gene_df$Term,
    "2" = term2gene_dt$Term,
    "3" = term2gene_df[, c("Term")],
    "4" = term2gene_dt[, c("Term"), with = FALSE],
    "5" = term2gene_dt[, Term],
    "6" = term2gene_df[, match("Term", column_names)],
    "7" = term2gene_df[, fmatch("Term", column_names)],
    "8" = term2gene_df[, match("Term", names(term2gene_df))],
    "9" = term2gene_df[, fmatch("Term", names(term2gene_df))],
    "10" = term2gene_df[, fastmatch::fmatch("Term", names(term2gene_df))],
    "11" = term2gene_dt[, match("Term", column_names), with = FALSE],
    "12" = term2gene_dt[, fmatch("Term", column_names), with = FALSE],
    "13" = term2gene_dt[, match("Term", names(term2gene_df)), with = FALSE],
    "14" = term2gene_dt[, fmatch("Term", names(term2gene_df)), with = FALSE]
)

autoplot(res) +
    ggplot2::scale_x_discrete(
        name = NULL,
        labels = rev(c("1. $ operator DF",
                   "2. $ operator DT",
                   "3. columnn name DF",
                   "4. columnn name DT",
                   "5. variable name DT",
                   "6. match DF",
                   "7. fastmatch DF",
                   "8. match DF with names()",
                   "9. fastmatch DF with names()",
                   "10. fastmatch DF with names() and dispatch",
                   "11. match DT",
                   "12. fastmatch DT",
                   "13. match DT with names()",
                   "14. fastmatch DT with names()")),
        limits = rev(levels(res$expr)))
print(res)

The conclusion is that when using a data.frame or a data.table the fastest method is the $ operator. Beware that package resolution is a big burden on performance.

Row selection

Evaluate the performance of different methods for row selection. Usually number of rows >> number of columns and thus the performance when selecting rows is a stronger limitation. In this case we will use as before two objects data.frame and data.table, but also combined with the library fastmatch.

Cell value comparison in a data.frame
Cell value comparison in a data.table
Row name match in a data.frame
Row name match in a data.table
Cell value native match in a data.frame
Cell value fastmatch in a data.frame
Cell value native match in a data.table
Cell value fastmatch in a data.table

random_term <- get_random_term(raw_annotations)
column_values <- term2gene_df$Term
res <- microbenchmark(
    "1" = term2gene_df[term2gene_df$Term == random_term, ],
    "2" = term2gene_dt[term2gene_dt$Term == random_term, ],
    "3" = term2gene_df[random_term, ],
    "4" = term2gene_dt[random_term, ],
    "5" = term2gene_df[match(random_term, column_values), ],
    "6" = term2gene_df[fmatch(random_term, column_values), ],
    "7" = term2gene_dt[match(random_term, column_values), ],
    "8" = term2gene_dt[fmatch(random_term, column_values), ]
    )

autoplot(res) +
    ggplot2::scale_x_discrete(
        name = NULL,
        labels = rev(c("1. Cell value comparison DF",
                   "2. Cell value comparison DT",
                   "3. Row name match DF",
                   "4. Row name match DT",
                   "5. match DF",
                   "6. fastmatch DF",
                   "7. match DT",
                   "8. fastmatch DT")),
        limits = rev(levels(res$expr)))
print(res)

There are two alternatives with very close times, only when looking at the median we can conclude that the fastest row selection is by row name matching on a data.frame. The implementation using fastmatch on a data.frame gets almost the same average execution time.

Intersection

Evaluate the performance of intersection between sets in a character vector.

Intersect using the %in% operator
Intersect using the intersect() function
Intersect using native match
Intersect using fastmatch

res <- microbenchmark(
    "1" = first_set[first_set %in% second_set],
    "2" = intersect(first_set, second_set),
    "3" = first_set[!is.na(match(first_set, second_set))],
    "4" = first_set[!is.na(fmatch(first_set, second_set))]
)

autoplot(res) +
    ggplot2::scale_x_discrete(
        name = NULL,
        labels = rev(c("1. %in% operator",
                   "2. intersect() function",
                   "3. Native match",
                   "4. fastmatch"
                   )),
        limits = rev(levels(res$expr)))
print(res)

The conclusion is that the fastest intersection is provided by fastmatch.

Union

Union using the union() function
Union using the unique() function and concatenation
Union using native match
Union using fastmatch

res <- microbenchmark(
    "1" = union(first_set, second_set),
    "2" = unique(c(first_set, second_set)),
    "3" = c(first_set, second_set[is.na(match(second_set, first_set))]),
    "4" = c(first_set, second_set[is.na(fmatch(second_set, first_set))])
)

autoplot(res) +
    ggplot2::scale_x_discrete(
        name = NULL,
        labels = rev(c("1. %in% operator",
                   "2. unique() function",
                   "3. Native match",
                   "4. fastmatch"
                   )),
        limits = rev(levels(res$expr)))
print(res)

The conclusion is that the fastest union is provided by fastmatch.

Difference

Difference using the %in% operator
Difference using the setdiff() function
Difference using native match
Difference using fastmatch

res <- microbenchmark(
    "1" = first_set [! first_set %in% second_set],
    "2" = setdiff(first_set, second_set),
    "3" = first_set [is.na(match(first_set, second_set))],
    "4" = first_set [is.na(fmatch(first_set, second_set))]
)

autoplot(res) +
    ggplot2::scale_x_discrete(
        name = NULL,
        labels = rev(c("1. %in% operator",
                   "2. setdiff() function",
                   "3. Native match",
                   "4. fastmatch"
                   )),
        limits = rev(levels(res$expr)))
print(res)

The conclusion is that the fastest difference is provided by fastmatch.

XOR

XOR using the %in% operator (assuming intersection and union is already computed)
XOR using native match (assuming intersection and union is already computed)
XOR using native match alternative implementation
XOR using fastmatch (assuming intersection and union is already computed)
XOR using fastmatch alternative implementation

union_set = c(first_set, second_set[is.na(fmatch(second_set, first_set))])
intersection_set = first_set[!is.na(fmatch(first_set, second_set))]
res <- microbenchmark(
    "1" = union_set[!union_set %in% intersection_set],
    "2" = union_set[is.na(match(union_set, intersection_set))],
    "3" = c(first_set[is.na(match(first_set, second_set))],
                        second_set[is.na(match(second_set, first_set))]),
    "4" = union_set[is.na(fmatch(union_set, intersection_set))],
    "5" = c(first_set[is.na(fmatch(first_set, second_set))],
                        second_set[is.na(fmatch(second_set, first_set))])
)

autoplot(res) +
    ggplot2::scale_x_discrete(
        name = NULL,
        labels = rev(c("1. %in% operator",
                   "2. Native match 1",
                   "3. Native match 2",
                   "4. fastmatch 1",
                   "5. fastmatch 2"
                   )),
        limits = rev(levels(res$expr)))
print(res)

The conclusion is that the fastest union is provided by fastmatch. If intersection and union are already computed use the first version, otherwise use the alternative implementation.

Compiling function to byte code

The just-in-time compilation allows that the code of our R code is translated into byte code and save some interpretation time at execution time. Our implementation is based on the compiler package included in base R.

get_ui_similarity_bc <- compiler::cmpfun(get_ui_similarity)

When benchmarking the execution time of several JIT compiled functions with their plain R counterparts we observe a slight increase in performance.

random_term_1 <- get_random_term(raw_annotations)
random_term_2 <- get_random_term(raw_annotations)
kegg <- TCGAome::load_kegg()
res <- microbenchmark(
    "1" = get_cosine_similarity(kegg, random_term_1, random_term_2),
    "2" = get_cosine_similarity_bc(kegg, random_term_1, random_term_2),
    "3" = get_binary_similarity(kegg, random_term_1, random_term_2),
    "4" = get_binary_similarity_bc(kegg, random_term_1, random_term_2),
    "5" = get_bc_similarity(kegg, random_term_1, random_term_2),
    "6" = get_bc_similarity_bc(kegg, random_term_1, random_term_2),
    "7" = get_ui_similarity(kegg, random_term_1, random_term_2),
    "8" = get_ui_similarity_bc(kegg, random_term_1, random_term_2)
)

autoplot(res) +
    ggplot2::scale_x_discrete(
        name = NULL,
        labels = rev(c("1. Cosine similarity",
                   "2. Cosine similarity byte code",
                   "3. Binary similarity",
                   "4. Binary similarity byte code",
                   "5. Bray-Curtis similarity",
                   "6. Bray-Curtis similarity byte code",
                   "7. Union-intersection similarity",
                   "8. Union-intersection similarity byte code"
                   )),
        limits = rev(levels(res$expr)))
print(res)

Session info

sessionInfo()

priesgo/TCGAome documentation built on May 25, 2019, 11:26 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

priesgo/TCGAome
MultiOmics (just RNAseq and RPPA...) Data Analyzer on TCGA data

In priesgo/TCGAome: MultiOmics (just RNAseq and RPPA...) Data Analyzer on TCGA data

Motivation

data.table

fastmatch

Test data

Single column selection

Row selection

Intersection

Union

Difference

XOR

Compiling function to byte code

Session info

R Package Documentation

Browse R Packages

We want your feedback!

priesgo/TCGAome MultiOmics (just RNAseq and RPPA...) Data Analyzer on TCGA data

In priesgo/TCGAome: MultiOmics (just RNAseq and RPPA...) Data Analyzer on TCGA data

Motivation

data.table

fastmatch

Test data

Single column selection

Row selection

Intersection

Union

Difference

XOR

Compiling function to byte code

Session info

R Package Documentation

Browse R Packages

We want your feedback!

priesgo/TCGAome
MultiOmics (just RNAseq and RPPA...) Data Analyzer on TCGA data