library(data.table) library(microbenchmark) library(fastmatch) library(ggplot2) devtools::load_all(".")
We will test the performance of the common operations that we require to map terms to genes and vice versa: basic operations on data tables and specially set operations are critical (i.e.: union, intersection, etc.). For this purpose we will compare two object types the native data.frame and the data.frame subclass data.table https://cran.r-project.org/web/packages/data.table/index.html. We will also compare the library fastmatch https://cran.r-project.org/web/packages/fastmatch/index.html and its native counterpart match. Also, we will evaluate the improvement obtained by using functions compiled in byte code.
Critical components that are executed several thousand times, like the functions that compute similarity between terms, have been implemented as functions avoiding S4 method dispatching as it adds an undesired overhead.
No effort has been done on parallelization or implementation in C/C++.
The following benchmarks have been performed on fixed and relatively small datasets, we did not analyse how these techniques behave when data scales up.
As stated in its CRAN site "Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread).". It is very likely that the datasets under analysis are not big enough to observe the advantages of using data.table. Furthermore, as we will see using data.table adds an overhead when dealing with the dataset under analysis.
fastmatch relies on keeping the hash table in memory, thus critical improvement in performance will only be observed when accessing repeteadly the same data. This justifies the dispersed distributions on the execution time that we will observe as compared to native match. For this reason the median will be used to compare native match and fastmatch performance.
The test data is based on the Gene Ontology annotations for biological process. We will compute our tests on the data structure that associates terms to genes (i.e.: term2gene).
## Loads raw annotations raw_annotations <- TCGAome::load_goa(ontology = "BP")@raw_annotations # Creates the data.frame term2gene_df <- aggregate(data = raw_annotations, Gene ~ Term, c) rownames(term2gene_df) <- term2gene_df$Term # Creates the data.table and sets the key of the table term2gene_dt <- data.table::data.table( aggregate(data = raw_annotations, Gene ~ Term, c)) data.table::setkey(term2gene_dt, Term) rownames(term2gene_dt) <- term2gene_dt$Term # Function to select random terms get_random_term <- function(raw_annotations) { random_term <- raw_annotations$Term[runif( 1, max=length(raw_annotations$Term))] return(random_term) }
Additionally, we will use two sets of 1000 elements for the set operations.
first_set = sample(10000, size = 1000, replace = FALSE) second_set = sample(10000, size = 1000, replace = FALSE)
Evaluate the performance of different methods for column selection.
column_names <- names(term2gene_df) res <- microbenchmark( "1" = term2gene_df$Term, "2" = term2gene_dt$Term, "3" = term2gene_df[, c("Term")], "4" = term2gene_dt[, c("Term"), with = FALSE], "5" = term2gene_dt[, Term], "6" = term2gene_df[, match("Term", column_names)], "7" = term2gene_df[, fmatch("Term", column_names)], "8" = term2gene_df[, match("Term", names(term2gene_df))], "9" = term2gene_df[, fmatch("Term", names(term2gene_df))], "10" = term2gene_df[, fastmatch::fmatch("Term", names(term2gene_df))], "11" = term2gene_dt[, match("Term", column_names), with = FALSE], "12" = term2gene_dt[, fmatch("Term", column_names), with = FALSE], "13" = term2gene_dt[, match("Term", names(term2gene_df)), with = FALSE], "14" = term2gene_dt[, fmatch("Term", names(term2gene_df)), with = FALSE] )
autoplot(res) + ggplot2::scale_x_discrete( name = NULL, labels = rev(c("1. $ operator DF", "2. $ operator DT", "3. columnn name DF", "4. columnn name DT", "5. variable name DT", "6. match DF", "7. fastmatch DF", "8. match DF with names()", "9. fastmatch DF with names()", "10. fastmatch DF with names() and dispatch", "11. match DT", "12. fastmatch DT", "13. match DT with names()", "14. fastmatch DT with names()")), limits = rev(levels(res$expr))) print(res)
The conclusion is that when using a data.frame or a data.table the fastest method is the $ operator. Beware that package resolution is a big burden on performance.
Evaluate the performance of different methods for row selection. Usually number of rows >> number of columns and thus the performance when selecting rows is a stronger limitation. In this case we will use as before two objects data.frame and data.table, but also combined with the library fastmatch.
random_term <- get_random_term(raw_annotations) column_values <- term2gene_df$Term res <- microbenchmark( "1" = term2gene_df[term2gene_df$Term == random_term, ], "2" = term2gene_dt[term2gene_dt$Term == random_term, ], "3" = term2gene_df[random_term, ], "4" = term2gene_dt[random_term, ], "5" = term2gene_df[match(random_term, column_values), ], "6" = term2gene_df[fmatch(random_term, column_values), ], "7" = term2gene_dt[match(random_term, column_values), ], "8" = term2gene_dt[fmatch(random_term, column_values), ] )
autoplot(res) + ggplot2::scale_x_discrete( name = NULL, labels = rev(c("1. Cell value comparison DF", "2. Cell value comparison DT", "3. Row name match DF", "4. Row name match DT", "5. match DF", "6. fastmatch DF", "7. match DT", "8. fastmatch DT")), limits = rev(levels(res$expr))) print(res)
There are two alternatives with very close times, only when looking at the median we can conclude that the fastest row selection is by row name matching on a data.frame. The implementation using fastmatch on a data.frame gets almost the same average execution time.
Evaluate the performance of intersection between sets in a character vector.
res <- microbenchmark( "1" = first_set[first_set %in% second_set], "2" = intersect(first_set, second_set), "3" = first_set[!is.na(match(first_set, second_set))], "4" = first_set[!is.na(fmatch(first_set, second_set))] )
autoplot(res) + ggplot2::scale_x_discrete( name = NULL, labels = rev(c("1. %in% operator", "2. intersect() function", "3. Native match", "4. fastmatch" )), limits = rev(levels(res$expr))) print(res)
The conclusion is that the fastest intersection is provided by fastmatch.
res <- microbenchmark( "1" = union(first_set, second_set), "2" = unique(c(first_set, second_set)), "3" = c(first_set, second_set[is.na(match(second_set, first_set))]), "4" = c(first_set, second_set[is.na(fmatch(second_set, first_set))]) )
autoplot(res) + ggplot2::scale_x_discrete( name = NULL, labels = rev(c("1. %in% operator", "2. unique() function", "3. Native match", "4. fastmatch" )), limits = rev(levels(res$expr))) print(res)
The conclusion is that the fastest union is provided by fastmatch.
res <- microbenchmark( "1" = first_set [! first_set %in% second_set], "2" = setdiff(first_set, second_set), "3" = first_set [is.na(match(first_set, second_set))], "4" = first_set [is.na(fmatch(first_set, second_set))] )
autoplot(res) + ggplot2::scale_x_discrete( name = NULL, labels = rev(c("1. %in% operator", "2. setdiff() function", "3. Native match", "4. fastmatch" )), limits = rev(levels(res$expr))) print(res)
The conclusion is that the fastest difference is provided by fastmatch.
union_set = c(first_set, second_set[is.na(fmatch(second_set, first_set))]) intersection_set = first_set[!is.na(fmatch(first_set, second_set))] res <- microbenchmark( "1" = union_set[!union_set %in% intersection_set], "2" = union_set[is.na(match(union_set, intersection_set))], "3" = c(first_set[is.na(match(first_set, second_set))], second_set[is.na(match(second_set, first_set))]), "4" = union_set[is.na(fmatch(union_set, intersection_set))], "5" = c(first_set[is.na(fmatch(first_set, second_set))], second_set[is.na(fmatch(second_set, first_set))]) )
autoplot(res) + ggplot2::scale_x_discrete( name = NULL, labels = rev(c("1. %in% operator", "2. Native match 1", "3. Native match 2", "4. fastmatch 1", "5. fastmatch 2" )), limits = rev(levels(res$expr))) print(res)
The conclusion is that the fastest union is provided by fastmatch. If intersection and union are already computed use the first version, otherwise use the alternative implementation.
The just-in-time compilation allows that the code of our R code is translated into byte code and save some interpretation time at execution time. Our implementation is based on the compiler package included in base R.
get_ui_similarity_bc <- compiler::cmpfun(get_ui_similarity)
When benchmarking the execution time of several JIT compiled functions with their plain R counterparts we observe a slight increase in performance.
random_term_1 <- get_random_term(raw_annotations) random_term_2 <- get_random_term(raw_annotations) kegg <- TCGAome::load_kegg() res <- microbenchmark( "1" = get_cosine_similarity(kegg, random_term_1, random_term_2), "2" = get_cosine_similarity_bc(kegg, random_term_1, random_term_2), "3" = get_binary_similarity(kegg, random_term_1, random_term_2), "4" = get_binary_similarity_bc(kegg, random_term_1, random_term_2), "5" = get_bc_similarity(kegg, random_term_1, random_term_2), "6" = get_bc_similarity_bc(kegg, random_term_1, random_term_2), "7" = get_ui_similarity(kegg, random_term_1, random_term_2), "8" = get_ui_similarity_bc(kegg, random_term_1, random_term_2) )
autoplot(res) + ggplot2::scale_x_discrete( name = NULL, labels = rev(c("1. Cosine similarity", "2. Cosine similarity byte code", "3. Binary similarity", "4. Binary similarity byte code", "5. Bray-Curtis similarity", "6. Bray-Curtis similarity byte code", "7. Union-intersection similarity", "8. Union-intersection similarity byte code" )), limits = rev(levels(res$expr))) print(res)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.