Marcelo Tonon 16/10/2020
As the title spells out, here will be presented how the innermap
package was developed, showing how the different versions of the
functions presented in the package perfom. Until now, there were three
versions of the package, being that the second one was not published, as
we will explain futher. The third version is the current one, and it
uses only the purrr
package. In the path of the development, I own a
lot to Giuliano Sposito, and also to
my great sock friend Pedro Cava, for
giving me advice in how to develop it better, and also for pointing to
the possibility of the better perfomance of a seqApply
[1].
Obviously, all eventual mistakes are my own!
Packages needed
library(purrr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(microbenchmark)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(rlang)
##
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
##
## %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
## flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
## splice
The first version of innermap
was structured in the following fashion:
innermap_V1 <- function(input, .f0, distance = 1){
.output <- purrr::map2(c(1:(length(input)-distance)), c((1+distance):length(input)),
function(x1,x2) rlang::exec(.f0, input[[x1]], input[[x2]])
)
return(.output)
}
All the innermap
function family was structured like this. As one can
notice, this version used the rlang
package.
# Changing due to feedbacks!
The early version of the innermap
package was nothing more then an
more abstracted form of the way I always used map2
when working with
Input-Output Tables and Structural Decomposition Analysis. As I never
had received any feedback or criticism for that, I did not though of any
other way for doing this. When I published the package, I was then
presented, by pedrocava and giulsposito to the dplyr::lead
function,
and so I created a new version of innermap
:
innermap_V2 <- function(input, .f0, distance = 1){
.output <- purrr::map2(input,
dplyr::lead(input, distance),
.f0
)[c(1:(length(input)-distance))]
return(.output)
}
This version did not used rlang::exec
, but used dplyr::lead
.
As said in README, the main motivation to create innermap
was for
dealing with lists of matrices and data.frames. But for the sake of
simplicity[2], the first example I gave in the first publication of
the package was of innermap
dealing with atomic vectors. Giuliano
Sposito then graciously
appeared
and using benchmark said that for atomic vectors there was a better way:
leadApply <- function(input, .f0, distance =1){
.f0(input,
dplyr::lead(input, distance)
)[1:(length(input)-distance)]
}
This is certainly true. Let’s compare how the three structures behave in comparison:
inputVec <- 1:100000
innermap_dbl_V1 <- function(input, .f0, distance = 1){
.output <- purrr::map2_dbl(c(1:(length(input)-distance)), c((1+distance):length(input)),
function(x1,x2) rlang::exec(.f0, input[[x1]], input[[x2]])
)
return(.output)
}
innermap_dbl_V2 <- function(input, .f0, distance = 1){
.output <- purrr::map2_dbl(input,
dplyr::lead(input, distance),
.f0
)[c(1:(length(input)-distance))]
return(.output)
}
microbenchmark(
"innermap_dbl_V1" = {innermap_dbl_V1(inputVec, `+`, 5)},
"innermap_dbl_V2" = {innermap_dbl_V2(inputVec, `+`, 5)},
"leadApply" = {leadApply(inputVec, `+`, 5)},
times =30
)
## Unit: microseconds
## expr min lq mean median uq
## innermap_dbl_V1 1423252.601 1538556.102 1927850.434 2049706.101 2170268.101
## innermap_dbl_V2 335935.702 354889.900 435371.324 428690.051 505641.601
## leadApply 798.001 1021.401 1688.581 1321.101 1888.601
## max neval
## 2420695.800 30
## 621687.900 30
## 5011.501 30
The desavantage of leadApply
is that it does not support vectors or
formula in .f0
as innermap
does. But it does support anonymous
function. It must be said that it will return an error if the function
choosed returns an vector with length higher then 1.
inputVec2 <- 1:100
microbenchmark(
"innermap_V1" = {innermap_V1(inputVec2, `+`, 20)},
"innermap_V2" = {innermap_V2(inputVec2, `+`, 20)},
times =30
)
## Unit: microseconds
## expr min lq mean median uq max neval
## innermap_V1 1301.301 1346.902 1904.161 1487.101 1720.901 12217.801 30
## innermap_V2 626.001 815.701 1313.841 1068.701 1272.801 8529.701 30
As we can see giulsposito was right, and leadApply
clearly perfoms way
better than innermap_dbl_V1
or innermap_dbl_V2
. We can see also that
innermap_dbl_V2
perfomed significantly better than innermap_dbl_V1
.
We can also see that all of them deliver the same result:
all.equal(innermap_dbl_V1(inputVec, `+`, 5),
innermap_dbl_V2(inputVec, `+`, 5))
## [1] TRUE
all.equal(innermap_dbl_V1(inputVec, `+`, 5),
leadApply(inputVec, `+`, 5))
## [1] TRUE
This holds for any suffix of innermap
that takes an atomic vector
as
input. But as we are going to see, this does not holds for a list of
matrices.
The V2 of this problem was never published as the current one. This was because of their worse perfomance relative to the their previous version when operating a matrix.
Examples:
MatrixList <- ozone %>% plyr::alply(3)
microbenchmark("V1" = {innermap_V1(MatrixList, `/`, 3)},
"V2" = {innermap_V2(MatrixList, `/`, 3)},
times = 30)
## Unit: milliseconds
## expr min lq mean median uq max neval
## V1 1.306701 1.528801 2.177694 1.765651 2.462201 9.387701 30
## V2 2.556301 3.186501 4.025328 3.783601 4.941301 6.716202 30
Cell by cell multiplication:
microbenchmark("V1" = {innermap_V1(MatrixList, `*`, 3)},
"V2" = {innermap_V2(MatrixList, `*`, 3)},
times = 30)
## Unit: milliseconds
## expr min lq mean median uq max neval
## V1 1.583701 1.787001 2.282301 2.075801 2.633101 4.468701 30
## V2 2.627301 3.121400 3.969358 3.516901 4.645402 10.486501 30
Standard Matrix Multiplications
microbenchmark("V1" = {innermap_V1(MatrixList, crossprod, 3)},
"V2" = {innermap_V2(MatrixList, crossprod, 3)},
times = 30)
## Unit: milliseconds
## expr min lq mean median uq max neval
## V1 1.923300 2.063600 3.029758 2.397900 3.239201 9.229301 30
## V2 2.650401 3.035702 3.706631 3.366301 3.962801 6.575902 30
microbenchmark("V1" = {innermap_V1(MatrixList, `+`, 3)},
"V2" = {innermap_V2(MatrixList, `+`, 3)},
times = 30)
## Unit: milliseconds
## expr min lq mean median uq max neval
## V1 1.402201 1.556802 4.362661 1.684951 2.013701 77.030301 30
## V2 2.622301 3.027902 3.686074 3.509601 4.472901 6.304702 30
innermap_lgl_V1 <- function(input, .f0, distance = 1){
.output <- purrr::map2_lgl(c(1:(length(input)-distance)), c((1+distance):length(input)),
function(x1,x2) rlang::exec(.f0, input[[x1]], input[[x2]])
)
return(.output)
}
innermap_lgl_V2 <- function(input, .f0, distance = 1){
.output <- purrr::map2_lgl(input,
dplyr::lead(input, distance),
.f0
)[c(1:(length(input)-distance))]
return(.output)
}
microbenchmark("V1" = {innermap_lgl_V1(MatrixList, identical, 3)},
"V2" = {innermap_lgl_V2(MatrixList, identical, 3)},
times = 30)
## Unit: milliseconds
## expr min lq mean median uq max neval
## V1 1.368701 1.765301 2.398498 1.904201 2.215502 12.996101 30
## V2 2.177601 2.536201 3.245624 2.867951 3.163701 9.914901 30
In this version I took away lead
in favour of a base-r solution for
the subsetting of the lists. This was also another one of the Giuliano
Sposito’s
recomendations.
This structure was then like this:
innermap_V3 <- function(input, .f0, distance = 1){
.output <- purrr::map2(input[c(1:(length(input)-distance))],
input[c((1+distance):length(input))],
.f0)
return(.output)
}
The perfomance of this structure compared to the formers:
microbenchmark("V1" = {innermap_V1(MatrixList, `/`, 3)},
"V2" = {innermap_V2(MatrixList, `/`, 3)},
"V3" = {innermap_V3(MatrixList, `/`, 3)},
times = 30)
## Unit: microseconds
## expr min lq mean median uq max neval
## V1 1507.501 1751.800 2403.181 2021.001 2650.501 7588.002 30
## V2 2585.701 3234.100 3739.831 3575.151 4362.301 5374.301 30
## V3 959.501 1190.701 1789.254 1384.251 1594.701 10095.301 30
microbenchmark("V1" = {innermap_V1(MatrixList, crossprod, 3)},
"V2" = {innermap_V2(MatrixList, crossprod, 3)},
"V3" = {innermap_V3(MatrixList, crossprod, 3)},
times = 30)
## Unit: milliseconds
## expr min lq mean median uq max neval
## V1 2.050002 2.371101 2.878371 2.626851 2.995701 5.346401 30
## V2 2.683101 3.031101 3.945448 3.518251 4.185100 11.647701 30
## V3 1.019101 1.258400 1.383681 1.336951 1.504101 2.075501 30
microbenchmark("V1" = {innermap_V1(MatrixList, `+`)},
"V2" = {innermap_V2(MatrixList, `+`)},
"V3" = {innermap_V3(MatrixList, `+`)},
times = 30)
## Unit: microseconds
## expr min lq mean median uq max neval
## V1 1348.301 1523.801 1827.788 1737.202 1885.602 3069.801 30
## V2 2408.702 2896.701 3295.241 3285.651 3550.601 4451.801 30
## V3 789.301 1052.502 1323.238 1198.901 1392.501 4427.201 30
innermap_lgl_V3 <- function(input, .f0, distance = 1){
.output <- purrr::map2_lgl(input[c(1:(length(input)-distance))],
input[c((1+distance):length(input))],
.f0)
return(.output)
}
microbenchmark("V1" = {innermap_lgl_V1(MatrixList, identical, 1)},
"V2" = {innermap_lgl_V2(MatrixList, identical, 1)},
"V3" = {innermap_lgl_V3(MatrixList, identical, 1)},
times = 30)
## Unit: microseconds
## expr min lq mean median uq max neval
## V1 1456.701 1618.501 1874.4676 1737.101 2031.301 3222.601 30
## V2 2240.200 2433.701 3187.8710 2781.301 3549.900 6859.901 30
## V3 478.601 550.600 939.4709 642.551 736.101 9130.701 30
These improvements happened because of three things:
. The structure of V1
used the lengths of the input
as .x
e .y
in map2()
and subsetted input
inside the rlang::exec
. Now is the
input itself that is used. In the end, V3
has a similar structure of
V2
, but; . V2
uses dplyr::lead
instead of baser
, that is
what V3
uses. The difference in perfomance is apparent when we deal
with lists:
exampleDist <- 2
microbenchmark(
"lead" = ({lead(MatrixList, exampleDist)}),
"baseR"= MatrixList[c((1+exampleDist):length(MatrixList))]
)
## Unit: microseconds
## expr min lq mean median uq max neval
## lead 1340.601 1623.2015 2042.02910 1875.051 2220.0505 6411.701 100
## baseR 17.500 22.1505 35.59499 31.701 39.4015 124.401 100
. innermap_V2(input, .f0, distance)
runs map2
a total of
length(input)
times and then subset it with
[c(1:(length(input)-distance))]
. The total of times
innermap_V3(input, .f0, distance)
runs map2
equals length(input) -
distance
. This happens because dplyr::lead(input)
keep the same
length of input
while innermap_V3
modifies the lenght of input
as
it serves as .x
and .y
parameter for map2
.
We have retired leadApply
and transformed it in `innerApply``,
since the latter perfoms way better.
innerApply <- function(input, .f0, distance =1){
.f0(input[c(1:(length(input)-distance))],
input[c((1+distance):length(input))]
)
}
microbenchmark(
"leadApply" = {leadApply(inputVec, `+`, 5)},
"innerApply" = {innerApply(inputVec, `+`, 5)},
times =30
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## leadApply 1.017001 1.573701 2.220898 1.766201 2.292801 11.0912 30
## innerApply 3.072501 3.613001 4.836614 4.208151 5.132701 12.7491 30
Well, for now not that much. Test in the future use as_mapper
directly
and see if we gain more perfomance in this matter. Writing C/C++ is not
in my plans. I also want to focus in other packages that would help
Input Output Analysis (Structural Decomposition in special) in R.
The function Giuliano Sposito proposed was called leadApply
since
it used the function dplyr::lead
, but as the base R function
perfom better, made no sense to called leadApply
. Important to
say that he also said that moving to a base-R structure would have a
better perfomance.
Also, no idea for an simple matrix example came to mind at that moment.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.