knitr::opts_chunk$set(echo = TRUE, fig.align = 'center', fig.width = 4, fig.height = 4.5)
The aim of intubate
(logo <||>
) is to offer a painless way to
add R functions that are not pipe-aware
to data science pipelines implemented by magrittr
with the
operator %>%
, without having to rely on workarounds of
varying complexity. It also
implements three extensions called intubOrders
, intuEnv
, and intuBags
.
install.packages("intubate")
# install.packages("devtools") devtools::install_github("rbertolusso/intubate")
If you like magrittr
pipelines (%>%) and you are looking
for an alternative to performing a statistical analysis
in the following way:
fit <- lm(sr ~ pop15, LifeCycleSavings) summary(fit)
intubate
let's you do it in these other ways:
library(intubate) library(magrittr)
ntbt_lm
is the interface provided to lm
, and one of the over 450
interfaces intubate
currently implements
(for the list of 88 packages currently containing interfaces see below).
LifeCycleSavings %>% ntbt_lm(sr ~ pop15) %>% ## ntbt_lm is the interface to lm provided by intubate summary()
ntbt
You do not need to use interfaces. You can call non-pipe-aware functions
directly using ntbt
(even those that currently do not have an interface
provided by intubate
).
LifeCycleSavings %>% ntbt(lm, sr ~ pop15) %>% ## ntbt calls lm without needing to use an interface summary()
The help for each interface contains examples of use.
intubate
allows you to create your own interfaces "on demand",
right now, giving you full power of decision regarding which functions
to interface.
The ability to amplify the scope of intubate
may prove to be particularly welcome if you are related to a particular
field that may, in the long run, continue to lack interfaces due to my
unforgivable, but unavoidable, ignorance.
As an example of creating an interface "on demand", suppose the interface to
cor.test
was lacking in the current version of intubate
and suppose
(at least for a moment) that you want to
create yours because you are searching for a pipeline-aware alternative to
any of the following styles of coding (results not shown):
data(USJudgeRatings) ## 1) cor.test(USJudgeRatings$CONT, USJudgeRatings$INTG) ## 2) attach(USJudgeRatings) cor.test(CONT, INTG) detach() ## 3) with(USJudgeRatings, cor.test(CONT, INTG)) ## 4) USJudgeRatings %>% with(cor.test(CONT, INTG))
To be able to create an interface to cor.test
"on demand", the only thing you
need to do is to add the following line of code somewhere before its use
in your pipeline:
ntbt_cor.test <- intubate ## intubate is the helper function
Please note the lack of parentheses.
Nothing else is required.
The only thing you need to remember is that the names of an interface
must start with ntbt_
followed by the name of the interfaced function
(cor.test
in this particular case), no matter which function you want to
interface.
Now you can use your "just baked" interface in any pipeline. A pipeline alternative to the above code may look like this:
USJudgeRatings %>% ntbt_cor.test(CONT, INTG) ## Use it right away
ntbt
Of course, as already stated, you do not have to create an interface if you do not
want to. You can call the non-pipe-aware function directly with ntbt
,
in the following way:
USJudgeRatings %>% ntbt(cor.test, CONT, INTG)
You can potentially use ntbt
with any function, also the ones without an interface
provided by intubate
. In principle,
the functions you would like to call are the ones you cannot use directly in
a pipeline (because data
is not in first place in the definition of the function).
The link below is to Dr. Sheather's website where code was extracted. In the link there is also information about the book. This code could be used to produce the plots in Figure 3.1 on page 46. Different strategies are illustrated.
http://www.stat.tamu.edu/~sheather/book/
attach(anscombe) plot(x1, y1, xlim = c(4, 20), ylim = c(3, 14), main = "Data Set 1") abline(lsfit(x1, y1)) detach()
You needed to attach
so variables are visible locally.
If not, you should have used anscombe$x1
and anscombe$y1
.
You could also have used with
.
Spaces were added for clarity and better comparison with code below.
%>%
) and intubate
(1: provided interface and 2: ntbt
):anscombe %>% ntbt_plot(x2, y2, xlim = c(4, 20), ylim = c(3, 14), main = "Data Set 2") %>% ntbt(lsfit, x2, y2) %>% # Call non-pipe-aware function directly with `ntbt` abline() # No need to interface 'abline'.
ntbt_plot
is the interface to plot
provided by intubate
.
As plot
returns NULL
, intubate
forwards (invisibly) its input
automatically without having to use %T>%
, so lsfit
gets the
original data (what it needs) and everything is done in one pipeline.ntbt
let's you call the non-pipe-aware function lsfit
directly.
You can use ntbt
always (you do not need to use ntbt_
interfaces
if you do not want to), but ntbt
is particularly useful to interface
directly a non-pipe-aware function for which intubate
does not provide
an interface (as currently happens with lsfit
).If intubate
does not provide an interface to a given function
and you prefer to use interfaces instead of ntbt
, you can create your own
interface "on demand" and use it right away in your pipeline.
To create an interface, it suffices the following line of code before its use:
ntbt_lsfit <- intubate # NOTE: we are *not* including parentheses.
That's it, you have created you interface. Just remember that:
intubate
interfaces must start with ntbt_
followed by the
name of the function to interface (lsfit
in this case).You can now use ntbt_lsfit
in your pipeline as any other interfaced function:
anscombe %>% ntbt_plot(x3, y3, xlim = c(4, 20), ylim = c(3, 14), main = "Data Set 3") %>% ntbt_lsfit(x3, y3) %>% # Using just created "on demand" interface abline()
Instead of the X
Y
approach, you can also use the formula variant.
In this case, we will have to used lm
as lsfit
does not implement
formulas.
anscombe %>% ntbt_plot(y4 ~ x4, xlim = c(4, 20), ylim = c(3, 14), main = "Data Set 4") %>% ntbt_lm(y4 ~ x4) %>% # We use 'ntbt_lm' instead of 'ntbt_lmfit' abline()
intubate
intubate
implements three extensions:
intubOrders
,intuEnv
, andintuBags
.These experimental features are functional for you to use. Unless you do not mind having to potentially make some changes to your code while the architecture solidifies, they are not recommended (yet) for production code.
intubOrders
intubOrders
allow, among other things, to:
run, in place, functions on the input (data
)
to the interfaced function, such as head
, tail
, dim
, str
, View
, ...
run, in place, functions that use the result generated by the interfaced
function, such as print
, summary
, anova
, plot
, ...
forward the input to the interfaced function without using %T>%
signal other modifications to the behavior of the interface
intubOrders
are implemented by an intuBorder
<||>
(from where the
logo of intubate
originates).
The intuBorder
contains 5 zones (intuZones
?, maybe too much...):
zone 1
< zone 2
| zone 3
| zone 4
> zone 5
zone 1
and zone 5
will be explained later
zone 2
is used to indicate the functions that are to be applied
to the input to the interfaced function
zone 3
to modify the behavior of the interface
zone 4
to indicate the functions that are to be applied to the
result of the interfaced function
For example, instead of running the following sequence of function calls (only plot shown):
head(LifeCycleSavings) tail(LifeCycleSavings, n = 3) dim(LifeCycleSavings) str(LifeCycleSavings) summary(LifeCycleSavings) result <- lm(sr ~ pop15 + pop75 + dpi + ddpi, LifeCycleSavings) print(result) summary(result) anova(result) plot(result, which = 1)
you could have run, using an intubOrder
:
LifeCycleSavings %>% ntbt_lm(sr ~ pop15 + pop75 + dpi + ddpi, "< head; tail(#, n = 3); dim; str; summary |i| print; summary; anova; plot(#, which = 1) >")
Note:
i
is used to force an invisible result#
is used as a placeholder either for the input or result in cases the
call requires extra parameters.intubOrders
may prove to be of interest to non-pipeline oriented people too
(results not shown):
ntbt_lm(LifeCycleSavings, sr ~ pop15 + pop75 + dpi + ddpi, "< head; tail(#, n = 3); dim; str; summary |i| print; summary; anova; plot(#, which = 1) >")
intubOrders
with collections of inputsWhen using pipelines, the receiving function has to deal with the whole object that receives as its input. Then, it produces a result that, again, needs to be consumed as a whole by the following function.
intubOrders
allow you to work with a collection of objects of any kind in one pipeline, selecting at each step which input to use.
As an example suppose you want to perform the following statistical procedures in one pipeline (results not shown).
CO2 %>% ntbt_lm(conc ~ uptake) USJudgeRatings %>% ntbt_cor.test(CONT, INTG) sleep %>% ntbt_t.test(extra ~ group)
We will first create a collection (a list
in this case, but it could also be
intuEnv
or an intuBag
, explained later) containing the three dataframes:
coll <- list(CO3 = CO2, USJudgeRatings1 = USJudgeRatings, sleep1 = sleep) names(coll)
(We have changed the names to show we are not cheating...)
We will now use as source the whole collection.
The intubOrder
will need the following info:
zone 1
, in each case, indicates which is the data.frame (or any other object)
that we want to use as input in this particular functionzone 3
needs to include f
to forward the input (if you want the next
function to receive the whole collection, and not the result if this step)zone 4
(optional) may contain a print
(or summary
) if you want
something to be displayedcoll %>% ntbt_lm(conc ~ uptake, "CO3 <|f| print >") %>% ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <|f| print >") %>% ntbt_t.test(extra ~ group, "sleep1 <|f| print >") %>% names()
names()
was added at the end to show that we have forwarded
the original collection to the end of the pipeline.What happens if you would like to save the results of the function calls (or intermediate results of data manipulations)?
intuEnv
and intuBags
intuEnv
and intuBags
allow to save intermediate results without
leaving the pipeline. They can also be used to contain the collections
of objects.
Let us first consider
intuEnv
When intubate
is loaded, it creates intuEnv
, an empty environment that can
be populated with results that you want to use later.
You can access the intuEnv
as follows:
intuEnv() ## intuEnv() returns invisible, so nothing is output
You can verify that, initially, it is empty:
ls(intuEnv())
How can intuEnv
be used?
Suppose that we want, instead of displaying the results of interfaced functions,
save the objects returned by them. One strategy (the other is using intuBags
)
is to save the results to intuEnv
.
intuEnv
?The intubOrder
will need the following info:
zone 3
needs to include f
to forward the input (if you want the next
function to receive the whole collection, and not its result)zone 5
, in each case, indicates the name that the result will have in the
intuEnv
coll %>% ntbt_lm(conc ~ uptake, "CO3 <|f|> lmfit") %>% ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <|f|> ctres") %>% ntbt_t.test(extra ~ group, "sleep1 <|f|> ttres") %>% names()
As you can see, the collection stays unchanged, but look
inside intuEnv
ls(intuEnv())
intuEnv
has collected the results, that are ready for use.
Four strategies of using one of the collected results are shown below (output not shown):
intuEnv()$lmfit %>% summary()
attach(intuEnv()) lmfit %>% summary() detach()
intuEnv() %>% ntbt(summary, "lmfit <||>")
intuEnv() %>% ntbt(I, "lmfit <|i| summary >")
clear_intuEnv
can be used to empty the contents of intuEnv
.
clear_intuEnv() ls(intuEnv())
intuEnv
with the Global EnvironmentIf you want your results to be saved to the Global environment (it could be
any environment), you can associate intuEnv
to it, so you can have your
results available as any other saved object.
First let's display the contents of the Global environment:
ls()
set_intuEnv
let's you associate intuEnv
to an environment. It takes
an environment as parameter, and returns the current intuEnv
, in case you
want to save it to reinstate it later. If not, I think it will be just
garbage collected (I may be wrong).
Let's associate intuEnv
to the global environment (saving the current
intuEnv
):
saved_intuEnv <- set_intuEnv(globalenv())
Now, we re-run the pipeline:
coll %>% ntbt_lm(conc ~ uptake, "CO3 <|f|> lmfit") %>% ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <|f|> ctres") %>% ntbt_t.test(extra ~ group, "sleep1 <|f|> ttres") %>% names()
Before forgetting, let's reinstate the original intuEnv
:
set_intuEnv(saved_intuEnv)
And now, let's see if the results were saved to the global environment:
ls()
They were.
Now the results are at your disposal to use as any other variable (result not shown):
lmfit %>% summary()
intuEnv
as source of the pipelineYou can use intuEnv
(or any other environment) as the input
of your pipeline.
We already cleared the contents of intuEnv
, but let's do it
again to get used to how to do it:
clear_intuEnv() ls(intuEnv())
Let's populate intuEnv
with the same objects as before:
intuEnv(CO3 = CO2, USJudgeRatings1 = USJudgeRatings, sleep1 = sleep) ls(intuEnv())
When using an environment, such as intuEnv
, as the source of your pipeline,
there is no need to specify f
in zone 3
, as the environment is always forwarded
(the same happens when the source is an intuBag
).
Keep in mind that, if you are saving results and your source is an environment
other than intuEnv
, the results will be saved to intuEnv
, and not to the source
enviromnent. If the source is an intuBag
, the results will be saved to the
intuBag
, and not to intuEnv
.
We will run the same pipeline as before, but this time we will add subset
and summary
(called directly with ntbt
) to illustrate how we can use a previously
generated result (such as from data transformations) in the same pipeline in which
it was generated. We will use intuEnv
as the source of the pipeline.
intuEnv() %>% ntbt(subset, Treatment == "nonchilled", "CO3 <||> CO3nc") %>% ntbt_lm(conc ~ uptake, "CO3nc <||> lmfit") %>% ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <||> ctres") %>% ntbt_t.test(extra ~ group, "sleep1 <||> ttres") %>% ntbt(summary, "lmfit <||> lmsfit") %>% names()
subset
is already pipe-aware (data
is its first parameter),
you have two ways of proceeding. One is the one illustrated
above (same strategy used on non-pipe-aware functions). The other, that
works only when using pipe-aware functions, is:intuEnv() %>% ntbt(subset, CO3, Treatment == "nonchilled", "<||> CO3nc")
intuBags
intuBags
differ from intEnv
in that they are based on lists, instead than
on environments. Even if (with a little of care) you could keep track of several
intuEnvs
, it seems natural (to me) to deal with only one, while several intuBags
(for example one for each database, or collection of objects) seem natural (to me).
Other than that, using an intuEnv
or an intuBag
is a matter of
personal taste.
What you can do with one you can do with the other.
iBag <- intuBag(CO3 = CO2, USJudgeRatings1 = USJudgeRatings, sleep1 = sleep)
iBag %>% ntbt(subset, Treatment == "nonchilled", "CO3 <||> CO3nc") %>% ntbt_lm(conc ~ uptake, "CO3nc <||> lmfit") %>% ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <||> ctres") %>% ntbt_t.test(extra ~ group, "sleep1 <||> ttres") %>% ntbt(summary, "lmfit <||> lmsfit") %>% names()
When using intuBags
, it is possible to
use %<>%
if you want to save your results to the intuBag
.
This way, instead of a long pipeline, you could run several
short ones.
iBag <- intuBag(CO3 = CO2, USJudgeRatings1 = USJudgeRatings, sleep1 = sleep) iBag %<>% ntbt(subset, CO3, Treatment == "nonchilled", "<||> CO3nc") %>% ntbt_lm(conc ~ uptake, "CO3nc <||> lmfit") iBag %<>% ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <||> ctres") iBag %<>% ntbt_t.test(extra ~ group, "sleep1 <||> ttres") %>% ntbt(summary, "lmfit <||> lmsfit") names(iBag)
The intuBag
will keep all your results, in any way you prefer to use it.
The same happens with intuEnv
. Just remember that %<>%
should not
be used with intuEnv
(you should always use %>%
).
Suppose you have a database consisting in the following two tables
iBag <- intuBag(members = data.frame(name=c("John", "Paul", "George", "Ringo", "Brian", NA), band=c("TRUE", "TRUE", "TRUE", "TRUE", "FALSE", NA)), what_played = data.frame(name=c("John", "Paul", "Ringo", "George", "Stuart", "Pete"), instrument=c("guitar", "bass", "drums", "guitar", "bass", "drums"))) print(iBag)
and you want to perform an inner join. In these cases, the functions
should receive the whole intuBag
(or intuEnv
, or collection), so
zone 1
should be empty, and the names of the tables should be specified
directly, in the function call, in their corresponding order
(or by stating their parameter names).
iBag %>% ntbt(merge, members, what_played, by = "name", "<|| print >")
intuBag
acting as a databaseThe following code has been extracted from chapter 13 of "R for data science", by Garrett Grolemund and Hadley Wickham (http://r4ds.had.co.nz/relational-data.html)
Original code (output not shown):
library(dplyr) library(nycflights13) flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier) flights2 flights2 %>% select(-origin, -dest) %>% left_join(airlines, by = "carrier") ## 13.4.5 Defining the key columns flights2 %>% left_join(weather) flights2 %>% left_join(planes, by = "tailnum") flights2 %>% left_join(airports, c("dest" = "faa")) flights2 %>% left_join(airports, c("origin" = "faa"))
nycflights13
is a database. As such, we can deal with it using intuBags.
The following code illustrates how all the above can be performed using
an intuBag
(or intuEnv
) and one pipeline:
iBag <- intuBag(flightsIB = flights, airlinesIB = airlines, weatherIB = weather, planesIB = planes, airportsIB = airports) ## Note we are changing the names, to make sure we are not cheating ## (by reading from globalenv()). iBag %<>% ntbt(select, flightsIB, year:day, hour, origin, dest, tailnum, carrier, "<|| head > flights2") %>% ntbt(select, flights2, -origin, -dest, "<|| print > flights3") %>% ntbt(left_join, flights3, airlinesIB, by = "carrier", "<|| print >") %>% ntbt(left_join, flights2, weatherIB, "<|| print >") %>% ntbt(left_join, flights2, planesIB, by = "tailnum", "<|| print >") %>% ntbt(left_join, flights2, airportsIB, c("dest" = "faa"), "<|| print >") %>% ntbt(left_join, flights2, airportsIB, c("origin" = "faa"), "<|| print >") names(iBag)
Note: the results were copied from previously run code to avoid adding dependences.
## ## ntbt(data = ., fti = select, flightsIB, year:day, hour, origin, ## dest, tailnum, carrier) ## ## * head <||> result * ## # A tibble: 6 x 8 ## year month day hour origin dest tailnum carrier ## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> ## 1 2013 1 1 5 EWR IAH N14228 UA ## 2 2013 1 1 5 LGA IAH N24211 UA ## 3 2013 1 1 5 JFK MIA N619AA AA ## 4 2013 1 1 5 JFK BQN N804JB B6 ## 5 2013 1 1 6 LGA ATL N668DN DL ## 6 2013 1 1 5 EWR ORD N39463 UA ## ## ntbt(data = ., fti = select, flights2, -origin, -dest) ## ## * print <||> result * ## # A tibble: 336,776 x 6 ## year month day hour tailnum carrier ## <int> <int> <int> <dbl> <chr> <chr> ## 1 2013 1 1 5 N14228 UA ## 2 2013 1 1 5 N24211 UA ## 3 2013 1 1 5 N619AA AA ## 4 2013 1 1 5 N804JB B6 ## 5 2013 1 1 6 N668DN DL ## 6 2013 1 1 5 N39463 UA ## 7 2013 1 1 6 N516JB B6 ## 8 2013 1 1 6 N829AS EV ## 9 2013 1 1 6 N593JB B6 ## 10 2013 1 1 6 N3ALAA AA ## # ... with 336,766 more rows ## ## ntbt(data = ., fti = left_join, flights3, airlinesIB, by = "carrier") ## ## * print <||> result * ## # A tibble: 336,776 x 7 ## year month day hour tailnum carrier name ## <int> <int> <int> <dbl> <chr> <chr> <chr> ## 1 2013 1 1 5 N14228 UA United Air Lines Inc. ## 2 2013 1 1 5 N24211 UA United Air Lines Inc. ## 3 2013 1 1 5 N619AA AA American Airlines Inc. ## 4 2013 1 1 5 N804JB B6 JetBlue Airways ## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc. ## 6 2013 1 1 5 N39463 UA United Air Lines Inc. ## 7 2013 1 1 6 N516JB B6 JetBlue Airways ## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc. ## 9 2013 1 1 6 N593JB B6 JetBlue Airways ## 10 2013 1 1 6 N3ALAA AA American Airlines Inc. ## # ... with 336,766 more rows ## ## ntbt(data = ., fti = left_join, flights2, weatherIB) ## Joining, by = c("year", "month", "day", "hour", "origin") ## ## * print <||> result * ## # A tibble: 336,776 x 18 ## year month day hour origin dest tailnum carrier temp dewp humid ## <dbl> <dbl> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 2013 1 1 5 EWR IAH N14228 UA NA NA NA ## 2 2013 1 1 5 LGA IAH N24211 UA NA NA NA ## 3 2013 1 1 5 JFK MIA N619AA AA NA NA NA ## 4 2013 1 1 5 JFK BQN N804JB B6 NA NA NA ## 5 2013 1 1 6 LGA ATL N668DN DL 39.92 26.06 57.33 ## 6 2013 1 1 5 EWR ORD N39463 UA NA NA NA ## 7 2013 1 1 6 EWR FLL N516JB B6 39.02 26.06 59.37 ## 8 2013 1 1 6 LGA IAD N829AS EV 39.92 26.06 57.33 ## 9 2013 1 1 6 JFK MCO N593JB B6 39.02 26.06 59.37 ## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.92 26.06 57.33 ## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>, ## # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>, ## # visib <dbl>, time_hour <time> ## ## ntbt(data = ., fti = left_join, flights2, planesIB, by = "tailnum") ## ## * print <||> result * ## # A tibble: 336,776 x 16 ## year.x month day hour origin dest tailnum carrier year.y ## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> ## 1 2013 1 1 5 EWR IAH N14228 UA 1999 ## 2 2013 1 1 5 LGA IAH N24211 UA 1998 ## 3 2013 1 1 5 JFK MIA N619AA AA 1990 ## 4 2013 1 1 5 JFK BQN N804JB B6 2012 ## 5 2013 1 1 6 LGA ATL N668DN DL 1991 ## 6 2013 1 1 5 EWR ORD N39463 UA 2012 ## 7 2013 1 1 6 EWR FLL N516JB B6 2000 ## 8 2013 1 1 6 LGA IAD N829AS EV 1998 ## 9 2013 1 1 6 JFK MCO N593JB B6 2004 ## 10 2013 1 1 6 LGA ORD N3ALAA AA NA ## # ... with 336,766 more rows, and 7 more variables: type <chr>, ## # manufacturer <chr>, model <chr>, engines <int>, seats <int>, ## # speed <int>, engine <chr> ## ## ntbt(data = ., fti = left_join, flights2, airportsIB, c(dest = "faa")) ## ## * print <||> result * ## # A tibble: 336,776 x 14 ## year month day hour origin dest tailnum carrier ## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> ## 1 2013 1 1 5 EWR IAH N14228 UA ## 2 2013 1 1 5 LGA IAH N24211 UA ## 3 2013 1 1 5 JFK MIA N619AA AA ## 4 2013 1 1 5 JFK BQN N804JB B6 ## 5 2013 1 1 6 LGA ATL N668DN DL ## 6 2013 1 1 5 EWR ORD N39463 UA ## 7 2013 1 1 6 EWR FLL N516JB B6 ## 8 2013 1 1 6 LGA IAD N829AS EV ## 9 2013 1 1 6 JFK MCO N593JB B6 ## 10 2013 1 1 6 LGA ORD N3ALAA AA ## # ... with 336,766 more rows, and 6 more variables: name <chr>, lat <dbl>, ## # lon <dbl>, alt <int>, tz <dbl>, dst <chr> ## ## ntbt(data = ., fti = left_join, flights2, airportsIB, c(origin = "faa")) ## ## * print <||> result * ## # A tibble: 336,776 x 14 ## year month day hour origin dest tailnum carrier ## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> ## 1 2013 1 1 5 EWR IAH N14228 UA ## 2 2013 1 1 5 LGA IAH N24211 UA ## 3 2013 1 1 5 JFK MIA N619AA AA ## 4 2013 1 1 5 JFK BQN N804JB B6 ## 5 2013 1 1 6 LGA ATL N668DN DL ## 6 2013 1 1 5 EWR ORD N39463 UA ## 7 2013 1 1 6 EWR FLL N516JB B6 ## 8 2013 1 1 6 LGA IAD N829AS EV ## 9 2013 1 1 6 JFK MCO N593JB B6 ## 10 2013 1 1 6 LGA ORD N3ALAA AA ## # ... with 336,766 more rows, and 6 more variables: name <chr>, lat <dbl>, ## # lon <dbl>, alt <int>, tz <dbl>, dst <chr>
names(iBag)
## [1] "flightsIB" "airlinesIB" "weatherIB" "planesIB" "airportsIB" ## [6] "flights2" "flights3"
The same, using intuEnv
(output not shown):
clear_intuEnv() intuEnv(flightsIB = flights, airlinesIB = airlines, weatherIB = weather, planesIB = planes, airportsIB = airports) %>% ntbt(select, flightsIB, year:day, hour, origin, dest, tailnum, carrier, "<|D| head > flights2") %>% ntbt(select, flights2, -origin, -dest, "<|| print > flights3") %>% ntbt(left_join, flights3, airlinesIB, by = "carrier", "<|| print >") %>% ntbt(left_join, flights2, weatherIB, "<|| print >") %>% ntbt(left_join, flights2, planesIB, by = "tailnum", "<|| print >") %>% ntbt(left_join, flights2, airportsIB, c("dest" = "faa"), "<|| print >") %>% ntbt(left_join, flights2, airportsIB, c("origin" = "faa"), "<|| print >") ls(intuEnv())
The 88 R packages that have interfaces implemented so far are:
adabag
: Multiclass AdaBoost.M1, SAMME and BaggingAER
: Applied Econometrics with Raod
: Analysis of Overdispersed Dataape
: Analyses of Phylogenetics and Evolutionarm
: Data Analysis Using Regression and Multilevel/Hierarchical Modelsbetareg
: Beta Regressionbrglm
: Bias reduction in binomial-response generalized linear modelscaper
: Comparative Analyses of Phylogenetics and Evolution in Rcar
: Companion to Applied Regressioncaret
: Classification and Regression Trainingcoin
: Conditional Inference Procedures in a Permutation Test FrameworkCORElearn
: Classification, Regression and Feature Evaluationdrc
: Analysis of Dose-Response Curvese1071
: Support Vector Machinesearth
: Multivariate Adaptive Regression SplinesEnvStats
: Environmental Statistics, Including US EPA GuidancefGarch
: Rmetrics - Autoregressive Conditional Heteroskedastic Modellingflexmix
: Flexible Mixture Modelingforecast
: Forecasting Functions for Time Series and Linear Modelsfrontier
: Stochastic Frontier Analysisgam
: Generalized Additive Modelsgbm
: Generalized Boosted Regression Modelsgee
: Generalized Estimation Equation Solverglmnet
: Lasso and Elastic-Net Regularized Generalized Linear Modelsglmx
: Generalized Linear Models Extendedgmnl
: Multinomial Logit Models with Random Parametersgplots
: Various R Programming Tools for Plotting Datagss
: General Smoothing Splinesgraphics
: The R Graphics Packagehdm
: High-Dimensional MetricsHmisc
: Harrell Miscellaneousipred
: Improved PredictorsiRegression
: Regression Methods for Interval-Valued Variablesivfixed
: Instrumental fixed effect panel data modelkernlab
: Kernel-Based Machine Learning Labkknn
: Weighted k-Nearest NeighborsklaR
: Classification and Visualizationlars
: Least Angle Regression, Lasso and Forward Stagewiselattice
: Trellis Graphics for RlatticeExtra
: Extra Graphical Utilities Based on Latticeleaps
: Regression Subset Selectionlfe
: Linear Group Fixed Effectslme4
: Linear Mixed-Effects Models using 'Eigen' and S4lmtest
: Testing Linear Regression ModelsMASS
: Robust Regression, Linear Discriminant Analysis, Ridge Regression,
Probit Regression, ...MCMCglmm
: MCMC Generalised Linear Mixed Modelsmda
: Mixture and Flexible Discriminant Analysismetafor
: Meta-Analysis Package for Rmgcv
: Mixed GAM Computation Vehicle with GCV/AIC/REML Smoothness Estimationminpack.lm
: R Interface to the Levenberg-Marquardt Nonlinear Least-Squares
Algorithm Found in MINPACK, Plus Support for Boundsmhurdle
: Multiple Hurdle Tobit Modelsmlogit
: Multinomial logit modelmnlogit
: Multinomial Logit Modelmodeltools
: Tools and Classes for Statistical Modelsnlme
: Linear and Nonlinear Mixed Effects Modelsnlreg
: Higher Order Inference for Nonlinear Heteroscedastic Modelsnnet
: Feed-Forward Neural Networks and Multinomial Log-Linear Modelsordinal
: Regression Models for Ordinal Dataparty
: A Laboratory for Recursive Partytioningpartykit
: A Toolkit for Recursive Partytioningplotrix
: Various Plotting Functionspls
: Partial Least Squares and Principal Component RegressionpROC
: Display and Analyze ROC Curvespscl
: Political Science Computational Laboratory, Stanford Universitypsychomix
: Psychometric Mixture Modelspsychotools
: Infrastructure for Psychometric Modelingpsychotree
: Recursive Partitioning Based on Psychometric Modelsquantreg
: Quantile RegressionrandomForest
: Random Forests for Classification and RegressionRchoice
: Discrete Choice (Binary, Poisson and Ordered) Models with Random Parametersrminer
: Data Mining Classification and Regression Methods rms
: Regression Modeling Strategiesrobustbase
: Basic Robust Statisticsrpart
: Recursive Partitioning and Regression TreesRRF
: Regularized Random ForestRWeka
: R/Weka InterfacesampleSelection
: Sample Selection Modelssem
: Structural Equation ModelsspBayes
: Univariate and Multivariate Spatial-temporal Modelingstats
: The R Stats Package (glm, lm, loess, lqs, nls, ...)strucchange
: Testing, Monitoring, and Dating Structural Changessurvey
: Analysis of Complex Survey Samplessurvival
: Survival AnalysisSwarmSVM
: Ensemble Learning Algorithms Based on Support Vector Machinessystemfit
: Estimating Systems of Simultaneous Equationstree
: Classification and Regression Treesvcd
: Visualizing Categorical Datavegan
: Community Ecology PackageThe aim is to continue adding interfaces to most methodologies used in data science or other disciplines.
For now the main focus is on interfacing non-pipe-aware functions having "formula" and
"data" (in that order), but the non-formula variants should also work (even cases
currently lacking interfaces). As a proof of concept, two libraries that contain
non-formula variants only (glmnet
and lars
) have also been interfaced.
Also, only packages in CRAN (in addition to the ones provided in the base installation of R) have been implemented. Packages from, for example, bioconductor, could also be easily added, but I would need some help from the maintainers of those packages.
intubate
core depends only on base
, stats
, and utils
libraries.
To keep it as lean as possible, and to be able to continue to include more interfaces without bloating your machine, starting from version 1.0.0 intubate
will not
install the packages that contain the functions that are interfaced.
You will need to install them yourself,
and load the corresponding libraries before using them in your pipelines. This
also applies to magrittr
(in case you want to use intubate
without pipelines).
Then, if you are only interested in a given field, say: bio-statistics, bio-informatics, environmetrics, econometrics, finance, machine learning, meta-analysis, pharmacokinetics, phylogenetics, psychometrics, social sciences, surveys, survival analysis, ..., you will not have to install all the packages for which interfaces are provided if you intend to use only a subset of them. You only need to install the subset of packages you intend to use (which are probably already installed in your machine).
Moreover, there are cases where some packages are in conflict if loaded simultaneously, leading
to a segmentation fault (for example, kernlab functions fail when testing the whole
examples provided with intubate
, but not when testing kernlab only examples
in a clean environment. I ignore which is/are the other(s) package(s) conflicting with it.
The only thing I know is that the package name is alphabetically ordered prior to kernlab)
I make no personal judgment (mostly due to personal ignorance)
about the merit of any interfaced function.
I have used only a subset of what is provided, and I am happy to include others,
that I am currently unaware of, down the line. In principle I plan on including
packages that are listed as reverse depends, imports, or suggest on package Formula
(I am missing still some of them). Adding interfaces is easy (and can be
boring...) so I will appreciate if you want to contribute (and you will be credited
in the help of the interfaced package). Also is welcome the improvement of the
provided examples (such as making sure the data used is correct for the statistical
technique used).
I do not claim to be a data scientist (I am barely a statistician and I still have almost no clue of what a data scientist is or is not, and my confusion about the subject only increases with time), nor someone entitled to tell you what to use or not.
As such, I am not capable of engaging in disputes of what is relevant or not, or, if there are competing packages, which to use. I will leave that to you to decide.
Please keep in mind that intubate
will not install any packages corresponding to the interfaces
that are provided. You can install only those that you need
(or like) and disregard the rest. Also please remember that you can create your
own interfaces (using helper function intubate
),
or call non-pipe-aware functions directly (using ntbt
).
The original aim of intubate
was to be able to include functions that
have formula and data (in that order) in a magrittr
pipeline using %>%
.
As such, my search so far has been concentrated in packages containing formulas and
misplaced (from pipes point of view) data (with the exception of a couple of packages
with non-formula variants interfaced as proofs of concept).
For example, this was the first implementation of ntbt_lm
ntbt_lm <- function(data, formula, ...) lm(formula, data, ...)
This approach was supposed to be repeated for each interface.
Soon after I realized that intubate
could have just a few helper functions (that was
version 0.99.2), later that only one helper function was enough (intubate
), and
later that you could call non-pipe-aware functions directly without defining
interfaces (ntbt
) and
that the interfaces and ntbt
could also be successfully used in cases
where non-formula variants are implemented.
However, the starting point inevitably led the way. I did not see the big picture
(well, what today I think the big picture is...),
so the current version only addresses packages containing functions that use formula
variant, even
if in those cases you can also use the non-formula variants (you can see the
examples corresponding to pROC
, where both cases for formula and non-formula
are demonstrated. You should be able to use that technique also for the rest of
the packages).
I am brewing some ideas about a general approach to packages that
do not use formula interface, but I leave it for a future iteration of intubate
.
This means that there are three possibilities to the eventual lack of inclusion of your favorite package for the time being:
Also, please keep in mind you can always create your own interfaces
(with the helper function intubate
), or call the non-pipe-aware functions
directly (with ntbt
).
The robustness and generality of the interfacing machinery still needs to be further verified (and very likely improved), as there are thousands of potential functions to interface and certainly some are bound to fail when interfaced. Some have already been addressed when implementing provided interfaces (as their examples failed).
The goal is to make intubate
each time more robust by
addressing the peculiarities of newly discovered failing functions.
For the time being, only cases where the
interfaces provided with intubate
fail will be considered as bugs.
Cases of failing user defined interfaces or when using ntbt
to call functions
directly that do not have interfaces provided with released versions of intubate
,
will be considered feature requests.
Of course, it will be greatly appreciated, if you have some coding skills and can follow the code of the interface, if you could provide the proposed solution, that shouldn't break anything else, together with the feature request.
intubate
The logo of intubate
is: <||>
. It corresponds to an intuBorder. I have not
found it in a Google search as of 2016/08/08. I intend to use it as a visual
identification of intubate
. If you know of it having being in use before this date
in any software related project, please let me know, and I will change it.
intuBorder(s) and intubOrder(s), as of 2016/08/08, only has been found, on Google,
in a snippet of code for the name of a variable (intUBorder
) (http://www.office-loesung.de/ftopic246897_0_0_asc.php) that would mean something
like an "integer upper border". There is also an intLBorder
for the lower border.
intuBag(s), as of 2016/08/08, seems to be used for a small bag for bikes (InTuBag,
meaning Inner Tub Bag)
(https://felvarrom.com/products/intubag-bike-tube-bag-medium-blue-inside?variant=18439367751),
but not for anything software related. If intubate
succeeds, they may end selling
more InTuBags!
intubate, as of 2016/08/08, seems to be used related to the medical procedure, perhaps also by the oil pipeline industry (at least "entubar" in Spanish is more general than the medical procedure), but not for software related projects.
intuEnv, as of 2016/08/18, was found only in some Latin text.
I intend to use "intubate", "<||>", "intuBorder", "intubOrder(s)", "intuBag(s)", "intuEnv(s)"and other derivations starting with "intu", in relation to the use and promotion of "intubate" for software related activities.
At some point I intend to register the names and logo as trademarks.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.