intubate <||> 1.0.0
In intubate: Interface to Popular R Functions for Data Science Pipelines

knitr::opts_chunk$set(echo = TRUE, fig.align = 'center',
                      fig.width = 4, fig.height = 4.5)

License: GPL >= 2

The aim of intubate (logo <||>) is to offer a painless way to add R functions that are not pipe-aware to data science pipelines implemented by magrittr with the operator %>%, without having to rely on workarounds of varying complexity. It also implements three extensions called intubOrders, intuEnv, and intuBags.

Installation

the latest released version from CRAN (1.0.0) with

install.packages("intubate")

the latest development version from github with

# install.packages("devtools")
devtools::install_github("rbertolusso/intubate")

In a nutshell

If you like magrittr pipelines (%>%) and you are looking for an alternative to performing a statistical analysis in the following way:

fit <- lm(sr ~ pop15, LifeCycleSavings)
summary(fit)

intubate let's you do it in these other ways:

library(intubate)
library(magrittr)

1) Using interface (provided by intubate or user defined)

ntbt_lm is the interface provided to lm, and one of the over 450 interfaces intubate currently implements (for the list of 88 packages currently containing interfaces see below).

LifeCycleSavings %>%
  ntbt_lm(sr ~ pop15) %>%    ## ntbt_lm is the interface to lm provided by intubate
  summary()

2) Calling the non-pipe-aware function directly with `ntbt`

You do not need to use interfaces. You can call non-pipe-aware functions directly using ntbt (even those that currently do not have an interface provided by intubate).

LifeCycleSavings %>%
  ntbt(lm, sr ~ pop15) %>%   ## ntbt calls lm without needing to use an interface
  summary()

The help for each interface contains examples of use.

Interfaces "on demand"

intubate allows you to create your own interfaces "on demand", right now, giving you full power of decision regarding which functions to interface.

The ability to amplify the scope of intubate may prove to be particularly welcome if you are related to a particular field that may, in the long run, continue to lack interfaces due to my unforgivable, but unavoidable, ignorance.

As an example of creating an interface "on demand", suppose the interface to cor.test was lacking in the current version of intubate and suppose (at least for a moment) that you want to create yours because you are searching for a pipeline-aware alternative to any of the following styles of coding (results not shown):

data(USJudgeRatings)

## 1)
cor.test(USJudgeRatings$CONT, USJudgeRatings$INTG)

## 2)
attach(USJudgeRatings)
cor.test(CONT, INTG)
detach()

## 3)
with(USJudgeRatings, cor.test(CONT, INTG))

## 4)
USJudgeRatings %>%
   with(cor.test(CONT, INTG))

To be able to create an interface to cor.test "on demand", the only thing you need to do is to add the following line of code somewhere before its use in your pipeline:

ntbt_cor.test <- intubate          ## intubate is the helper function

Please note the lack of parentheses.

Nothing else is required.

The only thing you need to remember is that the names of an interface must start with ntbt_ followed by the name of the interfaced function (cor.test in this particular case), no matter which function you want to interface.

Now you can use your "just baked" interface in any pipeline. A pipeline alternative to the above code may look like this:

USJudgeRatings %>%
  ntbt_cor.test(CONT, INTG)           ## Use it right away

Calling non-pipe-aware functions directly with `ntbt`

Of course, as already stated, you do not have to create an interface if you do not want to. You can call the non-pipe-aware function directly with ntbt, in the following way:

USJudgeRatings %>%
  ntbt(cor.test, CONT, INTG)

You can potentially use ntbt with any function, also the ones without an interface provided by intubate. In principle, the functions you would like to call are the ones you cannot use directly in a pipeline (because data is not in first place in the definition of the function).

Example showing different techniques

The link below is to Dr. Sheather's website where code was extracted. In the link there is also information about the book. This code could be used to produce the plots in Figure 3.1 on page 46. Different strategies are illustrated.

http://www.stat.tamu.edu/~sheather/book/

1) As in the book (without using pipes and attaching data):

attach(anscombe)
plot(x1, y1, xlim = c(4, 20), ylim = c(3, 14), main = "Data Set 1")
abline(lsfit(x1, y1))
detach()

You needed to attach so variables are visible locally. If not, you should have used anscombe$x1 and anscombe$y1. You could also have used with. Spaces were added for clarity and better comparison with code below.

2) Using magrittr pipes (`%>%`) and `intubate` (1: provided interface and 2: `ntbt`):

anscombe %>%
  ntbt_plot(x2, y2, xlim = c(4, 20), ylim = c(3, 14), main = "Data Set 2") %>%
  ntbt(lsfit, x2, y2) %>%   # Call non-pipe-aware function directly with `ntbt`
  abline()                  # No need to interface 'abline'.

ntbt_plot is the interface to plot provided by intubate. As plot returns NULL, intubate forwards (invisibly) its input automatically without having to use %T>%, so lsfit gets the original data (what it needs) and everything is done in one pipeline.
ntbt let's you call the non-pipe-aware function lsfit directly. You can use ntbt always (you do not need to use ntbt_ interfaces if you do not want to), but ntbt is particularly useful to interface directly a non-pipe-aware function for which intubate does not provide an interface (as currently happens with lsfit).

3) Defining interface "on demand"

If intubate does not provide an interface to a given function and you prefer to use interfaces instead of ntbt, you can create your own interface "on demand" and use it right away in your pipeline. To create an interface, it suffices the following line of code before its use:

ntbt_lsfit <- intubate      # NOTE: we are *not* including parentheses.

That's it, you have created you interface. Just remember that:

intubate interfaces must start with ntbt_ followed by the name of the function to interface (lsfit in this case).
Parentheses are not used in the definition of the interface.

You can now use ntbt_lsfit in your pipeline as any other interfaced function:

anscombe %>%
  ntbt_plot(x3, y3, xlim = c(4, 20), ylim = c(3, 14), main = "Data Set 3") %>%
  ntbt_lsfit(x3, y3) %>%    # Using just created "on demand" interface
  abline()

4) Using the formula variants:

Instead of the X Y approach, you can also use the formula variant. In this case, we will have to used lm as lsfit does not implement formulas.

anscombe %>%
  ntbt_plot(y4 ~ x4, xlim = c(4, 20), ylim = c(3, 14), main = "Data Set 4") %>%
  ntbt_lm(y4 ~ x4) %>%      # We use 'ntbt_lm' instead of 'ntbt_lmfit' 
  abline()

Extensions for pipelines provided by `intubate`

intubate implements three extensions:

intubOrders,
intuEnv, and
intuBags.

These experimental features are functional for you to use. Unless you do not mind having to potentially make some changes to your code while the architecture solidifies, they are not recommended (yet) for production code.

`intubOrders`

intubOrders allow, among other things, to:

run, in place, functions on the input (data) to the interfaced function, such as head, tail, dim, str, View, ...
run, in place, functions that use the result generated by the interfaced function, such as print, summary, anova, plot, ...
forward the input to the interfaced function without using %T>%
signal other modifications to the behavior of the interface

intubOrders are implemented by an intuBorder <||> (from where the logo of intubate originates).

The intuBorder contains 5 zones (intuZones?, maybe too much...):

zone 1 < zone 2 | zone 3 | zone 4 > zone 5

zone 1 and zone 5 will be explained later
zone 2 is used to indicate the functions that are to be applied to the input to the interfaced function
zone 3 to modify the behavior of the interface
zone 4 to indicate the functions that are to be applied to the result of the interfaced function

For example, instead of running the following sequence of function calls (only plot shown):

head(LifeCycleSavings)
tail(LifeCycleSavings, n = 3)
dim(LifeCycleSavings)
str(LifeCycleSavings)
summary(LifeCycleSavings)
result <- lm(sr ~ pop15 + pop75 + dpi + ddpi, LifeCycleSavings)
print(result)
summary(result)
anova(result)
plot(result, which = 1)

you could have run, using an intubOrder:

LifeCycleSavings %>%
  ntbt_lm(sr ~ pop15 + pop75 + dpi + ddpi,
          "< head; tail(#, n = 3); dim; str; summary
             |i|
             print; summary; anova; plot(#, which = 1) >")

Note:
- i is used to force an invisible result
- # is used as a placeholder either for the input or result in cases the call requires extra parameters.
intubOrders may prove to be of interest to non-pipeline oriented people too (results not shown):

ntbt_lm(LifeCycleSavings, sr ~ pop15 + pop75 + dpi + ddpi,
        "< head; tail(#, n = 3); dim; str; summary
           |i|
           print; summary; anova; plot(#, which = 1) >")

`intubOrders` with collections of inputs

When using pipelines, the receiving function has to deal with the whole object that receives as its input. Then, it produces a result that, again, needs to be consumed as a whole by the following function.

intubOrders allow you to work with a collection of objects of any kind in one pipeline, selecting at each step which input to use.

As an example suppose you want to perform the following statistical procedures in one pipeline (results not shown).

CO2 %>%
  ntbt_lm(conc ~ uptake)

USJudgeRatings %>%
  ntbt_cor.test(CONT, INTG)

sleep %>%
  ntbt_t.test(extra ~ group)

We will first create a collection (a list in this case, but it could also be intuEnv or an intuBag, explained later) containing the three dataframes:

coll <- list(CO3 = CO2,
             USJudgeRatings1 = USJudgeRatings,
             sleep1 = sleep)
names(coll)

(We have changed the names to show we are not cheating...)

Note: the objects of the collection must be named.

We will now use as source the whole collection.

The intubOrder will need the following info:

zone 1, in each case, indicates which is the data.frame (or any other object) that we want to use as input in this particular function
zone 3 needs to include f to forward the input (if you want the next function to receive the whole collection, and not the result if this step)
zone 4 (optional) may contain a print (or summary) if you want something to be displayed

coll %>%
  ntbt_lm(conc ~ uptake, "CO3 <|f| print >") %>%
  ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <|f| print >") %>%
  ntbt_t.test(extra ~ group, "sleep1 <|f| print >") %>%
  names()

Note: names() was added at the end to show that we have forwarded the original collection to the end of the pipeline.

What happens if you would like to save the results of the function calls (or intermediate results of data manipulations)?

`intuEnv` and `intuBags`

intuEnv and intuBags allow to save intermediate results without leaving the pipeline. They can also be used to contain the collections of objects.

Let us first consider

`intuEnv`

When intubate is loaded, it creates intuEnv, an empty environment that can be populated with results that you want to use later.

You can access the intuEnv as follows:

intuEnv()  ## intuEnv() returns invisible, so nothing is output

You can verify that, initially, it is empty:

ls(intuEnv())

How can intuEnv be used?

Suppose that we want, instead of displaying the results of interfaced functions, save the objects returned by them. One strategy (the other is using intuBags) is to save the results to intuEnv.

How to save to `intuEnv`?

The intubOrder will need the following info:

zone 3 needs to include f to forward the input (if you want the next function to receive the whole collection, and not its result)
zone 5, in each case, indicates the name that the result will have in the intuEnv

coll %>%
  ntbt_lm(conc ~ uptake, "CO3 <|f|> lmfit") %>%
  ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <|f|> ctres") %>%
  ntbt_t.test(extra ~ group, "sleep1 <|f|> ttres") %>%
  names()

As you can see, the collection stays unchanged, but look inside intuEnv

ls(intuEnv())

intuEnv has collected the results, that are ready for use.

Four strategies of using one of the collected results are shown below (output not shown):

Strategy 1

intuEnv()$lmfit %>%
  summary()

Strategy 2

attach(intuEnv())
lmfit %>%
  summary()
detach()

Strategy 3

intuEnv() %>%
  ntbt(summary, "lmfit <||>")

Strategy 4

intuEnv() %>%
  ntbt(I, "lmfit <|i| summary >")

clear_intuEnv can be used to empty the contents of intuEnv.

clear_intuEnv()

ls(intuEnv())

Associating `intuEnv` with the Global Environment

If you want your results to be saved to the Global environment (it could be any environment), you can associate intuEnv to it, so you can have your results available as any other saved object.

First let's display the contents of the Global environment:

ls()

set_intuEnv let's you associate intuEnv to an environment. It takes an environment as parameter, and returns the current intuEnv, in case you want to save it to reinstate it later. If not, I think it will be just garbage collected (I may be wrong).

Let's associate intuEnv to the global environment (saving the current intuEnv):

saved_intuEnv <- set_intuEnv(globalenv())

Now, we re-run the pipeline:

coll %>%
  ntbt_lm(conc ~ uptake, "CO3 <|f|> lmfit") %>%
  ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <|f|> ctres") %>%
  ntbt_t.test(extra ~ group, "sleep1 <|f|> ttres") %>%
  names()

Before forgetting, let's reinstate the original intuEnv:

set_intuEnv(saved_intuEnv)

And now, let's see if the results were saved to the global environment:

ls()

They were.

Now the results are at your disposal to use as any other variable (result not shown):

lmfit %>%
  summary()

Using `intuEnv` as source of the pipeline

You can use intuEnv (or any other environment) as the input of your pipeline.

We already cleared the contents of intuEnv, but let's do it again to get used to how to do it:

clear_intuEnv()

ls(intuEnv())

Let's populate intuEnv with the same objects as before:

intuEnv(CO3 = CO2,
        USJudgeRatings1 = USJudgeRatings,
        sleep1 = sleep)

ls(intuEnv())

When using an environment, such as intuEnv, as the source of your pipeline, there is no need to specify f in zone 3, as the environment is always forwarded (the same happens when the source is an intuBag).

Keep in mind that, if you are saving results and your source is an environment other than intuEnv, the results will be saved to intuEnv, and not to the source enviromnent. If the source is an intuBag, the results will be saved to the intuBag, and not to intuEnv.

We will run the same pipeline as before, but this time we will add subset and summary(called directly with ntbt) to illustrate how we can use a previously generated result (such as from data transformations) in the same pipeline in which it was generated. We will use intuEnv as the source of the pipeline.

intuEnv() %>%
  ntbt(subset, Treatment == "nonchilled", "CO3 <||> CO3nc") %>%
  ntbt_lm(conc ~ uptake, "CO3nc <||> lmfit") %>%
  ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <||> ctres") %>%
  ntbt_t.test(extra ~ group, "sleep1 <||> ttres") %>%
  ntbt(summary, "lmfit <||> lmsfit") %>%
  names()

Note that, as subset is already pipe-aware (data is its first parameter), you have two ways of proceeding. One is the one illustrated above (same strategy used on non-pipe-aware functions). The other, that works only when using pipe-aware functions, is:

intuEnv() %>%
  ntbt(subset, CO3, Treatment == "nonchilled", "<||> CO3nc")

`intuBags`

intuBags differ from intEnv in that they are based on lists, instead than on environments. Even if (with a little of care) you could keep track of several intuEnvs, it seems natural (to me) to deal with only one, while several intuBags (for example one for each database, or collection of objects) seem natural (to me).

Other than that, using an intuEnv or an intuBag is a matter of personal taste.

What you can do with one you can do with the other.

iBag <- intuBag(CO3 = CO2,
                USJudgeRatings1 = USJudgeRatings,
                sleep1 = sleep)

iBag %>%
  ntbt(subset, Treatment == "nonchilled", "CO3 <||> CO3nc") %>%
  ntbt_lm(conc ~ uptake, "CO3nc <||> lmfit") %>%
  ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <||> ctres") %>%
  ntbt_t.test(extra ~ group, "sleep1 <||> ttres") %>%
  ntbt(summary, "lmfit <||> lmsfit") %>%
  names()

When using intuBags, it is possible to use %<>% if you want to save your results to the intuBag. This way, instead of a long pipeline, you could run several short ones.

iBag <- intuBag(CO3 = CO2,
                USJudgeRatings1 = USJudgeRatings,
                sleep1 = sleep)

iBag %<>%
  ntbt(subset, CO3, Treatment == "nonchilled", "<||> CO3nc") %>%
  ntbt_lm(conc ~ uptake, "CO3nc <||> lmfit")

iBag %<>%
  ntbt_cor.test(CONT, INTG, "USJudgeRatings1 <||> ctres")

iBag %<>%
  ntbt_t.test(extra ~ group, "sleep1 <||> ttres") %>%
  ntbt(summary, "lmfit <||> lmsfit")

names(iBag)

The intuBag will keep all your results, in any way you prefer to use it.

The same happens with intuEnv. Just remember that %<>% should not be used with intuEnv (you should always use %>%).

Using more than one source

Suppose you have a database consisting in the following two tables

iBag <- intuBag(members = data.frame(name=c("John", "Paul", "George",
                                            "Ringo", "Brian", NA),
                band=c("TRUE",  "TRUE", "TRUE", "TRUE", "FALSE", NA)),
           what_played = data.frame(name=c("John", "Paul", "Ringo",
                                           "George", "Stuart", "Pete"),
                instrument=c("guitar", "bass", "drums", "guitar", "bass", "drums")))
print(iBag)

and you want to perform an inner join. In these cases, the functions should receive the whole intuBag (or intuEnv, or collection), so zone 1 should be empty, and the names of the tables should be specified directly, in the function call, in their corresponding order (or by stating their parameter names).

iBag %>%
  ntbt(merge, members, what_played, by = "name", "<|| print >")

Example of an `intuBag` acting as a database

The following code has been extracted from chapter 13 of "R for data science", by Garrett Grolemund and Hadley Wickham (http://r4ds.had.co.nz/relational-data.html)

Original code (output not shown):

library(dplyr)
library(nycflights13)

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2

flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier")

## 13.4.5 Defining the key columns

flights2 %>%
  left_join(weather)

flights2 %>%
  left_join(planes, by = "tailnum")

flights2 %>%
  left_join(airports, c("dest" = "faa"))

flights2 %>%
  left_join(airports, c("origin" = "faa"))

nycflights13 is a database. As such, we can deal with it using intuBags. The following code illustrates how all the above can be performed using an intuBag (or intuEnv) and one pipeline:

iBag <- intuBag(flightsIB = flights,
                airlinesIB = airlines,
                weatherIB = weather,
                planesIB = planes,
                airportsIB = airports)
## Note we are changing the names, to make sure we are not cheating
## (by reading from globalenv()).

iBag %<>%
  ntbt(select, flightsIB, year:day, hour, origin, dest, tailnum, carrier, "<|| head > flights2") %>%
  ntbt(select, flights2, -origin, -dest, "<|| print > flights3") %>% 
  ntbt(left_join, flights3, airlinesIB, by = "carrier", "<|| print >") %>%
  ntbt(left_join, flights2, weatherIB, "<|| print >") %>%
  ntbt(left_join, flights2, planesIB, by = "tailnum", "<|| print >") %>%
  ntbt(left_join, flights2, airportsIB, c("dest" = "faa"), "<|| print >") %>%
  ntbt(left_join, flights2, airportsIB, c("origin" = "faa"), "<|| print >")

names(iBag)

Note: the results were copied from previously run code to avoid adding dependences.

## 
## ntbt(data = ., fti = select, flightsIB, year:day, hour, origin, 
##     dest, tailnum, carrier)
## 
## * head <||> result *
## # A tibble: 6 x 8
##    year month   day  hour origin  dest tailnum carrier
##   <int> <int> <int> <dbl>  <chr> <chr>   <chr>   <chr>
## 1  2013     1     1     5    EWR   IAH  N14228      UA
## 2  2013     1     1     5    LGA   IAH  N24211      UA
## 3  2013     1     1     5    JFK   MIA  N619AA      AA
## 4  2013     1     1     5    JFK   BQN  N804JB      B6
## 5  2013     1     1     6    LGA   ATL  N668DN      DL
## 6  2013     1     1     5    EWR   ORD  N39463      UA
## 
## ntbt(data = ., fti = select, flights2, -origin, -dest)
## 
## * print <||> result *
## # A tibble: 336,776 x 6
##     year month   day  hour tailnum carrier
##    <int> <int> <int> <dbl>   <chr>   <chr>
## 1   2013     1     1     5  N14228      UA
## 2   2013     1     1     5  N24211      UA
## 3   2013     1     1     5  N619AA      AA
## 4   2013     1     1     5  N804JB      B6
## 5   2013     1     1     6  N668DN      DL
## 6   2013     1     1     5  N39463      UA
## 7   2013     1     1     6  N516JB      B6
## 8   2013     1     1     6  N829AS      EV
## 9   2013     1     1     6  N593JB      B6
## 10  2013     1     1     6  N3ALAA      AA
## # ... with 336,766 more rows
## 
## ntbt(data = ., fti = left_join, flights3, airlinesIB, by = "carrier")
## 
## * print <||> result *
## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier                     name
##    <int> <int> <int> <dbl>   <chr>   <chr>                    <chr>
## 1   2013     1     1     5  N14228      UA    United Air Lines Inc.
## 2   2013     1     1     5  N24211      UA    United Air Lines Inc.
## 3   2013     1     1     5  N619AA      AA   American Airlines Inc.
## 4   2013     1     1     5  N804JB      B6          JetBlue Airways
## 5   2013     1     1     6  N668DN      DL     Delta Air Lines Inc.
## 6   2013     1     1     5  N39463      UA    United Air Lines Inc.
## 7   2013     1     1     6  N516JB      B6          JetBlue Airways
## 8   2013     1     1     6  N829AS      EV ExpressJet Airlines Inc.
## 9   2013     1     1     6  N593JB      B6          JetBlue Airways
## 10  2013     1     1     6  N3ALAA      AA   American Airlines Inc.
## # ... with 336,766 more rows
## 
## ntbt(data = ., fti = left_join, flights2, weatherIB)
## Joining, by = c("year", "month", "day", "hour", "origin")
## 
## * print <||> result *
## # A tibble: 336,776 x 18
##     year month   day  hour origin  dest tailnum carrier  temp  dewp humid
##    <dbl> <dbl> <int> <dbl>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
## 1   2013     1     1     5    EWR   IAH  N14228      UA    NA    NA    NA
## 2   2013     1     1     5    LGA   IAH  N24211      UA    NA    NA    NA
## 3   2013     1     1     5    JFK   MIA  N619AA      AA    NA    NA    NA
## 4   2013     1     1     5    JFK   BQN  N804JB      B6    NA    NA    NA
## 5   2013     1     1     6    LGA   ATL  N668DN      DL 39.92 26.06 57.33
## 6   2013     1     1     5    EWR   ORD  N39463      UA    NA    NA    NA
## 7   2013     1     1     6    EWR   FLL  N516JB      B6 39.02 26.06 59.37
## 8   2013     1     1     6    LGA   IAD  N829AS      EV 39.92 26.06 57.33
## 9   2013     1     1     6    JFK   MCO  N593JB      B6 39.02 26.06 59.37
## 10  2013     1     1     6    LGA   ORD  N3ALAA      AA 39.92 26.06 57.33
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <time>
## 
## ntbt(data = ., fti = left_join, flights2, planesIB, by = "tailnum")
## 
## * print <||> result *
## # A tibble: 336,776 x 16
##    year.x month   day  hour origin  dest tailnum carrier year.y
##     <int> <int> <int> <dbl>  <chr> <chr>   <chr>   <chr>  <int>
## 1    2013     1     1     5    EWR   IAH  N14228      UA   1999
## 2    2013     1     1     5    LGA   IAH  N24211      UA   1998
## 3    2013     1     1     5    JFK   MIA  N619AA      AA   1990
## 4    2013     1     1     5    JFK   BQN  N804JB      B6   2012
## 5    2013     1     1     6    LGA   ATL  N668DN      DL   1991
## 6    2013     1     1     5    EWR   ORD  N39463      UA   2012
## 7    2013     1     1     6    EWR   FLL  N516JB      B6   2000
## 8    2013     1     1     6    LGA   IAD  N829AS      EV   1998
## 9    2013     1     1     6    JFK   MCO  N593JB      B6   2004
## 10   2013     1     1     6    LGA   ORD  N3ALAA      AA     NA
## # ... with 336,766 more rows, and 7 more variables: type <chr>,
## #   manufacturer <chr>, model <chr>, engines <int>, seats <int>,
## #   speed <int>, engine <chr>
## 
## ntbt(data = ., fti = left_join, flights2, airportsIB, c(dest = "faa"))
## 
## * print <||> result *
## # A tibble: 336,776 x 14
##     year month   day  hour origin  dest tailnum carrier
##    <int> <int> <int> <dbl>  <chr> <chr>   <chr>   <chr>
## 1   2013     1     1     5    EWR   IAH  N14228      UA
## 2   2013     1     1     5    LGA   IAH  N24211      UA
## 3   2013     1     1     5    JFK   MIA  N619AA      AA
## 4   2013     1     1     5    JFK   BQN  N804JB      B6
## 5   2013     1     1     6    LGA   ATL  N668DN      DL
## 6   2013     1     1     5    EWR   ORD  N39463      UA
## 7   2013     1     1     6    EWR   FLL  N516JB      B6
## 8   2013     1     1     6    LGA   IAD  N829AS      EV
## 9   2013     1     1     6    JFK   MCO  N593JB      B6
## 10  2013     1     1     6    LGA   ORD  N3ALAA      AA
## # ... with 336,766 more rows, and 6 more variables: name <chr>, lat <dbl>,
## #   lon <dbl>, alt <int>, tz <dbl>, dst <chr>
## 
## ntbt(data = ., fti = left_join, flights2, airportsIB, c(origin = "faa"))
## 
## * print <||> result *
## # A tibble: 336,776 x 14
##     year month   day  hour origin  dest tailnum carrier
##    <int> <int> <int> <dbl>  <chr> <chr>   <chr>   <chr>
## 1   2013     1     1     5    EWR   IAH  N14228      UA
## 2   2013     1     1     5    LGA   IAH  N24211      UA
## 3   2013     1     1     5    JFK   MIA  N619AA      AA
## 4   2013     1     1     5    JFK   BQN  N804JB      B6
## 5   2013     1     1     6    LGA   ATL  N668DN      DL
## 6   2013     1     1     5    EWR   ORD  N39463      UA
## 7   2013     1     1     6    EWR   FLL  N516JB      B6
## 8   2013     1     1     6    LGA   IAD  N829AS      EV
## 9   2013     1     1     6    JFK   MCO  N593JB      B6
## 10  2013     1     1     6    LGA   ORD  N3ALAA      AA
## # ... with 336,766 more rows, and 6 more variables: name <chr>, lat <dbl>,
## #   lon <dbl>, alt <int>, tz <dbl>, dst <chr>

names(iBag)

## [1] "flightsIB"  "airlinesIB" "weatherIB"  "planesIB"   "airportsIB"
## [6] "flights2"   "flights3"

The same, using intuEnv (output not shown):

clear_intuEnv()

intuEnv(flightsIB = flights,
        airlinesIB = airlines,
        weatherIB = weather,
        planesIB = planes,
        airportsIB = airports) %>%
  ntbt(select, flightsIB, year:day, hour, origin, dest, tailnum, carrier,
       "<|D| head > flights2") %>%
  ntbt(select, flights2, -origin, -dest, "<|| print > flights3") %>% 
  ntbt(left_join, flights3, airlinesIB, by = "carrier", "<|| print >") %>%
  ntbt(left_join, flights2, weatherIB, "<|| print >") %>%
  ntbt(left_join, flights2, planesIB, by = "tailnum", "<|| print >") %>%
  ntbt(left_join, flights2, airportsIB, c("dest" = "faa"), "<|| print >") %>%
  ntbt(left_join, flights2, airportsIB, c("origin" = "faa"), "<|| print >")

ls(intuEnv())

Note: the book is still not published (as of 8/27/16), so the examples in the chapter may have changed by the time you are reading this.

Packages containing interfaces

The 88 R packages that have interfaces implemented so far are:

adabag: Multiclass AdaBoost.M1, SAMME and Bagging
AER: Applied Econometrics with R
aod: Analysis of Overdispersed Data
ape: Analyses of Phylogenetics and Evolution
arm: Data Analysis Using Regression and Multilevel/Hierarchical Models
betareg: Beta Regression
brglm: Bias reduction in binomial-response generalized linear models
caper: Comparative Analyses of Phylogenetics and Evolution in R
car: Companion to Applied Regression
caret: Classification and Regression Training
coin: Conditional Inference Procedures in a Permutation Test Framework
CORElearn: Classification, Regression and Feature Evaluation
drc: Analysis of Dose-Response Curves
e1071: Support Vector Machines
earth: Multivariate Adaptive Regression Splines
EnvStats: Environmental Statistics, Including US EPA Guidance
fGarch: Rmetrics - Autoregressive Conditional Heteroskedastic Modelling
flexmix: Flexible Mixture Modeling
forecast: Forecasting Functions for Time Series and Linear Models
frontier: Stochastic Frontier Analysis
gam: Generalized Additive Models
gbm: Generalized Boosted Regression Models
gee: Generalized Estimation Equation Solver
glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models
glmx: Generalized Linear Models Extended
gmnl: Multinomial Logit Models with Random Parameters
gplots: Various R Programming Tools for Plotting Data
gss: General Smoothing Splines
graphics: The R Graphics Package
hdm: High-Dimensional Metrics
Hmisc: Harrell Miscellaneous
ipred: Improved Predictors
iRegression: Regression Methods for Interval-Valued Variables
ivfixed: Instrumental fixed effect panel data model
kernlab: Kernel-Based Machine Learning Lab
kknn: Weighted k-Nearest Neighbors
klaR: Classification and Visualization
lars: Least Angle Regression, Lasso and Forward Stagewise
lattice: Trellis Graphics for R
latticeExtra: Extra Graphical Utilities Based on Lattice
leaps: Regression Subset Selection
lfe: Linear Group Fixed Effects
lme4: Linear Mixed-Effects Models using 'Eigen' and S4
lmtest: Testing Linear Regression Models
MASS: Robust Regression, Linear Discriminant Analysis, Ridge Regression, Probit Regression, ...
MCMCglmm: MCMC Generalised Linear Mixed Models
mda: Mixture and Flexible Discriminant Analysis
metafor: Meta-Analysis Package for R
mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML Smoothness Estimation
minpack.lm: R Interface to the Levenberg-Marquardt Nonlinear Least-Squares Algorithm Found in MINPACK, Plus Support for Bounds
mhurdle: Multiple Hurdle Tobit Models
mlogit: Multinomial logit model
mnlogit: Multinomial Logit Model
modeltools: Tools and Classes for Statistical Models
nlme: Linear and Nonlinear Mixed Effects Models
nlreg: Higher Order Inference for Nonlinear Heteroscedastic Models
nnet: Feed-Forward Neural Networks and Multinomial Log-Linear Models
ordinal: Regression Models for Ordinal Data
party: A Laboratory for Recursive Partytioning
partykit: A Toolkit for Recursive Partytioning
plotrix: Various Plotting Functions
pls: Partial Least Squares and Principal Component Regression
pROC: Display and Analyze ROC Curves
pscl: Political Science Computational Laboratory, Stanford University
psychomix: Psychometric Mixture Models
psychotools: Infrastructure for Psychometric Modeling
psychotree: Recursive Partitioning Based on Psychometric Models
quantreg: Quantile Regression
randomForest: Random Forests for Classification and Regression
Rchoice: Discrete Choice (Binary, Poisson and Ordered) Models with Random Parameters
rminer: Data Mining Classification and Regression Methods
rms: Regression Modeling Strategies
robustbase: Basic Robust Statistics
rpart: Recursive Partitioning and Regression Trees
RRF: Regularized Random Forest
RWeka: R/Weka Interface
sampleSelection: Sample Selection Models
sem: Structural Equation Models
spBayes: Univariate and Multivariate Spatial-temporal Modeling
stats: The R Stats Package (glm, lm, loess, lqs, nls, ...)
strucchange: Testing, Monitoring, and Dating Structural Changes
survey: Analysis of Complex Survey Samples
survival: Survival Analysis
SwarmSVM: Ensemble Learning Algorithms Based on Support Vector Machines
systemfit: Estimating Systems of Simultaneous Equations
tree: Classification and Regression Trees
vcd: Visualizing Categorical Data
vegan: Community Ecology Package

The aim is to continue adding interfaces to most methodologies used in data science or other disciplines.

For now the main focus is on interfacing non-pipe-aware functions having "formula" and "data" (in that order), but the non-formula variants should also work (even cases currently lacking interfaces). As a proof of concept, two libraries that contain non-formula variants only (glmnet and lars) have also been interfaced.

Also, only packages in CRAN (in addition to the ones provided in the base installation of R) have been implemented. Packages from, for example, bioconductor, could also be easily added, but I would need some help from the maintainers of those packages.

intubate core depends only on base, stats, and utils libraries. To keep it as lean as possible, and to be able to continue to include more interfaces without bloating your machine, starting from version 1.0.0 intubate will not install the packages that contain the functions that are interfaced. You will need to install them yourself, and load the corresponding libraries before using them in your pipelines. This also applies to magrittr (in case you want to use intubate without pipelines).

Then, if you are only interested in a given field, say: bio-statistics, bio-informatics, environmetrics, econometrics, finance, machine learning, meta-analysis, pharmacokinetics, phylogenetics, psychometrics, social sciences, surveys, survival analysis, ..., you will not have to install all the packages for which interfaces are provided if you intend to use only a subset of them. You only need to install the subset of packages you intend to use (which are probably already installed in your machine).

Moreover, there are cases where some packages are in conflict if loaded simultaneously, leading to a segmentation fault (for example, kernlab functions fail when testing the whole examples provided with intubate, but not when testing kernlab only examples in a clean environment. I ignore which is/are the other(s) package(s) conflicting with it. The only thing I know is that the package name is alphabetically ordered prior to kernlab)

I make no personal judgment (mostly due to personal ignorance) about the merit of any interfaced function. I have used only a subset of what is provided, and I am happy to include others, that I am currently unaware of, down the line. In principle I plan on including packages that are listed as reverse depends, imports, or suggest on package Formula (I am missing still some of them). Adding interfaces is easy (and can be boring...) so I will appreciate if you want to contribute (and you will be credited in the help of the interfaced package). Also is welcome the improvement of the provided examples (such as making sure the data used is correct for the statistical technique used).

I do not claim to be a data scientist (I am barely a statistician and I still have almost no clue of what a data scientist is or is not, and my confusion about the subject only increases with time), nor someone entitled to tell you what to use or not.

As such, I am not capable of engaging in disputes of what is relevant or not, or, if there are competing packages, which to use. I will leave that to you to decide.

Please keep in mind that intubate will not install any packages corresponding to the interfaces that are provided. You can install only those that you need (or like) and disregard the rest. Also please remember that you can create your own interfaces (using helper function intubate), or call non-pipe-aware functions directly (using ntbt).

The original aim of intubate was to be able to include functions that have formula and data (in that order) in a magrittr pipeline using %>%. As such, my search so far has been concentrated in packages containing formulas and misplaced (from pipes point of view) data (with the exception of a couple of packages with non-formula variants interfaced as proofs of concept).

For example, this was the first implementation of ntbt_lm

ntbt_lm <- function(data, formula, ...)
  lm(formula, data, ...)

This approach was supposed to be repeated for each interface.

Soon after I realized that intubate could have just a few helper functions (that was version 0.99.2), later that only one helper function was enough (intubate), and later that you could call non-pipe-aware functions directly without defining interfaces (ntbt) and that the interfaces and ntbt could also be successfully used in cases where non-formula variants are implemented.

However, the starting point inevitably led the way. I did not see the big picture (well, what today I think the big picture is...), so the current version only addresses packages containing functions that use formula variant, even if in those cases you can also use the non-formula variants (you can see the examples corresponding to pROC, where both cases for formula and non-formula are demonstrated. You should be able to use that technique also for the rest of the packages).

I am brewing some ideas about a general approach to packages that do not use formula interface, but I leave it for a future iteration of intubate.

This means that there are three possibilities to the eventual lack of inclusion of your favorite package for the time being:

The package only uses matrices or x- y- like notation (and not formulas)
(more likely reason) I should know better, but I missed it (truth is that by implementing the supplied interfaces I realized how little I knew, and still know, about a field in which I am supposed to be an expert), and I apologize for that.
I got to the point I need to take a rest (this reason is competing with 2. with increasing strength as time passes by)

Also, please keep in mind you can always create your own interfaces (with the helper function intubate), or call the non-pipe-aware functions directly (with ntbt).

Bugs and Feature requests

The robustness and generality of the interfacing machinery still needs to be further verified (and very likely improved), as there are thousands of potential functions to interface and certainly some are bound to fail when interfaced. Some have already been addressed when implementing provided interfaces (as their examples failed).

The goal is to make intubate each time more robust by addressing the peculiarities of newly discovered failing functions.

For the time being, only cases where the interfaces provided with intubate fail will be considered as bugs.

Cases of failing user defined interfaces or when using ntbt to call functions directly that do not have interfaces provided with released versions of intubate, will be considered feature requests.

Of course, it will be greatly appreciated, if you have some coding skills and can follow the code of the interface, if you could provide the proposed solution, that shouldn't break anything else, together with the feature request.

Logo of `intubate`

The logo of intubate is: <||>. It corresponds to an intuBorder. I have not found it in a Google search as of 2016/08/08. I intend to use it as a visual identification of intubate. If you know of it having being in use before this date in any software related project, please let me know, and I will change it.

Names used

intuBorder(s) and intubOrder(s), as of 2016/08/08, only has been found, on Google, in a snippet of code for the name of a variable (intUBorder) (http://www.office-loesung.de/ftopic246897_0_0_asc.php) that would mean something like an "integer upper border". There is also an intLBorder for the lower border.

intuBag(s), as of 2016/08/08, seems to be used for a small bag for bikes (InTuBag, meaning Inner Tub Bag) (https://felvarrom.com/products/intubag-bike-tube-bag-medium-blue-inside?variant=18439367751), but not for anything software related. If intubate succeeds, they may end selling more InTuBags!

intubate, as of 2016/08/08, seems to be used related to the medical procedure, perhaps also by the oil pipeline industry (at least "entubar" in Spanish is more general than the medical procedure), but not for software related projects.

intuEnv, as of 2016/08/18, was found only in some Latin text.

I intend to use "intubate", "<||>", "intuBorder", "intubOrder(s)", "intuBag(s)", "intuEnv(s)"and other derivations starting with "intu", in relation to the use and promotion of "intubate" for software related activities.

At some point I intend to register the names and logo as trademarks.

Any scripts or data that you put into this service are public.

intubate documentation built on May 2, 2019, 2:46 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

intubate
Interface to Popular R Functions for Data Science Pipelines

intubate <||> 1.0.0
In intubate: Interface to Popular R Functions for Data Science Pipelines

License: GPL >= 2

Installation

In a nutshell

1) Using interface (provided by intubate or user defined)

2) Calling the non-pipe-aware function directly with `ntbt`

Interfaces "on demand"

Calling non-pipe-aware functions directly with `ntbt`

Example showing different techniques

1) As in the book (without using pipes and attaching data):

2) Using magrittr pipes (`%>%`) and `intubate` (1: provided interface and 2: `ntbt`):

3) Defining interface "on demand"

4) Using the formula variants:

Extensions for pipelines provided by `intubate`

`intubOrders`

`intubOrders` with collections of inputs

`intuEnv` and `intuBags`

`intuEnv`

How to save to `intuEnv`?

Strategy 1

Strategy 2

Strategy 3

Strategy 4

Associating `intuEnv` with the Global Environment

Using `intuEnv` as source of the pipeline

`intuBags`

Using more than one source

Example of an `intuBag` acting as a database

Packages containing interfaces

Bugs and Feature requests

Logo of `intubate`

Names used

Try the intubate package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

intubate Interface to Popular R Functions for Data Science Pipelines

intubate <||> 1.0.0 In intubate: Interface to Popular R Functions for Data Science Pipelines

License: GPL >= 2

Installation

In a nutshell

1) Using interface (provided by intubate or user defined)

2) Calling the non-pipe-aware function directly with ntbt

Interfaces "on demand"

Calling non-pipe-aware functions directly with ntbt

Example showing different techniques

1) As in the book (without using pipes and attaching data):

2) Using magrittr pipes (%>%) and intubate (1: provided interface and 2: ntbt):

3) Defining interface "on demand"

4) Using the formula variants:

Extensions for pipelines provided by intubate

intubOrders

intubOrders with collections of inputs

intuEnv and intuBags

intuEnv

How to save to intuEnv?

Strategy 1

Strategy 2

Strategy 3

Strategy 4

Associating intuEnv with the Global Environment

Using intuEnv as source of the pipeline

intuBags

Using more than one source

Example of an intuBag acting as a database

Packages containing interfaces

Bugs and Feature requests

Logo of intubate

Names used

Try the intubate package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

intubate
Interface to Popular R Functions for Data Science Pipelines

intubate <||> 1.0.0
In intubate: Interface to Popular R Functions for Data Science Pipelines

2) Calling the non-pipe-aware function directly with `ntbt`

Calling non-pipe-aware functions directly with `ntbt`

2) Using magrittr pipes (`%>%`) and `intubate` (1: provided interface and 2: `ntbt`):

Extensions for pipelines provided by `intubate`

`intubOrders`

`intubOrders` with collections of inputs

`intuEnv` and `intuBags`

`intuEnv`

How to save to `intuEnv`?

Associating `intuEnv` with the Global Environment

Using `intuEnv` as source of the pipeline

`intuBags`

Example of an `intuBag` acting as a database

Logo of `intubate`