knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/"
)

datools

The goal of datools is to cover a lot of convenient tools useful for machine learning consulting using R

Build status

Build Status Coverage Status wercker status lifecycle docs

Installation

You can install datools from github with:

# install.packages("devtools")
devtools::install_github("DoktorMike/datools")

Optionally you can also install the Rgraphviz package which is required for the graph learning visualization.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Rgraphviz")

Example

Say you have a vector of weekdays and you would really like to have that one hot encoded for use in your algorithms then oneHotEncoder comes to the rescue!

## basic example code
library(datools)
library(lubridate)
oneHotEncoder(x=wday(seq(as.Date("2017-10-07"), by ="days", length.out = 10), 
                     label = TRUE))

PCA fun

If you ever need to illustrate for your peers what sort of direction your original data is pointing to in the PCA space this function comes very much in handy.

library(datools)
data(iris)
plotPCAComponent(iris[,-5], iris$Species) + theme_minimal()

Indices

Splitting up a data.frame or a tibble into N buckets of size K is sometimes a hassle. The rangeToBuckets come to the rescue! In this example we'll split up the mtcars dataset and perform a simple regression on each subset of the data and show the results.

library(datools)
indsList <- rangeToBuckets(1:nrow(mtcars), 10)
sapply(indsList, function(x) coef(lm(mpg~disp, data=mtcars[x,])))

Of course we can make this nices by running more splits and making all of it in one go

library(datools)
library(dplyr)
sapply(rangeToBuckets(1:nrow(mtcars), 4), 
       function(x) coef(lm(mpg~disp, data=mtcars[x,]))) %>% 
  t() %>% knitr::kable()

Discovering relationships in your dataset

Here we'll look at some ways to detect dependencies and hierarchies between your variables in a given dataset. As per usual we'll use a simple dataset that is available in R. The mtcars tadaaa! Let's pretend you were given this dataset and have no clue how to best go about things and you feel like exploring. So let's start by looking at the data.

data(mtcars)
sapply(mtcars, summary) %>% t()

So far so good. Now how do these guys relate to each other? Well we could go about this by fitting every single linear model we could given all variables. That still wouldn't give us the hierarchy between all variables though. So can we do better? Why yes, yes we can.

library(datools)
library(Rgraphviz)
library(bnlearn)
data(mtcars)
myfit<-discover_hierarchy_and_fit(mtcars)
graphviz.plot(myfit)

So from this graph we can see that qsec is actually the last node in the hierarchy. Regression wise this node is affected by a lot of other variables but does not affect them in return. Notice here that we're only measuring correlation and graph factorization here. This is not a proper causality claim, but it might be indicative of it.

We could have a look at a textual representation of this graph as well if we're not into visualizations.

arcs(myfit)

Say now that we want to know from this fit which variables carb is affected by and by how much. In this case we simply look at

myfit$carb$coefficients

This can also be confirmed by running

coef(lm(carb~mpg+cyl+gear, data=mtcars))

in which you can see that the edges between each node is fitted with a maximum likelihood estimation. This is not the model you would have gotten even if you decided to model carb in a flat structure as evident from:

coef(lm(carb~., data=mtcars))

You can also get more information about a particular part of the graph by looking at the local model inside the graph.

myfit$qsec

Code of Conduct

Please note that the datools project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.



DoktorMike/datools documentation built on Feb. 28, 2021, 8:39 a.m.