knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/" )
The goal of datools is to cover a lot of convenient tools useful for machine learning consulting using R
You can install datools from github with:
# install.packages("devtools") devtools::install_github("DoktorMike/datools")
Optionally you can also install the Rgraphviz package which is required for the graph learning visualization.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Rgraphviz")
Say you have a vector of weekdays and you would really like to have that one hot encoded for use in your algorithms then oneHotEncoder comes to the rescue!
## basic example code library(datools) library(lubridate) oneHotEncoder(x=wday(seq(as.Date("2017-10-07"), by ="days", length.out = 10), label = TRUE))
If you ever need to illustrate for your peers what sort of direction your original data is pointing to in the PCA space this function comes very much in handy.
library(datools) data(iris) plotPCAComponent(iris[,-5], iris$Species) + theme_minimal()
Splitting up a data.frame or a tibble into N buckets of size K is sometimes a hassle. The rangeToBuckets come to the rescue! In this example we'll split up the mtcars dataset and perform a simple regression on each subset of the data and show the results.
library(datools) indsList <- rangeToBuckets(1:nrow(mtcars), 10) sapply(indsList, function(x) coef(lm(mpg~disp, data=mtcars[x,])))
Of course we can make this nices by running more splits and making all of it in one go
library(datools) library(dplyr) sapply(rangeToBuckets(1:nrow(mtcars), 4), function(x) coef(lm(mpg~disp, data=mtcars[x,]))) %>% t() %>% knitr::kable()
Here we'll look at some ways to detect dependencies and hierarchies between your variables in a given dataset. As per usual we'll use a simple dataset that is available in R. The mtcars
tadaaa! Let's pretend you were given this dataset and have no clue how to best go about things and you feel like exploring. So let's start by looking at the data.
data(mtcars) sapply(mtcars, summary) %>% t()
So far so good. Now how do these guys relate to each other? Well we could go about this by fitting every single linear model we could given all variables. That still wouldn't give us the hierarchy between all variables though. So can we do better? Why yes, yes we can.
library(datools) library(Rgraphviz) library(bnlearn) data(mtcars) myfit<-discover_hierarchy_and_fit(mtcars) graphviz.plot(myfit)
So from this graph we can see that qsec
is actually the last node in the hierarchy. Regression wise this node is affected by a lot of other variables but does not affect them in return. Notice here that we're only measuring correlation and graph factorization here. This is not a proper causality claim, but it might be indicative of it.
We could have a look at a textual representation of this graph as well if we're not into visualizations.
arcs(myfit)
Say now that we want to know from this fit which variables carb
is affected by and by how much. In this case we simply look at
myfit$carb$coefficients
This can also be confirmed by running
coef(lm(carb~mpg+cyl+gear, data=mtcars))
in which you can see that the edges between each node is fitted with a maximum likelihood estimation. This is not the model you would have gotten even if you decided to model carb
in a flat structure as evident from:
coef(lm(carb~., data=mtcars))
You can also get more information about a particular part of the graph by looking at the local model inside the graph.
myfit$qsec
Please note that the datools project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.