knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%", warning = FALSE )
Modeling-Oriented Exploratory Data Analysis (MOEDA) is a novel approach to EDA, which aims to make you quickly understand how useful your data is in modelling a target variable. This gives any data analyst or scientist an immediate feel of what can be expected from the data when taken through ML or statistical modelling initiatives. It addresses the main shortcomings of traditional EDA which tells you what the data is (how many variables of each type, how many nulls, etc.) instead of how useful it is. MOEDA also challenges “automatic” EDA approaches, which only automate the creation of dozens of charts without immediately surfacing the cons and pros of your explanatory variables.
MOEDA is still in its infancy so if you want to jump on board prepare for a bumpy road ahead. Although the idea for this method was brewing in my head for a while and I use the package daily, currently I consider this package usable only in a personal context. The ambition is of course to move it to CRAN when it matures. If you are fine with all of that, fire away:
``` {r, eval = FALSE} devtools::install_github("jarekkupisz/MOEDA")
## Usage example For a detailed explanation about the method please continue reading to the next paragraph. The package exports a single function. To use you typically only need to provide the target variable name as a string or unquoted (especially useful in a tidyverse piping context). You don’t need to worry about the rest. ```r library(dplyr) library(MOEDA) iris %>% moeda(Species)
In my daily work as a data scientist and/or analyst I was tired of the same scenario repeating all over again. You get some data and you try to do some EDA on it with typically multiple tools and packages. You diligently plot your distribution, correlation and other plots trying to take in as much information as possible. After some time you discover that these plots were mostly useless, as there are only a few variables in the dataset that matter for modelling your target variable. I’ve done a thorough review of EDA tools and approaches and there are several main problems I noticed:
The vision for MOEDA is to produce a single chart with a single function call that will tell you how useful your data is in ML. The idea behind the method is to select the variables that have the most influence on the dependent variable and visualize their groupings using upset style visualization.
The selection of variables is done by measuring the random forest permutation variable importance. This method ensures that the selected variables truly hold predictive power and it is reasonably fast. Also, it is criminally underused in times when for some reason many think that algorithm selection and tuning give you any bank for your buck.
The visualization part is not yet finalized as I am still figuring the best output. Currently, if you run moeda()
the following things happen:
n_top_vars
argument) are discretized using equal widths discretization via base::cut()
. You can select the number of cuts with the cuts
argument. GGally::ggpairs()
plot of top variables is printed.df
and joins top features columns that were cut together with resulting intersections. These additional columns have moedized
in their name.The function uses NSE so you can provide your target variable without quotation marks. It supports both regression and classification. Usage examples:
moeda(mtcars, mpg)
dplyr::storms %>% moeda(status)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.