knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code! There are three ways to use the package: interactive data exploration, generate an automated report or use low code for manual exploration (explore()
, describe()
, explain_*()
, report()
, abtest()
, ...)
explore package on Github: https://github.com/rolkra/explore
As the explore-functions fits well into the tidyverse, we load the dplyr-package as well.
library(dplyr) library(explore)
Explore your data set (in this case the iris data set) in one line of code:
```{R eval=FALSE, echo=TRUE} explore(iris)
A shiny app is launched, you can inspect individual variable, explore their relation to a target (binary / categorical / numerical), grow a decision tree or create a fully automated report of all variables with a few "mouse clicks". ![](../man/figures/explore-shiny-iris-target-species.png){width=600px} You can choose each variable containing as a target, that is binary (0/1, FALSE/TRUE or "no"/"yes"), categorical or numeric. ### Report variables Create a rich HTML report of all variables with one line of code: ```{R eval=FALSE, echo=TRUE} # report of all variables iris %>% report(output_file = "report.html", output_dir = tempdir())
{width=600px}
Or you can simply add a target and create the report. In this case we use a binary target, but a categorical or numerical target would work as well.
```{R eval=FALSE, echo=TRUE}
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0) iris %>% report(output_file = "report.html", output_dir = tempdir(), target = is_versicolor)
If you use a binary target, the parameter ***split = FALSE*** (or `targetpct = TRUE`) will give you a different view on the data. ![](../man/figures/report-target.png){width=600px} ### Grow a decision tree Grow a decision tree with one line of code: ```r iris %>% explain_tree(target = Species)
You can grow a decision tree with a binary target too.
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0) iris %>% select(-Species) %>% explain_tree(target = is_versicolor)
Or using a numerical target. The syntax stays the same.
iris %>% explain_tree(target = Sepal.Length)
You can control the growth of the tree using the parameters maxdepth
, minsplit
and cp
.
Explore your table with one line of code to see which type of variables it contains.
iris %>% explore_tbl()
You can also use describe_tbl() if you just need the main facts without visualisation.
iris %>% describe_tbl()
Explore a variable with one line of code. You don't have to care if a variable is numerical or categorical.
iris %>% explore(Species)
iris %>% explore(Sepal.Length)
Explore a variable and its relationship with a binary target with one line of code. You don't have to care if a variable is numerical or categorical.
iris %>% explore(Sepal.Length, target = is_versicolor)
Using split = FALSE will change the plot to %target:
iris %>% explore(Sepal.Length, target = is_versicolor, split = FALSE)
The target can have more than two levels:
iris %>% explore(Sepal.Length, target = Species)
Or the target can even be numeric:
iris %>% explore(Sepal.Length, target = Petal.Length)
iris %>% select(Sepal.Length, Sepal.Width) %>% explore_all()
iris %>% select(Sepal.Length, Sepal.Width, is_versicolor) %>% explore_all(target = is_versicolor)
iris %>% select(Sepal.Length, Sepal.Width, is_versicolor) %>% explore_all(target = is_versicolor, split = FALSE)
iris %>% select(Sepal.Length, Sepal.Width, Species) %>% explore_all(target = Species)
iris %>% select(Sepal.Length, Sepal.Width, Petal.Length) %>% explore_all(target = Petal.Length)
data(iris)
To use a high number of variables with explore_all() in a RMarkdown-File, it is necessary to set a meaningful fig.width and fig.height in the junk. The function total_fig_height() helps to automatically set fig.height: fig.height=total_fig_height(iris)
iris %>% explore_all()
If you use a target: fig.height=total_fig_height(iris, var_name_target = "Species")
iris %>% explore_all(target = Species)
You can control total_fig_height() by parameters ncols (number of columns of the plots) and size (height of 1 plot)
Explore correlation between two variables with one line of code:
iris %>% explore(Sepal.Length, Petal.Length)
You can add a target too:
iris %>% explore(Sepal.Length, Petal.Length, target = Species)
If you use explore to explore a variable and want to set lower and upper limits for values, you can use the min_val
and max_val
parameters. All values below min_val will be set to min_val. All values above max_val will be set to max_val.
iris %>% explore(Sepal.Length, min_val = 4.5, max_val = 7)
explore
uses auto-scale by default. To deactivate it use the parameter auto_scale = FALSE
iris %>% explore(Sepal.Length, auto_scale = FALSE)
Describe your data in one line of code:
iris %>% describe()
The result is a data-frame, where each row is a variable of your data. You can use filter
from dplyr for quick checks:
# show all variables that contain less than 5 unique values iris %>% describe() %>% filter(unique < 5)
# show all variables contain NA values iris %>% describe() %>% filter(na > 0)
You can use describe
for describing variables too. You don't need to care if a variale is numerical or categorical. The output is a text.
# describe a numerical variable iris %>% describe(Species)
# describe a categorical variable iris %>% describe(Sepal.Length)
Use one of the prepared datasets to explore:
# create dataset and describe it data <- create_data_app(obs = 100) describe(data)
# create dataset and describe it data <- create_data_random(obs = 100, vars = 5) describe(data)
You can build you own random dataset by using create_data_empty()
and add_var_randm_*()
functions:
# create dataset and describe it data <- create_data_empty(obs = 1000) %>% add_var_random_01("target") %>% add_var_random_dbl("age", min_val = 18, max_val = 80) %>% add_var_random_cat("gender", cat = c("male", "female", "other"), prob = c(0.4, 0.4, 0.2)) %>% add_var_random_starsign() %>% add_var_random_moon() describe(data)
data %>% select(random_starsign, random_moon) %>% explore_all()
To clean a variable you can use clean_var
. With one line of code you can rename a variable, replace NA-values and set a minimum and maximum for the value.
iris %>% clean_var(Sepal.Length, min_val = 4.5, max_val = 7.0, na = 5.8, name = "sepal_length") %>% describe()
Create an RMarkdown template to explore your own data. Set output_dir (existing file may be overwritten)
create_notebook_explore( output_dir = tempdir(), output_file = "notebook-explore.Rmd")
{width=600px}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.