The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code!

explore package on Github:

As the explore-functions fits well into the tidyverse, we load the dplyr-package as well.


Interactive data exploration

Explore your data set (in this case the iris data set) in one line of code:

```{R eval=FALSE, echo=TRUE} explore(iris)

A shiny app is launched, you can inspect individual variable, explore their relation to a target (binary / categorical / numerical), grow a decision tree or create a fully automated report of all variables with a few "mouse clicks".


You can choose each variable containing as a target, that is binary (0/1, FALSE/TRUE or "no"/"yes"), categorical or numeric.

### Report variables

Create a rich HTML report of all variables with one line of code:

```{R eval=FALSE, echo=TRUE}
# report of all variables
iris %>% report(output_file = "report.html", output_dir = tempdir())


Or you can simply add a target and create the report. In this case we use a binary target, but a categorical or numerical target would work as well.

```{R eval=FALSE, echo=TRUE}

report of all variables and their relationship with a binary target

iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0) iris %>% report(output_file = "report.html", output_dir = tempdir(), target = is_versicolor)

If you use a binary target, the parameter ***split = FALSE*** (or `targetpct = TRUE`) will give you a different view on the data.


### Grow a decision tree

Grow a decision tree with one line of code:

iris %>% explain_tree(target = Species)

You can grow a decision tree with a binary target too.

iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris %>% select(-Species) %>% explain_tree(target = is_versicolor)

Or using a numerical target. The syntax stays the same.

iris %>% explain_tree(target = Sepal.Length)

You can control the growth of the tree using the parameters maxdepth, minsplit and cp.

To create other types of models use explain_forest(), explain_xgboost() and explain_logreg().

Explore dataset

Explore your table with one line of code to see which type of variables it contains.

iris %>% explore_tbl()

You can also use describe_tbl() if you just need the main facts without visualization.

iris %>% describe_tbl()

Explore variables

Explore a variable with one line of code. You don't have to care if a variable is numerical or categorical.

iris %>% explore(Species)
iris %>% explore(Sepal.Length)

Explore variables with a target

Explore a variable and its relationship with a binary target with one line of code. You don't have to care if a variable is numerical or categorical.

iris %>% explore(Sepal.Length, target = is_versicolor)

Using split = FALSE will change the plot to %target:

iris %>% explore(Sepal.Length, target = is_versicolor, split = FALSE)

The target can have more than two levels:

iris %>% explore(Sepal.Length, target = Species)

Or the target can even be numeric:

iris %>% explore(Sepal.Length, target = Petal.Length)

Explore multiple variables

iris %>% 
  select(Sepal.Length, Sepal.Width) %>% 
iris %>% 
  select(Sepal.Length, Sepal.Width, is_versicolor) %>% 
  explore_all(target = is_versicolor)
iris %>% 
  select(Sepal.Length, Sepal.Width, is_versicolor) %>% 
  explore_all(target = is_versicolor, split = FALSE)
iris %>% 
  select(Sepal.Length, Sepal.Width, Species) %>% 
  explore_all(target = Species)
iris %>% 
  select(Sepal.Length, Sepal.Width, Petal.Length) %>% 
  explore_all(target = Petal.Length)

To use a high number of variables with explore_all() in a RMarkdown-File, it is necessary to set a meaningful fig.width and fig.height in the junk. The function total_fig_height() helps to automatically set fig.height: fig.height=total_fig_height(iris)

iris %>% 

If you use a target: fig.height=total_fig_height(iris, var_name_target = "Species")

iris %>% explore_all(target = Species)

You can control total_fig_height() by parameters ncols (number of columns of the plots) and size (height of 1 plot)

Explore correlation between two variables

Explore correlation between two variables with one line of code:

iris %>% explore(Sepal.Length, Petal.Length)

You can add a target too:

iris %>% explore(Sepal.Length, Petal.Length, target = Species)

Explore options

If you use explore to explore a variable and want to set lower and upper limits for values, you can use the min_val and max_val parameters. All values below min_val will be set to min_val. All values above max_val will be set to max_val.

iris %>% explore(Sepal.Length, min_val = 4.5, max_val = 7)

explore uses auto-scale by default. To deactivate it use the parameter auto_scale = FALSE

iris %>% explore(Sepal.Length, auto_scale = FALSE)

Describing data

Describe your data in one line of code:

iris %>% describe()

The result is a data-frame, where each row is a variable of your data. You can use filter from dplyr for quick checks:

# show all variables that contain less than 5 unique values
iris %>% describe() %>% filter(unique < 5)
# show all variables contain NA values
iris %>% describe() %>% filter(na > 0)

You can use describe for describing variables too. You don't need to care if a variale is numerical or categorical. The output is a text.

# describe a numerical variable
iris %>% describe(Species)
# describe a categorical variable
iris %>% describe(Sepal.Length)

Use data

Use one of the prepared datasets to explore:

use_data_beer() %>% describe()

Create data

Use one of the prepared datasets to explore:

# create dataset and describe it
data <- create_data_app(obs = 100)
# create dataset and describe it
data <- create_data_random(obs = 100, vars = 5)

You can build you own random dataset by using create_data_empty() and add_var_random_*() functions:

# create dataset and describe it
data <- create_data_empty(obs = 1000) %>% 
  add_var_random_01("target") %>% 
  add_var_random_dbl("age", min_val = 18, max_val = 80) %>% 
                     cat = c("male", "female", "other"), 
                     prob = c(0.4, 0.4, 0.2)) %>% 
  add_var_random_starsign() %>%
data %>% select(random_starsign, random_moon) %>% explore_all()

Basic data cleaning

To clean a variable you can use clean_var. With one line of code you can rename a variable, replace NA-values and set a minimum and maximum for the value.

iris %>% 
            min_val = 4.5, 
            max_val = 7.0, 
            na = 5.8, 
            name = "sepal_length") %>% 

To drop variables or observations you can use drop_var_*() and drop_obs_*() functions.

use_data_penguins() %>% 
use_data_penguins() %>%
  drop_obs_with_na() %>%

Create notebook

Create an RMarkdown template to explore your own data. Set output_dir (existing file may be overwritten)

  output_dir = tempdir(),
  output_file = "notebook-explore.Rmd")


