knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code!
The titanic dataset is available in base R. The data has 5 variables and only 32 rows. Each row does NOT represent an observation. It is not tidy, instead the data set contains Frequencies! And it is not a data frame, therefore we we need to convert it first.
library(dplyr) library(explore) titanic <- use_data_titanic(count = TRUE)
titanic %>% describe_tbl(n = n)
titanic %>% describe()
All variables are categorical except n, representing the number of observations.
The data look like this:
titanic %>% head(10)
As the normal explore() function of the {explore} package expects a tidy dataset (each row is an observation), we need add the parameter n (number of observations)
titanic %>% explore(Class, n = n)
We get the exact numbers by using describe() together with the n-parameter (weight)
titanic %>% describe(Class, n = n)
To explore all variables, we can simply use explore_all(). You automatically fit the height of the plot using fig.height=total_fig_height(titanic, var_name_n = "n")
in the code chunk header.
titanic %>% explore_all(n = n)
Now we want to check the relation between variables and Survived. We can use the explore() function with Survived as target.
titanic %>% explore(Class, target = Survived, n = n, split = FALSE)
To get a better feeling of the relationship between Class and Survived, we switch to percentage and split the target into sperate bars. We can do that by using split = TRUE (which is default).
titanic %>% explore(Class, target = Survived, n = n, split = TRUE)
Now we get a plot, where each color sum to 100%. So a big difference in bar length indicates an important relationship between the two variables. In this case, passengers of 1st Class had the highest probability to survive.
titanic %>% explore(Sex, target = Survived, n = n)
Female are much more likely to survive!
titanic %>% explore(Age, target = Survived, n = n)
Child had an advantage to survive.
Now we can create a simple decision tree. As we have count-data we need to pass parameter n.
titanic %>% explain_tree(target = Survived, n = n)
We see that Sex and Class can give a good explanation who are more likely to survive.
titanic %>% explore(Age, target = Class, n = n)
Child are unlikely in the 1st class! And all Crew members are adult as expected.
titanic %>% explore(Sex, target = Class, n = n)
Almost no female Crew members! Female tend to have better Class!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.