#| label = "setup", #| include = FALSE source("../setup.R")
#| label = "suggested_pkgs", #| include = FALSE pkgs <- "gapminder" successfully_loaded <- purrr::map_lgl(pkgs, requireNamespace, quietly = TRUE) can_evaluate <- all(successfully_loaded) if (can_evaluate) { purrr::walk(pkgs, library, character.only = TRUE) } else { knitr::opts_chunk$set(eval = FALSE) }
You can cite this package/vignette as:
#| label = "citation", #| echo = FALSE, #| comment = "" citation("ggstatsplot")
The function ggcorrmat()
provides a quick way to produce publication-ready correlation matrix (aka correlalogram) plot. The function can also be used
for quick data exploration. In addition to the plot, it can also be used to
get a correlation coefficient matrix or the associated p-value matrix.
This function is a convenient wrapper around ggcorrplot::ggcorrplot()
function with some additional functionality.
We will see examples of how to use this function in this vignette with the
gapminder
and diamonds
dataset.
To begin with, here are some instances where you would want to use
ggcorrmat
-
{ggplot2}
ggcorrmat()
For the first example, we will use the gapminder
dataset (available in
eponymous package on CRAN)
provides values for life expectancy, Gross Domestic Product (GDP) per capita,
and population, every five years, from 1952 to 2007, for each of 142 countries
and was collected by the Gapminder Foundation. Let's have a look at the data-
#| label = "gapminder" library(gapminder) library(dplyr) dplyr::glimpse(gapminder)
Let's say we are interested in studying correlation between population of a country, average life expectancy, and GDP per capita across countries only for the year 2007.
The simplest way to get a correlation matrix is to stick to the defaults-
#| label = "ggcorrmat1", #| fig.height = 6, #| fig.width = 6 ## select data only from the year 2007 gapminder_2007 <- dplyr::filter(gapminder::gapminder, year == 2007) ## producing the correlation matrix ggcorrmat( data = gapminder_2007, ## data from which variable is to be taken cor.vars = lifeExp:gdpPercap ## specifying correlation matrix variables )
This plot can be further modified with additional arguments-
#| label = "ggcorrmat2", #| fig.height = 6, #| fig.width = 6 ggcorrmat( data = gapminder_2007, ## data from which variable is to be taken cor.vars = lifeExp:gdpPercap, ## specifying correlation matrix variables cor.vars.names = c( "Life Expectancy", "population", "GDP (per capita)" ), type = "np", ## which correlation coefficient is to be computed lab.col = "red", ## label color ggtheme = ggplot2::theme_light(), ## selected ggplot2 theme ## turn off default ggestatsplot theme overlay matrix.type = "lower", ## correlation matrix structure colors = NULL, ## turning off manual specification of colors palette = "category10_d3", ## choosing a color palette package = "ggsci", ## package to which color palette belongs title = "Gapminder correlation matrix", ## custom title subtitle = "Source: Gapminder Foundation" ## custom subtitle )
As seen from this correlation matrix, although there is no relationship between population and life expectancy worldwide, at least in 2007, there is a strong positive relationship between GDP, a well-established indicator of a country's economic performance.
Given that there were only three variables, this doesn't look that impressive.
So let's work with another example from {ggplot2}
package: the diamonds
dataset. This dataset
contains the prices and other attributes of almost 54,000 diamonds.
Let's have a look at the data-
#| label = "diamonds" library(ggplot2) dplyr::glimpse(ggplot2::diamonds)
Let's see the correlation matrix between different attributes of the diamond and the price.
#| label = "ggcorrmat3", #| fig.height = 7, #| fig.width = 7 ## let's use just 5% of the data to speed it up ggcorrmat( data = dplyr::sample_frac(ggplot2::diamonds, size = 0.05), cor.vars = c(carat, depth:z), ## note how the variables are getting selected cor.vars.names = c( "carat", "total depth", "table", "price", "length (in mm)", "width (in mm)", "depth (in mm)" ), ggcorrplot.args = list(outline.color = "black", hc.order = TRUE) )
We can make a number of changes to this basic correlation matrix. For example,
since we were interested in relationship between price and other attributes,
let's make the price
column to the the first column.
#| label = "ggcorrmat4", #| fig.height = 7, #| fig.width = 7 ## let's use just 5% of the data to speed it up ggcorrmat( data = dplyr::sample_frac(ggplot2::diamonds, size = 0.05), cor.vars = c(price, carat, depth:table, x:z), ## note how the variables are getting selected cor.vars.names = c( "price", "carat", "total depth", "table", "length (in mm)", "width (in mm)", "depth (in mm)" ), type = "np", title = "Relationship between diamond attributes and price", subtitle = "Dataset: Diamonds from ggplot2 package", colors = c("#0072B2", "#D55E00", "#CC79A7"), pch = "square cross", ## additional aesthetic arguments passed to `ggcorrmat()` ggcorrplot.args = list( lab_col = "yellow", lab_size = 6, tl.srt = 90, pch.col = "white", pch.cex = 14 ) ) + ## modification outside `{ggstatsplot}` using `{ggplot2}` functions ggplot2::theme( axis.text.x = ggplot2::element_text( margin = ggplot2::margin(t = 0.15, r = 0.15, b = 0.15, l = 0.15, unit = "cm") ) )
As seen here, and unsurprisingly, the strongest predictor of the diamond price is its carat value, which a unit of mass equal to 200 mg. In other words, the heavier the diamond, the more expensive it is going to be.
grouped_ggcorrmat
What if we want to do the same analysis separately for each quality of the
diamond cut
(Fair, Good, Very Good, Premium, Ideal)?
{ggstatsplot}
provides a special helper function for such instances:
grouped_ggcorrmat()
. This is merely a wrapper function around
combine_plots()
. It applies ggcorrmat()
across all levels of
a specified grouping variable and then combines list of individual plots
into a single plot.
#| label = "ggcorrmat7", #| fig.height = 16, #| fig.width = 10 grouped_ggcorrmat( ## arguments relevant for `ggcorrmat()` data = ggplot2::diamonds, cor.vars = c(price, carat, depth), grouping.var = cut, ## arguments relevant for `combine_plots()` plotgrid.args = list(nrow = 3), annotation.args = list( tag_levels = "a", title = "Relationship between diamond attributes and price across cut", caption = "Dataset: Diamonds from ggplot2 package" ) )
Note that this function also makes it easy to run the same correlation matrix across different levels of a factor/grouping variable.
If you want a data frame of (grouped) correlation matrix, use
correlation::correlation()
instead. It can also do grouped analysis when
used with output from dplyr::group_by()
.
ggcorrmat()
+ {purrr}
Although grouped_
function is good for quickly exploring the data, it reduces
the flexibility with which this function can be used. This is the because the
common parameters used are applied to plots corresponding to all levels of the
grouping variable and there is no way to customize the arguments for different
levels of the grouping variable. We will see how this can be done using the
{purrr}
package.
See the associated vignette here: https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/purrr_examples.html
Details about underlying functions used to create graphics and statistical tests carried out can be found in the function documentation: https://indrajeetpatil.github.io/ggstatsplot/reference/gghistostats.html
If you find any bugs or have any suggestions/remarks, please file an issue on
GitHub
: https://github.com/IndrajeetPatil/ggstatsplot/issues
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.