Scatter, Box, and Violin Plots

knitr::opts_chunk$set(fig.width=4.5, fig.height=4)
suppressPackageStartupMessages(library("lessR"))

lessR provides many versions of a scatter plot with its Plot() function for one or two variables with an option to provide a separate scatterplot for each level of one or two categorical variables. Access all scatterplots with the same simple syntax. The first variable listed without a parameter name, the x parameter, is plotted along the x-axis. Any second variable listed without a parameter name, the y parameter, is plotted along the y-axis. Each parameter may be represented by a continuous or categorical variable, a single variable or a vector of variables.

The Data

Illustrate with the Employee data included as part of lessR.

d <- Read("Employee")

As an option, lessR also supports variable labels. The labels are displayed on both the text and visualization output. Each displayed label consists of the variable name juxtaposed with the corresponding label. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need to be entered into the table. The table can be stored as either a csv file or an Excel file.

Read the variable label file into the l data frame, currently the only permissible name for the label file.

l <- rd("Employee_lbl")

Display the available labels.

l

Continuous Variables

Two Variables

A typical scatterplot visualizes the relationship of two continuous variables, here Years worked at a company, and annual Salary. Following is the function call to Plot() for the default visualization.

Because d is the default name of the data frame that contains the variables for analysis, the data parameter that names the input data frame need not be specified. That is, no need to specify data=d, though this parameter can be explicitly included in the function call if desired.

Plot(Years, Salary)

Enhance the default scatterplot with parameter enhance. The visualization includes the mean of each variable indicated by the respective line through the scatterplot, the 95% confidence ellipse, labeled outliers, least-squares regression line with 95% confidence interval, and the corresponding regression line with the outliers removed.

Plot(Years, Salary, enhance=TRUE)

A variety of fit lines can be plotted. The available values: "loess" for general non-linear fit, "lm" for linear least squares, "null" for the null (flat line) model, "exp" for the exponential growth and decay, "quad" for the quadratic model, and power for the general power beyond 2. Setting fit to TRUE plots the "loess" line. With the value of power, specify the value of the root with parameter fit_power.

Here, plot the general non-linear fit. For emphasis set plot_errors to TRUE to plot the residuals from the line. The sum of the squared errors is displayed to facilitate the comparison of different models.

Plot(Years, Salary, fit="loess", plot_errors=TRUE)

Next, plot the exponential fit and show the residuals from the exponential curve. These data are approximately linear so the exponential curve does not vary far from a straight line. The function displays the corresponding sum of squared errors to assist in comparing various models to each other.

Plot(Years, Salary, fit="exp", plot_errors=TRUE)

The parameter transforms the y variable to the specified power from the default of 1 before doing the regression analysis. The availability of this parameter provides for a wide range of modifications to the underlying functional form of the fit curve.

Three Variables

Map a continuous variable, such as Pre, to the plotted points with the size parameter, a bubble plot.

Plot(Years, Salary, size=Pre)

Indicate multiple variables to plot along either axis with a vector defined according to the base R function c(). Plot the linear model for each variable according to the fit parameter set to "lm". By default, when multiple lines are plotted on the same panel, the confidence interval is turned off by internally setting the parameter fit_se set to 0. Explicitly override this parameter value as needed.

Plot(c(Pre, Post), Salary, fit="lm", fit_se=0)

Scatterplot Matrix

Multiple variables for the first parameter value, x, and no values for y, plot as a scatterplot matrix. Pass a single vector, such as defined by c(). Request the non-linear fit line and corresponding confidence interval by specifying TRUE or loess for the fit parameter. Request a linear fit line with the value of "lm".

Plot(c(Salary, Years, Pre, Post), fit="lm")

Smoothed and Binned Scatterplots

Smoothing and binning are two procedures for visualizing a relationship with many data values.

To obtain a larger data set, in this example generate random data with base R rnorm(), then plot. Plot() first checks the presence of the specified variables in the global environment (workspace). If not there, then from a data frame, of which the default value is d. Here, randomly generate values from normal populations for x and y in the workspace.

set.seed(13)
x=rnorm(4000)
y= 8*x + rnorm(4000,1, 30)
Plot(x, y)

With large data sets, even for continuous variables there can be much over-plotting of points. One strategy to address this issue smooths the scatterplot by turning on the smooth parameter. The individual points superimposed on the smoothed plot are potential outliers. The default number of plotted outliers is 100. Turn off the plotting of outliers completely by setting parameter smooth_points to 0. Show the linear trend with fit set to "lm".

Plot(x, y, smooth=TRUE, fit="lm")

Another strategy for alleviating over-plotting makes the fill color mostly transparent with the transparency parameter, or turn off completely by setting fill to "off". The closer the value of trans is to 1, the more transparent is the fill.

Plot(x, y, transparency=0.95)

Another way to visualize a relationship when there are many data points is to bin the x-axis. Specify the number of bins with parameter n_bins. Plot() then computes the mean of y for each bin and connects the means by line segments. This procedure plots the conditional means by default without any assumption of form such as linearity. Specify the stat parameter for median to compute the median of y for each bin. The standard Plot() parameters fill, color, size and segments also apply.

Plot(x, y, n_bins=5)

One Variable

The default plot for a single continuous variable includes not only the scatterplot, but also the superimposed violin plot and box plot, with outliers identified. Call this plot the VBS plot.

Plot(Salary)

Control the choice of the three superimposed plots -- violin, box, and scatter -- with the vbs_plot parameter. The default setting is "vbs" for all three plots. Here, for example, obtain just the box plot. Or, use the alias BoxPlot() in place of Plot().

Plot(Salary, vbs_plot="b")

Cleveland Dot Plot

Create a Cleveland dot plot when one of the variables has unique (ID) values. In this example, for a single variable, row names are on the y-axis. The default plots sorts by the value plotted with the default value of parameter sort_yx of "+" for an ascending plot. Set to "-" for a descending plot and "0" for no sorting.

Plot(Salary, row_names)

The standard scatterplot version of a Cleveland dot plot follows, with no sorting and no line segments.

Plot(Salary, row_names, sort_yx="0", segments_y=FALSE)

This Cleveland dot plot has two x-variables, indicated as a standard R vector with the c() function. In this situation, the two points on each row are connected with a line segment. By default the rows are sorted by distance between the successive points.

Plot(c(Pre, Post), row_names)

Categorical and Continuous Variables

A mixture of categorical and continuous variables can be plotted a variety of ways, as illustrated below.

Two Continuous, One Categorical

Plot a scatterplot of two continuous variables for each level of a categorical variable on the same panel with the by parameter. Here, plot Years and Salary each for the two levels of Gender in the data. Colors and geometric plot shapes can distinguish between the plots. For all variables except an ordered factor, the default plots according to the default qualitative color palette, "hues", with the geometric shape of a point.

Plot(Years, Salary, by=Gender)

Change the plot colors with the fill (interior) and color (exterior or edge) parameters. Because there are two levels of the by variable, specify two fill colors and two edge colors each with an R vector defined by the c() function. Also, include the regression line for each group with the fit parameter and increase the size of the plotted points with the size parameter.

Plot(Years, Salary, by=Gender, size=2, fit="lm",
     fill=c("olivedrab3", "gold1"), 
     color=c("darkgreen", "gold4")
)

Change the plotted shapes with the shape parameter. The default value is "circle" with both an exterior color and filled interior, specified with "color" and "fill". Other possible values, with fillable interiors, are "circle", "square", "diamond", "triup" (triangle up), and "tridown" (triangle down). Other possible values include all uppercase and lowercase letters, all digits, and most punctuation characters. The numbers 0 through 25 defined by the R points() function also apply. If plotting levels according to by, then list one shape for each level to be plotted.

A Trellis (facet) plot creates a separate panel for the plot of each level of the categorical variable. Generate Trellis plots with the by1 parameter. In this example, plot the best-fit linear model for the data in each panel according to the fit parameter. By default, the 95% confidence interval for each line is also displayed.

Plot(Years, Salary, by1=Gender, fit="lm")

Turn off the confidence interval by setting the parameter fit_se to 0 for the value of the confidence level.

One Continuous, One Categorical

A categorical variable plotted with a continuous variable results in a traditional scatterplot though, of course, the scatter is confined to the straight lines that represent the levels of the categorical variable, its values.

The first two parameters of Plot() are x and y. In this example, the categorical variable, Dept, listed second, specifies the y variable, as in y=Dept. There is no distinction in this function call for two continues variables or one continuous and one categorical. The Plot() function evaluates each variable for continuity and responds appropriately.

Plot(Salary, Dept)

To avoid point overlap, if there is at least one duplicated value of continuous y for any level of categorical x, by default some horizontal jitter for each plotted point is added, which was not needed in this example. Manually adjust the jitter with either parameter jitter_x or, if x is continuous and y categorical, the jitter_y parameter. In addition, if the categorical variable is an R factor or a variable of type character, by default the mean of the continuous variable is displayed at each level of the categorical variable, as well in the text output. If the categorical variable is numeric, better to convert the variable to a factor to have just the categories on the axis and not a continuous scale. For example, d$Gender <- factor(d$Gender).

Another helpful technique for large data sets is to add some fill transparency with the transparency parameter, with values such as 0.8 and 0.9. The combination of jitter and transparency allows for plotting many thousands of points.

An alternative display of the distribution of a continuous variable and a categorical variable is a superimposed violin, box, and scatter plot, a VBS plot. To plot the points in different colors according to the level of the categorical variable, invoke the by parameter. Here, plot Salary across the levels of Gender on the same panel.

Plot(Salary, by=Gender)

Or, create a Trellis plot that consists of a VBS plot on a separate panel for each level of the categorical variable. Accomplish the Trellis plot with the by1 parameter. Here, plot Salary across the levels of Dept. Again, specify one, two, or, by default, all three superimposed plots: violin, box, and scatter. In this example, the categorical variable, Gender specifies the by1 variable.

Plot(Salary, by1=Gender)

The default coloring of the boxes for variables other than an ordered factor follows the default qualitative palette, "hues". For an ordered factor, the fill color follows the default sequential palette of the corresponding theme, such as "blues". Customize colors with the box_fill parameter.

Just show the box plots according to the vbs_plot parameter, which has a default setting of vbs for the superimposed violin, box, and scatter plots. Set vbs_plot to "b". Or, use the alias BoxPlot(). Change the fill color of each box with the box_fill parameter. In addition to the traditional median for a box plot, show the mean as well as with the vbs_mean parameter. If specifying just one fill color, then all boxes are filled with that color.

Or, drop the box plot and only plot the violins and the scatter plots. Without the boxes, the violins take on the default colors. Specify a value of "vs" for the vbs_plot parameter. If only plotting the violins, then can also use the alias ViolinPlot(). Change the fill color of the violins with parameter violin_fill.

Plot(Salary, by1=Gender, vbs_plot="vs")

The following plot uses the alias BoxPlot() to generate a Trellis plot of boxplots only across the levels of Gender.

BoxPlot(Salary, by1=Gender, vbs_mean=TRUE, box_fill="lightgoldenrod")

Show the different distributions of the continuous variable across the levels of the categorical variable with a scatterplot. Here, show the distribution of Salary for Males and Females across the various departments.

Plot(Salary, Dept, by=Gender)

One Continuous, Two Categorical

To do a Trellis plot with two categorical variables, invoke the by2 parameter in addition to the by1 parameter. By default, the box fill colors are unique for each level of the by1 variable, and then the colors cycle through all the values of the by2 variable. With so many panels to plot, explicitly set them in a single column by setting parameter n_col to 1. The corresponding row parameter is n_row.

Plot(Salary, by1=Gender, by2=Dept, n_col=1)

To specify custom colors with the box_fill parameter, specify the number of colors according to the number of levels of the by1 variable. The colors for by1 then cycle over the by2 values.

Alternatively, invoke the by parameter and the by1 parameter. The values of the by variable plot as separate panels, the Trellis part, and the by variable plot for each panel.

Plot(Salary, by1=Dept, by=Gender, n_col=1)

Two Continuous, Three Categorical

Plot() can display the relationships for up to five variables. The two primary variables, x and y, that form the basis of the scatter plot, are continuous. Usually these two variables are listed first in the function call and so do not need their parameter names specified. Indicate two categorical variables that form the Trellis panels with parameters by1 and by2. Call these two variables the Trellis variables, which define a Trellis panel for each combination of their values. Finally, there can be a categorical grouping variable, the by variable, which plots different groups within each Trellis panel.

To illustrate, first, the data. Use the Cars93 data set that is installed with lessR, which describes characteristics of 1993 cars.

d <- Read("Cars93")

The categorical variables are integer coded 0 and 1, so recode to R factors to obtain more descriptive labels.

d$Transmission <- factor(d$Manual, levels=0:1, labels=c("Auto", "Manual"))
d$Source <- factor(d$Source, levels=0:1, labels=c("Foreign", "Domestic"))

Plot MPGcity according to Weight. Specify the number of Cylinders and Manual transmission or not as Trellis conditioning variables to form the Trellis plot. Specify the Source of the vehicle, Foreign or Domestic as a grouping variable to plot with separate colors on each panel.

Plot(Weight, MPGcity, by1=Cylinders, by2=Transmission, by=Source, n_col=1)

From the visualization the patterns emerge. As Weight increases city MPG decreases. Domestic cars tend to weigh more. Foreign cars tend to have fewer cylinders, which also leads to better fuel mileage.

Categorical Variables

To avoid over-plotting, the plot of two categorical variables results in a bubble plot of their joint frequencies.

d <- Read("Employee", quiet=TRUE)
Plot(Dept, Gender)

The parameter radius scales the size of the bubbles according to the size of the largest displayed bubble in inches. The power parameter sets the relative size of the bubbles. The default power value of 0.5 scales the bubbles so that the area of each bubble is the value of the corresponding sizing variable. A value of 1 scales so the radius of each bubble is the value of the sizing variable, increasing the discrepancy of size between the variables.

In this example, increase the absolute size of the bubbles as well as the relative discrepancy in their sizes. If the bubbles become too large, so that the largest bubbles become truncated, increase the spacing of the respective axes with the pad_x and/or pad_y parameters.

Plot(Dept, Gender, radius=.6, power=0.8, pad_x=0.05, pad_y=0.05)

Alternatively, plot two categorical variables with a Trellis (facet) chart by invoking the by1 parameter. If the first listed variable in the function call, the x parameter, is categorical, the result is a dot chart for each level of the by1 variable.

Plot(Dept, by1=Gender)

Plotting a single categorical variable yields the corresponding bubble plot of frequencies.

Plot(Dept)

Interactive Plots

An interactive visualization lets the user in real time change parameter values to change characteristics of the visualization. To create an interactive two-variable scatterplot of continuous variables with the employee data that displays the corresponding parameters, run the function interact() with "ScatterPlot" specified.

interact("ScatterPlot")

To create an interactive Trellis plot as a combined violin, box, and scatter plot with the five values of Dept from the Employee data set that displays the corresponding parameters, run the function interact() with "Trellis" specified.

interact("Trellis")

The functions are not run here because interactivity requires to run directly from the R console.

Full Manual

Use the base R help() function to view the full manual for Plot(). Simply enter a question mark followed by the name of the function.

?Plot

More

More on Scatterplots, Time Series plots, and other visualizations from lessR and other packages such as ggplot2 at:

Gerbing, D., R Visualizations: Derive Meaning from Data, CRC Press, May, 2020, ISBN 978-1138599635.



Try the lessR package in your browser

Any scripts or data that you put into this service are public.

lessR documentation built on Nov. 12, 2023, 1:08 a.m.