plot_r: Draw different scatterplots corresponding to a sample...

View source: R/plot_r.R

plot_rR Documentation

Draw different scatterplots corresponding to a sample correlation coefficient

Description

This function illustrates the importance of looking at your data before moving to numeric methods. More specifically, it illustrates that a single correlation coefficient can correspond to wildly different patterns in the data. In all, 16 scatterplots with the same sample Pearson correlation coefficient are drawn.

Usage

plot_r(r = 0.6, n = 50, showdata = FALSE, plot = TRUE)

Arguments

r

The desired sample correlation coefficient. This must be a number equal to or larger than -1 and smaller than (but not equal to) 1.

n

The number of data points per scatterplot.

showdata

Do you want to output a dataframe containing the plotted data (TRUE) or not (FALSE, default)? You can also output the data for a specific scatterplot by setting this parameter to the corresponding number (see examples).

plot

Do you want to draw the scatterplots (TRUE, default) or not (FALSE)?

Details

(1) The x/y data are drawn from a bivariate normal distribution. Pearson's correlation coefficient can be useful in this situation.

(2) The x data is uniformly distributed between 0 and 1; the y data are scattered normally about the regression line. This isn't too problematic either.

(3) The x data are drawn from a right-skewed distribution; the y data are scattered normally about the regression line. While all y-data follow from the same data-generating mechanism (i.e., outliers reflect the same process as ordinary data points and aren't the result of, say, transcription errors), outlying data points may exert a large effect on the correlation coefficient. (Try exporting the dataset for this plot and recomputing the correlation coefficient without the largest x-value!)

(4) Same as (3), but this time the x data are drawn from a left-skewed distribution.

(5) The x data are drawn from a normal distribution but the y values are right-skewed about the regression line. Every now and again one or a couple of datapoints will exert a large pull on the correlation coefficient.

(6) Same as (5), except the y values are left-skewed about the regression line.

(7) For increasing x values, the y values are ever more scattered about the regression line. Conceptually, Pearson's correlation coefficients underestimates how well you can anticipate a y-value for small x-values an overestimates how well you can do so for large x-values.

(8) Same as (7), except the scatter about the regression line becomes smaller for larger x-values.

(9) Pearson's correlation coefficient summarises the linear trend in the data; the trend in this panel, however, is quadratic. As a result, Pearson's correlation coefficient will underestimate the strength of the relationship between the two variables.

(10) Similar to (9) but the trend is sinusoid.

(11) There is a single positive outlier that exerts a pull on the correlation coefficient. In contrast to (3) through (6), this outlier is not caused by the same data-generating mechanism as the rest of the data, so it contaminates the estimated correlation coefficient. The regression line for the data without this outlier is plotted as a dashed red line; you'll often find that the true trend in the data go counter to the trend when the outlier isn't excluded.

(12) Similar to (11), except the outlier is negative (i.e., it pulls the correlation coefficient down).

(13) The y data are distributed bimodally about the regression line. This suggests that an important categorical predictor wasn't taken into account in the analysis.

(14) A more worrisome version of (13). There are actually two groups in the data, and this wasn't taken into account. But within each of these groups, the trend may often (not always) run counter to the overall trend. So, for instance, Pearson's correlation may suggest the presence of a positive correlation, but the true trend may actually be negative – provided the group factor is taken into account. This is known as Simpson's paradox.

(15) The data in this panel were sampled from a bivariate normal distribution but the middle part of the distribution was removed. This may be a sensible sampling strategy when you want to investigate the relationship between two variables, one of which is expensive to measure and the other cheap: Screen a large number of observations on the cheap variable, and only collect the expensive variable for the extreme observations. In a regression analysis, this sampling strategy can be highly effective in terms of power and precision. However, it artificially inflates correlation coefficients.

(16) This is what I suspect lots of datasets actually look like. The x and y data are both categorical. This needn't be too problematic; it's just that when someone mentions "r = 0.4", you think of panel (1) rather than panel (16).

Examples

plot_r(r = -0.30, n = 25)
plot_r(r = -0.30, n = 25, showdata = TRUE)

# Only show the data for the 12th plot
plot_r(r = 0.5, n = 250, showdata = 12)

# Generate data for the 14th plot, don't draw scatterplots
plot_r(r = 0.3, n = 12, showdata = 14, plot = FALSE)

janhove/cannonball documentation built on Feb. 19, 2025, 5:13 a.m.