Regression Analysis

library("lessR")

The Regression() function performs multiple aspects of a complete regression analysis. Abbreviate with reg(). To illustrate, first read the Employee data included as part of lessR. Read into the default lessR data frame d.

d <- Read("Employee")

As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need to be entered into the table. The table can be a csv file or an Excel file.

Read the label file into the l data frame, currently the only permitted name. The labels are displayed on both the text and visualization output. Each displayed label consists of the variable name juxtaposed with the corresponding label, as shown in the display of the label file.

l <- rd("Employee_lbl")
l

Default Analysis

Brief Output

The brief version provides just the basic analysis, what Excel provides, plus a scatterplot with the regression line, which becomes a scatterplot matrix with multiple regression. Because d is the default name of the data frame that contains the variables for analysis, the data parameter that names the input data frame need not be specified. Here, specify Salary as the target or response variable with features, or predictor variables, Years and Pre.

reg_brief(Salary ~ Years + Pre)

Full Output

The full output is extensive: Summary of the analysis, estimated model, fit indices, ANOVA, correlation matrix, collinearity analysis, best subset regression, residuals and influence statistics, and prediction intervals. The motivation is to provide virtually all of the information needed for a proper regression analysis.

reg(Salary ~ Years + Pre)

Standardize the Variables

Request a briefer output with the reg_brief() version of the function. Standardize the predictor variables in the model by setting the new_scale parameter to "z". Plot the residuals as a line connecting each data point to the corresponding point on the regression line as specified with the plot_errors parameter. To also standardize the response variable, set parameter scale_response to TRUE.

reg_brief(Salary ~ Years, new_scale="z", plot_errors=TRUE)

k-fold Cross-validation

Specify a cross-validation with the kfold parameter. Here, specify three folds. The function automatically creates the training and testing data sets.

reg(Salary ~ Years, kfold=3)

The standard output also includes $R^2_{press}$, the value of $R^2$ when applied to new, previously unseen data, a value comparable to the average $R^2$ on test data.

Output as a Stored Object

The output of Regression() can be stored into an R object, here named r. The output object consists of various components that together define the output of a comprehensive regression analysis. R refers to the resulting output structure as a list object.

r <- reg(Salary ~ Years + Pre)

Entering the name of the object displays the full output, the default output when the output is directed to the R console instead of saving into an R object.

r

Or, work with the components individually. Use the base R names() function to identify all of the output components. Component names that begin with out_ are part of the standard output. Other components include just data and statistics designed to be input in additional procedures, including R markdown documents.

names(r)

Here, only display the estimates and their inferential analysis as part of the standard text output.

r$out_estimates

Here, display the numeric values of the coefficients.

r$coefficients

An analysis of hundreds or thousands of rows of data can make it difficult to locate a specific prediction interval of interest. To initiate a search for a specific row, first do the regression and request all prediction intervals with parameter pred_rows. Then convert that output to a data frame named dp with base R read.table(). As a data frame, do a standard search for an individual row for a specific prediction interval (see the Subset a Data Frame vignette for directions to subset).

This particular conversion to a data frame requires one more step. One or more spaces in the out_predict output delimit adjacent columns, but the names in this data set are formatted with a comma followed by a space. Use base R sub() to remove the space after the comma before converting to a data frame.

r <- reg(Salary ~ Years, pred_rows="all", graphics=FALSE)
r$out_predict = sub(", ", ",", r$out_predict, fixed=TRUE)
dp <- read.table(text=r$out_predict)
dp[.(row.names(dp) == "Pham,Scott"),]

Contrasts

Because reg() accomplishes its computations with base R function lm(), lm() parameters can be passed to reg(), which then passes the values to lm() to define the corresponding indicator variables. Here, first use base R function contr.sum() to calculate an effect coding contrast matrix for a categorical variable with three levels, such as the variable Plan in the Employee data set.

cnt <- contr.sum(n=3)
cnt

Now use the lm() parameter contrasts to define the effect coding for JobSat, passed to reg_brief(). Contrasts only apply to factors, so convert JobSat to an R factor before the regression analysis, a task that should generally be done for all categorical variables in an analysis. Here, designate the order of the levels on output displays such as a bar graph.

d$JobSat <- factor(d$JobSat, levels=c("low", "med", "high"))
reg_brief(Salary ~ JobSat, contrasts=list(JobSat=cnt))

Null Model

The $R^2$ fit statistic compares the sum of the squared errors of the model with the X predictor variables to the sum of squared errors of the null model. The baseline of comparison, the null model, is a model with no X variables such that the fitted value for each set of X values is the mean of response variable $y$. The corresponding slope intercept is the mean of $y$, and the standard deviation of the residuals is the standard deviation of $y$.

The following submits the null model for Salary, and plots the errors. Compare the variability of the residuals to a regression model of Salary with one or more predictor variables. To the extent that the inclusion of one or more predictor variables in the model reduces the variability of the data about the regression line compared to the null model, the model fits the data.

reg_brief(Salary ~ 1, plot_errors=TRUE)

Can also get the null model plot from the lessR function Plot() with the fit parameter set to "null".

Likert Type Data

The scatterplot is displayed as a bubble plot when both variables consist of less than 10 unique integer values. With the bubble plot, there is no overprinting of the same point so that the number of values that represent a point is displayed.

dd <- Read("Mach4")
reg_brief(m10 ~ m02, data=dd)

Analysis of Covariance

Obtain an ANCOVA by entering categorical and continuous variables as predictor variables. For a single categorical variable and a single continuous variable, Regression() displays the regression line for each level of the categorical variable.

The ANCOVA assumes that the slopes for the different levels of the categorical variable are the same for the pairing of the continuous predictor variable and continuous response variable. Visually evaluate this assumption by plotting each separate slope and scatterplot.

Plot(Salary, Years, by=Dept, fit="lm")

Then, if the slopes are not too dissimilar, run the ANCOVA. The categorical variable must be interpretable as a categorical variable, either as an R variable type factor or as a non-numerical type character string. If the categorical variable is coded numerically, convert to a factor, such as d$CatVar <- factor(d$CatVar) which retains the original numerical values as the value labels.

The ANCOVA displays the appropriate Type II Sum of Squares in its ANOVA table for properly evaluating the group effect that corresponds to the entered categorical variable. Note that this SS is only displayed for an ANOVA with a single categorical variable and a single covariate

reg_brief(Salary ~ Dept + Years)

Moderation Analysis

To do a moderation analysis, specify one of (only) two predictor variables with the parameter mod.

reg_brief(Salary ~ Years + Pre, mod=Pre)

In this analysis, Pre is not a moderator of the impact of Years on Salary. There is a tendency expressed by the non-parallel lines in the visualization, and an almost significant interaction, but the interaction was not detected at the $\alpha=0.5$ level.

Logistic Regression

For a model with a binary response variable, $y$, specify multiple logistic regression with the usual R formula syntax applied to the lessR function Logit(). The output includes the confusion matrix and various classification fit indices.

Default Analysis

d <- Read("BodyMeas")
Logit(Gender ~ Hand)

Change Classification Threshold

Specify additional probability thresholds for classification beyond just the default 0.5 with the prob_cut parameter.

Logit(Gender ~ Hand, prob_cut=c(.3, .5, .7))

Plot Conditional Means across Bins

Categorize Hand size into six bins. Compute the conditional mean of Gender, scored as 0 and 1, at each level of Hand size. Both variables must be numeric. The visualization approximates the form of the sigmoid function from logistic regression. The point (bubble) size depends on the sample size for the corresponding bin.

d$Gender <- ifelse (d$Gender == "M", 1, 0)
Plot(Hand, Gender, n_bins=6)

Interpreted Output

The parameter Rmd creates an R markdown file that is automatically generated and then the corresponding html document from knitting the various output components together with full interpretation. A new, much more complete form of computer output.

Not run here.

reg(Salary ~ Years + Pre, Rmd="eg")

Full Manual

Use the base R help() function to view the full manual for Regression(). Simply enter a question mark followed by the name of the function, or its abbreviation.

?reg


Try the lessR package in your browser

Any scripts or data that you put into this service are public.

lessR documentation built on Nov. 12, 2023, 1:08 a.m.