library("lessR")
The Regression()
function performs multiple aspects of a complete regression analysis. Abbreviate with reg()
. To illustrate, first read the Employee data included as part of lessR. Read into the default lessR data frame d.
d <- Read("Employee")
As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need to be entered into the table. The table can be a csv
file or an Excel file.
Read the label file into the l data frame, currently the only permitted name. The labels are displayed on both the text and visualization output. Each displayed label consists of the variable name juxtaposed with the corresponding label, as shown in the display of the label file.
l <- rd("Employee_lbl") l
The brief version provides just the basic analysis, what Excel provides, plus a scatterplot with the regression line, which becomes a scatterplot matrix with multiple regression. Because d is the default name of the data frame that contains the variables for analysis, the data
parameter that names the input data frame need not be specified. Here, specify Salary as the target or response variable with features, or predictor variables, Years and Pre.
reg_brief(Salary ~ Years + Pre)
The full output is extensive: Summary of the analysis, estimated model, fit indices, ANOVA, correlation matrix, collinearity analysis, best subset regression, residuals and influence statistics, and prediction intervals. The motivation is to provide virtually all of the information needed for a proper regression analysis.
reg(Salary ~ Years + Pre)
Request a briefer output with the reg_brief()
version of the function. Standardize the predictor variables in the model by setting the new_scale
parameter to "z"
. Plot the residuals as a line connecting each data point to the corresponding point on the regression line as specified with the plot_errors
parameter. To also standardize the response variable, set parameter scale_response
to TRUE
.
reg_brief(Salary ~ Years, new_scale="z", plot_errors=TRUE)
Specify a cross-validation with the kfold
parameter. Here, specify three folds. The function automatically creates the training and testing data sets.
reg(Salary ~ Years, kfold=3)
The standard output also includes $R^2_{press}$, the value of $R^2$ when applied to new, previously unseen data, a value comparable to the average $R^2$ on test data.
The output of Regression()
can be stored into an R object, here named r. The output object consists of various components that together define the output of a comprehensive regression analysis. R refers to the resulting output structure as a list object.
r <- reg(Salary ~ Years + Pre)
Entering the name of the object displays the full output, the default output when the output is directed to the R console instead of saving into an R object.
r
Or, work with the components individually. Use the base R names()
function to identify all of the output components. Component names that begin with out_
are part of the standard output. Other components include just data and statistics designed to be input in additional procedures, including R markdown documents.
names(r)
Here, only display the estimates and their inferential analysis as part of the standard text output.
r$out_estimates
Here, display the numeric values of the coefficients.
r$coefficients
An analysis of hundreds or thousands of rows of data can make it difficult to locate a specific prediction interval of interest. To initiate a search for a specific row, first do the regression and request all prediction intervals with parameter pred_rows
. Then convert that output to a data frame named dp with base R read.table()
. As a data frame, do a standard search for an individual row for a specific prediction interval (see the Subset a Data Frame vignette for directions to subset).
This particular conversion to a data frame requires one more step. One or more spaces in the out_predict
output delimit adjacent columns, but the names in this data set are formatted with a comma followed by a space. Use base R sub()
to remove the space after the comma before converting to a data frame.
r <- reg(Salary ~ Years, pred_rows="all", graphics=FALSE) r$out_predict = sub(", ", ",", r$out_predict, fixed=TRUE) dp <- read.table(text=r$out_predict) dp[.(row.names(dp) == "Pham,Scott"),]
Because reg()
accomplishes its computations with base R function lm()
, lm()
parameters can be passed to reg()
, which then passes the values to lm()
to define the corresponding indicator variables. Here, first use base R function contr.sum()
to calculate an effect coding contrast matrix for a categorical variable with three levels, such as the variable Plan in the Employee data set.
cnt <- contr.sum(n=3) cnt
Now use the lm()
parameter contrasts
to define the effect coding for JobSat, passed to reg_brief()
. Contrasts only apply to factors, so convert JobSat to an R factor before the regression analysis, a task that should generally be done for all categorical variables in an analysis. Here, designate the order of the levels on output displays such as a bar graph.
d$JobSat <- factor(d$JobSat, levels=c("low", "med", "high")) reg_brief(Salary ~ JobSat, contrasts=list(JobSat=cnt))
The $R^2$ fit statistic compares the sum of the squared errors of the model with the X predictor variables to the sum of squared errors of the null model. The baseline of comparison, the null model, is a model with no X variables such that the fitted value for each set of X values is the mean of response variable $y$. The corresponding slope intercept is the mean of $y$, and the standard deviation of the residuals is the standard deviation of $y$.
The following submits the null model for Salary, and plots the errors. Compare the variability of the residuals to a regression model of Salary with one or more predictor variables. To the extent that the inclusion of one or more predictor variables in the model reduces the variability of the data about the regression line compared to the null model, the model fits the data.
reg_brief(Salary ~ 1, plot_errors=TRUE)
Can also get the null model plot from the lessR function Plot()
with the fit
parameter set to "null"
.
The scatterplot is displayed as a bubble plot when both variables consist of less than 10 unique integer values. With the bubble plot, there is no overprinting of the same point so that the number of values that represent a point is displayed.
dd <- Read("Mach4") reg_brief(m10 ~ m02, data=dd)
Obtain an ANCOVA by entering categorical and continuous variables as predictor variables. For a single categorical variable and a single continuous variable, Regression()
displays the regression line for each level of the categorical variable.
The ANCOVA assumes that the slopes for the different levels of the categorical variable are the same for the pairing of the continuous predictor variable and continuous response variable. Visually evaluate this assumption by plotting each separate slope and scatterplot.
Plot(Salary, Years, by=Dept, fit="lm")
Then, if the slopes are not too dissimilar, run the ANCOVA. The categorical variable must be interpretable as a categorical variable, either as an R variable type factor
or as a non-numerical type character
string. If the categorical variable is coded numerically, convert to a factor
, such as d$CatVar <- factor(d$CatVar)
which retains the original numerical values as the value labels.
The ANCOVA displays the appropriate Type II Sum of Squares in its ANOVA table for properly evaluating the group effect that corresponds to the entered categorical variable. Note that this SS is only displayed for an ANOVA with a single categorical variable and a single covariate
reg_brief(Salary ~ Dept + Years)
To do a moderation analysis, specify one of (only) two predictor variables with the parameter mod
.
reg_brief(Salary ~ Years + Pre, mod=Pre)
In this analysis, Pre is not a moderator of the impact of Years on Salary. There is a tendency expressed by the non-parallel lines in the visualization, and an almost significant interaction, but the interaction was not detected at the $\alpha=0.5$ level.
For a model with a binary response variable, $y$, specify multiple logistic regression with the usual R formula syntax applied to the lessR function Logit()
. The output includes the confusion matrix and various classification fit indices.
d <- Read("BodyMeas") Logit(Gender ~ Hand)
Specify additional probability thresholds for classification beyond just the default 0.5 with the prob_cut
parameter.
Logit(Gender ~ Hand, prob_cut=c(.3, .5, .7))
Categorize Hand size into six bins. Compute the conditional mean of Gender, scored as 0 and 1, at each level of Hand size. Both variables must be numeric. The visualization approximates the form of the sigmoid function from logistic regression. The point (bubble) size depends on the sample size for the corresponding bin.
d$Gender <- ifelse (d$Gender == "M", 1, 0) Plot(Hand, Gender, n_bins=6)
The parameter Rmd
creates an R markdown file that is automatically generated and then the corresponding html document from knitting the various output components together with full interpretation. A new, much more complete form of computer output.
Not run here.
reg(Salary ~ Years + Pre, Rmd="eg")
Use the base R help()
function to view the full manual for Regression()
. Simply enter a question mark followed by the name of the function, or its abbreviation.
?reg
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.