Regression | R Documentation |
Computes output for seven different regression types. Those being linear, binary logistic, ordered logistic, binomial, poisson, quasi-poisson and multinomial. Output includes general coefficient estimates and importance analysis estimates with possibilities for handling missing data and interaction terms.
Regression(
formula = NULL,
data = NULL,
subset = NULL,
weights = NULL,
missing = "Exclude cases with missing data",
type = "Linear",
robust.se = FALSE,
method = "default",
output = "Coefficients",
detail = FALSE,
m = 10,
seed = 12321,
statistical.assumptions,
auxiliary.data = NULL,
show.labels = FALSE,
internal = FALSE,
contrasts = c("contr.treatment", "contr.treatment"),
relative.importance = FALSE,
importance.absolute = FALSE,
interaction = NULL,
correction = "None",
interaction.formula = NULL,
recursive.call = FALSE,
effects.format = list(max.label = 10),
outlier.prop.to.remove = NULL,
stacked.data.check = FALSE,
unstacked.data = NULL,
...
)
formula |
An object of class |
data |
A |
subset |
An optional vector specifying a subset of observations to be
used in the fitting process, or, the name of a variable in |
weights |
An optional vector of sampling weights, or, the name or, the
name of a variable in |
missing |
How missing data is to be treated in the regression. Supplied parameter needs to be one of the following strings:
|
type |
Defaults to |
robust.se |
If |
method |
The method to be used; for fitting. This will only do something if method = "model.frame", which returns the model frame. |
output |
|
detail |
This is a deprecated function. If |
m |
The number of imputed samples, if using multiple imputation. |
seed |
The random number seed used in imputation and residual computations. |
statistical.assumptions |
A Statistical Assumptions object. |
auxiliary.data |
A |
show.labels |
Shows the variable labels, as opposed to the names, in the outputs, where a variables label is an attribute (e.g., attr(foo, "label")). |
internal |
If |
contrasts |
A vector of the contrasts to be used for |
relative.importance |
Deprecated. To run Relative Importance Analysis, use the output variable. |
importance.absolute |
Whether the absolute value of the relative importance should be shown. |
interaction |
Optional variable to test for interaction with other variables in the model. Output will be a crosstab showing coefficients from both both models. |
correction |
Method to correct for multiple comparisons. Can be one of |
interaction.formula |
Used internally for multiple imputation. |
recursive.call |
Used internally to indicate if call is a result of recursion (e.g., multiple imputation). |
effects.format |
A list of items |
outlier.prop.to.remove |
A single numeric value that determines the percentage of data points to remove from the analysis. The data points removed correspond to those in the proportion with the largest residuals. A value of 0 or NULL would denote no points are removed. A value x, with 0 < x < 0.5 (not inclusive) would denote that a percentage between none and 50% of the data points are removed. |
stacked.data.check |
Logical value to determine if the Regression should be the data and formula based off
the |
unstacked.data |
A list with two elements that provide the outcome and predictor variables respectively for data that needs to be stacked. See details section for more information. |
... |
Additional argments to be passed to |
In the case of Ordered Logistic regression, this function computes a proportional odds model using
the cumulative link (logistic). In the case of no weights, the polr
function is used.
In the case of a weighted regression, the svyolr
function is used.
"Imputation (replace missing values with estimates)". All selected
outcome and predictor variables are included in the imputation, along with
all auxiliary.data
, excluding cases that are excluded via subset or
have invalid weights, but including cases with missing values of the outcome variable.
Then, cases with missing values in the outcome variable are excluded from
the analysis (von Hippel 2007). See Imputation
.
Outlier removal is performed by computing residuals for the regression model and removing the largest residuals
from the dataset (outlier removal). The model is then refit on the reduced dataset after outliers are removed.
The residuals used in this process depend on the regression type. For a regression with a numeric response
(type
is "Linear", "Poisson", "Quasi-Poisson" or "NBD") in an unweighted regression, the studentised
deviance residuals are used (see Davison and Snell (1991) and rstudent
). In the weighted case
of the numeric response, the Pearson residuals are used (see Davison and Snell (1991) and
residuals.glm
). In the case of Binary and Ordinal data for both the unweighted and weighted
regression cases, the Surrogate residuals (SURE) are used (via the implementation in Greenwell, McCarthy and
Boehmke (2017) with their sure R package). This was based on the recent theoretical paper in Liu and Zhang (2018).
Currently "Multinomial Logit" is unsupported for automated outlier removal. Possible surrogate residual to be used
in a future version.
In the case of stacking using the unstacked.data
argument, it is designed to work best with input that is
created with Q or Displayr which contains data.frame
s with a particular structure. If the Q/Displayr
data.frame
s are not available then simple data.frame
s can be provided. In particular, a list is
required with two elements,
Y
A data.frame
with m
columns that represent the m
variables to be stacked.
X
A data.frame
where each column represents a column of a design matrix relevant to one of the
m
variables given in element Y
above. So if the overall regression model has p
predictors.
Then this data.frame
should contain m * p
columns. In the absence of Q/Displayr metadata, the naming
structure each column is comma separated of the form 'predictor, outcome' where 'predictor' denotes the
predictor name in the regression design matrix and 'outcome' denotes the name of the variable in element Y
.
This format is required to ensure that the columns are appropriately matched and stacked. The function also
accepts column names of the reverse order with 'outcome, predictor', so long as there isn't any ambiguity.
In the absense of Q/Displayr meta data, the identification split is attempted via an assumed single comma separator
Also, when using Q/Displayr, some columns in the data.frame
for the unstacked.data
argument will contain
data reductions or NETs based off the codeframe and assigned codes to each NET. During the stacking process a NET is
removed from analysis unless it is entirely comprised of codes that are not observed elsewhere in the data.frame
Generally, a list of class Regression
. The exception being when method = 'model.frame'
is a specified input argument. In that case, a data.frame
is returned which returns only the data
element from the Regression
return list.
The Regression
return list contains the fitted regression object and other statistical outputs.
These include elements
robust.se
A logical specifying if robust standard error calculations were performed.
type
A character string specifying the Regression type (matches the input argument)
weights
A numeric vector of weights applied in the regression
output
A character vector specifying the output type specified in the input argument.
Could be the table type or a separate analysis (see input argument for more details).
outlier.prop.to.remove
A numeric value specifying the proportion of outliers removed in the analysis.
show.labels
A logical value specifying if variable labels were used (TRUE
) or variable names (FALSE
)
test.interaction
A logical value specifying if an interaction test was assessed in the output.
effects.format
A list containing input for the the relevant X and Y values for an effects plot output.
original
The initial standard R regression output (possibly refitted with outliers removed)
sample.description
A character string describing the regression and its inputs and outputs for
printing in a footer of the output table.
summary
A summary
of the original
regression object above tidied up.
design
The survey design object accompanying any survey weighted regression (computed using the input weights
)
subset
The logical vector specifying which observations were filtered into the regression.
n.predictors
An integer specifying the number of predictors (not including the intercept) in the regression.
n.observations
An integer specifying the number of observations used in the regression after outliers are removed.
estimation.data
A data.frame
containing the regression design matrix.
This design matrix takes into account the subset, missing data options.
correction
A character string specifying the multiple comparisons correction used (see input arguments)
formula
A formula
object for the regression model (before interaction term added)
model
A single data.frame
of the input data with both predictors and outcome variable, possibly stacked
and including imputed values or interaction term if applicable.
outcome.name
A character string of the outcome variable name as used in the formula.
outcome.label
A character string of the outcome variable label or
name with possible back ticks removed if labels are not requested.
terms
A terms
object from the original
output element.
coef
The computed coefficients from the original
output element.
r.squared
The original
output element R squared (or equivalent)
z.statistics
Computed z-statistics from the set of coefficients in a Mulitnomial Logit model
p.values
Computed p-values for the z-statistics above
importance
A list of output relevant when the selected output is either a "Relative Importance Analysis"
,
"Shapley Regression"
, "Jaccard Coefficient"
or "Correlation"
. This list has elements
raw.importance
The raw importance scores (regression coefficients, jaccard coefficients or correlations)
importance
The raw importance scores scaled to 0-100
standard.errors
The computed standard errors for the raw importance scores.
statistics
The computed standardised statistics of the raw importance scores
statistic.name
Character showing either the t or z statistic being used.
p.values
The vector of p-values for the relevant statistics computed above.
importance.type
Character string specifying the type of Importance analysis requested
importance.names
Character vector of the names of the predictors in the importance analysis
importance.labels
Character vector of the labels of the predictors in the importance analysis
relative.importance
A copy of the importance
output, kept for legacy purposes.
interaction
A list containing the regression analysis with an interaction term. The list has elements
label
Character string of the variable label of the interaction variable.
split.size
Numeric vector of counts of each level of the interaction variable and a total NET count.
pvalue
p-value of overall test of significance of the Regression using a call to stats::anova
original.r2
Either the R squared for linear regression or
proportion of deviance in model without interaction
full.r2
Either the R squared for linear regression or
proportion of deviance in model with interaction
fit
Regression model with interaction
net.coef
Vector of regression coefficients or importance.scores
importance
The importance list of the Regression without interaction if applicable, NULL
otherwise.
anova.output
The anova output for the Regression output without interaction.
anova.test
Character string of the Overall test of significance used (F or Chis-square)
coef.pvalues
Matrix of pvalues for the coefficients or importance scores used at each interaction level
coef.tstat
Matrix of statistics for the coefficients or importance scores used at each interaction level
coefficients
Matrix of coefficients or raw.importance scores used at each interaction level
anova
Essentially is the return output of Anova
with relevant metadata added.
This element only added when the input argument output = 'ANOVA'
or 'Effects plot'
.
footer
Character string of the footer to appear in the output table
importance.footer
Character string of the footer to appear in the output table of an importance analysis
stacked
Logical element to specify if the data was stacked (TRUE
) or not (FALSE
)
The Regression
list also has a 'ChartData'
attribute that is used when exporting to XLS files.
The contents of this attribute is a data.frame
that gives the equivalent information and structure of the
the formattable table output htmlwidget.
Davison, A. C. and Snell, E. J. (1991) Residuals and diagnostics. In: Statistical Theory and Modelling. In Honour of Sir David Cox, FRS, eds. Hinkley, D. V., Reid, N. and Snell, E. J., Chapman & Hall.
Greenwell, B., McCarthy, A. and Boehmke, B. (2017). sure: Surrogate Residuals for Ordinal and General Regression Models. R package version 0.2.0. https://CRAN.R-project.org/package=sure
Gromping, U. (2007). "Estimators of Relative Importance in Linear Regression Based on Variance Decomposition", The American Statistician, 61, 139-147.
von Hippel, Paul T. 2007. "Regression With Missing Y's: An Improved Strategy for Analyzing Multiply Imputed Data." Sociological Methodology 37:83-117.
Johnson, J.W. (2000). "A Heuristic Method for Estimating the Relative Weight", Multivariate Behavioral Research, 35:1-19.
Long, J. S. and Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear regression model. The American Statistician, 54(3): 217-224.
Lui, D. and Zhang, H. (2018). Residuals and Diagnostics for Ordinal Regression Models: A Surrogate Approach. Journal of the American Statistical Association, 113:522, 845-854.
Lumley, T. (2004) Analysis of complex survey samples. Journal of Statistical Software 9(1): 1-19
White, H. (1980), A heteroskedastic-consistent covariance matrix estimator and a direct test of heteroskedasticity. Econometrica, 48, 817-838.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.