coding: Coding Categorical Variables

View source: R/coding.R

codingR Documentation

Coding Categorical Variables

Description

This function creates k - 1 variables for a categorical variable with k distinct levels. The coding system available in this function are dummy coding, simple coding, unweighted effect coding, weighted effect coding, repeated coding, forward Helmert coding, reverse Helmert coding, and orthogonal polynomial coding.

Usage

coding(..., data = NULL,
       type = c("dummy", "simple", "effect", "weffect", "repeat",
                "fhelm", "rhelm", "poly"), base = NULL,
       name = c("dum.", "sim.", "eff.", "weff.", "rep.", "fhelm.", "rhelm.", "poly."),
       append = TRUE, as.na = NULL, check = TRUE)

Arguments

...

a numeric vector with integer values, character vector or factor Alternatively, an expression indicating the variable name in data. Note that the function can only deal with one categorical variable.

data

a data frame when specifying a variable in the argument .... Note that the argument is NULL when specifying a numeric vector with integer values, character vector or factor numeric vector for the argument ....

type

a character string indicating the type of coding, i.e., dummy (default) for dummy coding, simple for simple coding, effect for unweighted effect coding, weffect for weighted effect coding, repeat for repeated coding, fhelm for forward Helmert coding, rhelm for reverse Helmert coding, and poly for orthogonal polynomial coding (see 'Details').

base

a numeric value or character string indicating the baseline group for dummy and simple coding and the omitted group in effect coding. By default, the first group or factor level is selected as baseline or omitted group.

name

a character string or character vector indicating the names of the coded variables. By default, variables are named "dum.", "sim.", "eff.", "weff.", "rep.", "fhelm.", "rhelm.",or "poly." depending on the type of coding with the category used in the comparison (e.g., "dum.2" and "dum.3"). Variable names can be specified using a character string (e.g., name = "dummy_" leads to dummy_2 and dummy_3) or a character vector matching the number of coded variables (e.g. name = c("x1_2", "x1_3")) which is the number of unique categories minus one.

append

logical: if TRUE (default), coded variables are appended to the data frame specified in the argument data.

as.na

a numeric vector indicating user-defined missing values, i.e. these values are converted to NA before conducting the analysis.

check

logical: if TRUE (default), argument specification is checked.

Details

Dummy Coding

Dummy or treatment coding compares the mean of each level of the categorical variable to the mean of a baseline group. By default, the first group or factor level is selected as baseline group. The intercept in the regression model represents the mean of the baseline group. For example, dummy coding based on a categorical variable with four groups A, B, C, D makes following comparisons: B vs A, C vs A, and D vs A with A being the baseline group.

Simple Coding

Simple coding compares each level of the categorical variable to the mean of a baseline level. By default, the first group or factor level is selected as baseline group. The intercept in the regression model represents the unweighted grand mean, i.e., mean of group means. For example, simple coding based on a categorical variable with four groups A, B, C, D makes following comparisons: B vs A, C vs A, and D vs A with A being the baseline group.

Unweighted Effect Coding

Unweighted effect or sum coding compares the mean of a given level to the unweighed grand mean, i.e., mean of group means. By default, the first group or factor level is selected as omitted group. For example, effect coding based on a categorical variable with four groups A, B, C, D makes following comparisons: B vs (A, B, C, D), C vs (A, B, C, D), and D vs (A, B, C, D) with A being the omitted group.

Weighted Effect Coding

Weighted effect or sum coding compares the mean of a given level to the weighed grand mean, i.e., sample mean. By default, the first group or factor level is selected as omitted group. For example, effect coding based on a categorical variable with four groups A, B, C, D makes following comparisons: B vs (A, B, C, D), C vs (A, B, C, D), and D vs (A, B, C, D) with A being the omitted group.

Repeated Coding

Repeated or difference coding compares the mean of each level of the categorical variable to the mean of the previous adjacent level. For example, repeated coding based on a categorical variable with four groups A, B, C, D makes following comparisons: B vs A, C vs B, and D vs C.

Foward Helmert Coding

Forward Helmert coding compares the mean of each level of the categorical variable to the unweighted mean of all subsequent level(s) of the categorical variable. For example, forward Helmert coding based on a categorical variable with four groups A, B, C, D makes following comparisons: (B, C, D) vs A, (C, D) vs B, and D vs C.

Reverse Helmert Coding

Reverse Helmert coding compares the mean of each level of the categorical variable to the unweighted mean of all prior level(s) of the categorical variable. For example, reverse Helmert coding based on a categorical variable with four groups A, B, C, D makes following comparisons: B vs A, C vs (A, B), and D vs (A, B, C).

Orthogonal Polynomial Coding

Orthogonal polynomial coding is a form of trend analysis based on polynomials of order k - 1, where k is the number of levels of the categorical variable. This coding scheme assumes an ordered-categorical variable with equally spaced levels. For example, orthogonal polynomial coding based on a categorical variable with four groups A, B, C, D investigates a linear, quadratic, and cubic trends in the categorical variable.

Value

Returns a data frame with k - 1 coded variables or a data frame with the same length or same number of rows as ... containing the coded variables.

Note

This function uses the contr.treatment function from the stats package for dummy coding and simple coding, a modified copy of the contr.sum function from the stats package for effect coding, a modified copy of the contr.wec function from the wec package for weighted effect coding, a modified copy of the contr.sdif function from the MASS package for repeated coding, a modified copy of the code_helmert_forward function from the codingMatrices for forward Helmert coding, a modified copy of the contr_code_helmert function from the faux package for reverse Helmert coding, and the contr.poly function from the stats package for orthogonal polynomial coding.

Author(s)

Takuya Yanagida takuya.yanagida@univie.ac.at

See Also

rec, item.reverse

Examples

# Example 1a: Dummy coding for 'gear', baseline group = 3
coding(gear, data = mtcars)

# Example 1b: Alterantive specification without using the 'data' argument
coding(mtcars$gear)

# Example 2: Dummy coding for 'gear', baseline group = 4
coding(gear, data = mtcars, base = 4)

# Example 3: Effect coding for 'gear', omitted group = 3
coding(gear, data = mtcars, type = "effect")

# Example 3: Effect coding for 'gear', omitted group = 4
coding(gear, data = mtcars, type = "effect", base = 4)

# Example 4a: Dummy-coded variable names with prefix "gear3."
coding(gear, data = mtcars, name = "gear3.")

# Example 4b: Dummy-coded variables named "gear_4vs3" and "gear_5vs3"
coding(gear, data = mtcars, name = c("gear_4vs3", "gear_5vs3"))

misty documentation built on June 29, 2024, 9:07 a.m.

Related to coding in misty...