knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",  
  out.extra = 'style="border:0; margin: auto"', 
fig.width=5,
fig.height=5,
out.width="400px",  
out.height="400px"  
)

Primary functions

The ordPens R package offers selection, and/or smoothing/fusion of ordinally scaled independent variables using a group lasso or generalized ridge penalty. In addition, nonlinear principal components analysis for ordinal variables is provided, using a second-order difference penalty.

Smoothing and selection of ordinal predictors is done by the function ordSelect(); smoothing only, by ordSmooth(); fusion and selection of ordinal predictors by ordFusion(). For ANOVA with ordinal factors, use ordAOV(). To test for genes that are differentially expressed between ordinal levels, use ordGene(). For Nonlinear PCA, performance evaluation and selection of an optimal penalty parameter, use ordPCA() and see also vignette("ordPCA", package = "ordPens").

library(ordPens) 

Fusion, smoothing and selection

These functions fit penalized dummy coefficients of ordinally scaled independent variables. The covariates are assumed to take values 1,2,...,max, where max denotes the (columnwise) highest level observed in the data. The ordSmooth() function penalizes the sum of squared differences of adjacent dummy coefficients, while ordSelect() uses a group lasso penalty on differences of adjacent dummies and ordFusion() proceeds with a fused lasso penalty on differences of adjacent dummy coefficients using the glmpath algorithm. For details on the different penalties used, please see also Tutz and Gertheiss (2014, 2016).

ordPens extends penalization to the wider framework of generalized linear models, thus allowing for binary, ordinal or Poisson distributed responses as well as the classical Gaussian model.
The response vector y can be either of numeric scale (leading to the default model = "linear"), 0/1 coded (model = "logit"), ordinal (model = "cumulative") or contain count data (i.e. model = "poisson").

Smoothing

For smoothing only, the generalized linear model is fitted via penalized maximum likelihood. In particular, the logit or poisson model is fitted by penalized Fisher scoring. For stopping the iterations the criterion $\sqrt(\sum((b.new-b.old)^2)/\sum(b.old^2)) < \delta$ is used.

The tuning parameter $\lambda$ controls the overall strength of the penalty. If $\lambda = 0$ one obtains the usual un-penalized coefficient estimates. If $\lambda \to \infty$ coefficients are enforced to shrink towards 0. For $0 < \lambda < \infty$ parameters corresponding to adjacent categories are enforced to obtain similar values, as it is sensible to assume that coefficients vary smoothly over ordinal categories.

It should be highlighted, that ordSmooth() is intended for use with high-dimensional ordinal predictors; more precisely, if the number of ordinal predictors is large. Package ordPens, however, also includes auxiliary functions such that mgcv::gam() (Wood, 2011 and 2017) can be used for fitting generalized linear and additive models with first- and second-order ordinal smoothing penalty as well as built-in smoothing parameter selection. In addition, mgcv tools for further statistical inference can be used. Note, however, significance of smooth (ordinal) terms is only reliable in case of the second-order penalty. Also note, if using gam(), dummy coefficients/fitted functions are centered over the data observed. For details, see also Gertheiss et al. (2021).

Selection

When many predictors are available, selecting the, possibly few, relevant ones is of particular interest. Joint selection and smoothing is achieved by a ordinal group lasso penalty, where the $L_2$-norm is modified by squared differences. While the ridge penalty shrinks the coefficients of correlated predictors towards each other, the (ordinal) group lasso additionally tends to select one or some of them and discard the others.
For more information on the original group lasso (for nominal predictors and grouped variables in general), cf. Meier et. al (2008).

Fusion

If a variable is selected by the group lasso, all coefficients belonging to that selected covariate are estimated as differing (at least slightly). However, it might be useful to collapse certain categories. Clustering is done by a fused lasso type penalty using the $L_1$-norm on adjacent categories. That is, adjacent categories get the same coefficient values and one obtains clusters of categories. The fused lasso also enforces selection of predictors: since the parameters of the reference categories are set to zero, a predictor is excluded if all its categories are combined into one cluster.

Simulated data example

Now, let's demonstrate and compare the three methods for the default Gaussian model.
Simulating data with ordinal predictors:

set.seed(123)

# generate (ordinal) predictors
x1 <- sample(1:8, 100, replace = TRUE)
x2 <- sample(1:6, 100, replace = TRUE)
x3 <- sample(1:7, 100, replace = TRUE)

# the response
y <- -1 + log(x1) + sin(3*(x2-1)/pi) + rnorm(100)

# x matrix
x <- cbind(x1,x2,x3)

# lambda values
lambda <- c(1000, 500, 200, 100, 70, 50, 30, 20, 10, 1) 

The syntax of the three functions is very similar, type also ?ordSmooth or one of the other functions. The fitted ordPen object contains the fitted coefficients and can be visualized by the generic plot function:

osm1 <- ordSmooth(x = x, y = y, lambda = lambda)
osl <- ordSelect(x = x, y = y, lambda = lambda)
ofu <- ordFusion(x = x, y = y, lambda = lambda) 

par(mar = c(4.1, 4.1, 3.1, 1.1)) 
plot(osm1)
plot(osl, main = "")
plot(ofu, main = "")

The vector of penalty parameters lambda is sorted decreasingly and each curve corresponds to a $\lambda$ value, with larger values being associated with the darker colors. The top row corresponds to smoothing, the middle row shows selection and the bottom row demonstrates variable fusion.

It is seen that for smaller $\lambda$, the coefficients are more wiggly, while differences of adjacent categories are more and more shrunk when $\lambda$ increases, which yields smoother coefficients.

Variable 3 is excluded from the model for $\lambda \geq 10$ with both ordSelect() and ordFusion(). In general, ordFusion() yields estimates that tend to be flat over adjacent categories, which represents a specific form of smoothness. With the fused lasso, all categories are fused and excluded at some point, which can be viewed from the bottom row or from the matrix of estimated coefficients (not printed here):

round(osm1$coefficients, digits = 3)
round(osl$coefficients, digits = 3)
round(ofu$coefficients, digits = 3)

Now, let’s plot the path of coefficients against the log-lambda value:

matplot(log(lambda), t(osm1$coefficients), type = "l", xlab = expression(log(lambda)),
        ylab = "Coefficients", col = c(1, rep(2,8), rep(3,6), rep(4,7)), cex.main = 1,
        main = "Smoothing", xlim = c(max(log(lambda)), min(log(lambda))))
axis(4, at = osm1$coefficients[,ncol(osm1$coefficients)], label = rownames(osm1$coefficients), 
     line = -.5, las = 1, tick = FALSE, cex.axis = 0.75)

matplot(log(lambda), t(osl$coefficients), type = "l", xlab = expression(log(lambda)), 
        ylab = "Coefficients", col = c(1, rep(2,8), rep(3,6), rep(4,7)), cex.main = 1, 
        main = "Selection",  xlim = c(max(log(lambda)), min(log(lambda))))
axis(4, at = osl$coefficients[,ncol(osl$coefficients)], label = rownames(osl$coefficients), 
     line = -.5, las = 1, tick = FALSE, cex.axis = 0.75)

matplot(log(lambda), t(ofu$coefficients), type = "l", xlab = expression(log(lambda)), 
        ylab = "Coefficients", col = c(1, rep(2,8), rep(3,6), rep(4,7)), cex.main = 1, 
        main = "Fusion", xlim = c(max(log(lambda)), min(log(lambda))))
axis(4, at = ofu$coefficients[,ncol(ofu$coefficients)], label = rownames(ofu$coefficients), 
     line = -.5, las = 1, tick = FALSE, cex.axis = 0.75)

Each curve corresponds to a dummy coefficient and each color corresponds to a variable (intercept not penalized here). The effect of the different penalties can be seen very clearly.

Using mgcv::gam() with ordPens and comparison to ordSmooth()

As mentioned before, the package also builds a bridge to mgcv::gam() by providing a new type of spline basis for ordered factors s(..., bs = "ordinal"), such that smooth terms in the GAM formula can be used. In addition, generic functions for prediction and plotting are provided; and existing mgcv tools for further statistical inference can be used this way, which enables a very flexible analysis.

Before estimation, we need to modify the ordinal predictors to the class of ordered factors. We also add a nominal covariate to control for.

x1 <- as.ordered(x1)
x2 <- as.ordered(x2)
x3 <- as.ordered(x3)

u1 <- sample(1:8, 100, replace = TRUE)
u <- cbind(u1)
osm2 <- ordSmooth(x = x, y = y, u = u, lambda = lambda)

Now, let us use gam() from mgcv for model fitting. Estimation with first-order penalty and smoothing parameter selection by REML:

gom1 <- gam(y ~ s(x1, bs = "ordinal", m = 1) + s(x2, bs = "ordinal", m = 1) + 
              s(x3, bs = "ordinal", m = 1) + factor(u1), method = "REML")

Using second-order penalty instead:

gom2 <- gam(y ~ s(x1, bs = "ordinal", m = 2) + s(x2, bs = "ordinal", m = 2) + 
              s(x3, bs = "ordinal", m = 2) + factor(u1), method = "REML")

The summary command including significance of smooth terms can be used in the usual way. Please note, the latter is only reliable for m = 2, as mentioned above.

summary(gom2)

We can also visualize the coefficients including confidence intervals by executing the plot function:

par(mar = c(4.1, 4.1, 3.1, 1.1)) 
plot(osm2)
plot(gom1)
plot(gom2)

Predicting from an ordPen object

The package also includes a generic function for prediction. It takes a fitted ordPen object and a matrix/data frame of new observations of the considered ordinal predictors (denoted by newx). If the fitted object contains also additional nominal or metric predictors, further matrices of new observations corresponding to those predictors need to be added (denoted by newu or newz).

x1 <- sample(1:8, 10, replace = TRUE)
x2 <- sample(1:6, 10, replace = TRUE)
x3 <- sample(1:7, 10, replace = TRUE)
newx <- cbind(x1, x2, x3)

The type of prediction can be chosen by the type option: "link" is on the scale of linear predictors, "response" is on the scale of the response variable (inverse link function).

Predict from an ordPen object:

round(predict(osm1, newx), digits = 3)
round(predict(osl, newx), digits = 3)
round(predict(ofu, newx), digits = 3)

ICF core set for chronic widespread Pain

Lastly, we demonstrate the three methods on a publicly available dataset suiting the high-dimensional setting they're designed for. The ICF core set for chronic widespread pain (CWP) consists of 67 rating scales - called ICF categories - and a physical health component summary measure (phcs).

Each ICF factor is associated with one of the following four types: 'body functions', 'body structures', 'activities and participation', and 'environmental factors'. The latter are measured on a nine-point Likert scale ranging from −4 'complete barrier' to +4 'complete facilitator'. All remaining factors are evaluated on a five-point Likert scale ranging from 0 'no problem' to 4 'complete problem'. Type ?ICFCoreSetCWP and see also references therein for further details.

data(ICFCoreSetCWP)
head(ICFCoreSetCWP)

Before estimation, we make sure that the ordinal design matrix is coded adequately:

y <- ICFCoreSetCWP$phcs
x <- ICFCoreSetCWP[, 1:67] + matrix(c(rep(1, 50), rep(5, 16), 1),
                                    nrow(ICFCoreSetCWP), 67,
                                    byrow = TRUE)
xnames <- names(x)
head(x)

We can check whether all categories are observed at least once as follows:

rbind(apply(x, 2, min), apply(x, 2, max))

As for some covariates not all possible levels are observed at least once, the easiest way is to add a corresponding row to x with corresponding y value being NA to ensure a corresponding coefficient is to be fitted.

x <- rbind(x, rep(1,67))
x <- rbind(x, c(rep(5, 50), rep(9,16), 5))
y <- c(y, NA, NA)

osm_icf <- ordSmooth(x = x, y = y, lambda = lambda)
osl_icf <- ordSelect(x = x, y = y, lambda = lambda)
ofu_icf <- ordFusion(x = x, y = y, lambda = lambda) 

Let us illustrate the fitted coefficients for some covariates for different values of $\lambda$ (larger values being associated with the darker curves). We can use axis() to replace the integers 1,2,... used for fitting with the original ICF levels:

wx <- which(xnames=="b1602"|xnames=="d230"|xnames=="d430"|xnames=="d455"|xnames=="e1101")
xmain <- c()
xmain[wx] <- c("Content of thought",
               "Carrying out daily routine",
               "Lifting and carrying objects",
               "Moving around",
               "Drugs")

par(mar = c(4.1, 4.1, 3.1, 1.1))  
for(i in wx){
  plot(osm_icf, whx = i, main = "", xaxt = "n")
  axis(1, at = 1:length(osm_icf$xlevels), 
       labels = ((1:length(osm_icf$xlevels)) - c(rep(1,50), rep(5,16), 1)[i]))   

  plot(osl_icf, whx = i, main = xmain[i], xaxt = "n")
  axis(1, at = 1:length(osm_icf$xlevels), 
       labels = ((1:length(osm_icf$xlevels)) - c(rep(1,50), rep(5,16), 1)[i]))   

  plot(ofu_icf, whx = i, main = "", xaxt = "n")
  axis(1, at = 1:length(osm_icf$xlevels), 
       labels = ((1:length(osm_icf$xlevels)) - c(rep(1,50), rep(5,16), 1)[i]))   
}

It is seen, for example, that variables b1602, d230, d430 (rows 1-3) are excluded by ordSelect() for $\lambda \geq 100,60,30$, respectively, whereas ordFusion() yields fused dummy estimates for b1602, clustering ICF levels 2-4 for $10 \leq \lambda < 100$ and excluding the variable for $\lambda \geq 100$:

xgrp <- rep(1:67, apply(x, 2, max))

osm_coefs <- osm_icf$coef[2:(length(xgrp) + 1),, drop = FALSE]
osl_coefs <- osl_icf$coef[2:(length(xgrp) + 1),, drop = FALSE]
ofu_coefs <- ofu_icf$coef[2:(length(xgrp) + 1),, drop = FALSE]

round(osm_coefs[xgrp == wx[1],, drop = FALSE] ,3)
round(osl_coefs[xgrp == wx[1],, drop = FALSE] ,3)
round(ofu_coefs[xgrp == wx[1],, drop = FALSE] ,3)

Note that a penalty using first-order differences, as applied here for smoothing/selection/fusion, always fuses the outer levels if observations are missing. This can be viewed for variable b1602 (where observations for level 4 are missing) as well as for variable e1101 (where level -4 is not observed once).

ANOVA for factors with ordered levels: ordAOV()

In the setting of a continuous response and a discrete covariate (or multiple, group-defining factors), testing for differences in the means might be of interest, known as the classical analysis of variance (ANOVA). ordAOV() carries out ANOVA for factors with ordered levels, which often outperforms the standard F-test.

The new test uses a mixed effects formulation of the usual one- or multi-factorial ANOVA model (currently with main effects only) while penalizing (squared) differences of adjacent means. Testing for equal means across factor levels is done by (restricted) likelihood ratio testing for a zero variance component in a linear mixed model, cf. Gertheiss and Oehrlein (2011) and Gertheiss (2014) or type ?ordAOV for further details on the testing procedure.
For one-factorial ANOVA, the (exact) finite sample distribution is derived by Crainiceanu and Ruppert (2004). For simulating values from the finite sample null distribution of the (restricted) likelihood ratio statistic, the algorithms implemented in Package RLRsim (Scheipl et al., 2008) are used. See ?LRTSim and ?RLRTSim for further information.

To illustrate the method, let's reanalyze the ICF data for CWP. Our aim is to look for categories that show significant association with the response phcs.

y <- ICFCoreSetCWP$phcs

one-factorial ANOVA

For one-factorial ordinal ANOVA, we consider the ICF category 'Moving around':

x <- ICFCoreSetCWP[, which(xnames == "d455")]

The factor argument x is assumed to be a vector (one-factorial) or matrix (multi-factorial) taking values 1,2,...max, where max denotes the highest level of the respective factor observed in the data. Since every level between 1 and max has to be observed at least once, we need to modify our covariate adequately:

x <- as.integer(x - min(x) + 1)

Let us visualize boxplots of the observed physical health component scores for the different levels of ICF category ‘Moving around’ and add the empirical means (red line):

boxplot(y~factor(x, levels = 1:5, labels = 0:4), varwidth = TRUE, col = "white", 
        xlab = "level", ylab = "physical health summary")

x.mean <- tapply(as.numeric(y)[order(x)], as.factor(x[order(x)]), mean)
lines(x.mean, type = "b", col = 2, pch = 17)

The widths of the boxes are proportional to the square-roots of the number of observations in each group. It is seen that the the patient's physical health condition worsens as the ability to move around becomes more and more a problem.

Now we can perform ordinal ANOVA for testing whether there are significant differences between the levels. The type argument allows to choose the type of test to carry out: likelihood ratio ("LRT") or restricted likelihood ratio ("RLRT", recommended).

For deriving p-values, we simulate 1,000,000 values from the null distribution of RLRT:

ordAOV(x, y, type = "RLRT", nsim = 1000000)

For comparison:

ordAOV(x, y, type = "LRT", nsim = 1000000)
anova(lm(y ~factor(x)))

multi-factorial ANOVA

For multi-factorial ordinal ANOVA, simply adjust x to a matrix with each column corresponding to one ordinal factor. Note that only main effects are considered here and that for the finite sample null distribution of the (R)LRT statistic, the approximation proposed by Greven et al. (2008) is used. The outcome is a list (of lists) with the jth component giving the results above when testing the main effect of factor j.

Testing for differentially expressed genes: ordGene()

A situation similar to those of the ICF above is found with microarrays of gene expression data when looking for differentially expressed genes between ordinal phenotypes.

ordGene() is a wrapper function and operates as follows: For each gene in the dataset, (R)LRT is applied separately using the ordAOV() function to test for differences between levels given in lvs and p-values are stored.
The null distribution of (R)LRT is obtained by simulation (one million values by default).

Let's simulate some (toy) gene expression data... We assume that xpr contains the gene expression data with genes in the rows and samples in the columns:

set.seed(321) 
ni <- 5
n <- sum(5*ni)
xpr <- matrix(NA, ncol = n, nrow = 100)
mu_lin <- 3:7  
mu_sq2 <- (-2:2)^2 * 0.5 + 3   
a <- seq(0.75, 1.25, length.out = 10)

for(i in 1:10){ 
  xpr[i,] <- a[i] * rep(mu_lin, each = ni) + rnorm(n)
  xpr[i+10,] <- a[i] * rep(mu_sq2, each = ni) + rnorm(n) 
} 
for(i in 21:100) xpr[i,] <- 3 + rnorm(n)

The generated dataset contains 25 samples of (fictive) dose-response microarray data of 100 genes at five doses.

dose contains the corresponding (fictive) doses being as follows:

dose <- rep(c(0, 0.01, 0.05, 0.2, 1.5), each = ni)

Now we can plot the data for some exemplary genes and add the true mean gene expression values (black lines):

plot(dose, xpr[4,], col = as.factor(dose), lwd = 2, ylab = "expression", main = "gene 4") 
lines(sort(unique(dose)), mu_lin * a[4], lty = 1, col = 1) 

plot(dose, xpr[14,], col = as.factor(dose), lwd = 2, ylab = "expression", main = "gene 14") 
lines(sort(unique(dose)), mu_sq2 * a[4], lty = 1, col = 1) 

Alternatively, we can plot dose levels (on ordinal scale) against expression values:

plot(1:length(sort(unique(dose))), ylim = range(xpr[4,]), pch = "", ylab = "expression", 
     main = "gene 4", xlab = "levels", xaxt = "n")
axis(1, at = 1:length(sort(unique(dose))) ) 
points(as.factor(dose), xpr[4,], col = as.factor(dose), lwd = 2) 
lines(1:length(sort(unique(dose))), mu_lin * a[4], lty = 1)

plot(1:length(sort(unique(dose))), ylim = range(xpr[14,]), pch = "", ylab = "expression", 
     main = "gene 14", xlab = "levels", xaxt = "n")
axis(1, at = 1:length(sort(unique(dose))) ) 
points(as.factor(dose), xpr[14,], col = as.factor(dose), lwd = 2) 
lines(1:length(sort(unique(dose))), mu_sq2 * a[4], lty = 1)

ordGene() takes the expression data, being a matrix or data frame, and the vector of doses and stores p-values according to the (R)LRT test statistic.

pvals <- ordGene(xpr = xpr, lvs = dose, nsim = 1e6)

In addition to (R)LRT, results of usual one-way ANOVA (not taking the factor's ordinal scale level into account) and a t-test assuming a linearity across factor levels (not the doses such as 0, 0.01, ...) are reported.

For illustration, we can compare the empirical cumulative distribution of the smallest p-values ($\leq 0.05$) produced from each of the tests (ANOVA, RLRT, t-test):

plot(ecdf(pvals[,1]), xlim = c(0,0.05), ylim = c(0, 0.25),
     main = "", xlab = "p-value", ylab = "F(p-value)")
plot(ecdf(pvals[,2]), xlim = c(0, 0.05), add = TRUE, col = 2)
plot(ecdf(pvals[,3]), xlim = c(0, 0.05), add = TRUE, col = 3)
legend('topleft', colnames(pvals), col = 1:3, lwd = 2, lty = 1) 

See also Sweeney et al. (2015) for a more detailed analysis on several publicly available datasets.

References



ahoshiyar/ordPens documentation built on May 7, 2024, 5:43 a.m.