options(continue=" ", width=70, prompt="> ") options(contrasts=c("contr.treatment", "contr.poly")) #ensure default options(show.significant.stars = FALSE) #statistical intelligence knitr::opts_chunk$set( collapse = TRUE, warning=FALSE, error=FALSE, tidy=FALSE, fig.asp=.75, fig.align="center", comment = "#>", highlight=TRUE, echo=TRUE, prompt=FALSE, fig.width=7, fig.height=5.5 ) library(survival)
The ridge()
function was included in the survival package as a test of the
code for penalized models. As the author of the package, I never actually
use the function myself, and as a consequence it has never received any
polish. But it does work. This neglect is simply due to the types of data
that I analyse.
This vignette was created to address a common issue with using the function, one that arises with some frequency on internet message boards. Hopefully this note will provide context and a reference for a solution. I will create a dummy dataset with 20 random 0/1 variables as a running example.
set.seed(1954) # force reproducability library(survival) n <- nrow(lung) snp <- matrix(rbinom(20*n, 1, p=.1), nrow=n) snpdata <- cbind(lung, data.frame(snp)) dim(snpdata)
The ridge function makes the most sense when there are a large number of variables, for instance if one had 100 SNPs. Say that you write the call in the obvious way:
cfit1 <- coxph(Surv(time, status) ~ age + sex + ridge(X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, theta=.1), data=snpdata)
The problem with this, of course, is that typing this statement is cumbersome
for 20 SNPs and practically impossible if there were 500.
One can use .
in R formulas to
stand for "all the other variables", but this does not work inside a function.
This leads to the first work-around, which is to create the formula
programatically.
xlist <- paste0("X", 1:20) myform <- paste("Surv(time, status) ~ age + sex + ridge(", paste(xlist, collapse= ", "), ", theta=.1)") cfit2 <- coxph(formula(myform), data=snpdata) all.equal(cfit1$loglik, cfit2$loglik)
This approach works well up to a point and then fails, once the model statement grows too long for the internal buffer of the R parser. The solution is to call the ridge function with a single matrix argument.
cfit3 <- coxph(Surv(time, status) ~ age + sex + ridge(snp, theta=.1), data = snpdata)
This fit has completely ignored the variables X1, X2, ..., X50 found in the dataframe. If the SNP data had come to us as a dataframe, then we would create a temporary matrix using the as.matrix command applied to the appropriate columns of said data.
The problem with this approach comes if we want to do predictions using a new set of data.
newsnp <- matrix(rbinom(20*4, 1, p=.12), nrow=4) prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1), newsnp) predict(cfit3, newdata=prdata)
We get a failure message that "variable lengths differ". The reason is that R
fitting functions look for their variables first in the data=
argument, and
then look in the primary working data for any that were not found. The call to
cfit3 has 3 variables of age, sex, and snp; while the new dataframe prdata has
variables of age, sex, X1, X2, ..., X20. The predict function pulls 4 rows from
prdata (for age and sex), and 50 rows from the global variable snp. This clearly
does not work.
What if we put those variables in the the main data, instead of a dataframe?
prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1), newsnp) snpsave <- snp snp <- newsnp age <- prdata$age sex <- prdata$sex new <- predict(cfit3) length(new) snp <- snpsave # restore the status quo rm(age, sex)
This time we have been tripped up by the predict.coxph function
. If there is
no newdata argument, the function assumes the user is asking for predictions
using the original data, and that is what is produced. There is also the need
to restore datasets that we overwrote.
The most direct way around this is to do the prediction "by hand". Since this dataset has no factor variables, splines, or other terms that lead to special coding, it is fairly easy to do the computation.
prmat <- cbind(age= c(50, 65, 48, 70), sex=c(1,1,2,1), newsnp) drop(prmat %*% coef(cfit3)) # simplify results to vector #alternate (center) drop(prmat %*% coef(cfit3)) - sum(coef(cfit3)* cfit3$mean)
The predict.coxph
function, by default, gives predictions for centered
covariates. This has its roots in the internals of the coxph
code, where
centering is used to avoid overflow of the exp function. This is entirely
optional when doing the prediction ourselves, unless one wants to match
predict.coxph
. (If I could go back in time, centering the predictions would
be undone; it's effect has been mostly to cause confusion. Que sera sera.)
Suppose that one still wanted to use the standard predict function, for instance
if the model did include a factor variable, or simply to fit in to the usual R
pattern? What is needed is a way to store a matrix directly into a dataframe.
This is done regularly by the model.frame
function; the code below causes
data.frame to use that behavior for our matrices of SNP values.
# new data type ridgemat <- function(x) { class(x) <- c("ridgemat", class(x)) x } as.data.frame.ridgemat <- function(x, ...) as.data.frame.model.matrix(as.matrix(x), ...) snpdata2 <- cbind(lung, snp= ridgemat(snp)) names(snpdata2) cfit4 <- coxph(Surv(time, status) ~ age + sex + ridge(snp, theta=.1), data= snpdata2) prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1), snp= ridgemat(newsnp)) predict(cfit4, newdata=prdata)
Success!
The downside to this is that the modified dataframe object is known to work with modeling functions, but is not guaranteed to be comparable with standard manipulations for dataframes, such as merge, subset, etc., or tidyverse concepts such as a tibble.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.