# pelora: Supervised Grouping of Predictor Variables In supclust: Supervised Clustering of Predictor Variables such as Genes

## Description

Performs selection and supervised grouping of predictor variables in large (microarray gene expression) datasets, with an option for simultaneous classification. Works in a greedy forward strategy and optimizes the binomial log-likelihood, based on estimated conditional probabilities from penalized logistic regression analysis.

## Usage

 1 2 pelora(x, y, u = NULL, noc = 10, lambda = 1/32, flip = "pm", standardize = TRUE, trace = 1) 

## Arguments

 x Numeric matrix of explanatory variables (p variables in columns, n cases in rows). For example, these can be microarray gene expression data which should be grouped. y Numeric vector of length n containing the class labels of the individuals. These labels have to be coded by 0 and 1. u Numeric matrix of additional (clinical) explanatory variables (m variables in columns, n cases in rows) that are used in the (penalized logistic regression) prediction model, but neither grouped nor averaged. For example, these can be 'traditional' clinical variables. noc Integer, the number of clusters that should be searched for on the data. lambda Real, defaults to 1/32. Rescaled penalty parameter that should be in [0,1]. flip Character string, describing a method how the x (gene expression) matrix should be sign-flipped. Possible are "pm" (the default) where the sign for each variable is determined upon its entering into the group, "cor" where the sign for each variable is determined a priori as the sign of the empirical correlation of that variable with the y-vector, and "none" where no sign-flipping is carried out. standardize Logical, defaults to TRUE. Is indicating whether the predictor variables (genes) should be standardized to zero mean and unit variance. trace Integer >= 0; when positive, the output of the internal loops is provided; trace >= 2 provides output even from the internal C routines.

## Value

pelora returns an object of class "pelora". The functions print and summary are used to obtain an overview of the variables (genes) that have been selected and the groups that have been formed. The function plot yields a two-dimensional projection into the space of the first two group centroids that pelora found. The generic function fitted returns the fitted values, these are the cluster representatives. coef returns the penalized logistic regression coefficients θ_j for each of the predictors. Finally, predict is used for classifying test data with Pelora's internal penalized logistic regression classifier on the basis of the (gene) groups that have been found.

An object of class "pelora" is a list containing:

 genes A list of length noc, containing integer vectors consisting of the indices (column numbers) of the variables (genes) that have been clustered. values A numerical matrix with dimension n \times \code{noc}, containing the fitted values, i.e. the group centroids \tilde{x}_j. y Numeric vector of length n containing the class labels of the individuals. These labels are coded by 0 and 1. steps Numerical vector of length noc, showing the number of forward/backward cycles in the fitting process of each cluster. lambda The rescaled penalty parameter. noc The number of clusters that has been searched for on the data. px The number of columns (genes) in the x-matrix. flip The method that has been chosen for sign-flipping the x-matrix. var.type A factor with noc entries, describing whether the jth predictor is a group of predictors (genes) or a single (clinical) predictor variable. crit A list of length noc, containing numerical vectors that provide information about the development of the grouping criterion during the clustering. signs Numerical vector of length p, saying whether the ith variable (gene) should be sign-flipped (-1) or not (+1). samp.names The names of the samples (rows) in the x-matrix. gene.names The names of the variables (columns) in the x-matrix. call The function call.

## Author(s)

Marcel Dettling, [email protected]

## References

Marcel Dettling (2003) Finding Predictive Gene Groups from Microarray Data, see http://stat.ethz.ch/~dettling/supervised.html

Marcel Dettling and Peter B<c3><bc>hlmann (2002). Supervised Clustering of Genes. Genome Biology, 3(12): research0069.1-0069.15.

Marcel Dettling and Peter B<c3><bc>hlmann (2004). Finding Predictive Gene Groups from Microarray Data. Journal of Multivariate Analysis 90, 106–131.

wilma for another supervised clustering technique.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 ## Working with a "real" microarray dataset data(leukemia, package="supclust") ## Generating random test data: 3 observations and 250 variables (genes) set.seed(724) xN <- matrix(rnorm(750), nrow = 3, ncol = 250) ## Fitting Pelora fit <- pelora(leukemia.x, leukemia.y, noc = 3) ## Working with the output fit summary(fit) plot(fit) fitted(fit) coef(fit) ## Fitted values and class probabilities for the training data predict(fit, type = "cla") predict(fit, type = "prob") ## Predicting fitted values and class labels for the random test data predict(fit, newdata = xN) predict(fit, newdata = xN, type = "cla", noc = c(1,2,3)) predict(fit, newdata = xN, type = "pro", noc = c(1,3)) ## Fitting Pelora such that the first 70 variables (genes) are not grouped fit <- pelora(leukemia.x[, -(1:70)], leukemia.y, leukemia.x[,1:70]) ## Working with the output fit summary(fit) plot(fit) fitted(fit) coef(fit) ## Fitted values and class probabilities for the training data predict(fit, type = "cla") predict(fit, type = "prob") ## Predicting fitted values and class labels for the random test data predict(fit, newdata = xN[, -(1:70)], newclin = xN[, 1:70]) predict(fit, newdata = xN[, -(1:70)], newclin = xN[, 1:70], "cla", noc = 1:10) predict(fit, newdata = xN[, -(1:70)], newclin = xN[, 1:70], type = "pro")