Classify: Classify observations using a Poisson model.
In PoiClaClu: Classification and Clustering of Sequencing Data Based on a Poisson Model

Description Usage Arguments Value Author(s) References See Also Examples

View source: R/Classify.R

Classify observations using a simple Poisson model. This function implements the "sparse Poisson linear discriminant analysis classifier", which is similar to linear discriminant analysis but assumes a Poisson model rather than a Gaussian model for the data. The classifier soft-thresholds the estimated effect of each feature in order to achieve sparsity.

1 2	Classify(x, y, xte=NULL, rho = 0, beta = 1, rhos = NULL, type=c("mle","deseq","quantile"), prior = NULL, transform=TRUE, alpha=NULL)

`x`	A n-by-p training data matrix; n observations and p features. Used to train the classifier.
`y`	A numeric vector of class labels of length n: 1, 2, ...., K if there are K classes. Each element of y corresponds to a row of x; i.e. these are the class labels for the observations in x.
`xte`	A m-by-p data matrix: m test observations and p features. The classifier fit on the training data set x will be tested on this data set. If NULL, then testing will be performed on the training set.
`rho`	Tuning parameter controlling the amount of soft thresholding performed, i.e. the level of sparsity, i.e. number of nonzero features in classifier. Rho=0 means that there is no soft-thresolding, i.e. all features used in classifier. Larger rho means that fewer features will be used.
`beta`	A smoothing term. A Gamma(beta,beta) prior is used to fit the Poisson model. Recommendation is to just leave it at 1, the default value.
`rhos`	A vector of tuning parameters that control the amount of soft thresholding performed. If "rhos" is provided then a number of models will be fit (one for each element of "rhos"), and a number of predicted class labels will be output (one for each element of "rhos").
`type`	How should the observations be normalized within the Poisson model, i.e. how should the size factors be estimated? Options are "quantile" or "deseq" (more robust) or "mle" (less robust). In greater detail: "quantile" is quantile normalization approach of Bullard et al 2010 BMC Bioinformatics, "deseq" is median of the ratio of an observation to a pseudoreference obtained by taking the geometric mean, described in Anders and Huber 2010 Genome Biology and implemented in Bioconductor package "DESeq", and "mle" is the sum of counts for each sample; this is the maximum likelihood estimate under a simple Poisson model.
`prior`	Vector of length equal to the number of classes, representing prior probabilities for each class. If NULL then uniform priors are used (i.e. each class is equally likely).
`transform`	Should data matrices x and xte first be power transformed so that it more closely fits the Poisson model? TRUE or FALSE. Power transformation is especially useful if the data are overdispersed relative to the Poisson model.
`alpha`	If transform=TRUE, this determines the power to which the data matrices x and xte are transformed. If alpha=NULL then the transformation that makes the Poisson model best fit the data matrix x is computed. (Note that alpha is computed based on x, not based on xte). Or a value of alpha, 0<alpha<=1, can be entered by the user.

`ytehat`	The predicted class labels for each of the test observations (rows of xte).
`discriminant`	A m-by-K matrix, where K is the number of classes. The (i,k) element is large if the ith element of xte belongs to class k.
`ds`	A K-by-p matrix indicating the extent to which each feature is under- or over-expressed in each class. The (k,j) element is >1 if feature j is over-expressed in class k, and is <1 if feature j is under-expressed in class k. When rho is large then many of the elemtns of this matrix are shrunken towards 1 (no over- or under-expression).
`alpha`	Power transformation used (if transform=TRUE).

Daniela Witten

D Witten (2011) Classification and clustering of sequencing data using a Poisson model. To appear in Annals of Applied Statistics.

Classify.cv

set.seed(1)
dat <- CountDataSet(n=40,p=500,sdsignal=.1,K=3,param=10)
cv.out <- Classify.cv(dat$x,dat$y)
print(cv.out)
out <- Classify(dat$x,dat$y,dat$xte,rho=cv.out$bestrho)
print(out)
cat("Confusion matrix for predicted and true test class labels:", fill=TRUE)
print(table(out$ytehat,dat$yte))

12345
Value of alpha used to transform data:  0.757551
Rho values considered:  0 0.116 0.233 0.349 0.466 0.582 0.699 0.815 0.932 1.048 
1.164 1.281 1.397 1.514 1.63 1.747 1.863 1.979 2.096 2.212 2.329 2.445 2.562 
2.678 2.795 2.911 3.027 3.144 3.26 3.377
Number of CV folds performed:  5
Type of normalization performed:  mle

CV results:
Rho	Errors	Num. Nonzero Features
0	1.6	500
0.116	1.6	494.8
0.233	1.4	478.4
0.349	1.4	451
0.466	1.4	415.8
0.582	1.4	373.2
0.699	1.2	332
0.815	1	289.6
0.932	0.6	248.8
1.048	0.6	209.4
1.164	0.6	178.8
1.281	0.8	148.8
1.397	0.8	123.8
1.514	1	101.6
1.63	1.2	86.8
1.747	1.2	70.8
1.863	1.4	59.6
1.979	1.4	50.2
2.096	1.6	42.6
2.212	1.6	33.8
2.329	1.8	27.2
2.445	1.6	22.2
2.562	1.8	19.2
2.678	2	16.2
2.795	2	13.8
2.911	2	11.8
3.027	2	10.6
3.144	2	10
3.26	2	8.4
3.377	2	7.4
Value of alpha used to transform data:  0.757551
Type of normalization performed:  mle
Number of training observations:  40
Number of test observations:  40
Number of features:  500
Value of rho used:  1.164403
Number of features used in classifier:  171
Confusion matrix for predicted and true test class labels:
   
     1  2  3
  1 14  0  1
  2  1  9  1
  3  1  1 12