naiveReplicateRUV: Remove unwanted variation from a gene expression matrix using...
In RUVnormalize: RUV for normalization of expression array data

Description Usage Arguments Details Value Examples

View source: R/naiveReplicateRUV.R

The function takes as input a gene expression matrix as well as the index of negative control genes and replicate samples. It estimates and remove unwanted variation from the gene expression.

1	naiveReplicateRUV(Y, cIdx, scIdx, k, rrem=NULL, p=NULL, tol=1e-6)

`Y`	Expression matrix where the rows are the samples and the columns are the genes.
`cIdx`	Column index of the negative control genes in Y, for estimation of unwanted variation.
`scIdx`	Matrix giving the set of replicates. Each row is a set of arrays corresponding to replicates of the same sample. The number of columns is the size of the largest set of replicates, and the smaller sets are padded with -1 values. For example if the sets of replicates are (1,11,21), (2,3), (4,5), (6,7,8), the scIdx should be 1 11 21 2 3 -1 4 5 -1 6 7 8
`k`	Desired rank for the estimated unwanted variation term. The returned rank may be lower if the replicate arrays and control genes did not contain a signal of rank k.
`rrem`	Optional, indicates which arrays should be removed from the returned result. Useful if the replicate arrays were not actual samples but mixtures of RNA which are only useful to estimate UV but which should not be included in the analysis.
`p`	Optional. If given, the function returns an estimate of the term of interest using rank-p restriction of the SVD of the corrected matrix.
`tol`	Directions of variance lower than this value in the replicate samples are dropped (which may result in an estimated unwanted variation term of rank smaller than k).

In terms of model, the rank k can be thought of as the number of independent sources of unwanted variation in the data (i.e., if one source is a linear combination of other sources, it does not increase the rank).

In practice, even if the real number of independent sources of unwanted variation is known, using a smaller k (resp., larger ridge) could yield better corrections because one may not have enough samples to effectively estimate all the effects.

A list containing the following terms:

`X, b`	if p is not NULL, contains an estimate of the factor of interest (X) and its effect (beta) obtained using rank-p restriction of the SVD of Y - W alpha.
`W, a`	Estimates of the unwanted variation factors (W) and their effect (alpha).
`cY`	The corrected expression matrix Y - W alpha
`Yctls`	The differences of replicate arrays which were used to estimate W and alpha.

    if(require('RUVnormalizeData')){
     
         ## Load the data
         data('gender', package='RUVnormalizeData')
     
         Y <- t(exprs(gender))
         X <- as.numeric(phenoData(gender)$gender == 'M')
         X <- X - mean(X)
         X <- cbind(X/(sqrt(sum(X^2))))
         chip <- annotation(gender)
             
         ## Extract regions and labs for plotting purposes
         lregions <- sapply(rownames(Y),FUN=function(s) strsplit(s,'_')[[1]][2])
         llabs <- sapply(rownames(Y),FUN=function(s) strsplit(s,'_')[[1]][3])
     
         ## Dimension of the factors
         m <- nrow(Y)
         n <- ncol(Y)
         p <- ncol(X)
     
         Y <- scale(Y, scale=FALSE) # Center gene expressions
     
         cIdx <- which(featureData(gender)$isNegativeControl) # Negative control genes
     
         ## Prepare plots
         annot <- cbind(as.character(sign(X)))
         colnames(annot) <- 'gender'
         plAnnots <- list('gender'='categorical')
         lab.and.region <- apply(rbind(lregions, llabs),2,FUN=function(v) paste(v,collapse='_'))
         gender.col <- c('-1' = "deeppink3", '1' = "blue")
     
         ## Remove platform effect by centering.
     
         Y[chip=='hgu95a.db',] <- scale(Y[chip=='hgu95a.db',], scale=FALSE)
         Y[chip=='hgu95av2.db',] <- scale(Y[chip=='hgu95av2.db',], scale=FALSE)
     
         ## Prepare control samples
     
         scIdx <- matrix(-1,84,3)
         rny <- rownames(Y)
         added <- c()
         c <- 0
     
         ## Replicates by lab
         for(r in 1:(length(rny) - 1)){
             if(r %in% added)
                 next
             c <- c+1
             scIdx[c,1] <- r
             cc <- 2
             for(rr in seq(along=rny[(r+1):length(rny)])){
                 if(all(strsplit(rny[r],'_')[[1]][-3] ==  strsplit(rny[r+rr],'_')[[1]][-3])){
                     scIdx[c,cc] <- r+rr
                     cc <- cc+1
                     added <- c(added,r+rr)
                 }
             }   
        }
       scIdxLab <- scIdx
     
         scIdx <- matrix(-1,84,3)
         rny <- rownames(Y)
         added <- c()
         c <- 0
     
         ## Replicates by region
         for(r in 1:(length(rny) - 1)){
             if(r %in% added)
                 next
             c <- c+1
             scIdx[c,1] <- r
             cc <- 2
             for(rr in seq(along=rny[(r+1):length(rny)])){
                 if(all(strsplit(rny[r],'_')[[1]][-2] ==  strsplit(rny[r+rr],'_')[[1]][-2])){
                     scIdx[c,cc] <- r+rr
                     cc <- cc+1
                     added <- c(added,r+rr)
                 }
             }
         }
         scIdx <- rbind(scIdxLab,scIdx)
     
         ## Number of genes kept for clustering, based on their variance
         nKeep <- 1260
     
         ## Prepare plots
         annot <- cbind(as.character(sign(X)))
         colnames(annot) <- 'gender'
         plAnnots <- list('gender'='categorical')
         lab.and.region <- apply(rbind(lregions, llabs),2,FUN=function(v) paste(v,collapse='_'))
         gender.col <- c('-1' = "deeppink3", '1' = "blue")
     
         ## Remove platform effect by centering.
     
         ## Correction
         sRes <- naiveReplicateRUV(Y, cIdx, scIdx, k=20)
     
         ## Clustering on the corrected data
         sdY <- apply(sRes$cY, 2, sd)
         ssd <- sort(sdY,decreasing=TRUE,index.return=TRUE)$ix
         kmresRep <- kmeans(sRes$cY[,ssd[1:nKeep],drop=FALSE],centers=2,nstart=200)
         vclustRep <- kmresRep$cluster
         RepScore <- clScore(vclustRep,X)
     
         ## Plot of the corrected data
         svdResRep <- NULL
         svdResRep <- svdPlot(sRes$cY[, ssd[1:nKeep], drop=FALSE],
                              annot=annot,
                              labels=lab.and.region,
                              svdRes=svdResRep,
                              plAnnots=plAnnots,                    
                              kColors=gender.col, file=NULL)   
     }