suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(singleCellFeatures))
options(scipen=1, digits=2, singleCellFeatures.progressBars="none")

The following document investigates different data standardization techniques. Previously it was shown that some sort of normalization is required for fitting sensible models, especially for intensity features.

wells     <- findWells(plates="J107-2C", well.names=c("H2", "H6"))
data      <- unlist(getSingleCellData(wells), recursive=FALSE)
cleaned   <- lapply(data, cleanData, "lower")
melted    <- lapply(cleaned, meltData)
data.mint <- prepareDataforGlm(melted[[2]]$mat$Cells, 
                               melted[[1]]$mat$Cells, test=NULL)$train

Two wells, MTOR and SCRAMBLED or plate r getBarcode(data[[1]]) are loaded, cleaned up, melted and prepared for glm fitting.

model.mint <- glm("Response ~ .", binomial, data.mint)
coeff.mint <- model.mint$coefficients
coeff.mint <- coeff.mint[order(abs(coeff.mint), decreasing=TRUE)]
head(coeff.mint, n=20)

The top 20 coefficients are printed. All are intensity features and coefficient values are extremely high.

data.drop  <- prepareDataforGlm(melted[[2]]$mat$Cells, 
                                melted[[1]]$mat$Cells, drop.sep=TRUE,
                                test=NULL)$train

model.drop <- glm("Response ~ .", binomial, data.drop)
coeff.drop <- model.drop$coefficients
coeff.drop <- coeff.drop[order(abs(coeff.drop), decreasing=TRUE)]
head(coeff.drop, n=20)

All features that separate the data by themselfes are discarded. Only quasi-separation occurs, which mos probably is due to the situation that for one class, all values are zero and for the other class, they range from zero to some nonzero value.

The top coefficients of the sbsequent fit, as well as their magnitudes, however, remains the same.

data.scal  <- normalizeData(data.drop, features="intensity",
                            method="unitInterval")
model.scal <- glm("Response ~ .", binomial, data.scal)
coeff.scal <- model.scal$coefficients
coeff.scal <- coeff.scal[order(abs(coeff.scal), decreasing=TRUE)]
head(coeff.scal, n=20)

Next, all intensity features are scaled to the unit interval. This has a considerable effect on the top 20 coefficients ot the fit. The top values are tenfold smaller and some non-intensity features are among the top 20. Complete separation within the data still occurs however.

data.rnge  <- normalizeData(data.drop, features="intensity",
                            method="scaleRange")
model.rnge <- glm("Response ~ .", binomial, data.rnge)
coeff.rnge <- model.rnge$coefficients
coeff.rnge <- coeff.rnge[order(abs(coeff.rnge), decreasing=TRUE)]
head(coeff.rnge, n=20)

Instead of svaling to the unit interval, intensity features are scaled to their respective range. The effect is very similar to the previous scaling.

data.cntr  <- normalizeData(data.drop, features="intensity",
                            method="center")
model.cntr <- glm("Response ~ .", binomial, data.cntr)
coeff.cntr <- model.cntr$coefficients
coeff.cntr <- coeff.cntr[order(abs(coeff.cntr), decreasing=TRUE)]
head(coeff.cntr, n=20)

Instead of scaling intensity features, they are centered. The resulting fit takes up the undesirable aspects of the first two approaches.

data.uvar  <- normalizeData(data.drop, features="intensity",
                            method="unitVar")
model.uvar <- glm("Response ~ .", binomial, data.uvar)
coeff.uvar <- model.uvar$coefficients
coeff.uvar <- coeff.uvar[order(abs(coeff.uvar), decreasing=TRUE)]
head(coeff.uvar, n=20)

Scaling all intensity features to have unit variance drives the coefficients to new lows. Additionally, no intensity features are among the top 20 candidates.

data.zsco  <- normalizeData(data.drop, features="intensity",
                            method="zScore")
model.zsco <- glm("Response ~ .", binomial, data.zsco)
coeff.zsco <- model.zsco$coefficients
coeff.zsco <- coeff.zsco[order(abs(coeff.zsco), decreasing=TRUE)]
head(coeff.zsco, n=20)

Z-scoring (unit variance and centering) makes little difference to only scaling to unit variance. The result again is low coefficients and no intensity features.



nbenn/singleCellFeatures documentation built on May 23, 2019, 12:24 p.m.