getRepPS: Select a representative probe set for a gene based on e.g....
In jotsetung/unsoRted: Unsorted R functions and utilities

Description Usage Arguments Details Value Author(s) References Examples

View source: R/probesetSelection.R

This function allows to select a probe set for a gene based on some criteria like the highest average expression across samples, highest variance, lowest p-value or any score that can be calculated on e.g. expression values. The function allows also to incorporate additional information to the selection criteria, i.e. whether the transcript detected by the probe set is protein coding or (for Affymetrix ST microarrays) the number of probes on the microarray that detect that transcript.

getRepPS( x, annot, FUN=function( z ){ mean( z ) * sd( z ) }, order.decreasing=TRUE, v=TRUE, gene.id.col="gene_id", probeset.id.col="probeset_id", transcript.biotype.col="transcript_biotype", prefer.protein.coding=TRUE, probe.count.col="probe_count", probe.count.cut=c( 9, 7, 5 ), mc.cores=getOption( "mc.cores", 2L ) )

`x`	Either a matrix with numeric values (rownames being probe set ids) or an “ExpressionSet”.
`annot`	A data.frame with annotations for the probe set ids. Has to provide at least two columns named according to the parameters “probeset.id.col” and “gene.id.col” specifying the probe set and gene ids, respectively.
`FUN`	The function to be applied to the numerical values for each row in x to calculate the score based on which the representative probe set will be selected.
`order.decreasing`	Whether the above calculated scores should be ordered increasing or decreasing. The representative probe set will be the first on in the ordered list.
`v`	Verbosity.
`gene.id.col`	The name of the column in “annot” that contains the gene ids (can be non-unique).
`probeset.id.col`	The name of the column in “annot” that contains the probe set ids (have to be unique).
`transcript.biotype.col`	Optional; the name of the column in “annot” that specifies whether the transcript detected by the probe set is protein coding (internally the function greps for “protein”) in the respective column.
`prefer.protein.coding`	Boolean specifying whether protein coding transcripts should be preferred over non-coding transcripts.
`probe.count.col`	The name of the column in “annot” that provides the number of probes for each probe set.
`probe.count.cut`	A numerical vector specifying potential cut-off values for the probe counts. With the default setting, the function will check whether there are probe sets for a gene with a number of probes bigger or equal to 9, and, if so excludes all probe sets with less. The function will iterate through the numerical vector and apply the above described search for each cut-off value. This will select for each gene more stable probe sets, i.e. probe sets with more probes, as the variance of a probe set across technical replicates decreases with the number of probes (unpublished observation).
`mc.cores`	The number of cores that should be used to process the code.

This function may not only be used for the above mentioned use case, generally it can be used to select one entry among many for e.g. the same entity based on some provided or calculated values (see examples below). In detail, the function uses the entries in column “gene.id.col” to split the data (using split) and performs the representative probe set search using mclapply. This approach to select a representative probe set for a given gene has been used, and was developed, in the publications listed in the reference section.

A character vector with the probe set IDs of the representative probe sets for each given gene with the gene names used as names of the character vector.

Johannes Rainer

Rainer J, Lelong J, Bindreither D, Mantinger C, Ploner C, Geley S, Kofler R. (2012) Research resource: transcriptional response to glucocorticoids in childhood acute lymphoblastic leukemia. Mol Endocrinol. 26,:178–93.

Aneichyk T, Bindreither D, Mantinger C, Grazio D, Goetsch K, Kofler R and Rainer J (2013) Translational profiling in childhood acute lymphoblastic leukemia: no evidence for glucocorticoid regulation of mRNA translation. BMC Genomics 14, 844.

http://bioinfo.i-med.ac.at

annotation <- data.frame( gene_id=c( "a", "a", "b", "b", "a" ), probeset_id=c( "a1", "a2", "b1", "b2", "a3" ), stringsAsFactors=FALSE )

data <- matrix( c( 4, 5, 2, 1, 3, 4, 4, 7, 3, 2 ) , ncol=2, byrow=TRUE )
rownames( data ) <- annotation$probeset_id

data

## return for each gene the probe set with the highest average expression
best.ps <- getRepPS( x=data, annot=annotation, FUN=function( z ){ mean( z ) } )

data[ best.ps,  ]