# A package for clinical proteomics profiling data analysis

### Description

This package is still under development but it is intended to provide
a range of tools for analysing clinical genomics,
methylation and proteomics, data with the non-standard repeated
expression measurments
arising from technical replicates.
Most of these studies are observational
case-control by design and the results of analyses must be
appropriately adjusted for confounding factors and
imbalances in the data.
This regression-type problem is different from the
regression problem in
`limma`

, in which all the `covariates`

are some kind
of `contrasts`

and are
therefore important. Our method is specifically suitable for
analysing single-channel microarrays and proteomics data with repeated
probe, or peak
measurements, especially
in the case where
there is no one-to-one correspondence between cases and controls
and the data cannot be analysed as log-ratios.
In the current version (version 0.1.0),
we are more concerned with the problem of sample size calculations for these data sets.
But some tools for pre-processing of the repeated peaks data,
including tools for checking for the consistency in the
number
of replicates across samples, the consistency of
the peak information
between replicate
spectra and tools for data formatting
and averaging, are included.
`clippda`

also implements
a routine
for evaluating differential-expression between cases and controls, especially for data
in which each sample is assayed more than once, and are obtained from
studies which are observational, or
those for which the data are heterogeneous
(e.g. data for cancer studies in which controls are not directly sampled,
but are obtained
from samples from suspected cases that turn out to be benign disease,
after an operation, for example.
In this case there could be serious
imbalances in demographics between the cases and controls).
The test statistics considered are derived from the methods developed by Nyangoma
et al. (2009). These new methods for evaluating differential-expression
are compared with the empirical Bayes method in the
`limma`

package.
To limit the number of false positive discoveries, we control
the tail probability of the proportion of false positives, (TPPFP).
Further details can be found in the package `vignette`

.

### Details

Package: | anRpackage |

Type: | Package |

Version: | 0.1.0 |

Date: | 2009-03-25 |

License: | GPL (>=2) |

LazyLoad: | yes |

This package provides a method for calculating the sample size required when planning
proteomic profiling studies using repeated peak measurements. At the planning stage,
an experimenter typically does not yet have information on the heterogeneity of the data expected.
We provide a method which makes it possible to input, and adjust for the effect of,
the expected
heterogeneity in the sample size
calculations. The code for calculating sample size is
the `sampleSize`

function. It requires the computation of the
between- and within-sample variations, the differences in mean between
cases and controls, the intraclass correlations
between duplicate
peak data, and the heterogeneity correction factor. These can be computed
from pilot data using the functions:
`betweensampleVariance`

, `withinsampleVariance`

,
`replicateCorrelations`

and `FisherInformation`

, respectively.
Before the data can be analysed using these functions, it must be adequately
pre-processed,
and this package provides a number of tools for doing this.
We provide a grid of the clinically important differences versus
protein variances (with superimposed sample size contours). On this grid, we
have plotted sample sizes computed using parameters from several real-life
proteomic data
from a range of cancer-types, fluid-types, cancer stages and experimental
protocols of SELDI and MALDI. These values provide sample size
ranges which may be used to estimate the
number of samples required. The example below takes you through some of the processes
of sample size calculations using this package.

### Author(s)

Stephen Nyangoma

Maintainer: S Nyangoma s.o.nyangoma@bham.ac.uk

### References

Birkner, et al. Issues of processing and multiple testing of SELDI-TOF MS Proteomic Data. Stat Appl Genet Mol Biol 2006, 5.1

Nyangoma, Stephen O.; Collins, Stuart I.; Altman, Douglas G.; Johnson, Philip; and Billingham, Lucinda J. (2012) Sample Size Calculations for Designing Clinical Proteomic Profiling Studies Using Mass Spectrometry, Statistical Applications in Genetics and Molecular Biology: Vol. 11: Iss. 3, Article 2.

Nyangoma SO, et al. Multiple Testing Issues in Discriminating Compound-Related Peaks and Chromatograms from High Frequency Noise, Spikes and Solvent-Based Noise in LC - MS Data Sets. Stat Appl Genet Mol Biol 2007, 6, 1, Article 23

Smyth GK, et al.: Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics 2005, 21, 2067 - 75

Smyth GK: Linear models and emperical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004, 3, 1, Article 3

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | ```
#########################################################################################
# The routine for calculating sample size required when planning a clinical proteomic
# profiling study is provided in the sampleSize function. First, this functions performs
# computations for sample size parameters, that include: the biological variance, the
# tecnical variance, the differences to be estimated, the intraclass correlation
# (if unknown). These computations are done ase follows:
#########################################################################################
#########################################################################################
# biological variation, difference to be estimated, and the p-values for differential-
# expression are computed using the generic function: betweensampleVariance
# It requires data of a aclinicalProteomicsData class, as input
########################################################################
###################################################
# Creating data of a aclinicalProteomicsData class
###################################################
data(liverdata)
data(liver_pheno)
OBJECT=new("aclinicalProteomicsData")
OBJECT@rawSELDIdata=as.matrix(liverdata)
OBJECT@covariates=c("tumor" , "sex")
OBJECT@phenotypicData=as.matrix(liver_pheno)
OBJECT@variableClass=c('numeric','factor','factor')
OBJECT@no.peaks=53
Data=OBJECT
#################################################################################
# Data manipulation carried out internally by the betweensampleVariance function
#################################################################################
rawData <- proteomicsExprsData(Data)
no.peaks <- Data@no.peaks
JUNK_DATA <- sampleClusteredData(rawData,no.peaks)
JUNK_DATA=negativeIntensitiesCorrection(JUNK_DATA)
# we use the log base 2 expression values
LOG_DATA <- log2(JUNK_DATA)
#######################################################################################
# compute biological variation, difference to be estimated, and the p-values
#######################################################################################
BiovarDiffSig <- betweensampleVariance(OBJECT)
BiovarDiffSig
#####################
# technical variance
#####################
sample_technicalVariance(Data)
#######################################################################################
# heterogeneity correction-factor is the second diagonal element of the output
# matrix from the fisherInformation function, i.e. from the expected Fisher Information
########################################################################################
Z <- as.vector(fisherInformation(Data)[2,2])/2
Z
###################################################################################
# The outputs of these functions are converted into statistics used in
# the sample size calculations using a wraper function sampleSizeParameters
# it gives the consensus parameter values. You must specify the p-value and the
# intraclass correation, cutoff. The description of how these parameters
# are chosen is given in Nyangoma, et al. (2009).
###################################################################################
intraclasscorr <- 0.60 #cut-off for intraclass correlation
signifcut <- 0.05 #significance cut-off
sampleparameters=sampleSizeParameters(Data,intraclasscorr,signifcut)
#######################################################################################
# SAMPLE SIZE CALCULATIONS
#The function sampleSize calculates the protein variance, difference to be estimated,
# the technical varaince. These parameters are computed from statistics of peaks with
# medium biological variation.
# It also gives sample sizes for beta=c(0.90,0.80,0.70) and alpha = c(0.001, 0.01,0.05)
#######################################################################################
samplesize <- sampleSize(OBJECT,intraclasscorr,signifcut)
samplesize
``` |