bootMedians: Bootstrap two gene sets and compare their values
In steveped/funsForLu: Extra functions for the specific analysis

Description Usage Arguments Details Value Author(s)

A function for comparing two sets of genes without relying on any distributional assumptions.

1
2
3

bootMedians(data, testIds, refIds, idCol = 1L, binCol = "lengthBin",
  valCols = "TPM", nGenes = 1000L, nBoot = 100L, minGenes = 200L, ...,
  na.rm = TRUE, replace = TRUE, maxP)

`data`	A data frame containing all the data required
`testIds`	A `character vector` with test set of Ids
`refIds`	A `character vector` with the reference setof Ids
`idCol`	The column in `data` containing the Ids in the vectors `testIds` and `refIds`. Can be specified as an integer position or as a character (regular expression).
`binCol`	The column in `data` containing the bin allocations for each gene. Can also be specified as an integer or by name.
`valCols`	Regular expression or integers defining the columns in `data` containing the values of interest.
`nGenes`	`integer`. The number of genes to sample at each iteration. Values greater than the number of testIds will automatically be capped at the number of testIds
`nBoot`	`integer`. The number of bootstrap iterations to be performed
`minGenes`	`integer`. The minimum number of IDs required to conduct a bootstrap procedure with any meaning.
`...`	Passed to the function `mean` internally
`na.rm`	`logical`. Also passed internally to the function `mean`
`replace`	`logical`. Should the bootstrap use sampling with replacement (`replace = TRUE`) or without
`maxP`	The maximum probability (weight) allowed for an individual gene in the reference set. Defaults to `1/nGenes`

This function breaks the supplied data.frame into two sets of test IDs & reference IDs. The data.frame must contain a column (binCol) which classifies each ID into a bin. The probabilities of bin membership in the test IDs are then used for sampling during the bootstrap procedure.

The values to be bootstrapped must be specified in the argument valCols, and this can be a regular expression used to extract a set of columns, or integers specifying exact columns in the supplied data.frame.

The function itself will sample the same number of IDs (nGenes) from each dataset, based on the probabilities of bin membership in the test dataset. At each bootstrap iteration, the median values for each column specified will be returned from both datasets, with the reference values then subtracted from the tested values. This allows direct comparison of these values as they will be drawn from similar distributions based on the binning variable used.

If any genes have a probability of being resampled > maxP they may exert undue influence on the results. If any are found the process will stop to allow removal of this grouping. Alternatively, the value for maxP can be reset up to a maximum of 1, which would represent maximum permissability.

A list with components:

$samples The sampled differences in the median values
$p The proportion of sampled differences which are > 0
$nGenes The number of genes sampled at each bootstrap iteration
$nBoot The number of bootstrap iterations
sampleSizes A named numeric vector with the sample sizes for each dataset
$testBins The distributions of genes amongst the binning variable in the set of test IDs
$refBins The distributions of genes amongst the binning variable in the set of reference IDs. The final column represents the sampling probability for each individual gene in the corresponding bin
$missingBins These are the bins not commonly represented in the dataset. If any are found a non-fatal warning message will be printed during running of the process.