generateData: Generate a list of dataframes with TPM data by Gene

View source: R/utility_functions.R

generateDataR Documentation

Generate a list of dataframes with TPM data by Gene

Description

generateData generates a list of data by gene that will be needed for downstream DTU compositional analysis

Usage

generateData(
  x,
  dat,
  nsamp,
  abundance,
  abData,
  abCompDatasets = NULL,
  useExistingOtherGroups,
  useOtherGroups = FALSE,
  useExistingMajorTrans = TRUE,
  infReps = "none",
  ninfreps = NA,
  samps = NULL,
  CompMI = FALSE
)

Arguments

x

is a genename of interest. Genes that have <2 transcripts or have total expression across all samples of 0 are filtered out before calling this function since they can never by used for any kind of DTU analysis

dat

is the observed count or TPM data. Usually this is the output from prepareData that has additionally been filtered by excluding genes with 1 transcript of those that have a total expression level of 0.

nsamp

is the number of biological samples/replicates

abundance

is TRUE/FALSE and indicates whether the data is abundance (TPM) or not. abundance=F means the length information will also be output for each gene.

abData

is the abundance data as would be used in dat. This argument is only needed if abundance=F and the results are being run on the count data. This is used to generate the “offset”, which is not currently used.

abCompDatasets

is the list of dataframes output by running generateData on the abundance data. This argument is only non-Null when abundance=F (ie when running on count data) and is only needed to ensure the calculated “Other” Transcripts and major transcripts for each gene are the same when count data is input as when TPM data is input. Other Groups are currently not used.

useExistingOtherGroups

is a TRUE/FALSE indicator. If true it will use the ExistingOtherGroups from the abCompDatasets, regardless of the RTA for that particular dataset. Useful to keep the other groups the same to be able to compare results easier. This is used for the power analysis, when the other transcript groups should be the same regardless of the current data.

useOtherGroups

is a TRUE/FALSE indicator of whetner other groups should be used or not. Default is FALSE

useExistingMajorTrans

is a TRUE/FALSE indicator of whether to load MajorTrans information from the existing input file. Useful for the power analyses from the paper or when generating the files corresponding to Bootstrap samples.

infReps

is a character variable indicating what kind of inferential replicates (if any) are to be analyzed by the current function call. Values to be used should be "none", "Boot", and "Gibbs". Default is "none".

ninfreps

is the number of inferential replicates being used by the current call. Default is NA, corresponding to none.

samps

is an optional vector containing the sample names. Need to specify this if sample names are not just paste0("Sample", 1:nsamp) without any missing.

CompMI

is a TRUE/FALSE corresponding to whether datasets for the multiple imputation based analysis are being used. This will add columns for transcripts that may be missing in the inferential replicates that were't missing in the non-inferential replicate data. Default is FALSE.

Details

generateData exports a gene-wise list of all data that will be needed for downstream compositional analysis. This can be done using DRIMSeq's filters or with an approach we considered based on "OtherGroups". This includes combining and transcripts that have <5% RTA across all samples into an “Other” category to ensure proper computation can be done downstream. Note that if there is exactly one transcript that has <5% RTA it is dropped since there are no other transcript with a low RTA to combine it with and we would not want to combine it with a transcript with high RTA. This also computes the MajorTranscript for each gene, which is the transcript with the highest expression level across all samples and stores it as an attribute. The MajorTranscript is always computed based on the abundance data. For this reason, need to run this on abundance data first then use those results for count data.

Value

A list with one element per gene containing TPM or count information for each transcript and the other transcript group. Each element is a dataframe with one row per sample and one column per transcript that is not combined into other (if the OtherGroups are used). If the data is TPM level each element of the list is the TPM values for that gene, broken down by transcipt and "Other". If the data is at the count level each element of the list has two elements corresponding to Counts and Lengths. A list of transcripts that make up the Other category can be viewed in the attribute "OtherTrans", as can a full list of transcripts for that gene ("FullTrans") and a list of transcripts that did not contribute to the Other category "NotOtherTrans".


skvanburen/CompDTUReg documentation built on Jan. 23, 2025, 9:01 a.m.