gtxpipe: Pipeline for routine genetic association analysis and...

Description Usage Arguments Details Author(s)

View source: R/pipeline.R

Description

An implementation of a pipeline that simplifies and standardizes the analysis and report generation for routine genetic association projects.

Usage

1
2
3
4
5
6
gtxpipe(gtxpipe.models = getOption("gtxpipe.models"),
        gtxpipe.groups = getOption("gtxpipe.groups", data.frame(group = 'ITT', deps = 'pop.PNITT', fun = 'pop.PNITT', stringsAsFactors = FALSE)),
        gtxpipe.derivations = getOption("gtxpipe.derivations", {data(derivations.standard.IDSL); derivations.standard.IDSL}),
        gtxpipe.transformations = getOption("gtxpipe.transformations", data.frame(NULL)),
        gtxpipe.eigenvec,
        stop.before.make = FALSE)

Arguments

gtxpipe.models

A data frame defining association models to be fitted

gtxpipe.groups

A data frame defining (sub)groups of individuals in which to fit the models

gtxpipe.derivations

A data frame defining methods to derive analysis variables from underlying clinical data

gtxpipe.transformations

A data frame defining transformations required for the analysis variables

gtxpipe.eigenvec

A filename containing eigenvectors or other covariates to adjust for in all association models

stop.before.make

Logical whether to stop before actually fitting association models

Details

The pipeline implemented by gtxpipe takes as input some clinical data and some genotype data, performs association analyses, and outputs tables, figures, and documents summarising the results.

The association analyses that are conducted are controlled by function arguments passed to gtxpipe, and options with names that begin ‘gtx.’ or ‘gtxpipe.’. The intention is that settings that are likely to vary on a project by project basis are controlled by function arguments, and settings that are likely to be constant over multiple projects are controlled by options.

The input clinical data must be a set of plain text files inside a single directory, which is specifed by options(gtxpipe.clinical) (default a subdirectory ‘clinical’ of the current working directory). [In future the intent is to support multiple directories for multiple clinical studies to be analysed together.] These are expected to have .txt extenstions and to be SAS datasets exported in plain text format using SAS proc export, but other plain text files in the same format may work. The files are read using gtx::clinical.import.

The input genotype data must be a set of minimac dosage and information files, inside a single directory specifed by options(gtxpipe.genotypes) (default a subdirectory ‘genotypes’ of the current working directory). [In future the intent is to support other genotype data formats].

Analyses are conducted inside a directory specifed by options(gtxpipe.analyses) (default a subdirectory ‘analyses’ of the current working directory), and the pipeline outputs are written inside a directory specifed by options(gtxpipe.outputs) (default a subdirectory ‘outputs’ of the current working directory). Both directories are created if they do not already exist.

Models are defined (for the purposes of gtxpipe) as regression models on derived and transformed clinical data (derivations and transformations are defined below). Genetic variables are not explicitly included in the model definition; these are automatically added by gtxpipe. The gtxpipe.models argument must be a dataframe with the following columns (with the following information in each row): “model” (a name used for organising and reporting results); “deps” (a string with space delimited names of variables the model depends on); “fun” (a string with an R language statement of the model); “groups” (a string with space delimited subject groups [as defined in Groups below] in which to evaluate the model); “contrasts” (a string with space delimited group contrasts of interest); “cvlist” (a string with space delimited identifiers for candidate genetic variants of interest). [In future the intent is to provide an interface so that the information in “deps” is determined automatically from a specification of “fun”.]

Derivations are defined (for the purposes of gtxpipe) as methods that: (i) convert clinical data (which may not be one-row-per-subject) to analysis variables (which must be one-row-per-subject); and (ii) can be applied independently over subjects. That is, a derived analysis variable for the i-th subject must be a scalara that depends only on the clinical data for the i-th subject. An example of a derivation is to compute the highest grade of adverse event experienced by each subject, using the clinical data for all adverse events of a given type. The gtxpipe.derivations argument must be a dataframe suitable for passing as the derivations argument to the function clinical.derive. Note that all analysis variables must be derived (even if the derivation is a simple extraction from a clinical dataset). See the examples for how to write simple derivations. For efficiency, it is possible to specify the derivation of multiple variables in a single row of gtxpipe.derivations, as long as the “deps” and “data” part of the derivation are the same for all the variables.

Transformations are defined (for the purposes of gtxpipe) as methods that convert one or more analysis variables (which must be one-row-per-subject) to another analysis variable (also one-row-per-subject). Transformations may act independently over subjects (e.g. a log transform), but also may act such that the transformed value for the i-th subject depends on the values for other subjects (e.g. a rank or quantile transform). Thus in general the action of the transformation may depend on which set or subset of subjects it is applied to. Transformations may also be used to convert data types e.g. to create a variable of the Surv class from a time variable and a censoring indicator variable. The gtxpipe.transformations argument must be a dataframe with the following columns (with the following information in each row): “targets” (a name [or space delimited names] for the derived variable[s]); “deps” (a string with space delimited names of variables the derivation depends on); “fun” (a string with an R language statement of the transformation).

Groups are defined using R language statements that depend on derived variables. FIXME: Can they depend on transformations? The gtxpipe.groups argument must be a dataframe with the following columns (with the following information in each row): “group” (a name for the group, as referred to in Model groups and contrasts); “deps” (a string with space delimited names of variables the derivation depends on); “fun” (a string with an R language statement that evaluates logical for group membership). Group contrasts are specified in Models using the names of two groups, separated by a ‘/’. For this reason, and because group names are used as directory names, all group names must be simple alphanumeric.

Descriptors is a name for the concept of storing a table of long descriptive names for individual variables and substituting those names in output displays. FIXME Document more.

make command needs to be set. FIXME should be a sensible default.

FIXME list: Candidate variants should be analysed even if failing MAF or Rsq filters. Better interface for specifying the arguments (helper functions to build models, derivations etc in the above format. Currently deps have to be specified manually but this should be automatic for most cases).

FIXME: Analysis datasets are stored as csv files with two special comment rows. Hence R classes such as Surv and ordered factors are not preserved. (This is a problem.) Can write models like coxph(Surv(SRVMO,SRVCFLCD)~...) and clm(factor(BR,c("CR","PR","SD","PD"))~...) but this is tedious and inefficient. Consider applying transformations within slave process?

Note regarding deps: The deps have two purposes, firstly to make data loading and derivations more efficient by only loading/deriving the data needed, and secondly to compute analysis datasets and subsets of subjects with nonmissing data for each analysis.

WARNING: The master/slave arrangement means BAD THINGS MAY HAPPEN if you upgrade the gtx package while pipeline() is running.

Author(s)

Toby Johnson Toby.x.Johnson@gsk.com


tobyjohnson/gtx documentation built on Aug. 30, 2019, 8:07 p.m.