concensusDataSetFromFile: Set up analysis container for ConcensusGLM
In eachanjohnson/concensusGLM: concensusGLM

Description Usage Arguments Details Value See Also

Set up analysis container for ConcensusGLM using path to data as input. Performs validity checks on the input. Optionally annotates input with experimental metadata from annotation_filename.

concensusDataSetFromFile(data_filename, annotation_filename = NULL,
  output_path = ".", controls = NULL, rename = NULL, test = FALSE,
  checkpoint = FALSE, threshold = 100, spike_in = "^intcon",
  pseudostrains = TRUE, ...)

`data_filename`	Character. Path to a table of counts.
`annotation_filename`	Character. Optional. Path to experimental annotations. These will be joined to the input data and used for batch correction.
`output_path`	Character. Path to directory where you want the analysis output. This is also where checkpoints and logs from cluster execution are stored. Default is current working directory.
`controls`	Named list of Characters with elements `positive` and `negative`. Adds logical columns to data called `positive_control` and `negative_control` based on regular expression matches.
`test`	Logical. Run in test mode? If so, only reads the first 5 million lines of `data_filename`.
`checkpoint`	Logical. Save intermediate results as checkpoints?
`threshold`	Numeric. Strains below this total count threshold will be discarded. Default `1000`. Plates below 1000 x `threshold` will be discarded. Set to `0` to skip this.
`spike_in`	Character. A regular expression to match spike-in controls.
`pseudostrains`	Logical. Make pseudostrains such as "total", which is the sum of all non-spike-ins.

This creates the concensusDataSet object from data_filename on which downstream analysis is carried out.

The input CSV should be found at data_filename. This CSV should at least have the headers 'id', 'compound', 'concentration', 'strain', 'plate_name', 'count', 'well', and must have one row per strain-compound-concentration-plate_name-well combination. Together, "id", "plate_name" and "well" define unique experimental samples; "id" refers to sequencing (technical) replicates (if any) and "plate_name" and "well" are biological replicates (recommended). Any given condition should have at least 2 replicates of some kind. If "row" and "column" are present but not "well", then well is constructed by concatenating "row" and "column".

Firstly, this function loads a CSV from data_filename. This may take some time if it is a large file. It then checks that the minimum headers are present.

It then checks for either a negative_control column or a list of control compounds supplied to controls argument.

Under-represented (assumed to be spurious) strains and plates (as defined by threshold) are removed.

Then, pseudo-strains are built if requested (the default). So far, the only pseudo-strain defined is "total", which is the total counts per well of non-spike-in strains.

If defined, annotations (like experimental meta-data) are loaded from annotation_filename. This CSV file needs a column name in common with data_filename (case sensitive) since the next step is a join on the shared columns. Also, every observation in the input data that you want to keep must be annotated.

Finally, it adds a negative_control column if necessary and a positive_control column if defined, and checks that there are at least 2 negative control observations.

list of class "concensusDataSet"

newWorkflow, pipeline, concensusDataSetFromFile

eachanjohnson/concensusGLM documentation built on June 26, 2019, 2:26 a.m.