Organizes data for BaySIC functions


Creates a list object from mutation and reference data for use with BaySIC fitting and testing functions


1, ref.dat, plot = FALSE, N = NULL, silent = TRUE)



matrix; Mutation input data. Baysic requires a specific format similar to the MUT format file, and should be an M\times7 matrix with column headings "chr", "start", "end", "id","type", "gene","context," where each row details an individual mutation.


a dataframe or list of dataframes; ref.dat is a representation of the sequence content of each gene of interest, for 32 unique trinucleotide sequence contexts, yielding an G\times34 matrix, where G is the total number of genes. If ref.dat is a matrix, it is assumed that all subjects correspond to the same reference data. It is possible that reference data may vary from subject to subject due to different platforms or coverages. In this case, ref.dat can also be a list of N reference data matrices, where N is the number of subjects. The names of each list element should correspond to ids used in the dat file.


logical; if TRUE, a plot summarizing the mutation data at an overall and per subject basis is generated. Defaults to FALSE.


an integer (optional); equal to the number of subjects represented in dat. If N=NULL and is.list(ref.dat)==FALSE, N is assumed to the number of unique subject ids in dat. If is.list(ref.dat)=TRUE, then N=length(ref.dat).


logical; if FALSE, mutations defined as 'Synonymous' or 'Silent' will be removed from the dataset and subsequent analyses. Defaults to TRUE.


The mutation data dat is a 7-column matrix similar in style to other popular mutation file formats. The first three columns ("chr","start","end") correspond to the positional information of the somatic mutation. The "id" column represents an identification vector including subject ids for each documented mutation. The "type" column corresponds to the type of mutation for each entry. This is relatively flexible for point mutations, and only requires some form of "silent" or "synonymous" for such mutations if silent=FALSE, but insertion/deletion events should be designated as "INDEL." The "gene" column represents the name of the gene the mutation corresponds to, and must match the gene names used in ref.dat. The "context" entries represent the trinucleotide sequence context of each point mutation (NA for INDELS)

The first two columns of the data matrix (or matrices) in ref.dat should correspond to the gene name and corresponding chromosome, and the column names of the remaining 32 columns should correspond to the trinucleotide motif (e.g. "ACA"). The sequence content entries should be integer values which correspond to the number of nucleotides in the coding content of a given gene which satisify the trinucleotide motif (central base with flanking 5' and 3' bases). Each base should be uniquely represented, such that the sum of all 32 counts is equivalent to the basepair length of the total coding sequence for a given gene.

The function has its own trinucleotide naming convention, in that all motifs are in all caps and have either "T" or "C" as the central base. Column names of ref.dat and "context" entries in dat will be adjusted to accommodate this convention if they deviate from it.


Returns a list data structure with the following components:


Original mutation data object dat


Original reference data object ref.dat


Number of subjects with observed data


Vector of length G of gene names included in analysis, where G is the total number of genes. Derived from ref.dat


A G\times32 matrix of total number of SNV mutations per sequence context and gene


Vector of length G of total number of indel mutations per gene


Nicholas B. Larson

See Also,baysic.test


## Not run: 

## End(Not run)
comments powered by Disqus