docs/reference.md

Active Bindings

Most of the variables that will be used during the analysis can be accessed through the FishHook object. This can be done with the $ operator, similar to how columns of a dataframe or data.table can be accessed using dataframe$column_name. For example, if you want to access the hypotheses in the FishHook object you can use the following syntax: x = FishHook\$hypotheses You can also assign variables using the active bindings like so: FishHook\$hypotheses = x

FishHook

hypotheses

Description: This variable contains a GRanges that is used to define the hypotheses (regions of the genome) to test. Return: GRanges Setable: Yes Set Conditions: The object must be of class GRanges or character and cannot be NULL. Set Results: The object is reset to the initialized state and all annotations/scores are deleted. If the object is of class character, then fishHook will try to load in the path specified by the character object using rtracklayer::import(). Default: None, must be set at initialization.

events

Description: This variable contains a GRanges that is used to define the events (mutations) for use in the fishHook analysis. Return: GRanges Setable: Yes Set Conditions: The object must be of class GRanges and cannot be NULL. Set Results: The object is reset to the initialized state and all annotations/scores are deleted Default: None, must be set at initialization.

covariates

Description: This variable contains a Covariate that is used to store all of the covariates during the analysis. Return: Covariate Setable: Yes Set Conditions: The object must be of class Covariate or be NULL. Set Results: The object is reset to the initialized state and all annotations/scores are deleted. Default: NULL

eligible

Description: This variable contains a GRanges that is used to define regions eligible for the fishHook analysis. Return: GRanges Setable: Yes Set Conditions: The object must be of class GRanges or be NULL. Set Results: The object is reset to the initialized state and all annotations/scores are deleted. Default: NULL

data

Description: This variable contains a GRanges object containing the data on which regression will be performed. Return: GRanges Setable: Yes, but you should not unless you know what you're doing. Set Conditions: The object must be of class GRanges or be NULL. Set Results: The variable is set to the value provided. Default: NULL

res

Description: This variables contains a data.table containing analysis results generated by FishHook$score() Return: data.table Setable: Yes, but you should not unless you know what you're doing. Set Conditions: The object must be of class data.table or be NULL. Set Results: The variable is set to the value provided. Default: NULL

all

Description: Returns a data.table that contains the original hypotheses and associated metadata annotated with the output of FishHook\$score. Return: data.table Setable: No Set Conditions: NA Set Results: NA Default: NULL

state

Description: Returns a character that indicates the internal state of the FishHook object. Upon initialization the state is set to 'Initialzied'. Once the object is annotated the state is set to 'Annotated'. Once the object is scored, the state is set to 'Scored'. Return: character Setable: No Set Conditions: NA Set Results: NA Default: 'Initialized'

mc.cores

Description: A numeric variable that indicates the number of cores to use when annotating the data in FishHook\$annotate() Return: numeric Setable: Yes Set Conditions: The object must be of class numeric, and have a value > 0 or be NULL. Note that non integers will be floored. e.g. FishHook$mc.cores = 3.41232 will set mc.cores to 3. Set Results: The variable is set to the value provided Default: 1

max.slice

Description: A parameter used when annotating covariates, indicates the max.slice parameter for gr.val. This parameter indicates the maximum number of ranges (covariate rows) to use at a time. This is correlated with memory usage. For example, a high max.slice will yeild faster running times but will require more memory. Return: numeric Setable: Yes Set Conditions: The object must be of class numeric or be NULL. Set Results: The variables is set to the value provided. Default: 1e3 (1,000)

ff.chunk

Description: For use with ffTrack covariates. Indicates the max interval length to load in from ffTrack. Larger values will result in faster run times but will increase memory usage. Return: numeric Setable: Yes Set Conditions: The object must be of class numeric or be NULL. Set Results: The variable is set to the value provided Default: 1e6 (1,000,000)

max.chunk

Description: Used when finding the overlap between events and hypotheses. This is a parameter passed into gr.findoverlaps and indicates the total number of ranges(events) to consider at a given time. Larger values will result in faster run times but will increase memory usage. Return: numeric Setable: Yes Set Conditions: The object must be of class numeric or be NULL. Set Results: The variable is set to the value provided Default: 1e11

pad

Description: A numeric variable indicating how far each covariate range should be extended. e.g. If a covariate has ranges [10,20] and pad = 5, the covariate ranges will be set to [5,25]. This will only be used where Covariate\$pad == NA. Return: numeric Setable: Yes Set Conditions: The object must be of class numeric or be NULL. Set Results: The variable is set to the value provided. Default: 0

verbose

Description: A logical variable indicating whether or not to pipe additional analysis details to output. Return: logical Setable: Yes Set Conditions: The object must be of class logical or be NULL. Set Results: The variable is set to the value provided. Default: TRUE

out.path

Description: The path to which to write the score.hypotheses output. Return: character Setable: Yes Set Conditions: The object must be of class character or be NULL. Set Results: The variable is set to the value provided. Default: NULL

model

Description: The model used by fishHook to calculate p-values for the analysis. This is generated by FishHook$score() Return: glm Setable: Yes, but you should not unless you know what you're doing. Set Conditions: None Set Results: The variable is set to the value provided Default: NULL

na.rm

Description: A boolean variable the indicates whether to remove na values during the analysis. Return: logical Setable: Yes Set Conditions: The object must be of class logical or be NULL. Set Results: The variable is set to the value provided. Default: TRUE

idcol

Description: This is used when you want to limit the number of events that any given patient can contribute. This parameter is a character that indicates the column name of the 'events' variable that contains the patient IDs. This should be used in conjunction with the idcap parameter. Return: character Setable: Yes Set Conditions: The object must be of class character or be NULL. Set Results: The variable is set to the value provided Default: NULL

idcap

Description: This is used when you want to limit the number of events that any given patient can contribute. This parameter is a numeric that indicates the maximum number of events any given patient can contribute to any given target. This should be used in conjunction with the idcol parameter. Return: character Setable: Yes Set Conditions: The object must be of class character or be NULL. Set Results: The variable is set to the value provided Default: Inf

weightEvents

Description: This is a boolean that idicates whether an events contribution should be weight by its overlap with the hypotheses. This can be used for copy number data but violates the assumption of the poisson that the variable exists as discrete counts. For example, if only 10% of an event overlapped a target (large copy number variation) that event would contribute 0.1 to the total count of that target. Thus with this paramter an event may contribute between 0 and 1 to the total target count. Return: boolean Setable: Yes Set Conditions: The object must be of class locial or be NULL. Set Results: The variable is set to the value provided. Default: FALSE

nb

Description: A boolean that indicates which model to use. If true, a negative binomial will be used, if false a poisson will be used. Return: boolean Setable: Yes Set Conditions: The object must be of class logical or be NULL. Set Results: The variable is set to the value provided. Default: TRUE

Covariate

data

Description: A list of covariates for use in a FishHook analysis. Each covariate can be of type: 'GRanges','ffTrack','RleList', however, 'GRanges' is the best supported type. Return: list Setable: Yes Set Conditions: The object must be of type list and contain only covariates of types: 'GRanges','ffTrack', or 'RleList'. Set Results: The variable is set to the value provided. Note that by changing the covariates you may introduce discrepencies between the covariates and other parameters such as type. Default: Must be intialized.

names

Description: A character vector containing all of the names for the covariates. Return: character vector Setable: Yes Set Conditions: The vector must be of class character and have length equal to length or cvs or satisfy the condition length(cvs) %% length(names) == 0 Set Results: The variable is set to the value provided. If the length of names is less than that of csv and the Set Conditions are satisfied, names will be repeated such that its length is equal to that of cvs. Default: NA

type

Description: A character vector indicating the type of each covariate. Types can be one of 'numeric','sequence', or 'interval'. Return: character vector Setable: Yes Set Conditions: The vector must be of class character and have length equal to length or cvs or satisfy the condition length(cvs) %% length(type) == 0 Set Results: The variable is set to the value provided. If the length of type is less than that of csv and the Set Conditions are satisfied, type will be repeated such that its length is equal to that of cvs. Default: NA

pad

Description: A numeric vector indicating how far each covariate range should be extended. e.g. If a covariate has ranges [10,20] and pad = 5, the covariate ranges will be set to [5,25]. Return: numeric vector Setable: Yes Set Conditions: The vector must be of class character and have length equal to length or cvs or satisfy the condition length(cvs) %% length(pad) == 0 Set Results: The variable is set to the value provided. If the length of pad is less than that of csv and the Set Conditions are satisfied, pad will be repeated such that its length is equal to that of cvs. Default: 0

field

Description: A character vector that should be specified for numeric covariates. All other types of covariates should have this value set to NA. This value indicates the column name in which to find the score of the numeric covariate. The score refers to the numeric value associated with said covariate. Return: character vector Setable: Yes Set Conditions: The vector must be of class character and have length equal to length or cvs or satisfy the condition length(cvs) %% length(field) == 0 Set Results: The variable is set to the value provided. If the length of field is less than that of csv and the Set Conditions are satisfied, field will be repeated such that its length is equal to that of cvs. Default: NA

signature

Description: signature is for use with ffTrack and is a list of named lists that specifies what is to be tallied. Each signature (list element) consists of an arbitrary length character vector specifying strings to match if grep == FALSE. Signature can also be a length 1 character vector to grepl (if grep = TRUE) or a length 1 or 2 numeric vector specifying exact value or interval to match (for numeric data). Return: list Setable: Yes Set Conditions: None Set Results: The variable is set to the value provided Default: NA

na.rm

Description: A logical vector the indicates whether to remove na values for a given covariate. Return: logical vector Setable: Yes Set Conditions: The vector must be of class logical and have length equal to length or cvs or satisfy the condition length(cvs) %% length(na.rm) == 0 Set Results: The variable is set to the value provided. If the length of na.rm is less than that of csv and the Set Conditions are satisfied, na.rm will be repeated such that its length is equal to that of cvs. Default: NA

grep

Description: A logical vector for use with ffTrack covariates. It specifies what form of signature to use. See the signature paramter for more information. Return: vector Setable: Yes Set Conditions: The vector must be of class logical and have length equal to length or cvs or satisfy the condition length(cvs) %% length(grep) == 0 Set Results: The variable is set to the value provided. If the length of grep is less than that of cvs and the Set Conditions are satisfied, grep will be repeated such that its length is equal to that of cvs. Default: NA

# Relevant Functions ----------- These are functions stored in the R6 objects, Covariate and FishHook. They can be accessed by: FishHook\$Function() Covariate\$Function() ## FishHook ----------- ### initialize() **Description:** Initializes the FishHook Object. Can be called with: x = FishHook$new(...) **Params:** 1. **hypotheses:** Examples of hypotheses are genes, enhancers, or even 1kb tiles of the genome that we can then convert into a rolling/tiled window. This param must be of class "GRanges". 2. **events:** Events are the given mutational regions and must be of class "GRanges". Examples of events are SNVs (e.g. C->G) somatic copy number alterations (SCNAs), fusion events, etc. 3. **eligible:** Eligible regions are the regions of the genome that have enough statistical power to score. For example, in the case of exome sequencing where all regions are not equally represented, eligible can be a set of regions that meet an arbitrary exome coverage threshold. Another example of when to use eligibility is in the case of whole genomes, where your hypotheses are 1kb tiles. Regions of the genome you would want to exclude in this case are highly repetitive regions such as centromeres, telomeres, and satellite repeats. This param must be of class "GRanges". 4. **covariates:** Covariates are genomic covariates that you believe will cause your given type of event (mutations, CNVs, fusions, case control samples) that are not linked to the process you are investigating (e.g. cancer drivers). In the case of cancer drivers, we are looking for regions that are mutated as part of cancer progression. As such, regions that are more suceptable to random mutagenesis such as late replicating or non-expressed region (transcription coupled repair) could become false positives. Including covariates for these biological processes will reduce thier visible effect in the final data. This param must be of type "Covariate". 5. **out.path:** A character that will indicate a system path in which to save the results of the analysis. 6. **use_local_mut_density:** A logical that when true, creates a covariate that will represent the mutational density in the genome, whose bin size will be determined by local_mut_density_bin. This covariate can be used when you have no other covariates as a way to correct for variations in mutational rates along the genome under the assumption that driving mutations will cluster in local regions as opposed to global regions. This is similar to saying, in the town of foo, there is a crime rate of X that we will assume to be the local crime rate. If a region in foo have a crime rate Y such that Y >>>>> X, we can say that region Y has a higher crime rate than we would expect. 7. **local_mut_density_bin:** A numeric value that will indicate the size of the genomic bins to use if use_local_mut_density = TRUE. Note that this paramter should be a few orders of magnitude greater than the size of your hypotheses e.g. if your hypotheses are 1e5 bps long, you may want a local_mut_density_bin of 1e7 or even 1e8 8. **genome:** A character value indicating which build of the human genome to use, by default set to hg19 9. **mc.cores:** A numeric value that indicates the amount of computing cores to use when running fishHook. This will mainly be used during the annotation step of the analysis, or during initial instantiation of the object if use_local_mut_density = T 10. **na.rm:** A logical indicating how you handle NAs in your data, mainly used in fftab and gr.val, see these function documentations for more information. 11. **pad:** A numeric indicating how far each covariate range should be extended, see Covariate for more information, not that this will only be used if atleast on of the Covariates have pad = NA. 12. **verbose:** A logical indicating whether or not to print information to the console when running FishHook 13. **max.slice:** A parameter used when annotating covariates, indicates the max.slice paramter for gr.val. This parameter indicates the maximum number of ranges (covariate rows) to use at a time. This is correlated with memory usage. For example, a high max.slice will yeild faster running times but will require more memory. 14. **ff.chunk:** For use with ffTrack covariates. Indicates the max interval length to load in from ffTrack. Larger values will result in faster run times but will increase memory 15. **max.chunk:** Used when finding the overlap between events and hypotheses. This is a parameter passed into gr.findoverlaps and indicates the total number of ranges(events) to consider at a given time. Larger values will result in faster run times but will increase memory usage. 16. **idcol:** This is used when you want to limit the number of events that any given patient can contribute. This parameter is a character that indicates the column name of the 'events' variable that contains the patient IDs. This should be used in conjunction with the idcap parameter. 17. **maxptpergene:** This is used when you want to limit the number of events that any given patient can contribute. This parameter is a numeric that indicates the maximum number of events any given patient can contribute to any given target. This should be used in conjunction with the idcol parameter. 18. **weightEvents:** A logical that indicates if the events should be weighted by thier overlap with the hypotheses. e.g. if we have a SCNA spanning 0:1000 and a target spanning 500:10000, the overlap **Return:** FishHook object ready for annotation/scoring **UI:** None ### print() **Description:** Prints out a summary of the FishHook object. Can be used by invoking the variable name. **Params:** No parameters required. Provided parameters will be ignored. **Return:** None **UI:** Prints information about the FishHook object to the console including; total events, total hypotheses, whether eligible regions will be used, and covariates/number of covariates. ### aggregate() **Description:** Aggregates hypotheses into groups for aggregate scoring. e.g. aggregate genes in a pathway or tiles of a genome. **Params:** 1. **by:** A character vector with which to split into meta-territories (default = NULL) 2. **fields:** A character vector indicating which columns to be used in aggregation by default all meta data fields of hypotheses EXCEPT reserved field names \$coverage, \$counts, \$query.id (default = NULL) 3. **rolling:** A positive numeric (integer) specifying how many (genome coordinate) adjacent to aggregate in a rolling fashion; positive integer with which to performa rolling sum / weighted average WITHIN chromosomes of "rolling" ranges" --> return a granges For example, if we cut a chromosome into 5 pieces (1,2,3,4,5) and set rolling = 3, we will get an aggregated dataset (123,234,345) as the internal value. This is mainly for use with whole genome analysis in order to speed up the annotation step (default = NULL) 4. **disjoint:** boolean only take disjoint bins of input (default = TRUE) 5. **na.rm:** boolean only applicable for sample wise aggregation (i.e. if by = NULL) (default = FALSE) 6. **FUN:** list only applies (for now) if by = NULL, this is a named list of functions, where each item named "nm" corresponds to an optional function of how to alternatively aggregate field "nm" per samples, for alternative aggregation of coverage and count. This function is applied at every iteration of loading a new sample and adding to the existing set. It is normally sum [for coverage and count] and coverage weighted mean [for all other covariates]. Alternative coverage / count aggregation functions should have two arguments (val1, val2) and all other alt covariate aggregation functions should have four arguments (val1, cov1, val2, cov2) where val1 is the accumulating vector and val2 is the new vector of values. 7. **verbose:** boolean verbose flag (default = TRUE) **Return:** None, but sets the internal state of the object to 'Aggregated'. **UI:** None ### score() **Description:** If the FishHook object is in the Annotated State, this function will fit a regression model (negative binomial/poisson) to the hypotheses and assign significance. If the FishHook object is in the Initialized state, this function will first annotate the FishHook object and then score. **Params:** 1. **verbose:** boolean verbose flag (default = TRUE) 2. **iter:** max iterations to use when fitting the linear model (only for negative binomial) 3. **subsample:** Number of hypotheses to use when fitting the model (selected randomly). 4. **seed:** numeric (integer) indicated the random number seed to be used. (default = 42) 6. **nb:** boolean If TRUE, uses negative binomial; if FALSE then use Poisson **Return:** None, but sets the internal state of the object to 'Scored'. You can acess the scored data with FishHook\$all **UI:** None ### qqp() **Description:** Creates a qqplot plot (either base R or plotly) to visualize target significance and how good the model is for a given dataset. **Params:** 1. **plotly:** boolean value indicating if the function should return a plotly (TRUE) or base R plot (FALSE) object. 2. **columns:** A character vector, that indicates the names of the columns from the fishHook$all output to use in annotating hovertext on plotly plots. This will only be used if plotly = T. 3. **annotations:** A named list of character vectors. Each vector must have the same number of rows as the fishHook\$all data.table. These character vectors will be used to annotate hover text on the plotly plots in the same order as the hypotheses. This will only be used if plotly = T 4. **key:** A character that is passed to the plotly function that will link each point to a give value. For example, if key is set to gene_name. The plotted points are referred to by the value in the column gene_name. This is useful when integrating with shiny or any other tool that can integrate with plotly plots. **Return:** Either a base R or a plotly plot. This is dependant on the 'plotly' parameter. **UI:** None ## Covariate ----------- ### initialize() **Description:** Initializes the Covariate Object. Can be called with: x = Covariate$new(...) **Params:** 1. **name:** A character vector containing all of the names for the covariates. 2. **pad:** A numeric vector indicating how far each covariate range should be extended. e.g. If a covariate has ranges [10,20] and pad = 5, the covariate ranges will be set to [5,25]. 3. **type:** A character vector that contains the types of each covariate (numeric, interval, sequencing). 4. **signature:** signature is for use with ffTRack and is a list of named lists that specifies what is to be tallied. Each signature (list element) consists of an arbitrary length character vector specifying strings to match if grep == FALSE. Signarure can also be a length 1 character vector to grepl (if grep = TRUE) or a length 1 or 2 numeric vector specifying exact value or interval to match (for numeric data). 5. **field:** A character vector for use with numeric covariates (NA otherwise) the indicates the column containing the values of that covariates. For example, if you have a covariate for replication timing and the values are in the column 'value', the parameter field should be set to the character 'Value'. 6. **na.rm:** A logical vector that indicates whether or not to remove NAs in the covariates. 7. **grep:** A logical vector for use with ffTrack covariates. It specifies what form of signature to use. See the signature paramter for more information. 8. **cvs:** A list of covariates that can include any of the covariate classes (GRanges, ffTrack, RleList, character). **Return:** Returns an object of type Covariate that can be passed directly to the fishHook class. **UI:** None ### chr() **Description:** Returns a logical vector where each element corresponds to a covariate and where TRUE indicates a chr based seqlevels e.g. chr14 will return TRUE **Params:** No parameters required. Provided parameters will be ignored. **Return:** logical vector **UI:** None ### seqlevels() **Description:** returns a list of character vectors. If the respective covariate is of class GRanges, the vector will contain all of the chromosome names. If a covariate is not of class GRanges, will return an NA. **Params:** No parameters required. Provided parameters will be ignored. **Return:** character vector **UI:** None ### toList() **Description:** Returns a list of lists where each internal list corresponds to a covariate and is for use internally in the annotate.hypotheses function. The list representation of the covariate will contain the following variables: type, signature, pad, na.rm, field, grep **Params:** No parameters required. Provided parameters will be ignored. **Return:** list of lists **UI:** None ### print() **Description:** Prints out a summary of the Covariate. Can be used by invoking the variable name. **Params:** No parameters required. Provided parameters will be ignored. **Return:** None **UI:** Prints information about the Covariate to the console with all of covariates printed in order with variables printed alongside each covariate.


mskilab/fishHook documentation built on Jan. 10, 2023, 8:20 p.m.