R/deprecated.R

Defines functions lolliplot_transcriptID2codingSeq lolliplot_reduceLolli lolliplot_qual lolliplot_mutationObs lolliplot_fetchDomain lolliplot_dodgeCoordY lolliplot_dodgeCoordX lolliplot_DNAconv lolliplot_constructGene lolliplot_Codon2AA lolliplot_buildMain lolliplot_AA2sidechain lolliplot TvTi_rmMnuc TvTi_rmIndel TvTi_qual TvTi_convMaf TvTi_calcTransTranvFreq TvTi_buildMain TvTi_annoTransTranv TvTi_alignPlot TvTi waterfall_select_palette waterfall_sampSort waterfall_sampAlt waterfall_rmvSilent waterfall_qual waterfall_palette_names waterfall_hierarchyTRV waterfall_geneSort waterfall_geneRecurCutoff waterfall_geneAlt waterfall_calcMutFreq waterfall_build_proportions waterfall_buildMutBurden_B waterfall_buildMutBurden_A waterfall_buildMain waterfall_buildGenePrevelance waterfall_align waterfall_NA2gene waterfall_MGI2anno waterfall_MAF2anno waterfall_Custom2anno waterfall

Documented in lolliplot lolliplot_AA2sidechain lolliplot_buildMain TvTi waterfall

#' Construct a waterfall plot
#'
#' Given a data frame construct a water fall plot showing the mutation burden
#' and mutation type on a gene and sample level.
#' @name waterfall
#' @param x Object of class data frame representing annotated mutations. The
#' data frame supplied must have one of the following sets of column names
#' ("Tumor_Sample_Barcode", "Hugo_Symbol", "Variant_Classification") for
#' fileType="MAF", ("sample","gene_name","trv_type") for fileType="MGI" or
#' ("sample", "gene", "variant_class") for fileType="Custom". This columns
#' should represent samples in a cohort, gene with mutation, and the mutation
#' type respectively.
#' @param mainRecurCutoff Numeric value between 0 and 1 specifying a
#' mutation recurrence cutoff. Genes which do not have mutations in the
#' proportion os samples defined are removed.
#' @param mainGrid Boolean specifying if a grid should be overlayed on the main
#' plot. Not recommended if the number of genes or samples to be plotted
#' is large.
#' @param mainXlabel Boolean specifying whether to label the x-axis with sample
#' names. Not recommended if the number of samples to be plotted is large.
#' @param main_geneLabSize Intenger specifying the size of gene names displayed
#' on the y-axis.
#' @param mainLabelCol Character string specifying a column name from the
#' argument supplied to parameter `x` from which to derive cell labels from
#' (see details and vignette).
#' @param mainLabelSize Integer specifying the size of text labels for cells
#' in the main plot. Valid only if argument is supplied to the parameter
#' `mainLabelCol`.
#' @param mainLabelAngle Integer specifying the degree of rotation for
#' text labels. Valid only if argument is supplied to the parameter
#' `mainLabelCol`.
#' @param mainDropMut Boolean specifying whether to drop unused
#' "mutation type" levels from the legend.
#' @param mainPalette Character vector specifying colours for mutation types
#' plotted in the main plot, must specify a colour for each mutation type
#' plotted.
#' @param mainLayer Valid ggplot2 layer to be added to the main plot.
#' @param mutBurden Object of class data frame containing columns "sample",
#' "mut_burden" with sample levels matching those supplied in x.
#' @param plotMutBurden Boolean specify if the mutation burden sub-plot should
#' be displayed.
#' @param coverageSpace Integer specifying the size in bp of the genome
#' covered by sequence data from which mutations could be called
#' (see details and vignette).
#' @param mutBurdenLayer Valid ggplot2 layer to be added to the top sub-plot.
#' @param clinData Object of class data frame with rows representing clinical
#' data. The data frame should be in "long format" and columns must be names as
#' "sample", "variable", and "value" (optional see details and vignette).
#' @param clinLegCol Integer specifying the number of columns in the legend for
#' the clinical data, only valid if argument is supplied to parameter clinData.
#' @param clinVarOrder Character vector specifying the order in which to plot
#' variables in the variable column of the argument given to the parameter
#' clinData. The argument supplied to this parameter should have the same unique
#' length and values as in the variable column of the argument supplied to
#' parameter clinData (see vignette).
#' @param clinVarCol Named character vector specifying the mapping of colours
#' to variables in the variable column of the data frame supplied to clinData
#' (ex. "variable"="colour").
#' @param clinLayer Valid ggplot2 layer to be added to the clinical sub-plot.
#' @param sampRecurLayer Valid ggplot2 layer to be added to the left sub-plot.
#' @param plotGenes Character vector specifying genes to plot. If not null genes
#' not specified within this character vector are removed.
#' @param geneOrder Character vector specifying the order in which to plot
#' genes.
#' @param plotSamples Character vector specifying samples to plot. If not null
#' all other samples not specified within this parameter are removed.
#' @param sampOrder Character vector specifying the order of the samples to
#' plot.
#' @param maxGenes Integer specifying the maximum number of genes to be plotted.
#' Genes kept will be choosen based on the recurrence of mutations in samples.
#' @param rmvSilent Boolean specifying if silent mutations should be removed
#' from the plot.
#' @param fileType Character string specifying the file format of the data
#' frame specified to parameter `x`, one of "MGI", "MAF", "Custom"
#' (see details and vignette).
#' @param variant_class_order Character vector specifying the hierarchical order
#' of mutation types to plot, required if file_type == "Custom"
#' (see details and vignette).
#' @param out Character vector specifying the the object to output, one of
#' "data", "grob", or "plot", defaults to "plot" (see returns).
#' @param plot_proportions Plot mutational profile layer?
#' @param proportions_layer ggplot2 layer(s) to be added to the mutational 
#'      profile plot
#' @param proportions_type Which type of proportions plot to use? Can be 
#'  "trv_type" or "TvTi" currently
#' @param section_heights Heights of each section. Must be the same length as 
#'  the number of vertical section
#' @details waterfall is a function designed to visualize the mutations seen in
#' a cohort. The function takes a data frame with appropriate column names (see
#' fileType parameter) and plots the mutations within. In cases where multiple
#' mutations occur in the same cell the most deleterious mutation is given
#' priority (see vignette for default priority). If the fileType parameter is
#' set to "Custom" the user most supply this priority via the
#' `variant_class_order` parameter with the highest priorities occuring first.
#' Additionally this parameter will override the default orders of MGI and MAF
#' file types.
#'
#' Various data subsets are allowed via the waterfall function (see above), all
#' of these subsets will occur independently of the mutation burden calculation.
#' To clarify the removal of genes and mutations will only occur after the
#' mutation burden is calculated. The mutation burden calculation is only meant
#' to provide a rough estimate and assumes that the coverage breadth within the
#' cohort is aproximately equal. For more accurate calculations it is
#' recommended to supply this information via the mutBurden parameter which.
#' Note that the mutation burden calculation relies on the `coverageSpace`
#' parameter (see vignette).
#'
#' It is possible to display additional information within the plot via cell
#' labels. The `mainLabelCol` parameter will look for an additional column in
#' the data frame and plot text within cells based on those values
#' (see vignette).
#' @examples
#' # Plot the data
#' waterfall(brcaMAF, plotGenes=c("PIK3CA", "TP53", "USH2A", "MLL3", "BRCA1"))
#' @return One of the following, a list of dataframes containing data to be
#' plotted, a grob object, or a plot.
#' @importFrom utils tail
#' @importFrom grid nullGrob
#' @export

waterfall <- function(x, mainRecurCutoff=0, mainGrid=TRUE, mainXlabel=FALSE, 
                      main_geneLabSize=8, mainLabelCol=NULL, mainLabelSize=4,
                      mainLabelAngle=0, mainDropMut=FALSE, mainPalette=NULL, 
                      mainLayer=NULL, mutBurden=NULL, plotMutBurden=TRUE, 
                      coverageSpace=44100000, mutBurdenLayer=NULL,
                      clinData=NULL, clinLegCol=1, clinVarOrder=NULL,
                      clinVarCol=NULL, clinLayer=NULL, sampRecurLayer=NULL, 
                      plotGenes=NULL, geneOrder=NULL, plotSamples=NULL,
                      sampOrder=NULL, maxGenes=NULL, rmvSilent=FALSE,
                      fileType='MAF', variant_class_order=NULL, out="plot",
                      plot_proportions = FALSE,
                      proportions_layer = NULL, proportions_type = "TRV_TYPE",
                      section_heights)
{
    
    # Perform data quality checks and conversions
    inputDat <- waterfall_qual(x, clinData, mutBurden, file_type=fileType,
                               label_col=mainLabelCol)
    data_frame <- inputDat[[1]]
    clinData <- inputDat[[2]]
    mutBurden <- inputDat[[3]]
    
    # Set flag if it is desirable to plot cell text
    if(!is.null(mainLabelCol))
    {
        main.plot_label_flag <- TRUE
    } else {
        main.plot_label_flag <- FALSE
    }
    
    # If it is requested subset the input data on a sample list
    if(!is.null(plotSamples))
    {
        data_frame <- waterfall_sampAlt(data_frame, plotSamples)
    }
    
    # add in a count of mutations at the sample level before anything is
    # stripped out and save for mutation recurrence plot
    data_frame2 <- waterfall_calcMutFreq(data_frame[,c('sample', 'gene',
                                                       'trv_type')])
    
    # Subset the data to remove silent mutations if specified
    if(rmvSilent==TRUE)
    {
        data_frame <- waterfall_rmvSilent(data_frame)
    }
    
    # Subset the data based on a vector of genes if supplied
    if(!is.null(plotGenes))
    {
        data_frame <- waterfall_geneAlt(data_frame, plotGenes)
    }
    
    # Remove trv_type that are not the most deleterious for a given gene/sample
    data_frame <- waterfall_hierarchyTRV(data_frame, fileType,
                                         variant_class_order)
    
    # Subset the data based on the recurrence of mutations at the gene level
    data_frame <- waterfall_geneRecurCutoff(data_frame, mainRecurCutoff)
    
    # Use the max genes parameter to limit the number of genes plotted
    # and then reorder genes based on the frequency of mutations
    gene_sorted <- waterfall_geneSort(data_frame, geneOrder)
    data_frame$gene <- factor(data_frame$gene, levels=gene_sorted)
    if(!is.null(maxGenes))
    {
        max_gene_list <- utils::tail(gene_sorted, maxGenes)
        data_frame <- data_frame[data_frame$gene %in% max_gene_list,]
    }
    
    # reorder the samples based on hiearchial sort on ordered gene list
    sample_order <- waterfall_sampSort(data_frame, sampOrder)
    data_frame$sample <- factor(data_frame$sample, levels=sample_order)
    
    if (plot_proportions) {
        prop_x_label <- is.null(clinData)
        if (toupper(proportions_type) == "TRV_TYPE") {
            prop_dat <- inputDat[[1]]
            prop_dat[["sample"]] <- factor(prop_dat[["sample"]], 
                                           levels = sample_order)
            proportions_plot <- waterfall_build_proportions(
                data_frame = prop_dat, 
                plot_palette = mainPalette,
                file_type = fileType,
                layers = proportions_layer,
                x_label = prop_x_label
            )
        } else if (toupper(proportions_type) == "TVTI") {
            if (is.null(proportions_layer)) {
                proportions_layer <- list(
                    theme(
                        axis.ticks.x = element_blank(),
                        axis.text.x = element_blank(),
                        axis.ticks.y = element_blank(),
                        axis.text.y = element_blank(),
                        panel.background = element_blank(),
                        panel.border = element_blank(),
                        panel.grid.minor = element_blank(),
                        plot.background = element_blank()
                    ),
                    scale_y_continuous(
                        name = "Proportion", 
                        labels = scales::percent_format()
                    ),
                    guides(fill = guide_legend(ncol = 2))
                )
            }
            proportions_plot <- TvTi(
                x,
                fileType = fileType,
                sort = "custom",
                sample_order_input = sample_order,
                layers = proportions_layer,
                return_plot = TRUE
            )
            if (prop_x_label) proportions_plot <- proportions_plot + xlab(paste0('Sample (n=', nlevels(data_frame$sample), ')'))
        }
    } else proportions_plot <- NULL
    
    # Reorder the sample levels in data_frame2 to match the main plot's levels,
    # and then plot the top margin plot
    if(isTRUE(plotMutBurden))
    {
        if(!is.null(mutBurden))
        {
            if(!setequal(sample_order, mutBurden$sample))
            {
                stop("levels in the sample column of mutBurden does not match
                     either: the samples given in x, or plotSamples")
            }
            
            mutBurden$sample <- factor(mutBurden$sample, levels=sample_order)
            burden_plot <- waterfall_buildMutBurden_B(mutBurden, layers=mutBurdenLayer)
            } else {
                data_frame2$sample <- factor(data_frame2$sample,
                                             levels=sample_order)
                burden_plot <- waterfall_buildMutBurden_A(data_frame2, coverageSpace,
                                                          layers=mutBurdenLayer)
            }
    } else {
        # create a blank ggplot object
        burden_plot <- grid::nullGrob()
    }
    
    # Plot the Left Bar Chart
    gene_plot <- waterfall_buildGenePrevelance(data_frame, layers=sampRecurLayer,
                                               gene_label_size=main_geneLabSize)
    
    # if there are any NA values in the data frame for a gene, give these NA
    # values a gene name so they are plotted properly
    data_frame <- waterfall_NA2gene(data_frame)
    
    # Plot the Heatmap
    plot_x_title <- !(!is.null(clinData) || plot_proportions)
    
    heatmap <- waterfall_buildMain(data_frame, grid=mainGrid,
                                   label_x=mainXlabel,
                                   file_type=fileType,
                                   drop_mutation=mainDropMut,
                                   plot_x_title=plot_x_title,
                                   plot_label=main.plot_label_flag,
                                   plot_label_size=mainLabelSize,
                                   plot_palette=mainPalette, layers=mainLayer,
                                   plot_label_angle=mainLabelAngle)
    
    
    # Plot any clinical data if it is specified
    if(!is.null(clinData))
    {
        # match the levels of sample in y to conform to the main plot
        clinData$sample <- factor(clinData$sample, levels=sample_order)
        if(any(is.na(clinData$sample)))
        {
            clinData <- clinData[-which(is.na(clinData$sample)),]
        }
        
        # plot the clinical data
        clinical_plot <- multi_buildClin(clinData, clin.legend.col=clinLegCol,
                                         clin.var.colour=clinVarCol,
                                         clin.var.order=clinVarOrder,
                                         clin.layers=clinLayer)
        
        # Align all plots and return as 1 plot
        pA <- waterfall_align(genes = gene_plot, heatmap = heatmap, burden = burden_plot, 
                              clinical = clinical_plot, proportion = proportions_plot, 
                              section_heights = section_heights)
        #return(grid::grid.draw(pA))
    } else 
    {
        pA <- waterfall_align(genes = gene_plot, heatmap = heatmap, burden = burden_plot, 
                              proportion = proportions_plot, section_heights = section_heights)
    }
    dataOut <- list("main"=data_frame,
                    "mutation_count"=data_frame2,
                    "clinical"=clinData)
    output <- multi_selectOut(data=dataOut, plot=pA, draw=TRUE, out=out)
    return(output)
}
#' Convert Custom File
#'
#' Convert columns of a Custom annotation file into a format
#' recognizable by internal functions
#' @name waterfall_Custom2anno
#' @param x a data frame with columns having values for sample, gene, mutation
#' type
#' @param label_col Character string specifying the column name of a
#' label column (optional)
#' @noRd
#' @return a data frame coerced from custom to annotation format

waterfall_Custom2anno <- function(x, label_col)
{
    # message statement
    memo <- paste0("Detected \"Custom\" file_type flag, ",
                   "looking for correct column names...")
    message(memo)
    
    # define expected columns
    expec_col <- c("sample", "gene", "variant_class")
    if(!is.null(label_col))
    {
        expec_col <- c(expec_col, label_col)
    }
    
    # check expected columns are present
    if(!all(expec_col %in% colnames(x)))
    {
        memo <- paste0("Did not detect correct column names, column names
                       should be: ", toString(expec_col))
        stop(memo)
    }
    
    x <- x[,c('sample', 'gene', 'variant_class', label_col)]
    
    if(!is.null(label_col))
    {
        colnames(x) <- c('sample', 'gene', 'trv_type', 'label')
    } else {
        colnames(x) <- c('sample', 'gene', 'trv_type')
    }
    
    # if no silent mutations are present warn the user
    if(all(!toupper(x$trv_type) %in% toupper("silent")))
    {
        warning("Did not detect silent mutations in input, is this expected?")
    }
    return(x)
}
#' Convert MAF File
#'
#' Convert columns of a mutation annotation file "MAF" into a format
#' recognizable by internal functions
#' @name waterfall_MAF2anno
#' @param x a data frame in MAF format
#' @param label_col Character string specifying the column name of a
#' label column
#' @noRd
#' @return a data frame coerced from MAF to TGI format

waterfall_MAF2anno <- function(x, label_col)
{
    # Check that correct column names are present and convert to internal format
    expec_col <- c('Tumor_Sample_Barcode', 'Hugo_Symbol',
                   'Variant_Classification')
    
    if(!is.null(label_col))
    {
        expec_col <- c(expec_col, label_col)
    }
    
    if(!all(expec_col %in% colnames(x)))
    {
        memo <- paste0("Did not detect correct column names, column names
                       should be: ", toString(expec_col))
        stop(memo)
    }
    
    x <- x[,c('Tumor_Sample_Barcode', 'Hugo_Symbol', 'Variant_Classification',
              label_col)]
    
    if(!is.null(label_col))
    {
        colnames(x) <- c('sample', 'gene', 'trv_type', 'label')
    } else {
        colnames(x) <- c('sample', 'gene', 'trv_type')
    }
    return(x)
}
#' Convert MGI File
#'
#' Convert columns of a mutation annotation file "MGI" into a format
#' recognizable by internal functions
#' @name waterfall_MGI2anno
#' @param x a data frame in MGI internal format
#' @param label_col Character string specifying the column name of a label
#' column
#' @noRd
#' @return a data frame coerced from MGI to internal annotation format

waterfall_MGI2anno <- function(x, label_col)
{
    # Check that correct column names are present and convert to internal format
    expec_col <- c('sample', 'gene_name', 'trv_type')
    if(!is.null(label_col))
    {
        expec_col <- c(expec_col, label_col)
    }
    
    if(!all(expec_col %in% colnames(x)))
    {
        memo <- paste0("Did not detect correct column names, column names
                       should be: ", toString(expec_col))
        stop(memo)
    }
    
    x <- x[,c('sample', 'gene_name', 'trv_type', label_col)]
    if(!is.null(label_col))
    {
        colnames(x) <- c('sample', 'gene', 'trv_type', 'label')
    } else {
        colnames(x) <- c('sample', 'gene', 'trv_type')
    }
    
    return(x)
}
#' Assign NA samples a gene
#'
#' Replace NA values in a gene column with the top gene name
#' @name waterfall_NA2gene
#' @param x a data frame in anno format
#' @return a data frame with NA values in a gene column coerced to the top gene
#' name
#' @noRd
#' @importFrom stats na.omit

waterfall_NA2gene <- function(x)
{
    # Get The gene with the most Mutations and add the NA samples to that gene
    # (ensures that the NAs are added in as gene with most mutations will always
    # be plotted) i.e. makes sure that samples are plotted, happens with rmvSilent param
    
    # find top gene
    top_gene <- stats::na.omit(rev(x$gene))[1]
    
    # set the trv_type to NA if gene is NA (makes sure that wile a sample is plotted the cell is empty)
    x$trv_type <- replace(x$trv_type, is.na(x$gene), NA)
    x$gene <- replace(x$gene, is.na(x$gene), top_gene)
    
    return(x)
}
#' align plots
#'
#' align mutation landscape, mutation burden on sample, and mutation burden on
#' gene plots
#' @name waterfall_align
#' @param heatmap ggplot object displaying a mutation landscape
#' @param genes ggplot object displaying mutation burden on gene
#' @param burden ggplot object displaying mutation burden on sample
#' @param clinical ggplot object displaying clinical information "optional"
#' @param proportion ggplot object displaying proportion of mutation types "optional"
#' @param section_heights Heights of each section (should sum to one)
#' @return a grob object
#' @noRd
#' @importFrom gridExtra arrangeGrob
#' @importFrom grid nullGrob

waterfall_align <- function(genes, heatmap, burden, clinical, proportion, 
                            section_heights) {
    if (missing(section_heights)) {    
        if (missing(clinical)) {
            if (is.null(proportion)) {
                section_heights <- c(1, 4)
            } else {
                section_heights <- c(1, 4, 0.5)
            }
        } else {
            if (is.null(proportion)) {
                section_heights <- c(1, 4, 0.5)
            } else {
                section_heights <- c(1, 4, 1.2, 0.5)
            }
        }
    }
    
    
    # define the ggplot's as grobs and create a blank plot
    genes_grob <- suppressWarnings(ggplot2::ggplotGrob(genes))
    
    heatmap_grob <- ggplot2::ggplotGrob(heatmap)
    
    ## Strip out legends and plot separately
    ## https://github.com/baptiste/gridextra/wiki/arranging-ggplot#legends
    ind_legend <- grep("guide", heatmap_grob$layout$name)
    heatmap_legend <- heatmap_grob[["grobs"]][[ind_legend]]
    heatmap_width <- sum(heatmap_legend$width)
    heatmap_grob <- ggplot2::ggplotGrob(heatmap + theme(legend.position="none"))
    
    if(grid::is.grob(burden)){
        burden_width <- NULL
        burden_grob <- burden
        blankPanel <- grid::grid.rect(gp=grid::gpar(col="white"))
        burden_legend <- grid::nullGrob()
    } else {
        burden_grob <- ggplot2::ggplotGrob(burden)
        ## Strip out legends and plot separately
        ind_legend <- grep("guide", burden_grob$layout$name)
        burden_legend <- burden_grob[["grobs"]][[ind_legend]]
        burden_width <- sum(burden_legend$width)
        burden_grob <- ggplot2::ggplotGrob(burden + theme(legend.position="none"))
        
        blankPanel <- grid::grid.rect(gp=grid::gpar(col="white"))
    }
    
    if (!is.null(proportion)) {
        prop_grob <- ggplot2::ggplotGrob(proportion)
        ind_legend <- grep("guide", prop_grob$layout$name)
        prop_legend <- prop_grob[["grobs"]][[ind_legend]]
        prop_width <- sum(prop_legend$width)
        prop_grob <- ggplot2::ggplotGrob(proportion + theme(legend.position="none"))
        
    } else prop_width <- NULL
    
    if(!missing(clinical)) {
        clin_grob <- ggplot2::ggplotGrob(clinical)
        ind_legend <- grep("guide", clin_grob$layout$name)
        clin_legend <- clin_grob[["grobs"]][[ind_legend]]
        clin_width <- sum(clin_legend$width)
        clin_grob <- ggplot2::ggplotGrob(clinical + theme(legend.position="none"))
        
    } else clin_width <- NULL
    
    ## https://github.com/baptiste/gridextra/wiki/arranging-ggplot#legends
    legend_width <- unit.pmax(heatmap_width, burden_width, clin_width, prop_width)
    width_left <- unit(1, "npc") - max(legend_width)
    widths <- grid::unit.c(width_left * 0.15, width_left * 0.85, legend_width)
    
    # Adjust the grob widths so heatmap and burden plots line up
    if(!missing(clinical)) {
        if (!is.null(proportion)) {
            maxwidth <- grid::unit.pmax(heatmap_grob$widths,
                                        burden_grob$widths,
                                        clin_grob$widths,
                                        prop_grob$widths)
            prop_grob$widths <- as.list(maxwidth)
        } else {
            maxwidth <- grid::unit.pmax(heatmap_grob$widths,
                                        burden_grob$widths,
                                        clin_grob$widths
            )
        }
        burden_grob$widths <- as.list(maxwidth)
        heatmap_grob$widths <- as.list(maxwidth)
        clin_grob$widths <- as.list(maxwidth)
    } else {
        if (!is.null(proportion)) {
            maxwidth <- grid::unit.pmax(heatmap_grob$widths,
                                        burden_grob$widths,
                                        prop_grob$widths)
            prop_grob$widths <- as.list(maxwidth)
        } else {   
            maxwidth <- grid::unit.pmax(heatmap_grob$widths,
                                        burden_grob$widths
            )
        }
        burden_grob$widths <- as.list(maxwidth)
        heatmap_grob$widths <- as.list(maxwidth)
    }
    
    # Adjust the grob heights so heatmap, and genes plots line up
    maxheight <- grid::unit.pmax(genes_grob$heights, heatmap_grob$heights)
    genes_grob$heights <- as.list(maxheight)
    heatmap_grob$heights <- as.list(maxheight)
    
    # plot the grobs with grid.arrange
    if(!missing(clinical)) {
        if (!is.null(proportion)) {
            nrows <- 4
            grobs <- list(
                blankPanel, burden_grob, burden_legend, 
                genes_grob, heatmap_grob, heatmap_legend, 
                blankPanel, prop_grob, prop_legend, 
                blankPanel, clin_grob, clin_legend)
        } else {
            nrows <- 3
            grobs <- list(
                blankPanel, burden_grob, burden_legend, 
                genes_grob, heatmap_grob, heatmap_legend,
                blankPanel, clin_grob, clin_legend)
        }
        heatmap <- gridExtra::arrangeGrob(grobs = grobs,
                                          ncol=3, nrow=nrows, 
                                          widths=widths,
                                          heights=section_heights)
        
        
        
    } else {
        if (!is.null(proportion)) {
            nrows <- 3
            grobs <- list(
                blankPanel, burden_grob, burden_legend,
                genes_grob, heatmap_grob, heatmap_legend, 
                blankPanel, prop_grob, prop_legend)
        } else {
            grobs <- list(
                blankPanel, burden_grob, burden_legend,
                genes_grob, heatmap_grob, heatmap_legend
            )
            nrows <- 2
        }
        heatmap <- gridExtra::arrangeGrob(grobs = grobs,
                                          ncol=3, nrow=nrows,
                                          widths=widths, heights=section_heights)
    }
    
    return(heatmap)
}
#' plot mutation recurrence in genes
#'
#' plot a bar graph displaying the percentage of samples with a mutation
#' @name waterfall_buildGenePrevelance
#' @param data_frame a data frame in MAF format
#' @param gene_label_size numeric value indicating the size of the gene labels
#'      on the y-axis
#' @param layers additional ggplot2 layers
#' @noRd
#' @return a ggplot object
#' @importFrom plyr count
#' @importFrom stats na.omit

waterfall_buildGenePrevelance <- function(data_frame, gene_label_size=8, layers=NULL)
{
    # Convert all silent mutations to Synonymous, and all else to non-synonymous
    data_frame$trv_type <- as.character(data_frame$trv_type)
    data_frame$trv_type[toupper(data_frame$trv_type) != toupper('silent')] <-
        'Non Synonymous'
    data_frame$trv_type[toupper(data_frame$trv_type) == toupper('silent')] <-
        'Synonymous'
    data_frame$trv_type <- factor(data_frame$trv_type,
                                  levels=c('Synonymous', 'Non Synonymous'))
    
    # Define the number of samples for the Percentage calculation
    # (Note: to pass a variable outside of aes into aes it needs to be
    # defined again)
    total_number_sample <- nlevels(data_frame$sample)
    
    # Make a count column
    data_frame <- plyr::count(data_frame, c("gene", "trv_type"))
    data_frame$prop <- data_frame$freq/total_number_sample * 100
    
    # Define Theme and various other layers to be passed to ggplot
    theme <- theme(axis.text.y=element_text(size=gene_label_size,
                                            colour='black', face='italic'),
                   axis.title.y=element_blank(),
                   legend.position=('none'))
    y_limits <- ylim(100, 0)
    y_label <- ylab('% Mutant')
    legend <- scale_fill_manual(name="Translational Effect",
                                values=c("Non Synonymous"="blue", "Synonymous"="red"),
                                breaks=c('Synonymous', 'Non Synonymous'),
                                drop=FALSE)
    
    if(!is.null(layers))
    {
        layers <- layers
    } else {
        layers <- geom_blank()
    }
    
    # Plotting call
    p1 <- ggplot(stats::na.omit(data_frame),
                 aes_string(x='gene', y='prop', fill='trv_type')) +
        geom_bar(position='stack', alpha=.75, width=1, stat='identity') +
        theme_bw() + coord_flip() + theme + y_label + scale_y_reverse() +
        legend + layers
    
    return(p1)
}
#' Plot a mutation heatmap
#'
#' Plot a Mutation Landscape with variables sample, gene, mutation
#' @name waterfall_buildMain
#' @param data_frame a data frame in MAF format
#' @param grid boolean value whether to overlay a grid on the plot
#' @param label_x boolean value whether to label the x axis
#' @param file_type character string specifying the file type, one of 'MAF' or
#' 'MGI'
#' @param drop_mutation Boolean specifying whether to drop unused
#' "mutation type" levels from the legend
#' @param plot_x_title Boolean specifying whether to plot the x_axis title
#' @param plot_label Boolean specifying whether to plot text inside each cell
#' @param plot_label_size Integer specifying text size of cell labels
#' @param plot_palette Character vector specifying colors to fill on mutation
#' type
#' @param layers additional ggplot2 layers to plot
#' @param plot_label_angle angle at which to plot label text if plot_label is
#' true
#' @noRd
#' @return a ggplot2 object
#' @import ggplot2

waterfall_buildMain <- function(data_frame, grid=TRUE, label_x=FALSE,
                                file_type='MGI',
                                drop_mutation=FALSE, plot_x_title=TRUE,
                                plot_label=FALSE, plot_label_size=4,
                                plot_palette=NULL, layers=NULL,
                                plot_label_angle=0)
{
    # NOTE: scale_x_discret(drop=FALSE), added to ggplot2 call to ensure samples
    # remove by 'mutation_recurrence_subset' are still plotted as empty tiles
    
    # Define layers
    
    # grid overlay
    vertical_grid <- geom_vline(xintercept = seq(.5, nlevels(data_frame$sample),
                                                 by=1),
                                linetype='solid', colour='grey80', size=.01)
    
    if(length(unique(data_frame$gene)) == 1)
    {
        horizontal_grid <- geom_blank()
    } else {
        horizontal_grid <- geom_hline(yintercept = seq(1.5, length(unique(data_frame$gene)),
                                                       by=1),
                                      linetype='solid', colour='grey80',
                                      size=.01)
    }
    
    # Declare the appropriate palette
    palette <- waterfall_select_palette(file_type = file_type, 
                                        custom_palette = plot_palette)
    breaks_labels <- waterfall_palette_names(palette,
                                             file_type, data_frame)
    breaks <- breaks_labels[["breaks"]]
    labels <- breaks_labels[["labels"]]
    
    if(drop_mutation == TRUE)
    {
        # Create Legend
        legend <- scale_fill_manual(name="Mutation Type", values=palette,
                                    breaks=breaks, labels=labels, drop=TRUE)
    } else if (drop_mutation == FALSE) {
        # Create Legend
        legend <- scale_fill_manual(name="Mutation Type", values=palette,
                                    breaks=breaks, labels=labels, drop=FALSE)
    }
    
    # X Label
    x_label <- xlab(paste0('Sample (n=', nlevels(data_frame$sample), ')'))
    
    if(plot_label == TRUE)
    {
        label <- geom_text(data=data_frame,
                           mapping=aes_string(x='sample', y='gene',
                                              label='label'),
                           size=plot_label_size, colour='white',
                           angle=plot_label_angle)
    } else {
        label <- geom_blank()
    }
    
    # Theme, Boolean, if specified to plot x labels, define theme such that
    # labels are plotted
    default_theme <- theme(axis.ticks=element_blank(),
                           panel.grid.major=element_blank(),
                           panel.grid.minor=element_blank(),
                           axis.title.y=element_blank(),
                           panel.background=element_rect(fill='white',
                                                         colour='white'),
                           axis.text.y=element_blank(),
                           plot.title=element_blank())
    if(label_x & plot_x_title)
    {
        theme <-  theme(axis.text.x=element_text(angle=50, hjust=1),
                        axis.title.x=element_text(size=10),
                        legend.title=element_text(size=14)
        )
    } else if(!label_x & plot_x_title) {
        theme <-  theme(axis.text.x=element_blank(),                        
                        axis.title.x=element_text(size=20),
                        legend.title=element_text(size=14),
                        panel.border=element_rect(colour='grey80', fill=NA,
                                                  size=.1),
                        legend.position=("right"))
    } else if(label_x & !plot_x_title) {
        theme <-  theme(axis.text.x=element_text(angle=50, hjust=1),
                        axis.title.x=element_blank(),
                        legend.title=element_text(size=14))
    } else if(!label_x & !plot_x_title) {
        theme <-  theme(
            axis.text.x=element_blank(),
            axis.title.x=element_blank(),
            legend.title=element_text(size=14),
            panel.border=element_rect(colour='grey80', fill=NA,
                                      size=.1),
            legend.position=("right"))
    }
    
    # additional parameters
    if(!is.null(layers))
    {
        layers <- layers
    } else {
        layers <- geom_blank()
    }
    
    # ggplot call
    p1 <- ggplot(data_frame, aes_string('sample', 'gene')) +
        geom_tile(aes_string(fill='trv_type'), position="identity")
    if(grid == TRUE) {
        p1 <- p1 + vertical_grid + horizontal_grid 
    }
    p1 <- p1 +  default_theme + theme + legend + x_label +
        scale_x_discrete(drop=FALSE) + label + layers
    
    return(p1)
}
#' plot mutation burden
#'
#' plot a barchart showing mutations per MB
#' @name waterfall_buildMutBurden_A
#' @param x a data frame in MAF format
#' @param coverage_space an integer specifying the coverage space in base pairs
#' from which a mutation could occur
#' @param layers Additional ggplot2 layers to plot
#' @noRd
#' @return a ggplot object

waterfall_buildMutBurden_A <- function(x, coverage_space, layers=NULL)
{
    # Add in mutations per MB calculation
    x$mutation_per_MB <-
        x$mutation_total/coverage_space * 1000000
    
    # Alter GGplot2 Theme
    theme <- theme(axis.ticks.x = element_blank(),
                   axis.text.x = element_blank(),
                   axis.title.x = element_blank(),
                   legend.title = element_text(size=14),
                   axis.text.y = element_text(colour = "black"),
                   axis.title.y = element_text(colour = "black"),
                   panel.background = element_blank(),
                   # panel.grid.minor.y = element_line(colour = "black"),
                   panel.grid.major.y = element_line(colour = "grey80"),
                   panel.grid.minor.x = element_blank(),
                   panel.grid.major.x = element_blank(),
                   panel.border = element_rect(fill = NA)
    )
    
    # Add Legend
    legend <- scale_fill_manual(name="Translational Effect",
                                values=c("Synonymous"="red", "Non Synonymous"="blue"),
                                breaks=c("Synonymous", "Non Synonymous"),
                                drop=FALSE)
    
    # add y label
    y_label <- ylab('Mutations per MB')
    
    # ggplot2 call
    p1 <- ggplot(x, aes_string(x='sample', y='mutation_per_MB',
                               fill='trv_type')) +
        geom_bar(stat='identity', alpha=.75, width=1) +
        theme + y_label + legend + layers
    
    return(p1)
}
#' plot mutation burden
#'
#' plot a barchart showing mutation burden given by data frame
#' @name waterfall_buildMutBurden_B
#' @param x a data frame containing columns sample, mut_burden
#' @param layers additional ggplot2 layers to plot
#' @return a ggplot object
#' @noRd
#' @import ggplot2

waterfall_buildMutBurden_B <- function(x, layers=NULL)
{
    # add in fake column for legend
    # (necessary to have legend for proper plot alignment)
    x$Type <- c("Undefined")
    x$Type <- factor(x$Type,
                     levels=c("Synonymous", "Non Synonymous", "Undefined"))
    
    # Define Theme
    theme <- theme(axis.ticks.x = element_blank(),
                   axis.text.x = element_blank(),
                   axis.title.x = element_blank(),
                   legend.title = element_text(size=14),
                   axis.text.y = element_text(colour = "black"),
                   axis.title.y = element_text(colour = "black"),
                   panel.background = element_blank(),
                   # panel.grid.minor.y = element_line(colour = "black"),
                   panel.grid.major.y = element_line(colour = "grey80"),
                   panel.grid.minor.x = element_blank(),
                   panel.grid.major.x = element_blank(),
                   panel.border = element_rect(fill = NA)
    )
    
    # Define additional parameters
    y_label <- ylab("Mutation Burden")
    legend <- scale_fill_manual(name="Translational Effect",
                                values=c("Non Synonymous"="blue",
                                         "Synonymous"="red",
                                         "Undefined"="black"),
                                breaks=c("Synonymous", "Non Synonymous", "Undefined"),
                                drop=FALSE)
    
    
    bar <- geom_bar(stat='identity', alpha=.75, width=1)
    
    # ggplot2 call
    p1 <- ggplot(x, aes_string(x='sample', y='mut_burden', fill='Type')) + bar +
        theme + y_label + legend + layers
    
    return(p1)
}
#' @title Build mutational profile plot
#' 
#' @param data_frame input data.frame
#' @param plot_palette Color palette to use
#' @param file_type MAF, etc
#' @param layers layer(s) to add to this plot object
#' @param x_label label this plot?
#' @description Builds a ggplot object showing individuals' mutational profile
#' @noRd
#' @return a ggplot object
waterfall_build_proportions <- function(data_frame, plot_palette, 
                                        file_type, layers, x_label) {
    
    # Declare the appropriate palette
    palette <- waterfall_select_palette(file_type, custom_palette = plot_palette)
    breaks_labels <- waterfall_palette_names(palette, file_type, data_frame)
    breaks <- breaks_labels[["breaks"]]
    labels <- breaks_labels[["labels"]]
    
    if (x_label) {
        x_label_obj <- xlab(paste0('Sample (n=', nlevels(data_frame$sample), ')'))
    } else {
        x_label_obj <- geom_blank()
    }
    
    p5 <- ggplot(data_frame, aes_string(x = 'sample', fill = 'trv_type')) + 
        geom_bar(position = "fill", width = 0.95) + 
        scale_y_continuous(
            name = "Proportion", 
            labels = scales::percent_format()) + 
        scale_fill_manual(name="Mutation Type", 
                          values=palette, 
                          breaks = breaks,
                          labels = labels
        ) + 
        guides(fill = guide_legend(title = "Mutation Type", ncol = 2)) +
        theme(
            axis.ticks.x = element_blank(),
            axis.text.x =  element_blank(),
            axis.title.x = if (x_label) element_text() else element_blank(),
            axis.text.y = element_text(colour = "black"),
            axis.title.y = element_text(colour = "black", hjust = 0),
            panel.background = element_blank(),
            panel.border = element_blank(),
            panel.grid.minor = element_blank(),
            plot.background = element_blank()
        ) + layers + x_label_obj
    p5
}
#' Calculate Synonymous/Nonsynonymous mutation frequency
#'
#' Creates a data frame giving synonymous/nonsynonymous counts on a sample level
#' @name waterfall_calcMutFreq
#' @param x data frame in long format with columns sample, trv_type
#' @importFrom data.table melt
#' @noRd
#' @return a data frame with synonymous/nonsynonymous counts appended

waterfall_calcMutFreq <- function(x)
{
    message("Calculating frequency of mutations...")
    # Change trv_type calls to either synonymous or non synonymous,
    # for use in the mutation per Mb plot
    x$trv_type <- as.character(x$trv_type)
    x$trv_type[toupper(x$trv_type) != toupper('silent')] <- 'Non Synonymous'
    x$trv_type[toupper(x$trv_type) == toupper('silent')] <- 'Synonymous'
    x$trv_type <- factor(x$trv_type, levels=c('Synonymous', 'Non Synonymous'))
    
    # Obtain a data frame of mutation counts on the sample level
    mutation_counts <- table(x[,c('sample', 'trv_type')])
    mutation_counts <- as.data.frame(data.table::melt(mutation_counts))
    colnames(mutation_counts) <- c('sample', 'trv_type', 'mutation_total')
    
    return(mutation_counts)
}
#' mutation sample cutoff gene based
#'
#' Subset a internal mutSpec file keeping only samples within the specified gene
#'  list
#' @name waterfall_geneAlt
#' @param x a data frame in long format with columns 'gene', 'trv_type'
#' @param genes character vector listing genes to plot
#' @noRd
#' @return a subset data frame

waterfall_geneAlt <- function(x, genes)
{
    message("Removing genes not in: ", toString(genes))
    # Perform quality checks
    if(!is(genes, 'character'))
    {
        memo <- paste0("argument supplied to plotGenes is not a character ",
                       "vector, attempting to coerce")
        warning(memo)
        genes <- as.character(genes)
    }
    
    if(!all(toupper(genes) %in% toupper(x$gene)))
    {
        memo <- paste0("genes supplied in plotGenes contains an element not ",
                       "found in x or it's subsequent subsets")
        warning(memo)
    }
    genes <- c(genes, NA)
    x <- x[(toupper(x$gene) %in% toupper(genes)), ]
    
    return(x)
}
#' Mutation Recurrence Cutoff
#'
#' Subset a MAF file keeping only samples that meet a mutation recurrence cutoff
#' @name waterfall_geneRecurCutoff
#' @param x data frame in long format with columns 'gene', 'trv_type', 'sample'
#' @param recurrence_cutoff integer specifying removal of entries not seen
#' in at least "x" percent of samples
#' @return a subset data frame
#' @noRd
#' @importFrom plyr count
#' @importFrom stats na.omit

waterfall_geneRecurCutoff <- function(x, recurrence_cutoff)
{
    if(recurrence_cutoff != 0)
    {
        message("Performing recurrence cutoff...")
    }
    mutRecur <- plyr::count(unique(x), vars=c("gene"))
    mutRecur <- stats::na.omit(mutRecur)
    mutRecur$prop <- mutRecur$freq/nlevels(x$sample)
    
    # If recurrence cutoff specified exceeds upper limit such that no plot
    # useful would be generated, reset recurrence cutoff
    maxRecur <- max(mutRecur$prop)
    if(maxRecur < recurrence_cutoff)
    {
        memo <- paste0("The recurrence cutoff specified exceeds the recurrence",
                       " seen in the data, resetting this value to equal max ",
                       "recurrence:", maxRecur)
        warning(memo)
        recurrence_cutoff <- maxRecur
    }
    
    gene_above_recur <- mutRecur[mutRecur$prop >= recurrence_cutoff,]$gene
    # add NA to the end of 'gene_above_recurrence' vector, allowing for all
    # samples having NA as a gene name to be retained in the subset below
    gene_above_recur <- c(as.character(gene_above_recur), NA)
    
    # subset the original data frame based on the following: keep gene if it is
    # in the gene vector in "mutation_recurrence_subset"
    x <- x[(x$gene %in% gene_above_recur), ]
    
    return(x)
}
#' sort waterfall file by gene
#'
#' order a waterfall file ranking genes with more mutations higher if a gene
#' order is unspecified.
#' @name waterfall_geneSort
#' @param x Data frame with columns names "gene", "trv_type".
#' @param geneOrder Character vector specifying the order in which to plot
#' genes.
#' @noRd
#' @return Character vector of ordered genes

waterfall_geneSort <- function(x, geneOrder=NULL)
{
    if(!is.null(geneOrder))
    {
        geneOrder <- as.character(unique(geneOrder))
        # if there are any genes in geneOrder not in x, remove those
        gene_to_rmv <- geneOrder[!geneOrder %in% unique(x$gene)]
        if(length(gene_to_rmv) == 0 & length(geneOrder) >= 1)
        {
            return(rev(geneOrder))
        } else if(length(gene_to_rmv) == length(geneOrder)) {
            memo <- paste0("Did not find any genes supplied to the parameter",
                           " geneOrder in the input supplied to x, perhaps ",
                           " there is a case issue?")
            warning(memo)
        } else if(length(gene_to_rmv) != length(geneOrder)) {
            memo <- paste0("The following genes were not found in the input", 
                           " supplied to parameter x: ", toString(gene_to_rmv),
                           ", removing these from geneOrder!")
            warning(memo)
            gene_order <- geneOrder[geneOrder %in% unique(x$gene)]
            return(rev(gene_order))
        }
    }
    
    # order based on mutation frequency
    gene_mutation_table <- table(x[,c('gene', 'trv_type')])
    gene_order <- names(sort(rowSums(gene_mutation_table)))
    return(gene_order)
}
#' Hiearchical removal of MAF entries
#'
#' Remove MAF entries with the same gene/sample in an ordered fashion such that
#' the most deleterious are retained
#' @name waterfall_hierarchyTRV
#' @param x a data frame in long format with columns sample, gene,
#' trv_type
#' @param file_type The type of file to act on one of 'MAF", "MGI", "Custom"
#' @param variant_class_order character vector giving the hierarchical order of
#' mutation types to plot
#' @noRd
#' @return a data frame with multiple mutations in the same sample/gene
#' collapsed on the most deleterious

waterfall_hierarchyTRV <- function(x, file_type, variant_class_order)
{
    message("setting mutation hierarchy...")
    # if variant_class_order is null use predefined values
    if(is.null(variant_class_order))
    {
        if(toupper(file_type) == toupper('MGI'))
        {
            mutation_order <- c("nonsense", "frame_shift_del",
                                "frame_shift_ins", "splice_site_del",
                                "splice_site_ins", "splice_site",
                                "nonstop", "in_frame_del", "in_frame_ins",
                                "missense", "splice_region_del",
                                "splice_region_ins", "splice_region",
                                "5_prime_flanking_region",
                                "3_prime_flanking_region",
                                "3_prime_untranslated_region",
                                "5_prime_untranslated_region", "rna",
                                "intronic", "silent", NA)
        } else if(toupper(file_type) == toupper('MAF')) {
            mutation_order <- c("Nonsense_Mutation", "Frame_Shift_Ins",
                                "Frame_Shift_Del", "Translation_Start_Site",
                                "Splice_Site", "Nonstop_Mutation",
                                "In_Frame_Ins", "In_Frame_Del",
                                "Missense_Mutation", "5\'Flank",
                                "3\'Flank", "5\'UTR", "3\'UTR", "RNA", "Intron",
                                "IGR", "Silent", "Targeted_Region", NA)
        } else if(toupper(file_type) == toupper('Custom')) {
            memo <- paste0("Detected NULL in variant_class_order, ",
                           "this parameter is required if file_type is set ",
                           "to \"Custom\"")
            stop(memo)
        }
    } else {
        mutation_order <- unique(c(variant_class_order, NA))
    }
    
    # Check that elements in trv_type are in the mutation order
    if(any(!x$trv_type %in% mutation_order))
    {
        memo <- paste0("Detected an invalid mutation type, valid values for ",
                       file_type, " are: ", toString(mutation_order))
        stop(memo)
    }
    # refactor the data frame
    x$trv_type <- factor(x$trv_type, levels=mutation_order)
    
    # sort the data frame so that the duplicated call will remove the
    # proper trv_type
    x <- x[order(x$sample, x$gene, x$trv_type),]
    
    # collapse the data on sample/gene
    x <- x[!duplicated(x[, c("sample", "gene")]), ]
    
    return(x)
}
#'  waterfall_palette_names
#' 
#' @title waterfall_palette_names
#' 
#' @description Make labels and breaks for palettes
#' 
#' @param palette Named colour vector as input 
#' @param file_type Which file type is involved?
#' @param data_frame Only used if file_type is "custom"
#' @noRd
#' @return a named list of "breaks" and "labels"
waterfall_palette_names <- function(palette, file_type, data_frame) {
    # Create breaks specific and labels for specified file type
    ## Create labels and breaks from 
    ## names of palette to avoid ordering issues       
    if(toupper(file_type) == toupper('MGI'))
    {
        breaks <- names(palette)
        
        labels <- gsub("(\\d)_prime", "\\1'", breaks)
        labels <- gsub("untranslated_region", "UTR", labels)
        labels <- gsub("flanking_region", "Flank", labels)
        labels <- gsub("_", " ", labels)
        labels <- gsub("del$", "deletion", labels)
        labels <- gsub("ins$", "insertion", labels)
        labels <- gsub("prime ", "", labels)
        labels <- gsub("(rna)", "\\U\\1", labels, perl = TRUE)
        labels <- gsub("\\b([a-z])", "\\U\\1", labels, perl = TRUE)
    } else if(toupper(file_type) == toupper('MAF')) {
        breaks <- names(palette)
        labels <- gsub("_", " ", breaks)
        labels <- gsub("'", "' ", labels)
    } else if(toupper(file_type) == toupper('Custom')) {
        breaks <- levels(data_frame[["trv_type"]])
        labels <- breaks
    }
    list("breaks" = breaks,
         "labels" = labels)
}

#' Check input to mutSpec
#'
#' Perform a data quality check on input to mutSpec
#' @name waterfall_qual
#' @param x a data frame in annotation format
#' @param y a data frame containing clinical data or a null object
#' @param z a data frame containing mutation burden information or a null object
#' @param file_type Character string specifying the input format to expect in x
#' @param label_col Character string specifying the column name of a label
#' column
#' @noRd
#' @return a list of data frames passing quality checks

waterfall_qual <- function(x, y, z, file_type, label_col)
{
    # print message statement
    message("Checking if input is properly formatted...")
    
    # Check input data to x
    if(!is.data.frame(x))
    {
        stop("Did not detect a data frame for input to x")
    }
    
    # Convert file type to internal format
    if(toupper(file_type) == toupper("MAF"))
    {
        x <- waterfall_MAF2anno(x, label_col)
    } else if(toupper(file_type) == toupper("MGI")) {
        x <- waterfall_MGI2anno(x, label_col)
    } else if(toupper(file_type) == toupper("Custom")) {
        x <- waterfall_Custom2anno(x, label_col)
    } else {
        stop("Unrecognized file_type: ", file_type)
    }
    
    # drop unused levels in x
    x$sample <- as.factor(x$sample)
    x$gene <- as.factor(x$gene)
    x$trv_type <- as.factor(x$trv_type)
    x <- droplevels(x)
    
    # Check input data to clinDat
    if(!is.null(y))
    {
        if(!is.data.frame(y))
        {
            stop("Did not detect a data frame for input to clinDat")
        }
        y <- droplevels(y)
        
        if(!all(c('sample', 'variable', 'value') %in% colnames(y)))
        {
            stop("Did not detect correct sample names in clinDat")
        }
        
        # make sure clinical data columns are of expected class
        y$sample <- as.factor(y$sample)
        y$variable <- as.factor(y$variable)
        y$value <- as.character(y$value)
    }
    
    # check input data to mutBurden
    if(!is.null(z))
    {
        if(!is.data.frame(z))
        {
            stop("Did not detect a data frame for input to mutBurden")
        }
        z <- droplevels(z)
        
        if(!all(c('sample', 'mut_burden') %in% colnames(z)))
        {
            stop("Did not detect correct sample names in mutBurden")
        }
        
        # Make sure mutation burden columns are of proper class
        z$sample <- as.factor(z$sample)
        z$mut_burden <- as.numeric(as.character(z$mut_burden))
    }
    
    return(list(x, y, z))
}
#' Silent Mutation Removal
#'
#' Subset a MAF file setting keeping only sample information if a mutation
#' is silent
#' @name waterfall_rmvSilent
#' @param x a data frame with columns 'sample', 'gene', 'trv_type'
#' @noRd
#' @return a subset data frame

waterfall_rmvSilent <- function(x)
{
    message("Removing silent mutations...")
    
    # Index and remove those rows which contain silent mutations
    x[which(toupper(x$trv_type) == toupper('silent')), c('gene')] <- NA
    if("label" %in% colnames(x)){
        x[which(toupper(x$trv_type) == toupper('silent')), c('label')] <- NA
    }  
    x[which(toupper(x$trv_type) == toupper('silent')), c('trv_type')] <- NA
    
    return(x)
}
#' mutation sample subset sample based
#'
#' Alter a mutSpec input file keeping/adding entries in a selection of samples
#' @name waterfall_sampAlt
#' @param x a data frame in long format with columns 'sample', 'trv_type'
#' @param samples character vector giving samples to plot
#' @return a subset data frame
#' @noRd
#' @importFrom plyr rbind.fill

waterfall_sampAlt <- function(x, samples)
{
    message("Retrieving requested samples from supplied data...")
    # Perform quality check
    if(!is(samples, 'character'))
    {
        memo <- paste0("argument supplied to main.samples is not a ",
                       "character vector, attempting to coerce")
        warning(memo)
        samples <- as.character(samples)
    }
    
    # add in samples not in the original data frame x
    if(!all(toupper(samples) %in% toupper(x$sample)))
    {
        memo <- paste0("Detected one or more samples supplied to main.samples ",
                       "not found in x ... adding samples to plot")
        message(memo)
        
        samp_not_in_x <- as.data.frame(samples[which(!(samples %in% x$sample))])
        colnames(samp_not_in_x) <- "sample"
        x <- plyr::rbind.fill(samp_not_in_x, x)
    }
    
    # Subset the data based on the arguments supplied in main.samples,
    # then relevel the data frame
    x <- x[(toupper(x$sample) %in% toupper(samples)), ]
    x <- droplevels(x)
    x$sample <- factor(x$sample, levels=samples)
    
    return(x)
}
#' sort samples in an internal waterfall file.
#'
#' perform a hiearchial sort on samples based on the presence of mutations in
#' an ordered list of genes if a sample order is unspecified.
#' @name waterfall_sampSort
#' @param x a data frame in long format with column names "sample",
#' "gene", "trv_type"
#' @param sampOrder Character vector specifying the order of samples to plot.
#' @return a vector of samples in a sorted order
#' @noRd
#' @importFrom data.table dcast

waterfall_sampSort <- function(x, sampOrder=NULL)
{
    
    # if a sample order is already defined plot that instead
    if(!is.null(sampOrder))
    {
        sampOrder <- as.character(unique(sampOrder))
        
        # determine if there are any new samples in sampOrder not in the data
        # and remove if true
        new_samples <- sampOrder[!sampOrder %in% unique(x$sample)]
        if(length(new_samples) != 0)
        {
            memo <- paste0("The following samples were not detected in the ",
                           "original data: ", toString(new_samples),
                           ", removing these samples from sampOrder!")
            warning(memo)
            sampOrder <- sampOrder[!sampOrder %in% new_samples]
        }
        
        # return what was given originally
        return(sampOrder)
    }
    
    # recast the data going from long format to wide format, values in this data
    # are counts of a mutation call
    wide_data <- data.table::dcast(x, sample ~ gene, fun.aggregate = length,
                                 value.var="trv_type")
    
    # apply a boolean function to convert the data frame values to 1's and 0's
    values <- wide_data[,-1, drop=FALSE]
    sample <- wide_data[,1]
    values <- data.frame(apply(values, 2,
                               function(x) as.numeric(as.logical(x))))
    wide_boolean <- cbind(sample, values)
    
    # reverse the columns so that genes with highest mutation's are listed first
    # (assumes gene_sort has been run on the data frame)
    wide_boolean <- wide_boolean[,c(1, rev(2:ncol(wide_boolean)))]
    
    # if there are any NA values present in a sample at the gene put that
    # remove that sample and save (at this stage it should be samples)
    if(any(grepl("^NA.$", colnames(wide_boolean))))
    {
        # Find which column has the NA header
        NA_index <- which(grepl("^NA.$", colnames(wide_boolean)))
        
        # Append NA column to end of data frame
        NA_gene <- wide_boolean[,NA_index]
        wide_boolean <- wide_boolean[,-NA_index]
        
        # Save copy and remove samples with no mutations,
        # these will be added to the end
        samp_no_mut <- wide_boolean[rowSums(wide_boolean[2:ncol(wide_boolean)]) == 0,]$sample
        samp_no_mut <- as.character(samp_no_mut)
        wide_boolean <- wide_boolean[!wide_boolean$sample %in% samp_no_mut,]
    } else {
        samp_no_mut <- NULL
    }
    
    # hiearchial sort on all column's (i.e. genes) such that samples are
    # rearranged if there is a mutation in that gene
    sample_order <- wide_boolean[do.call(order, as.list(-wide_boolean[2:ncol(wide_boolean)])),]$sample
    
    # Put those samples not in sample order in from the original levels of the
    # data (these are samples with no mutations)
    not_in <- as.character(levels(sample_order)[which(!levels(sample_order) %in% sample_order)])
    not_in <- not_in[!not_in %in% samp_no_mut]
    sample_order <- c(as.character(sample_order), as.character(not_in))
    
    # Put those samples with no mutations back in
    if(!is.null(samp_no_mut))
    {
        sample_order <- c(sample_order, samp_no_mut)
    }
    
    # see issue #325, make sure sample_order values are unique
    sample_order <- unique(sample_order)
    
    return(sample_order)
}
#' waterfall_select_palette
#' 
#' @title Helper function to select a colour palette
#' 
#' @param file_type Which file tyoe is used?
#' @param custom_palette Nullable custom colour palette
#' @noRd
#' @return A vector of colours to be used as palette
waterfall_select_palette <- function(file_type, custom_palette = NULL) {
    if (is.null(custom_palette)) {
        if (toupper(file_type) == toupper('MGI')) {
            palette <- c("nonsense"='#4f00A8', "frame_shift_del"='#A80100',
                         "frame_shift_ins"='#CF5A59', "splice_site_del"='#A80079',
                         "splice_site_ins"='#BC2D94', "splice_site"='#CF59AE',
                         "nonstop"='#000000', "in_frame_del"='#006666',
                         "in_frame_ins"='#00A8A8', "missense"='#009933',
                         "splice_region_del"='#ace7b9', "splice_region_ins"='#cdf0d5',
                         "splice_region"='#59CF74', "5_prime_flanking_region"='#002AA8',
                         "3_prime_flanking_region"='#5977CF',
                         "3_prime_untranslated_region"='#F37812',
                         "5_prime_untranslated_region"='#F2B079', "rna"='#888811',
                         "intronic"='#FDF31C', "silent"='#8C8C8C')
        } else if (toupper(file_type) == toupper('MAF')) {
            palette <- c("Nonsense_Mutation"="grey", "Frame_Shift_Ins"='#A80100',
                         "Frame_Shift_Del"='#CF5A59', "In_Frame_Ins"='#A80079',
                         "In_Frame_Del"='#CF59AE', "Nonstop_Mutation"='#000000',
                         "Translation_Start_Site"='#9159CF', "Splice_Site"='#4f00A8',
                         "Missense_Mutation"='#59CF74', "5\'Flank"='#00A8A8',
                         "3\'Flank"='#79F2F2', "5\'UTR"='#006666',
                         "3\'UTR"='#002AA8', "RNA"='#5977CF', "Intron"='#F37812',
                         "IGR"='#F2B079', "Silent"='#888811',
                         "Targeted_Region"='#FDF31C')
        } else if (toupper(file_type) == toupper('Custom')) {
            memo <- paste0("Defining a palette in mainPalette is recommended ",
                           "when file_type is set to \"Custom\", defaulting to ",
                           "a predefined palette with 20 levels")
            warning(memo)
            palette <- c('#4f00A8', '#A80100', '#CF5A59', '#A80079', '#BC2D94',
                         '#CF59AE', '#000000', '#006666', '#00A8A8', '#009933',
                         '#ace7b9', '#cdf0d5', '#59CF74', '#002AA8', '#5977CF',
                         '#F37812', '#F2B079', '#888811', '#FDF31C', '#8C8C8C')
        }
    }
    else {
        if(toupper(file_type) == "CUSTOM") {
            palette <- custom_palette
        } else {
            if(toupper(file_type) == "MGI") {
                breaks <- c("nonsense", "frame_shift_del",
                            "frame_shift_ins", "splice_site_del",
                            "splice_site_ins", "splice_site",
                            "nonstop", "in_frame_del",
                            "in_frame_ins", "missense",
                            "splice_region_del", "splice_region_ins",
                            "splice_region", "5_prime_flanking_region",
                            "3_prime_flanking_region",
                            "3_prime_untranslated_region",
                            "5_prime_untranslated_region", "rna",
                            "intronic", "silent")
            } else if (toupper(file_type) == "MAF") {
                breaks <- c("Nonsense_Mutation", "Frame_Shift_Ins",
                            "Frame_Shift_Del", "In_Frame_Ins",
                            "In_Frame_Del", "Nonstop_Mutation",
                            "Translation_Start_Site", 
                            "Splice_Site", "Missense_Mutation", 
                            "5\'Flank", "3\'Flank", "5\'UTR",
                            "3\'UTR", "RNA", "Intron",
                            "IGR", "Silent", "Targeted_Region")
            }
            ## Ensure custom palette is correct by
            ## Labelling correctly and making correct length
            palette <- setNames(
                grDevices::colorRampPalette(custom_palette)(length(breaks)), 
                breaks) 
        }
    }
    palette
}

#' Construct transition-transversion plot
#'
#' Given a data frame construct a plot displaying the proportion or frequency of
#' transition and transversion types observed in a cohort.
#' @name TvTi
#' @param x Object of class data frame with rows representing transitions and
#' transversions. The data frame must contain the following columns 'sample',
#' reference' and 'variant' or alternatively "Tumor_Sample_Barcode",
#' "Reference_Allele", "Tumor_Seq_Allele1", "Tumor_Seq_Allele2" depending on the
#' argument supplied to the fileType parameter. (required)
#' @param y Named vector or data frame representing the expected transition and 
#' transversion rates. Either option must name transition and transverions as
#' follows: "A->C or T->G (TV)", "A->G or T->C (TI)", "A->T or T->A (TV)",
#'  "G->A or C->T (TI)", "G->C or C->G (TV)", "G->T or C->A (TV)". If specifying
#'  a data frame, the data frame must contain the following columns names
#'  "Prop", "trans_tranv" (optional see vignette).
#' @param clinData Object of class data frame with rows representing clinical
#' data. The data frame should be in "long format" and columns must be names as
#' "sample", "variable", and "value" (optional see details and vignette).
#' @param type Character string specifying if the plot should display the
#' Proportion or Frequency of transitions/transversions observed. One of
#' "Proportion" or "Frequency", defaults to "Proportion".
#' @param lab_Xaxis Boolean specifying whether to label the x-axis in the plot.
#' @param lab_txtAngle Integer specifying the angle of labels on the x-axis of
#' the plot.
#' @param palette Character vector of length 6 specifying colours for each
#' of the six possible transition transversion types.
#' @param fileType Character string specifying the format the input given to
#' parameter x is in, one of 'MAF', 'MGI'. The former option requires the data
#' frame given to x to contain the following column names
#' "Tumor_Sample_Barcode", "Reference_Allele", "Tumor_Seq_Allele1",
#' "Tumor_Seq_Allele2" the later option requires the data frame givin to x to
#' contain the following column names "reference", "variant" and "sample".
#' (required)
#' @param tvtiLayer Valid ggplot2 layer to be added to the main plot.
#' @param expecLayer Valid ggplot2 layer to be added to the expected sub-plot.
#' @param sort Character string specifying the sort order of the sample
#' variables in the plot. Arguments to this parameter should be "sample",
#' "tvti", or "none" to sort the x-axis by sample name, transition transversion
#' frequency, or no sort respectively.
#' @param out Character vector specifying the the object to output, one of
#' "data", "grob", or "plot", defaults to "plot" (see returns).
#' @param clinLegCol Integer specifying the number of columns in the legend for
#' the clinical data, only valid if argument is supplied to parameter clinData.
#' @param clinVarCol Named character vector specifying the mapping of colours
#' to variables in the variable column of the data frame supplied to clinData
#' (ex. "variable"="colour").
#' @param clinVarOrder Character vector specifying the order in which to plot
#' variables in the variable column of the argument given to the parameter
#' clinData. The argument supplied to this parameter should have the same unique
#' length and values as in the variable column of the argument supplied to 
#' parameter clinData (see vignette).
#' @param clinLayer Valid ggplot2 layer to be added to the clinical sub-plot.
#' @param progress Boolean specifying if progress bar should be displayed for
#' the function.
#' @param sample_order_input Sample orders to be used
#' @param layers ggplot object to be added to proportions plot
#' @param return_plot Return as ggplot object? Only returns main plot
#' @details TvTi is a function designed to display proportion or frequency
#' of transitions and transversion seen in a data frame supplied to parameter x.
#' @examples
#' TvTi(brcaMAF, type='Frequency',
#' palette=c("#77C55D", "#A461B4", "#C1524B", "#93B5BB", "#4F433F", "#BFA753"),
#' lab_txtAngle=60, fileType="MAF")
#' @return One of the following, a list of dataframes containing data to be
#' plotted, a grob object, or a plot.
#' @importFrom plyr adply
#' @importFrom gtools mixedsort
#' @export

TvTi <- function(x, fileType=NULL, y=NULL, clinData=NULL, type='Proportion',
                 lab_Xaxis=TRUE, lab_txtAngle=45,
                 palette=c('#D53E4F', '#FC8D59', '#FEE08B', '#E6F598',
                           '#99D594', '#3288BD'),
                 tvtiLayer=NULL, expecLayer=NULL,
                 sort='none', clinLegCol=NULL, clinVarCol=NULL,
                 clinVarOrder=NULL, clinLayer=NULL, progress=TRUE, out="plot",
                 sample_order_input, layers = NULL, return_plot = FALSE)
{ 
    # Perform quality checks
    output <- TvTi_qual(x, y, clinData, file_type=fileType)
    x <- output$input1
    y <- output$input2
    clinData <- output$input3
    
    # add transition/transversion info
    if(isTRUE(progress))
    {
        message("annotating transitions and transversions")
        x <- plyr::adply(x, 1, TvTi_annoTransTranv, .progress='text')
    } else {
        x <- plyr::adply(x, 1, TvTi_annoTransTranv)
    }
    
    
    # Calculate the proportion of transitions/transversions
    x <- TvTi_calcTransTranvFreq(x)
    
    # re-level based on proportion values or via a smart sort or not at all
    if(toupper(sort) == toupper('sample'))
    {
        sample_order <- as.vector(unique(x$sample))
        sample_order <- gtools::mixedsort(sample_order)
        x$sample <- factor(x$sample, levels=sample_order)
    } else if(toupper(sort) == toupper('tvti')) {
        sample_order <- x[order(x$trans_tranv, -x$Prop),]
        sample_order <- sample_order[sample_order$Prop != 0,]
        sample_order <- unique(sample_order$sample)
        x$sample <- factor(x$sample, levels=sample_order)
    } else if(toupper(sort) == toupper("custom")) {
        sample_order <- sample_order_input
        x$sample <- factor(x$sample, levels = sample_order)
    } else if(toupper(sort) == toupper('none')){
        sample_order <- levels(x$sample)
    } else {
        memo <-paste0(sort, " is not a valid parameter for sort, please",
                      " specify one of \"sample\", \"tvti\", \"custom\", \"none\"")
        stop(memo)
    }
    
    # Perform a quality control on y to ensure fill levels match x
    if(!is.null(y))
    {
        y$trans_tranv <- factor(y$trans_tranv, levels=levels(x$trans_tranv))
    }
    
    # Build the Transition/Transversion Plot if clinical data does not exist
    if(is.null(clinData))
    {
        p1 <- TvTi_buildMain(x, y, type=type,
                             x_axis_text_angle=lab_txtAngle,
                             palette=palette, label_x_axis=lab_Xaxis,
                             tvti.layers=tvtiLayer, expec.layers=NULL,
                             title_x_axis = TRUE)
    }
    
    
    # Plot Clinical Data if Speccified and build modified TvTi main plot
    if(!is.null(clinData))
    {
        clinData$sample <- factor(clinData$sample, levels=sample_order)
        p3 <- multi_buildClin(clinData, clin.legend.col=clinLegCol, 
                              clin.var.colour=clinVarCol, 
                              clin.var.order=clinVarOrder,
                              clin.layers=clinLayer)
        
        # Build the Transition/Transversion Plot
        p1 <- TvTi_buildMain(x, y, type=type,
                             x_axis_text_angle=lab_txtAngle,
                             palette=palette, label_x_axis=lab_Xaxis,
                             tvti.layers=tvtiLayer, expec.layers=NULL,
                             title_x_axis=FALSE)
    } else {
        p3 <- NULL
    }
    
    if(!is.null(y))
    {
        # If y is input plot the expected values
        p2 <- TvTi_buildMain(y, y, type=type,
                             x_axis_text_angle=lab_txtAngle,
                             palette=palette, label_x_axis=lab_Xaxis,
                             plot_expected=TRUE, tvti.layers=NULL,
                             expec.layers=expecLayer)
    } else {
        p2 <- NULL
    }
    p1 <- p1 + layers
    # Decide what to output
    if (return_plot) {
        return(p1)
    } else {
        finalPlot <- TvTi_alignPlot(p1=p1, p2=p2, p3=p3)
        dataOut <- list("main"=x, "expect"=y)
        output <- multi_selectOut(data=dataOut, plot=finalPlot, draw=TRUE, out=out)
        return(output)
    }
}

#' align TvTi plots on y axis
#'
#' align transition/transversion plots
#' @name TvTi_alignPlot
#' @param p1 main plot
#' @param p2 left expected value subplot
#' @param p3 bottom clinical subplot
#' @noRd
#' @return ggplot object

TvTi_alignPlot <- function(p1=NULL, p2=NULL, p3=NULL)
{
    # define the ggplot's as grobs and create a blank plot
    gA <- ggplot2::ggplotGrob(p1)
    
    # Adjust the grob heights so p1, and p2 plots line up if p2 exists
    if(!is.null(p2))
    {
        # convert expected plot to grob
        gB <- ggplot2::ggplotGrob(p2)
        
        maxheight = grid::unit.pmax(gA$heights, gB$heights)
        gA$heights <- as.list(maxheight)
        gB$heights <- as.list(maxheight)
    }
    
    # adjust the grob widths so p1 and p3 line up if p3 exists
    if(!is.null(p3))
    {
        # Convert clinical plot to grob
        gC <- ggplot2::ggplotGrob(p3)
        gD <- grid::grid.rect(gp=grid::gpar(col="white"))
        
        maxwidth = grid::unit.pmax(gA$widths, gC$widths)
        gA$widths <- as.list(maxwidth)
        gC$widths <- as.list(maxwidth)        
    }
    
    # Build the final plot
    if(is.null(p3) & is.null(p2))
    {
        finalPlot <- gridExtra::arrangeGrob(gA, ncol=1, nrow=1)
    } else if(is.null(p3) & !is.null(p2)) {
        finalPlot <- gridExtra::arrangeGrob(gB, gA, ncol=2, nrow=1, widths=c(1,6))
    } else if(!is.null(p3) & is.null(p2)) {
        finalPlot <- gridExtra::arrangeGrob(gA, gC, ncol=1, nrow=2, heights=c(6,1))
    } else if(!is.null(p3) & !is.null(p2)) {
        finalPlot <- gridExtra::arrangeGrob(gB, gA, gD, gC, ncol=2, nrow=2, heights=c(6,1), widths=c(1,6))
    }
    
    return(finalPlot)
}

#' Annotate Transitions and Transversions
#'
#' Given a data frame with columns reference and variant annotate the base
#' change occurring
#' @name TvTi_annoTransTranv
#' @param x Object of class data frame containing columns 'reference', 'variant'
#' @noRd
#' @return Object of class data frame with transition/transversion annotations
#' appended

TvTi_annoTransTranv <- function(x)
{
    # add an extra column with the reference to variant
    x$base_change <- paste0(toupper(x$reference), "2", toupper(x$variant))
    
    # annotate the grouping of the base change
    x$trans_tranv <- switch(x$base_change, A2C="A->C or T->G (TV)",
                            T2G="A->C or T->G (TV)", A2G="A->G or T->C (TI)",
                            T2C="A->G or T->C (TI)", A2T="A->T or T->A (TV)",
                            T2A="A->T or T->A (TV)", G2A="G->A or C->T (TI)",
                            C2T="G->A or C->T (TI)", G2C="G->C or C->G (TV)",
                            C2G="G->C or C->G (TV)", G2T="G->T or C->A (TV)",
                            C2A="G->T or C->A (TV)")
    
    # remove the temp base change column
    x$base_change <- NULL
    return(x)
}

#' build transitions/transversions
#'
#' Given a data frame with columns 'trans_tranv', 'sample', 'Freq', and 'Prop',
#' build a transition/transversion plot
#' @name TvTi_buildMain
#' @param x Object of class data frame containing columns 'trans_tranv',
#' 'sample', 'Freq', and 'Prop'
#' @param y Object of class data frame containing columns 'Prop', 'trans_tranv'
#' for display of expected results
#' @param type Object of class character specifying whether to plot the
#' Proportion or Frequency, one of "Prop"
#' @param label_x_axis boolean specifying wheter to label x axis
#' @param x_axis_text_angle Integer specifying the angle to labels on x_axis
#' @param palette Character vector of length 6 specifying colors for
#' trans/tranv type
#' @param plot_expected Boolean specifying if this is the main TvTi plot or a
#' sub plot for expected values
#' @param tvti.layers Additional ggplot2 layers for the main plot
#' @param expec.layers Additional ggplot2 layers for the expected values plot
#' @param title_x_axis boolean specifying whether to display an x axis title
#' @noRd
#' @return GGplot Object
#' @import ggplot2

TvTi_buildMain <- function(x, y=NULL, type='Proportion', label_x_axis=TRUE,
                           x_axis_text_angle=45,
                           palette=c('#D53E4F', '#FC8D59', '#FEE08B', '#E6F598',
                                     '#99D594', '#3288BD'),
                           plot_expected=FALSE, tvti.layers=NULL,
                           expec.layers=NULL, title_x_axis=TRUE)
{
    
    if(!is.null(y))
    {
        # cumulativley sum the expected values and plot
        y$cumsum <- cumsum(y$Prop)
        expected <- geom_hline(data=y, mapping=aes_string(yintercept='cumsum'),
                               linetype="longdash", size=.5)
    } else {
        expected <- geom_blank()
    }
    
    # Define various parameters of plot
    if(plot_expected == TRUE)
    {
        bar <- geom_bar(data=x, mapping=aes_string(x=shQuote('Expected'),
                                                   y='Prop',
                                                   fill='trans_tranv'),
                        stat='identity', width=1)
    } else if(toupper(type) == 'PROPORTION') {
        bar <- geom_bar(data=x,
                        mapping=aes_string(x='sample', y='Prop',
                                           fill='trans_tranv'),
                        stat='identity', width=1)
    } else if(toupper(type) == 'FREQUENCY') {
        bar <- geom_bar(data=x,
                        mapping=aes_string(x='sample', y='Freq',
                                           fill='trans_tranv'),
                        stat='identity', width=1)
    }
    
    ylabel <- ylab(type)
    
    if(title_x_axis == TRUE)
    {
        xlabel <- xlab(paste0("Sample: n=", length(unique(x$sample))))
    } else {
        xlabel <- xlab('')
    }
    
    fill_palette <- scale_fill_manual(name='Transition/Transversion',
                                      values=palette)
    
    # additional layers to plot?
    if(!is.null(tvti.layers))
    {
        layers <- tvti.layers
    } else {
        layers <- geom_blank()
    }
    
    # Define theme of plot
    if(plot_expected == TRUE)
    {
        theme <- theme(axis.title.y=element_blank(),
                       axis.text.y=element_blank(),
                       axis.ticks.y=element_blank(),
                       legend.position='none',
                       axis.title.x=element_blank(),
                       axis.text.x=element_text(angle=x_axis_text_angle,
                                                hjust=1, vjust=1))
        if(!is.null(expec.layers))
        {
            layers <- expec.layers
        } else {
            layers <- geom_blank()
        }
        
    } else if(label_x_axis == TRUE) {
        theme <- theme(axis.text.x=element_text(angle=x_axis_text_angle,
                                                hjust=1, vjust=1))
    } else {
        theme <- theme(axis.text.x=element_blank(),
                       axis.ticks.x=element_blank())
    }
    
    # Define plot
    tmp <- data.frame(x=0, y=0)
    p1 <- ggplot(data=tmp, aes(y=0)) + bar + xlabel + ylabel +
        theme_bw() + theme + fill_palette + expected +
        guides(fill=guide_legend(reverse=TRUE)) + layers
    
    return(p1)
}

#' Calculate Transition/Transversion Frequency
#'
#' Given a data frame with columns reference, variant, sample, and trans/tranv
#' calculate the frequencies of transitions and transversion occuring.
#' @name TvTi_calcTransTranvFreq
#' @param x Object of class data frame containing columns 'reference',
#' 'variant', 'sample', 'trans_tranv'
#' @noRd
#' @return Object of class data frame with Frequency and Proportion of
#' Transistions/Transversions appended on a sample level

TvTi_calcTransTranvFreq <- function(x)
{
    # Ensure all possible combinations of trans/tranv are represented
    trans_tranv <- c("A->C or T->G (TV)", "A->G or T->C (TI)",
                     "A->T or T->A (TV)", "G->A or C->T (TI)",
                     "G->C or C->G (TV)", "G->T or C->A (TV)")
    sample <- c('dummy_sample')
    reference <- c('A')
    variant <- c('T')
    dummy_data <- data.frame(reference, variant, sample, trans_tranv)
    x <- rbind(dummy_data, x)
    
    # calculate the frequency of transitions/transversions on a sample basis
    x_freq <-  table(x$trans_tranv, x$sample)
    
    # calculate the proportion of transitions/transversions on a sample basis
    x_prop <-  prop.table(x_freq, 2)
    
    # format and remove the dummy data introduced above
    x_freq <- as.data.frame(x_freq)
    x_prop <- as.data.frame(x_prop)
    x <- cbind(x_freq, x_prop$Freq)
    colnames(x) <- c('trans_tranv', 'sample', 'Freq', 'Prop')
    x <- x[which(x$sample != "dummy_sample"),]
    
    return(x)
}

#' Convert .maf format to internal format
#'
#' Convert data frame in .maf format to an internally recogized format
#' @name TvTi_convMAF
#' @param x Object of class data frame containing columns
#' 'Tumor_Sample_Barcode', 'Reference_Allele' 'Tumor_Seq_Allele1',
#' 'Tumor_Seq_Allele2'
#' @noRd
#' @return a data frame, with column names 'sample', 'reference', 'variant'

TvTi_convMaf <- function(x)
{
    # Take out the appropriate columns and format for each allele
    x <- x[,c('Tumor_Sample_Barcode', 'Reference_Allele', 'Tumor_Seq_Allele1',
              'Tumor_Seq_Allele2')]
    
    allele1 <- x[,c('Tumor_Sample_Barcode', 'Reference_Allele',
                    'Tumor_Seq_Allele1')]
    colnames(allele1) <- c('sample', 'reference', 'variant')
    allele2 <- x[,c('Tumor_Sample_Barcode', 'Reference_Allele',
                    'Tumor_Seq_Allele2')]
    colnames(allele2) <- c('sample', 'reference', 'variant')
    
    #!!!Developer note: The if's are here because subsetting when there is
    #!!! nothing to subset (integer0) causes problems
    
    # if the tumor allele 1 matchest tumor allele2 remove that information from
    # one of the alleles
    if(any(as.character(allele1$variant) == as.character(allele2$variant)))
    {
        allele1 <- allele1[-which(as.character(allele1$variant) == as.character(allele2$variant)),]
    }
    
    # if the allele matches the reference remove it from the data
    if(any(as.character(allele1$reference) == as.character(allele1$variant)))
    {
        allele1 <- allele1[-which(as.character(allele1$reference) == as.character(allele1$variant)),]
    }
    if(any(as.character(allele2$reference) == as.character(allele2$variant)))
    {
        allele2 <- allele2[-which(as.character(allele2$reference) == as.character(allele2$variant)),]
    }
    
    # bind the data from both alleles together
    x <- rbind(allele1, allele2)
    
    return(x)
}

#' Check input to TvTi
#'
#' Perform quality check for input to function TvTi
#' @name TvTi_qual
#' @param x Object of class data frame containing columns 'sample', reference',
#' 'variant' for 'MGI' file or 'Tumor_Sample_Barcode', 'Reference_Allele',
#' 'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2' for 'MAF' file
#' @param y Object of class data frame containing columns "Prop", "trans_tranv"
#' @param z Object of class data frame containing columns "sample", "variable",
#' "value" denoting clinical information
#' @param file_type Character string spedifying th input file type expected
#' @noRd
#' @return a data frame, or list of data frames passing quality checks

TvTi_qual <- function(x, y=NULL, z=NULL, file_type='MAF')
{
    # Check file type is valid
    if(!grepl("MAF|MGI", file_type))
    {
        memo <- paste0("Did not recognize input to paramter fileType as a",
                       " valid argument... Please specify one of \"MGI\"",
                       " or \"MAF\"")
        stop(memo)
    }
    
    # Check if x input is a data frame
    if(!is.data.frame(x))
    {
        memo <- paste0("argument supplied to x is not an object of class",
                       " data frame, attempting to coerce")
        warning(memo)
        x <- as.data.frame(x)
    }
    
    # check for duplicate elements in x
    if(nrow(unique(x)) != nrow(x))
    {
        warning("Detected duplicate rows in x, was this expected?")
    }
    
    # Check if y input is a data frame
    if(!is.null(y))
    {
        # Check y input if data frame
        if(is.data.frame(y))
        {
            if(!all(colnames(y) %in% c('Prop', 'trans_tranv')))
            {
                memo <- paste0("Did not detect correct column names in",
                               "input to y, missing one of \"Prop\",",
                               "\"trans_tranv\"")
                stop(memo)
            }
            
            y$Prop <- as.numeric(as.character(y$Prop))
        }
        
        # Check y input if vector
        if(is.vector(y))
        {
            y <- as.data.frame(y)
            y$trans_tranv <- rownames(y)
            colnames(y) <- c('Prop', 'trans_tranv')
            
            if(typeof(y$Prop) != "double" & typeof(y$Prop) != "numeric")
            {
                memo <- paste0("Values found in y are not of type double",
                               " or numeric!")
                stop(memo)
            }
        }
        
        if(!is.data.frame(y))
        {
            memo <- paste0("input to y is not an object of class data frame",
                           " or named vector")
            stop(memo)
        }
    }
    
    # Check columns of x input and change to internal format
    if(file_type == 'MGI')
    {
        # Check that columns are named appropriatley, if not print error
        proper_names <- c("reference", "variant", "sample")
        if(all(proper_names %in% colnames(x)))
        {
            message("Found appropriate columns")
        } else {
            memo <- paste0("Could not find all columns requested, missing ",
                           "one of \"reference\", \"variant\", \"sample\"")
            stop(memo)
        }
        
        x <- x[,c('reference', 'variant', 'sample')]
        
    } else if(file_type == 'MAF') {
        proper_names <- c("Tumor_Sample_Barcode", "Reference_Allele",
                          "Tumor_Seq_Allele1", "Tumor_Seq_Allele2")
        if(all(proper_names %in% colnames(x)))
        {
            message("Found appropriate columns")
        } else {
            memo <- paste0("Could not find all columns requested, missing one ",
                           "of \"Tumor_Sample_Barcode\", \"Reference_Allele\",",
                           " \"Tumor_Seq_Allele1\", \"Tumor_Seq_Allele2\"")
            stop(memo)
        }
        # Convert MAF file to internal format
        x <- TvTi_convMaf(x)
    } else {
        memo <- paste0("TvTi requires a fileType specification, please",
                       "specify one of \"MAF\" or \"MGI\" based on the",
                       "argument supplied to parameter x. See docs for help.")
        stop(memo)
    }
    
    # Remove any indels present in the data
    x <- TvTi_rmIndel(x)
    # Warn about multi nucleotide codes
    x <- TvTi_rmMnuc(x)
    
    # Check that reference and variant columns only contain the proper codes
    ref_codes <- c('A', 'C', 'G', 'T', '-', 0)
    if(!all(toupper(x$reference) %in% toupper(ref_codes)))
    {
        memo <- paste0("Unrecognized Base Detected in reference column, ",
                       "expected values are: ", toString(ref_codes))
        stop(memo)
    } else if(!all(toupper(x$variant) %in% toupper(ref_codes))) {
        memo <- paste0("Unrecognized Base Detected in reference column, ",
                       "expected values are: ", toString(ref_codes))
        stop(memo)
    }
    
    # check y input for proper row names
    if(!is.null(y))
    {
        trans_tranv_names <- c("A->C or T->G (TV)", "A->G or T->C (TI)",
                               "A->T or T->A (TV)", "G->A or C->T (TI)",
                               "G->C or C->G (TV)", "G->T or C->A (TV)")
        if(!all(rownames(y) %in% trans_tranv_names))
        {
            memo <- paste0("Did not detect a value for all combinations of ",
                           "transitions/transversions, please specify input ",
                           "for each of the following levels: ",
                           toString(trans_tranv_names))
            stop(memo)
        }
        
        # check that y sums to 1 (i.e. its a true proportion among all elements)
        if(round(sum(y$Prop), digits=2) != 1)
        {
            stop("The sum of elements in y should equal 1")
        }
    }
    
    # Check input data to clinDat
    if(!is.null(z))
    {
        if(!is.data.frame(z))
        {
            stop("Did not detect a data frame for input to clinDat")
        }
        z <- droplevels(z)
        
        if(!all(c('sample', 'variable', 'value') %in% colnames(z)))
        {
            stop("Did not detect correct sample names in clinDat")
        }
        
        if(!all(unique(x$sample) %in% unique(z$sample)))
        {
            memo <- paste0("Found a sample supplied to clinData not found",
                           " in the data frame supplied to x")
            warning(memo)
        }
    }
    
    return(list('input1'=x, 'input2'=y, 'input3'=z))
}

#' Remove indels
#'
#' Given a data frame with columns reference and variants remove all indels
#' from data
#' @name TvTi_rmIndel
#' @param x Object of class data frame containing columns 'reference', 'variant'
#' @noRd
#' @return Object of class data frame with indels removed

TvTi_rmIndel <- function(x)
{
    original_size <- nrow(x)
    
    # Find and remove insertions and deletions
    x <- x[grep('-|0', x$reference, perl=TRUE, invert=TRUE),]
    x <- x[grep('-|0', x$variant, perl=TRUE, invert=TRUE),]
    
    new_size <- nrow(x)
    
    # Print message if indels have been removed
    if(new_size != original_size)
    {
        message("Removed ", original_size - new_size, " indels present in data")
    }
    
    return(x)
}

#' Remove multinucleotide codes
#'
#' Given a data frame with columns reference and variants remove all
#' multinucleotides from data
#' @name TvTi_rmMnuc
#' @param x Object of class data frame containing columns 'reference', 'variant'
#' @noRd
#' @return Object of class data frame with multi nucleotide codes removed

TvTi_rmMnuc <- function(x)
{
    original_size <- nrow(x)
    
    # Find and remove multi nucleotide codes
    x <- x[grep('[ACGT]{2,}', x$reference, perl=TRUE, invert=TRUE, ignore.case=TRUE),]
    x <- x[grep('[ACGT]{2,}', x$variant, perl=TRUE, invert=TRUE, ignore.case=TRUE),]
    
    new_size <- nrow(x)
    
    if(new_size != original_size)
    {
        memo <- paste0("Multi Nucleotide codes are not currently supported, ", 
                       "removed: ", original_size - new_size,
                       " multi-nucleotides present in the data")
        warning(memo) 
    }
    
    return(x)
}

#' Construct a lolliplot
#'
#' Given a data frame construct a plot displaying mutations on a transcript
#' framework.
#' @name lolliplot
#' @param x Object of class data frame with rows representing mutations. The
#' data frame must contain columns with the following names "transcript_name",
#' "gene", and "amino_acid_change". Values in the "transcript_name" column must
#' represent an ensembl transcript id and values in the "amino_acid_change"
#' column must be in p.notation (see details).
#' @param y Object of class data frame with rows representing mutations. The
#' data frame must contain columns with the following names "transcript_name"
#' and "amino_acid_change". Values in the "transcript_name" column must
#' represent an ensembl transcript id and values in the "amino_acid_change"
#' column must be in p. notation (optional, see details).
#' @param z Object of class data frame with rows representing regions of
#' interest. The data frame must contain columns with the following names
#' "description", "start", "stop" (optional see details).
#' @param fillCol Character string specifying the column name of the argument
#' supplied to parameter x on which to colour the lollis representing mutations
#' (see details).
#' @param labelCol Character string specifying the column name of the argument
#' supplied to parameter x from which to extract and display text corresponding
#' to mutations (see details).
#' @param txtAngle Integer specifying the angle of label text to be plotted if
#' an argument is supplied to the labelCol parameter.
#' @param txtSize Integer specifying the size of label text to be plotted if an
#' argument is supplied to the labelCol parameter.
#' @param pntSize Integer specifying the size of lolli points representing
#' mutations.
#' @param proteinColour Character string specifying the background colour of the
#' protein.
#' @param obsA.rep.fact Numeric value representing the repulsive factor for the
#' lollis plotted, which were derived from the argument supplied to parameter x
#' (see details and vignette).
#' @param obsA.rep.dist.lmt Numberic value representing the repulsive distance
#' limit for the lollis plotted, which were derived from the argument supplied
#' to parameter x (see details and vignette).
#' @param obsA.attr.fact Numeric value representing the attraction factor for
#' the lollis plotted, which were derived from the argument supplied to
#' parameter x (see details and vignette).
#' @param obsA.adj.max Numeric value representing the max position adjustment
#' for the lollis plotted, which were derived from the argument supplied to
#' parameter x (see details and vignette).
#' @param obsA.adj.lmt Numeric value representing the adjustment limit for the
#' lollis plotted, which were derived from the argument supplied to parameter x
#' (see details and vignette).
#' @param obsA.iter.max Integer representing the number of iterations of
#' position adjustments for the lollis plotted, which were derived from the 
#' argument supplied to parameter x (see details and vignette).
#' @param obsB.rep.fact Numeric value representing the repulsive factor for the
#' lollis plotted, which were derived from the argument supplied to parameter y
#' (see details and vignette).
#' @param obsB.rep.dist.lmt Numberic value representing the repulsive distance
#' limit for the lollis plotted, which were derived from the argument supplied
#' to parameter y (see details and vignette).
#' @param obsB.attr.fact Numeric value representing the attraction factor for
#' the lollis plotted, which were derived from the argument supplied to
#' parameter y (see details and vignette).
#' @param obsB.adj.max Numeric value representing the max position adjustment
#' for the lollis plotted, which were derived from the argument supplied to
#' parameter y (see details and vignette).
#' @param obsB.adj.lmt Numeric value representing the adjustment limit for the
#' lollis plotted, which were derived from the argument supplied to parameter y
#' (see details and vignette).
#' @param obsB.iter.max Integer representing the number of iterations of
#' position adjustments for the lollis plotted, which were derived from the 
#' argument supplied to parameter y (see details and vignette).
#' @param sideChain Boolean specifying if amino acid sidechain data should be
#' plotted in lieu of protein domains (see details).
#' @param species A valid species from which to retrieve protein domain and
#' sequence data for a given transcript (see details).
#' @param maxLolliStack Integer specifying the cutoff for the maximum number of
#' lollis allowed to be stacked at a single position.
#' @param plotLayer Valid ggplot2 layer to be added to the plot.
#' @param paletteA Character vector specifying colours for protein domains,
#' valid only if sideChain==FALSE.
#' @param paletteB Character vector specifying colours for lollis representing
#' mutations, valid only if argument is supplied to fillCol.
#' @param host Host to connect to for biomaRt queries (see details).
#' @param out Character vector specifying the the object to output, one of
#' "data", "grob", or "plot", defaults to "plot" (see returns).
#' @details lolliplot is a function designed to display mutation information in
#' the context of a protein identified by an ensembl transcript id. The
#' lolliplot function will query ensembl via biomart to retrieve sequence and
#' domain information in order to construct a representation of a protein and
#' therefore requires an internet connection. A value must be supplied to the
#' species parameter (defaults to hsapiens) in order for a successful biomart
#' query. Valid arguments to this field are those species with datasts available
#' via ensembl. please specify species in lowercase without a period
#' (i.e. hsapiens instead of H.sapiens), lolliplot will inform the user of
#' available species if input to the species parameter is not recognized.
#' Further lolliplot will build a protein framework based on sequence data
#' obtained from biomaRt, by default this will default to the latest ensembl
#' version. In order for the most accurate representation the annotation version
#' of the mutations given to lolliplot should match the annotation version used 
#' by biomaRt. The annotation version used by biomaRt can be changed via the 
#' host paramter (see vignette for more details).
#' 
#' lolliplot is capable of plotting two seperate sets of data on the protein
#' representation specified by parameters `x` and `y`, the data supplied to
#' these parameters will be plotted on the top and bottom of the protein
#' respectively. Note that input to these parameters is expected to correspond
#' to a single ensembl transcript and that values in the "amino_acid_change"
#' columns are required to be in p. notation (i.e. p.V600E). Further lolliplot
#' is able to plot custom domain annotation if supplied via the parameter `z`,
#' this will override domain information obtained from biomart.
#' 
#' lolliplot uses a forcefield model from the package FField to attract and 
#' repulse lollis. The parameters for this force field model are set to
#' reasonable defaults however may be adjusted via the obsA... and obsB...
#' family of parameters. Please see the package FField available on cran for
#' a description of these parameters. Note that the time to construct the
#' lolliplot will in large part depend on the number of mutations and the values
#' supplied to the forcefield parameters.
#' 
#' @examples
#' # Create input data
#' data <- brcaMAF[brcaMAF$Hugo_Symbol == 'TP53',c('Hugo_Symbol', 'amino_acid_change_WU')]
#' data <- as.data.frame(cbind(data, 'ENST00000269305'))
#' colnames(data) <- c('gene', 'amino_acid_change', 'transcript_name')
#'
#' # Call lolliplot
#' lolliplot(data)
#' @return One of the following, a list of dataframes containing data to be
#' plotted, a grob object, or a plot.
#' @export

lolliplot <- function(x, y=NULL, z=NULL, fillCol=NULL, labelCol=NULL,
                      txtAngle=45, txtSize=5, pntSize=4,
                      proteinColour='#999999', obsA.rep.fact=5000,
                      obsA.rep.dist.lmt=500, obsA.attr.fact=.1, obsA.adj.max=.1,
                      obsA.adj.lmt=.5, obsA.iter.max=50000, obsB.rep.fact=5000,
                      obsB.rep.dist.lmt=500, obsB.attr.fact=.1, obsB.adj.max=.1,
                      obsB.adj.lmt=.5, obsB.iter.max=50000,
                      sideChain=FALSE, species="hsapiens",
                      maxLolliStack=NULL, plotLayer=NULL, paletteA=NULL,
                      paletteB=NULL, host="www.ensembl.org", out="plot")
{
    # Perform quality check
    input <- lolliplot_qual(x, y, z)
    x <- input[[1]]
    y <- input[[2]]
    z <- input[[3]]
   
    # extract transcript id and subset data y on that id if it exists
    transcriptID <- as.character(x$transcript_name[1])
    if(!is.null(y))
    {
        y <- y[y$transcript_name == transcriptID,]
    }
    
    # extract HUGO gene name
    gene <- as.character(x$gene[1])
    
    # Obtain length of protein
    result <- lolliplot_transcriptID2codingSeq(transcriptID,
                                               species=species,
                                               host=host)
    codingSeq <- result$coding
    cdsLen <- result$cds_length
    
    
    # Get the sequence length in AA, perform quality checks along the way
    residueSeq <- lolliplot_DNAconv(codingSeq, to="residue")    
    # If it is requested grab the sidechain information and bind to residues
    if(sideChain==TRUE)
    {
        sidechain <- lolliplot_DNAconv(codingSeq, to="sidechain")
        AAsequence <- cbind(sidechain, residueSeq)
        AAsequence <- as.data.frame(AAsequence)
        AAsequence$coord <- seq(from=1, to=nrow(AAsequence))
    } else {
        AAsequence <- NULL
    }
    
    # if there are any stop codons remove them as they are not considered part
    # of the protein
    if(any(residueSeq %in% c("OPAL", "OCHRE", "AMBER")))
    {
        stopRes <- c("OPAL", "OCHRE", "AMBER")
        residueSeq <- residueSeq[-which(residueSeq %in% stopRes)]
        if(!is.null(AAsequence))
        {
            AAsequence <- AAsequence[-which(AAsequence$residueSeq %in% stopRes),]
        }
    }
    
    # grab the length of the protein in Amino Acids
    proteinLength <- length(residueSeq)    
    
    # if z is specified plot that instead of fetching the domain information
    if(!is.null(z))
    {
        geneData <- lolliplot_constructGene(gene, z, proteinLength)
    } else {
        # extract protein domain data
        protein_domain <- lolliplot_fetchDomain(transcriptID,
                                                species=species,
                                                host=host)
        
        # construct gene from data collected
        geneData <- lolliplot_constructGene(gene, protein_domain, proteinLength)
    }
    
    # construct data frame of observed mutations for top track
    observed_mutation <- lolliplot_mutationObs(x, 'top', fillCol, labelCol,
                                               obsA.rep.fact, obsA.rep.dist.lmt,
                                               obsA.attr.fact, obsA.adj.max,
                                               obsA.adj.lmt, obsA.iter.max)
    observed_mutation <- lolliplot_reduceLolli(observed_mutation,
                                               maxLolliStack)
    
    # construct data frame of observed mutations for bottom track
    if(!is.null(y))
    {
        observed_mutation2 <- lolliplot_mutationObs(y, 'bottom', fillCol,
                                                    labelCol, obsB.rep.fact,
                                                    obsB.rep.dist.lmt,
                                                    obsB.attr.fact,
                                                    obsB.adj.max, obsB.adj.lmt,
                                                    obsB.iter.max)
        observed_mutation2 <- lolliplot_reduceLolli(observed_mutation2,
                                                    maxLolliStack)
    } else {
        observed_mutation2 <- NULL
    }
    
    # construct the lolliplot
    plot <- lolliplot_buildMain(geneData, length, observed_mutation,
                                observed_mutation2,fillCol, labelCol,
                                txtAngle, txtSize, pntSize,
                                proteinColour, AAsequence,
                                plot_sidechain=sideChain, layers=plotLayer,
                                paletteA=paletteA, paletteB=paletteB)
    
    # Decide what to output
    dataOut <- list("gene"=geneData,
                    "observed_mutation"=observed_mutation,
                    "observed_mutation2"=observed_mutation2)
    output <- multi_selectOut(data=dataOut, plot=plot, out=out)
    return(output)
}

#' Convert AA to side chain classification
#' 
#' Given the 1 letter code an amino acid, return the side chain classification
#' @name lolliplot_AA2sidechain
#' @param x Character of length 1 giving the 1 letter amino acid code
#' @return Object of class character

lolliplot_AA2sidechain <- function(x)
{
    # Coerce all AA changes to uppercase and then apply switch statement
    x <- toupper(x)
    x <- switch(EXPR=x, "F"="Nonpolar", "L"="Nonpolar", "S"="Polar",
                "Y"="Polar", "C"="Polar", "W"="Nonpolar", "L"="Nonpolar",
                "P"="Nonpolar", "H"="Basic", "Q"="Polar", "R"="Basic",
                "I"="Nonpolar", "M"="Nonpolar", "T"="Polar", "N"="Polar",
                "K"="Basic", "S"="Polar", "R"= "Basic", "V"="Nonpolar",
                "A"="Nonpolar", "D"= "Acidic", "E"="Acidic", "G"="Polar")
    
    return(x)
}

#' Construct Lolliplot
#'
#' Construct Lolliplot given gene and mutation data
#' @name lolliplot_buildMain
#' @param gene_data object of class dataframe giving protein domain and gene
#' information
#' @param length integer specifying the length of the protein in amino acids
#' @param mutation_observed object of class data frame specifying mutations
#' observed in input file
#' @param mutation_observed2 optional object of class data frame specifying
#' additional mutations for bottom track
#' @param fill_value character string specifying the column on which to colour
#' mutation points
#' @param label_column character string specifying the column containing the
#' labels to attach to mutation points
#' @param plot_text_angle numeric value specifying the angle of text to be
#' plotted
#' @param plot_text_size numeric value specifying the size of text to be plotted
#' @param point_size numeric value specigying the size of mutation points
#' @param gene_colour color to shade plotted gene
#' @param sequence_data object of class dataframe giving AA sequence, sidechain,
#' and coord required if plot_sidechain is true
#' @param plot_sidechain boolean specifying whether to plot the AA sidechain
#' instead of domain information
#' @param layers additional ggplot2 layers to plot
#' @param paletteA Character vector specifying colours for gene features
#' @param paletteB Character vector specifying colours for lolli features
#' @return a ggplot2 object
#' @import ggplot2

lolliplot_buildMain <- function(gene_data, length, mutation_observed,
                                mutation_observed2, fill_value, label_column,
                                plot_text_angle, plot_text_size, point_size,
                                gene_colour, sequence_data,
                                plot_sidechain=FALSE,layers=NULL,
                                paletteA=NULL, paletteB=NULL)
{
    # build the various features of the lolliplot
    
    # Build gene base either using domain information
    # or AA sidechain information
    if(plot_sidechain == TRUE)
    {
        sequence_data$coord_start <-
            as.numeric(as.character(sequence_data$coord)) - 1
        sequence_data$coord_end <- as.numeric(as.character(sequence_data$coord))
        gene_plot <- geom_rect(data=sequence_data,
                               mapping=aes_string(xmin='coord_start',
                                                  xmax='coord_end', ymin=-.1,
                                                  ymax=.1, fill='sidechain'))
        domain_plot <- geom_blank()
    } else {
        gene_plot <- geom_rect(data=gene_data[1,],
                               mapping=aes_string(xmin='pos_from',
                                                  xmax='pos_to',
                                                  ymin='height_min',
                                                  ymax='height_max'),
                               fill='#999999', colour='#000000')
        # Take into account there might not be any domains
        if(nrow(gene_data) == 1)
        {
            domain_plot <- geom_blank()
        } else {
            domain_plot <- geom_rect(data=gene_data[-1,],
                                     mapping=aes_string(xmin='pos_from',
                                                        xmax='pos_to',
                                                        ymin='height_min',
                                                        ymax='height_max',
                                                        fill='Domain'),
                                     alpha=1, colour='black')
        }
    }
    
    # Build the Observed track
    observed_plot <- geom_point(data=mutation_observed,
                                mapping=aes_string(x='coord_x_dodge',
                                                   y='coord_y_dodge',
                                                   colour=fill_value),
                                size=point_size)
    observed_line <- geom_segment(data=mutation_observed,
                                  mapping=aes_string(x='mutation_coord',
                                                     y=.1, xend='coord_x_dodge',
                                                     yend=.3))
    observed_line_2 <- geom_segment(data=mutation_observed,
                                    mapping=aes_string(x='coord_x_dodge', y=.3,
                                                       xend='coord_x_dodge',
                                                       yend='coord_y_dodge'))
    
    # Miscelaneous features
    title <- ggtitle(gene_data[1,1])
    x_label <- xlab('Amino Acid Position')
    
    # add a theme and guide to the plot
    theme <- theme(legend.position='bottom',
                   legend.direction='vertical',
                   legend.box='horizontal',
                   axis.text.y=element_blank(),
                   axis.ticks.y=element_blank(),
                   axis.title.y=element_blank())
    guide <- guides(colour=guide_legend(ncol=2), fill=guide_legend(ncol=2))
    
    # set colours manually if these are specified
    if(!is.null(paletteA))
    {
        gene_features_fill <- scale_fill_manual(values=paletteA)
    } else {
        gene_features_fill <- geom_blank()
    }
    
    if(!is.null(paletteB))
    {
        lolli_features_fill <- scale_colour_manual(values=paletteB)
    } else {
        lolli_features_fill <- geom_blank()
    }
    
    # construct the plot with or without 2nd observed track
    if(is.null(mutation_observed2))
    {
        y_limits <- ylim(c(-.1, max(mutation_observed$coord_y_dodge) + .1))
        tmp <- data.frame(xmin=0, xmax=0, ymin=0, ymax=0, x=0, y=0, xend=0, yend=0)
        p1 <- ggplot(data=tmp, aes(xmin=0, xmax=0, ymin=0, ymax=0, x=0, y=0, xend=0, yend=0)) +
            gene_plot + domain_plot + observed_line_2 + observed_line +
            observed_plot + x_label + title + y_limits + theme_bw() + theme +
            guide + layers + gene_features_fill + lolli_features_fill
    } else {
        y_limits <- ylim(c(min(mutation_observed2$coord_y_dodge) - .1,
                           max(mutation_observed$coord_y_dodge) + .1))
        if(any(colnames(mutation_observed2) %in% fill_value))
        {
            observed2_plot <- geom_point(data=mutation_observed2,
                                         mapping=aes_string(x='coord_x_dodge',
                                                            y='coord_y_dodge',
                                                            colour=fill_value),
                                         size=point_size)
        } else {
            observed2_plot <- geom_point(data=mutation_observed2,
                                         mapping=aes_string(x='coord_x_dodge',
                                                            y='coord_y_dodge'),
                                         size=point_size)
        }
        observed2_line <- geom_segment(data=mutation_observed2,
                                       mapping=aes_string(x='mutation_coord',
                                                          y=-.1,
                                                          xend='coord_x_dodge',
                                                          yend=-.3))
        observed2_line_2 <- geom_segment(data=mutation_observed2,
                                         mapping=aes_string(x='coord_x_dodge',
                                                            y=-.3,
                                                            xend='coord_x_dodge',
                                                            yend='coord_y_dodge'))
        
        tmp <- data.frame(xmin=0, xmax=0, ymin=0, ymax=0, x=0, y=0, xend=0, yend=0)
        p1 <- ggplot(tmp, aes(xmin=0, xmax=0, ymin=0, ymax=0, x=0, y=0, xend=0, yend=0)) +
            gene_plot + domain_plot + observed_line + observed_line_2 +
            observed_plot + observed2_line + observed2_line_2 + observed2_plot +
            x_label + title + y_limits + theme_bw() + theme + guide + layers +
            gene_features_fill + lolli_features_fill
    }
    
    # If a label column is specified plot labels
    if(any(colnames(mutation_observed) %in% "labels"))
    {
        mutation_observed$y_label_offset <-
            mutation_observed$coord_y_dodge + .01
        p1 <- p1 + geom_text(data=mutation_observed,
                             mapping=aes_string(x='coord_x_dodge',
                                                y='y_label_offset',
                                                label='labels'),
                             angle=plot_text_angle, size=plot_text_size,
                             vjust=1, hjust=0)
    }
    if(any(colnames(mutation_observed2) %in% "labels"))
    {
        mutation_observed2$y_label_offset <-
            mutation_observed2$coord_y_dodge - .01
        p1 <- p1 + geom_text(data=mutation_observed2,
                             mapping=aes_string(x='coord_x_dodge',
                                                y='y_label_offset',
                                                label='labels'),
                             angle=plot_text_angle, size=plot_text_size,
                             vjust=0, hjust=1)
    }
    
    return(p1)
}

#' Convert Codon to AA
#' 
#' Convert a Codon to the appropriate amino acid
#' @name lolliplot_Codon2AA
#' @param x Character string of length 1 giving the DNA codon to convert
#' @return Character corresponding to the residue for the given codon
#' @noRd

lolliplot_Codon2AA <- function(x)
{
    # Convert codons to single AA code
    x <- toupper(x)
    x <- switch(x, "TTT"="F", "TTC"="F", "TTA"="L", "TTG"="L", "CTT"="L",
                "CTC"="L", "CTA"="L", "CTG"="L", "ATT"="I", "ATC"="I",
                "ATA"="I", "ATG"="M", "GTT"="V", "GTC"="V", "GTA"="V",
                "GTG"="V", "TCT"="S", "TCC"="S", "TCA"="S", "TCG"="S",
                "CCT"="P", "CCC"="P", "CCA"="P", "CCG"="P", "ACT"="T",
                "ACC"="T", "ACA"="T", "ACG"="T", "GCT"="A", "GCC"="A",
                "GCA"="A", "GCG"="A", "TAT"="Y", "TAC"="Y", "TAA"="OCHRE",
                "TAG"="AMBER", "CAT"="H", "CAC"="H", "CAA"="Q", "CAG"="Q",
                "AAT"="N", "AAC"="N", "AAA"="K", "AAG"="K", "GAT"="D",
                "GAC"="D", "GAA"="E", "GAG"="E", "TGT"="C", "TGC"="C",
                "TGA"="OPAL", "TGG"="W", "CGT"="R", "CGC"="R", "CGA"="R",
                "CGG"="R", "AGT"="S", "AGC"="S", "AGA"="R", "AGG"="R",
                "GGT"="G", "GGC"="G", "GGA"="G", "GGG"="G")
    
    return(x)
}

#' Construct gene information
#' 
#' Build gene for input into lolliplot_buildMain
#' @name lolliplot_constructGene
#' @param gene character string specifying gene name
#' @param domain_data object of class data frame specifying protein domain
#' information, obtained from lolliplot_fetchDomain, should contain columns
#' giving "description", "start", "end"
#' @param length integer specifying length of transcript in amino acids
#' @return object of class data frame giving gene and domain information
#' @noRd

lolliplot_constructGene <- function(gene, domain_data, length)
{
    # message
    message("Constructing gene track")
    
    # Construct basic gene information, if there are no domains return the gene
    gene <- data.frame(Domain=gene, pos_from=1, pos_to=length, nest=1)
    if(nrow(na.omit(domain_data)) == 0)
    {
        gene$height_min <- .1/(as.numeric(gene$nest))
        gene$height_max <- -.1/(as.numeric(gene$nest))
        gene$pos_from <- as.numeric(gene$pos_from)
        gene$pos_to <- as.numeric(gene$pos_to)
        return(gene)
    }
    
    # rename columns for domain_data and make sure description column is
    # not a factor
    colnames(domain_data) <- c("description", "start", "end")
    domain_data$description <- as.character(domain_data$description)
    
    # quality check of domain data
    if(max(domain_data$end) > length)
    {
        memo <- paste0("The end position of a domain: ",  max(domain_data$end),
                       " is exceeding the length of the protein:", length)
        warning(memo)
    } else if(min(domain_data$start) < 1) {
        memo <- paste0("The start position of a domain:",
                       min(domain_data$start),
                       "is less than the start of the protein", 1)
        warning(memo)
    }
    
    # Check that start coordinates are always less than the end coordinates
    if(any(domain_data$start >= domain_data$end))
    {
        memo <- paste0("Found a start position greater than an end position",
                       " in the protein features track. Check input to Z or",
                       "results of the biomaRt query using dataOut==TRUE.")
        warning(memo)
    }
    
    # determine which regions are overlapping and annotate which nest domain is
    # sort on start
    domain_data$start <- as.numeric(domain_data$start)
    domain_data$end <- as.numeric(domain_data$end)
    domain_data <- domain_data[order(domain_data$start),]
    
    # annotate nests
    nest <- vector('numeric')
    end <- vector('numeric')
    for(i in 1:nrow(domain_data))
    {
        # Remove from end any values <= gene$start[i]
        idx <- domain_data$start[i] < end
        end <- end[idx]
        
        nest <- c(nest, length(end))
        end <- c(end, domain_data$end[i])
    }
    
    # add this nest information to the data frame
    domain_data$nest <- nest + 1
    colnames(domain_data) <- c("Domain", "pos_from", "pos_to", "nest")
    
    # combine gene and domain information
    gene <- rbind(gene, domain_data)
    
    # annotate display heights based on nesting and make sure coord are numeric
    gene$height_min <- .1/(as.numeric(gene$nest))
    gene$height_max <- -.1/(as.numeric(gene$nest))
    gene$pos_from <- as.numeric(gene$pos_from)
    gene$pos_to <- as.numeric(gene$pos_to)
    
    return(gene)
}

#' Convert DNA character string
#' 
#' Convert a character string of nucleotides to amino acids or side chain class
#' @name lolliplot_DNAconv
#' @param x Character string of nucleotides to convert
#' @param to Character string specifying conversion to do, one of "codon", 
#' "residue", "sidechain"
#' @return Converted string of nucleotides as character vector
#' @noRd

lolliplot_DNAconv <- function(x, to="residue")
{
    # check if given character string is a multiple of 3
    if(nchar(x)%%3 != 0)
    {
        memo <- paste0("Coding sequence retrieved for given ensembl transctipt",
                       ", is not a multiple of three. output may not be,",
                       " accurate!")
        warning(memo)
    }
    
    # split the character string into codons
    codon <- substring(x, seq(1,nchar(x), 3), seq(3, nchar(x), 3))
    if(toupper(to)=="CODON")
    {
        return(as.character(codon))
    }
    
    # convert the codons into amino acid residues
    residue <- sapply(codon, lolliplot_Codon2AA)
    if(toupper(to)=="RESIDUE")
    {
        return(as.character(residue))
    }
    
    # convert the residues into sidechain classifications
    sidechain <- sapply(residue, lolliplot_AA2sidechain)
    if(toupper(to)=="SIDECHAIN")
    {
        return(as.character(sidechain))
    }
    
    # return a warning if code gets this far
    memo <- paste0("did not recognize input to variable \"to\",",
                   " returning residue data")
    warning(memo)
    return(as.character(residue))
}

#' dodge coordinates
#'
#' given amino acid position dodge on x axis
#' @name lolliplot_dodgeCoordX
#' @param x numeric vector of position coordinates on x axis
#' @param rep.fact repulsive factor for plotted mutations observed track
#' @param rep.dist.lmt repulsive distance limit for plotted mutations observed
#' track
#' @param attr.fact attraction factor for plotted mutations observed track
#' @param adj.max maximum position change for each iteration observed track
#' @param adj.lmt position adjustment limit which simulation stops observed
#' track
#' @param iter.max maximum iterations beyond which to stop the simulation
#' observed track
#' @return numeric vector of dodged position coordinates on x axis
#' @importFrom FField FFieldPtRep
#' @noRd

lolliplot_dodgeCoordX <- function(x, rep.fact=5000, rep.dist.lmt=500,
                                  attr.fact=.1, adj.max=.1, adj.lmt=.5,
                                  iter.max=50000)
{
    # Format into data frame with columns as x and y
    x <- as.data.frame(cbind(x, 0))
    colnames(x) <- c('x', 'y')
    
    # Forcefield does not work with only 1 point, check for this first
    if(nrow(x) < 2)
    {
        return(x$x)
    }
    
    # take the data frame and apply a repulsive force to coordinates
    x <- FField::FFieldPtRep(x, rep.fact=rep.fact, rep.dist.lmt=rep.dist.lmt,
                             attr.fact=attr.fact, adj.max=adj.max,
                             adj.lmt=adj.lmt, iter.max=iter.max)
    
    return(x$x)
}

#' dodge coordinates
#'
#' given a data frame, dodge x coordinates ontop of each other
#' @name lolliplot_dodgeCoordY
#' @param x data frame containing columns coord_x_dodge
#' @param track character vector, one of "top", "bottom" specifying whether to
#' dodge in a positive or negative fashion
#' @return numeric vector of dodged position coordinates on y axis
#' @noRd

lolliplot_dodgeCoordY <- function(x, track='top')
{
    for(i in 1:length(x$coord_x_dodge))
    {
        if(track == 'top' & i == 1)
        {
            pos <- .3
            orig_pos <- .3
            pos_change <- .1
        } else if(track == 'bottom' & i == 1) {
            pos <- -.3
            orig_pos <- -.3
            pos_change <- -.1
        }
        
        if(i == 1)
        {
            y_axis_vec <- c(pos)
            next
        } else {
            x_coord_a <- x$coord_x_dodge[i-1]
            x_coord_b <- x$coord_x_dodge[i]
        }
        
        if(x_coord_b == x_coord_a)
        {
            new_pos <- pos + pos_change
            y_axis_pos <- new_pos
            y_axis_vec <- c(y_axis_vec, y_axis_pos)
            pos <- new_pos
            next
        } else {
            pos <- orig_pos
            y_axis_vec <- c(y_axis_vec, pos)
        }
    }
    
    return(y_axis_vec)
}

#' fetch protein domains
#' 
#' Retrieve protein domains given ensembl transcript ID
#' @name lolliplot_fetchDomain
#' @param transcriptID String specifying ensembl transcript id
#' @param species character string to use when searching for ensemblMart dataset
#' @param host Host to connect to.
#' @return data frame of protein domains and start/stop coordinates
#' @importFrom biomaRt useMart
#' @importFrom biomaRt listDatasets
#' @importFrom biomaRt useDataset
#' @importFrom biomaRt getBM
#' @noRd
lolliplot_fetchDomain <- function(transcriptID,
                                  species="hsapiens",
                                  host="www.ensembl.org")
{
    # display message
    message("Querying biomaRt for protein domains")

    # Load in mart
    ensembl_mart <- biomaRt::useMart("ENSEMBL_MART_ENSEMBL",
                                     host="www.ensembl.org")
    
    # select proper data set given regexp print warnings if unexpected out occur
    dataset <- biomaRt::listDatasets(ensembl_mart)$dataset
    index <- which(grepl(species, dataset))
    if(length(index)>1)
    {
        memo <- paste0(species, " Matches more than one dataset for the",
                       " ensembl mart, please specify a species in the, ",
                       "following format: hsapiens")
        stop(memo)
    } else if(length(index)==0) {
        memo <- paste0(species, " does not appear to be supported by biomaRt",
                       "if you beleive this to be in error please modify", 
                       "you're input to to conform to this format: hsapiens")
        stop(memo)
    }
    ensembl_mart <- biomaRt::useDataset(as.character(dataset[index]),
                                        mart=ensembl_mart)
    
    # Apply various filters using vector of values
    filters <- c("ensembl_transcript_id")
    values <- as.character(transcriptID)
    
    # Select attributes to retrieve (protein domain, start, stop)
    attributes <- c("interpro_description",
                    "interpro_start",
                    "interpro_end")
    
    # Retrieve data
    result <- biomaRt::getBM(attributes=attributes, filters=filters,
                             values=values, mart=ensembl_mart)
    
    return(result)
}

#' format mutation observations
#'
#' Create a data frame of mutation observations
#' @name lolliplot_mutationObs
#' @param x object of class data frame with columns trv_type and amino acid
#' change
#' @param track character string specifying one to 'top', 'bottom' to specify
#' proper track
#' @param fill_value character string giving the name of the column to shade
#' variants on
#' @param label_column character string specifying column containing text
#' information to be plotted
#' @param rep.fact repulsive factor for plotted mutations observed track
#' @param rep.dist.lmt repulsive distance limit for plotted mutations observed
#' track
#' @param attr.fact attraction factor for plotted mutations observed track
#' @param adj.max maximum position change for each iteration observed track
#' @param adj.lmt position adjustment limit which simulation stops observed
#' track
#' @param iter.max maximum iterations beyond which to stop the simulation
#' observed track
#' @return object of class data frame giving mutation observations
#' @noRd

lolliplot_mutationObs <- function(x, track, fill_value, label_column,
                                  rep.fact, rep.dist.lmt, attr.fact, adj.max,
                                  adj.lmt, iter.max)
{
    # Remove variants within an intronic or splice site region
    if(any(grepl("^e", x$amino_acid_change, ignore.case=TRUE, perl=TRUE)))
    {   
        # save original data frame size before subset for message
        origDim <- nrow(x)
        
        # remove regions with AA change starting with e (i.e. intronic/splice)
        x <- x[-which(grepl("^e", x$amino_acid_change,
                            ignore.case=TRUE, perl=TRUE)),]
        
        newDim <- nrow(x)
        
        # print update message
        memo <- paste0("Removed ", origDim - newDim,
                       " variants not within a residue")
        message(memo)
        
        # if the removal has removed all rows print an error
        if(newDim == 0)
        {
            memo <- paste0("Did not detect any residues, please check input", 
                           " lolliplot must have at least one valid residue",
                           " present!")
            stop(memo)
        }
    }
    
    # extract the mutation types and set a flag specifying they are present
    if(any(colnames(x) %in% fill_value))
    {
        fill_value_flag <- TRUE
        fill <- as.character(x[,eval(fill_value)])
    } else {
        fill_value_flag <- FALSE
    }
    
    # extract the mutation coordinates
    mutation_coord <- x$amino_acid_change
    if(all(grepl("p\\.", mutation_coord)))
    {
        message("Detected p. notation for amino_acid_change")
        mutation_coord <- as.numeric(gsub("p\\.[*a-zA-z]*(\\d+).*?$", "\\1",
                                          mutation_coord, perl=TRUE))
    } else if(all(grepl("c\\.", mutation_coord))) {
        memo <- paste0("c. notation is not currently supported",
                       " please specify amino acid change in p. notation")
        stop(memo)
    } else {
        memo <- paste0("Could not determine notation type for ",
                       "column \"amino_acid_change\", please check input.", 
                       "Expecting p. notation: ex. p.R383A")
        stop(memo)
    }
    
    # combine mutation type and mutation coord into a data frame
    if(fill_value_flag)
    {
        mutation_data <- as.data.frame(cbind(mutation_coord, fill))
        colnames(mutation_data) <- c('mutation_coord', eval(fill_value))
    } else {
        mutation_data <- as.data.frame(mutation_coord)
        colnames(mutation_data) <- c('mutation_coord')
    }
    mutation_data$mutation_coord <-
        as.numeric(as.character(mutation_data$mutation_coord))
    
    # add extra column giving height of Y axis for points to be plotted
    if(track == 'top')
    {
        mutation_data$height_max <- .3
    } else if (track == 'bottom') {
        mutation_data$height_min <- -.3
    } else {
        stop("Fatal error: incorrect track type specified")
    }
    
    # extract the mutation types and set a flag specifying they are present
    if(any(colnames(x) %in% label_column))
    {
        label_column_flag <- TRUE
        mutation_data$labels <- as.character(x[,eval(label_column)])
    } else {
        label_column_flag <- FALSE
    }
    
    # Dodge mutation coordinates on the x axis
    if(track == 'top')
    {
        memo <- paste0("applying force field to observed mutations for",
                       " top track. This will take time if n is large",
                       ", see vignette for tips")
        message(memo)
    } else if (track == 'bottom') {
        memo <- paste0("applying force field to observed mutations for",
                       " bottom track. This will take time if n is large",
                       ", see vignette for tips")        
        message(memo)
    }
    mutation_data <- mutation_data[order(mutation_coord),] 
    mutation_data$coord_x_dodge <- 
        lolliplot_dodgeCoordX(as.vector(mutation_data$mutation_coord),
                              rep.fact=rep.fact,
                              rep.dist.lmt=rep.dist.lmt,
                              attr.fact=attr.fact,
                              adj.max=adj.max,
                              adj.lmt=adj.lmt,
                              iter.max=iter.max)
    
    # Dodge y coordinates
    if(track == 'top')
    {
        mutation_data$coord_y_dodge <- lolliplot_dodgeCoordY(mutation_data,
                                                             track='top')
    } else if(track == 'bottom') {
        mutation_data$coord_y_dodge <- lolliplot_dodgeCoordY(mutation_data,
                                                             track='bottom')
    }
    
    return(mutation_data)
}

#' Check input to lolliplot
#'
#' Perform Basic quality checks for lolliplot input
#' @name lolliplot_qual
#' @param x object of class data frame containing columns transcript_name, gene,
#' and amino_acid_change and rows denoting mutations
#' @param y object of class data frame containing columns transcript_name, and
#' amino_acid_change and rows denoting mutations
#' @param z Object of class data frame containing columns "description", "start",
#'  "stop" specifying gene regions to highlight
#' @return objects passing basic quality checks
#' @noRd

lolliplot_qual <- function(x, y, z)
{
    # Check input to x
    if(!is.data.frame(x))
    {
        message("Input to x is not a data frame, attempting to coerce")
        x <- as.data.frame(x)
        x <- droplevels(x)
    }
    
    # Check for correct columns in x
    if(!all(c('transcript_name', 'gene', 'amino_acid_change') %in% colnames(x)))
    {
        stop("Did not detect correct columns in x,
             missing one of transcript_name, gene, amino_acid_change")
    }
    
    # Make sure columns in x are of the proper class
    x$transcript_name <- as.factor(x$transcript_name)
    x$gene <- as.factor(x$gene)
    x$amino_acid_change <- as.factor(x$amino_acid_change)
    
    # Check that "transcript_name" in x contains only 1 transcript
    if(length(unique(x$transcript_name)) != 1)
    {
        stop("Detected more than 1 transcript in input to x")
    }
    
    # Check input to y
    if(!is.null(y))
    {
        # is y a data frame?
        if(!is.data.frame(y))
        {
            message(y, "is not a data frame, attempting to coerce")
            y <- as.data.frame(y)
            y <- droplevels(y)
        }
        
        # does y have correct columns?
        if(!all(c('transcript_name', 'amino_acid_change') %in% colnames(y)))
        {
            stop("Did not detect correct columns in y, missing one of
                 transcript_name, amino_acid_change")
        }
        
        # make sure columns in y are of proper class
        y$transcript_name <- as.factor(y$transcript_name)
        y$amino_acid_change <- as.factor(y$amino_acid_change)
        }
    
    # Check input to z
    if(!is.null(z))
    {
        # is z a data frame?
        if(!is.data.frame(z))
        {
            memo <- paste0("Input to z is not a data frame",
                           ", attempting to coerce")
            warning(memo)
            z <- as.data.frame(z)
            z <- droplevels(z)
        }
        
        # does z contain correct columns
        if(!all(c("description", "start", "stop") %in% colnames(z)))
        {
            memo <- paste0("Did not detect correct columns in input to z, ",
                           "missing one of \"description\", \"start\",",
                           " \"stop\"")
            stop(memo)
        }
        
        # make sure column class of z are of proper type
        z$description <- as.factor(z$description)
        z$start <- as.numeric(as.character(z$start))
        z$stop <- as.numeric(as.character(z$stop))
    }
    
    return(list(x, y, z))
    }

#' Reduce Lolli
#' 
#' Reduce lollis stacked ontop of each other to the amount specified
#' @name lolliplot_reduceLolli
#' @param x Data frame with column name mutation_coord to reduce lollis on
#' @param max Integer specifying the maximum number of lollis to allow
#' @return Object of class data frame taking the reduced form of x
#' @importFrom plyr count
#' @noRd

lolliplot_reduceLolli <- function(x, max=NULL)
{
    # if max is null no reduction is required
    if(is.null(max))
    {
        return(x)
    }
    
    # Get a frequency of counts from x
    coordFreq <- plyr::count(x$mutation_coord)
    colnames(coordFreq) <- c("x", "freq")
    
    keep <- vector('numeric')
    # Loop through mutations keeping only what is under max
    for(i in 1:nrow(coordFreq))
    {
        index <- which(x$mutation_coord == coordFreq$x[i])
        index <- index[1:max]
        keep <- c(keep, index)
    }
    
    # subset the input on what we want to keep
    x <- x[keep,]
    
    return(x)
}

#' fetch protein length
#' 
#' Retrieve protein length from ensembl database given enseml transcript id
#' @name lolliplot_transcriptID2codingSeq
#' @param transcriptID character string giving ensembl transcript id
#' @param species character string to use when searching for ensemblMart dataset
#' @param host Host to connect to.
#' @return length in residues of ensembl transcript id
#' @importFrom biomaRt useMart
#' @importFrom biomaRt listDatasets
#' @importFrom biomaRt useDataset
#' @importFrom biomaRt getBM
#' @noRd

lolliplot_transcriptID2codingSeq <- function(transcriptID,
                                             species="hsapiens",
                                             host="www.ensembl.org")
{
    # display mesage
    memo <- paste0("Using the following host: ", host, " for biomaRt queries",
                   " to change the ensembl annotation version alter this",
                   " parameter!")
    message(memo)
    message("Querying biomaRt for transcript sequence")

    # Load in mart
    ensembl_mart <- biomaRt::useMart("ENSEMBL_MART_ENSEMBL",
                                     host=host)
    
    # select proper data set given regexp print warnings if unexpected out occur
    dataset <- biomaRt::listDatasets(ensembl_mart)$dataset
    index <- which(grepl(species, dataset))
    if(length(index)>1)
    {
        memo <- paste0(species, " Matches more than one dataset for the",
                       " ensembl mart, please specify a species in the, ",
                       "following format: hsapiens")
        stop(memo)
    } else if(length(index)==0) {
        valid_species <- toString(gsub("_gene_ensembl",
                                       "",
                                       dataset))
        
        memo <- paste0(species, " does not appear to be supported by biomaRt",
                       " please specify one of the following species:",
                       valid_species)
        stop(memo)
    }
    ensembl_mart <- biomaRt::useDataset(as.character(dataset[index]),
                                        mart=ensembl_mart)
    
    # Apply various filters using vector of values
    filters <- c("ensembl_transcript_id")
    ensg_id <- as.character(transcriptID)
    
    # Select attributes to retrieve coding dna sequence
    attributes <- c("coding","cds_length")
    
    # Retrieve data
    result <- biomaRt::getBM(attributes=attributes, filters=filters,
                             values=ensg_id, mart=ensembl_mart)
    
    return(as.list(result))
}

Try the GenVisR package in your browser

Any scripts or data that you put into this service are public.

GenVisR documentation built on Dec. 28, 2020, 2 a.m.