gGnome: gGnome: reference based assembly graph for analyzing rearranged genomes

Documented in gGnome.js pgv

#' Generate a PGV instance
#' @name pgv
#' @description
#'
#' Takes a table with paths to gGraphs, coverage files, and gWalks (optional), and 
#' generates an instance of a PGV directory that is ready to visualize using PGV
#' 
#' @param data either a path to a TSV/CSV or a data.table
#' @param name.col column name in the input data table containing the sample names (default: "sample")
#' @param patient.id column name in the input data table containing the patient ID/names (default: 'patricipant'). 
#' If your table includes more than one datasets (e.g. samples from multiple patients), then you can specify 
#' the column from which to read the dataset names. This column would be used to group together samples that belong to the 
#' same dataset. If no values are passed, we take the pair name as patientID.
#' @param outdir the path where to save the files. This path should not exist, 
#' unless you want to add more files to an existing directory in which case 
#' you must use append = TRUE
#' @param cov.col column name in the input data table containing the paths to 
#' coverage files
#' @param gg.col column name in the input data table containing the paths to 
#' RDS files containing the gGnome objects
#' @param gw.col column name in the input data table containing the paths to 
#' RDS files containing the gWalk objects (optional)
#' @param descriptors list of columns in data table that provides description tags to our
#' patient IDs. Here we are looking for IDs and tags that can be used in PGV to subset
#' our data. Expects a list of character column names. (default: NA)
#' @param append if set to FALSE the the directory is expected to to exist 
#' yet (default: TRUE). By default, samples would be appended to a PGV instance 
#' if the directory already exists
#' @param cov.field the name of the field in the coverage GRanges that should be used (default: "ratio", use "foreground" if dryclean output)
#' @param cov.field.col column name in the input data table containing the name 
#' of the field in the coverage GRanges that should be used. If this is supplied 
#' then it overrides the value in "cov.field". Use this if some of your coverage 
#' files differ in the field used.
#' @param cov.bin.width bin width to use when rebinning the coverage data (default: 1e4). 
#' If you don't want rebinning to be performed then set to NA.
#' @param cov.color.field field in the coverage GRanges to use in order to set 
#' the color of coverage data points. If nothing is supplied then default colors 
#' are used for each seqname (namely chromosome) by reading the colors that are 
#' defined in the settings.json file for the specific reference that is being used 
#' for this dataset.
#' @param ref the genome reference name used for this dataset. This reference name 
#' must be defined in the settings.json file. By default PGV accepts one of the 
#' following: hg19, hg38, covid19. If you are using a different reference then 
#' you must first add it to the settings.json file.
#' @param overwrite by default only files that are missing will be created. If set 
#' to TRUE then existing coverage arrow files and gGraph JSON files will be overwritten
#' @param annotation which node/edge annotation fields to add to the gGraph JSON 
#' file. By default we assume that gGnome::events has been executed and we add 
#' the following SV annotations: 'simple', 'bfb', 'chromoplexy', 'chromothripsis', 
#' 'del', 'dm', 'dup', 'pyrgo', 'qrdel', 'qrdup', 'qrp', 'rigma', 'tic', 'tyfonas'
#' @param tree path to newick file containing a tree to incorporate with the dataset. 
#' IF provided then the tree is added to datafiles.json and will be visualized by 
#' PGV. If the names of leaves of the tree match the names defined in the name.col 
#' then PGV will automatically assocaited these leaves with the samples and hence 
#' upon clicking a leaf of the tree the browser will scroll down to the corresponding 
#' genome graph track
#' @param cid.field field in the graph edges that should be used for setting the 
#' cid values in the JSON (default: 'sedge.id'). This is useful for cases in which 
#' there is some unique identifier used across samples to identify identical junctions 
#' (for example "merged.ix" field, which is generated by merge.Junction())
#' @param connections.associations (FALSE) produce a connections.associations table.
#' @param kag.col (default: 'kag') name of column in input table that includes the 
#' paths to JaBbA karyographs
#' @param ncn.gr GRanges object or path to GRanges object containing normal copy n
#' umber (ncn) values. The ncn values must be contained in a field named "ncn"
#' @param mc.cores how many cores to use
#' 
#' @return a generated PGV formatted json ready for visualization.
#' 
#' @export
pgv = function(data,
               name.col = 'sample',
               patient.id = 'participant',
               outdir = './pgv',
               cov.col = 'coverage',
               gg.col = 'graph',
               gw.col = 'walks',
               descriptors = NA,
               append = TRUE,
               cov.field = 'ratio',
               cov.field.col = NA,
               cov.bin.width = 1e4,
               cov.color.field = NULL,
               ref = NA,
               overwrite = FALSE,
               annotation = c('simple', 'bfb', 'chromoplexy',
                              'chromothripsis', 'del', 'dm', 'dup',
                              'pyrgo', 'qrdel', 'qrdup', 'qrp', 'rigma',
                              'tic', 'tyfonas'),
               tree = NA,
               cid.field = 'sedge.id',
               connections.associations = FALSE,
               kag.col = 'kag',
               ncn.gr = NA,
               mc.cores = 1
){
    data = read.js.input.data(data, name.col = name.col)
    if (is.na(patient.id)){
        warning('no patient.id provided, using name.col...' )
        if (!(name.col %in% names(data))){
            stop('name.col column not found in data input')
        }
        datasets = unique(data[, get(name.col)])
        patient.id = name.col
    } else {
        if (!(patient.id %in% names(data))){
            stop('patient.id column not found in data input')
        }
        datasets = unique(data[, get(patient.id)])
    }
    if (any(!is.na(descriptors))){
        # check if cols are in data.table
        if (any(descriptors %in% names(data))){
            desc = descriptors[which(descriptors %in% names(data))]
            if (any(!(descriptors %in% names(data)))){
                warning(paste0('dropping ', 
                               descriptors[which(!(descriptors %in% 
                                                           names(data)))],
                               ' columns due to not being found in data.table'))
                
            }
        }
    } else {
        desc = NA
    }
    message("using descriptor columns ", desc)
    out = lapply(datasets, function(dname){
        # print(data[get(patient.id) == dname,])
        return(gen_js_instance(data = data[get(patient.id) == dname,],
                               name.col = name.col,
                               outdir = outdir,
                               cov.col = cov.col,
                               gg.col = gg.col,
                               gw.col = gw.col,
                               descriptors = desc,
                               append = append,
                               js.type = 'PGV',
                               cov.field = cov.field,
                               cov.field.col = cov.field.col,
                               cov.bin.width = cov.bin.width,
                               cov.color.field = cov.color.field,
                               patient.id = dname,
                               ref = ref,
                               overwrite = overwrite,
                               annotation = annotation,
                               tree = tree,
                               cid.field = cid.field,
                               connections.associations = connections.associations,
                               kag.col = kag.col,
                               ncn.gr = ncn.gr,
                               mc.cores = mc.cores))
    })
    return(out)
}

#' Generate a gGnome.js instance
#' @name gGnome.js
#' @description
#'
#'
#' Takes a table with paths to gGraphs and coverage files (optional) and 
#' generates an instance of a gGnome.js directory that is ready to visualize using gGnome.js
#' 
#' @param data either a path to a TSV/CSV or a data.table
#' @param name.col column name in the input data table containing the sample names (default: "sample")
#' @param outdir the path where to save the files. This path should not exist, unless you want to add more files to an existing directory in which case you must use append = TRUE
#' @param reference name of the reference genome used. You can use one of the built-in references (hg19, hg38), or provide a path to a folder with properly formatted genes.json and metadata.json files
#' @param cov.col column name in the input data table containing the paths to coverage files
#' @param gg.col column name in the input data table containing the paths to RDS files containing the gGnome objects
#' @param append if set to FALSE the the directory is expected not to exist yet (default: TRUE). By default, samples would be appended to a gGnome.js instance if the directory already exists
#' @param cov.field the name of the field in the coverage GRanges that should be used (default: "ratio")
#' @param cov.field.col column name in the input data table containing the name of the field in the coverage GRanges that should be used. If this is supplied then it overrides the value in "cov.field". Use this if some of your coverage files differ in the field used.
#' @param cov.bin.width bin width to use when rebinning the coverage data (default: 1e4). If you don't want rebinning to be performed then set to NA.
#' @param ref the genome reference name used for this dataset. This reference name must be defined in the settings.json file. By default gGnome.js accepts one of the following: hg19, hg38, covid19. If you are using a different reference then you must first add it to the settings.json file.
#' @param overwrite by default only files that are missing will be created. If set to TRUE then existing coverage arrow files and gGraph JSON files will be overwritten
#' @param annotation which node/edge annotation fields to add to the gGraph JSON file. By default we assume that gGnome::events has been executed and we add the following SV annotations: 'simple', 'bfb', 'chromoplexy', 'chromothripsis', 'del', 'dm', 'dup', 'pyrgo', 'qrdel', 'qrdup', 'qrp', 'rigma', 'tic', 'tyfonas'
#' @param mc.cores how many cores to use
#' 
#' @export
gGnome.js = function(data,
                           name.col = 'sample',
                           outdir = './gGnome.js',
                           reference = 'hg19',
                           cov.col = 'coverage',
                           gg.col = 'graph',
                           append = FALSE,
                           cov.field = 'ratio',
                           cov.field.col = NA,
                           cov.bin.width = 1e4,
                           overwrite = FALSE,
                           annotation = c('simple', 'bfb', 'chromoplexy',
                                       'chromothripsis', 'del', 'dm', 'dup',
                                       'pyrgo', 'qrdel', 'qrdup', 'qrp', 'rigma',
                                       'tic', 'tyfonas'),
                           kag.col = 'kag',
                           ncn.gr = NA,
                           mc.cores = 1
                     ){
    return(gen_js_instance(data = data,
                           name.col = name.col,
                           outdir = outdir,
                           cov.col = cov.col,
                           ref = reference,
                           gg.col = gg.col,
                           append = append,
                           js.type = 'gGnome.js',
                           cov.field = cov.field,
                           cov.field.col = cov.field.col,
                           cov.bin.width = cov.bin.width,
                           overwrite = overwrite,
                           annotation = annotation,
                           kag.col = kag.col,
                           ncn.gr = ncn.gr,
                           mc.cores = mc.cores))
}

# ' Generates a gGnome.js instance
#' @name gen_js_instance
#' @description
#'
#' Takes a table with paths to gGraphs and coverage files (optional) and 
#' generates an instance of a gGnome.js directory that is ready to visualize using gGnome.js
#' 
#' @param data either a path to a TSV/CSV or a data.table
#' @param name.col column name in the input data table containing the sample names (default: "sample")
#' @param outdir the path where to save the files. This path should not exist, unless you want to add more files to an existing directory in which case you must use append = TRUE
#' @param cov.col column name in the input data table containing the paths to coverage files
#' @param gg.col column name in the input data table containing the paths to RDS files containing the gGnome objects
#' @param gw.col column name in the input data table containing the paths to RDS files containing the gWalk objects (optional)
#' @param patient.id column name in the input data table containing the patient ID/names (default: 'patricipant'). 
#' If your table includes more than one datasets (e.g. samples from multiple patients), then you can specify 
#' the column from which to read the dataset names. This column would be used to group together samples that belong to the 
#' same dataset. If no values are passed, we take the pair name as patient.id.
#' @param append if set to FALSE then the directory is expected not to exist yet (default: TRUE). By default, samples would be appended to the instance if the directory already exists (and if there is no directory then a clone would first be generated from github)
#' @param js.type either "PGV" or "gGnome.js"
#' @param cov.field the name of the field in the coverage GRanges that should be used (default: "ratio")
#' @param cov.field.col column name in the input data table containing the name of the field in the coverage GRanges that should be used. If this is supplied then it overrides the value in "cov.field". Use this if some of your coverage files differ in the field used.
#' @param cov.bin.width bin width to use when rebinning the coverage data (default: 1e4). If you don't want rebinning to be performed then set to NA.
#' @param cov.color.field field in the coverage GRanges to use in order to set the color of coverage data points. If nothing is supplied then default colors are used for each seqname (namely chromosome) by reading the colors that are defined in the settings.json file for the specific reference that is being used for this dataset.
#' @param ref the genome reference name used for this dataset. For specific behaviour refer to the PGV/gGnome.js wrappers
#' @param overwrite by default only files that are missing will be created. If set to TRUE then existing coverage arrow files and gGraph JSON files will be overwritten
#' @param annotation which node/edge annotation fields to add to the gGraph JSON file. By default we assume that gGnome::events has been executed and we add the following SV annotations: 'simple', 'bfb', 'chromoplexy', 'chromothripsis', 'del', 'dm', 'dup', 'pyrgo', 'qrdel', 'qrdup', 'qrp', 'rigma', 'tic', 'tyfonas'
#' @param tree path to newick file containing a tree to incorporate with the dataset. Only relevant for PGV. IF provided then the tree is added to datafiles.json and will be visualized by PGV. If the names of leaves of the tree match the names defined in the name.col then PGV will automatically assocaited these leaves with the samples and hence upon clicking a leaf of the tree the browser will scroll down to the corresponding genome graph track
#' @param mc.cores how many cores to use
#' 
#' @keywords internal 
gen_js_instance = function(data,
                           name.col = 'sample',
                           patient.id = 'participant',
                           outdir = './gGnome.js',
                           cov.col = 'coverage',
                           gg.col = 'graph',
                           gw.col = NA,
                           descriptors = NA,
                           append = FALSE,
                           js.type = 'PGV',
                           cov.field = 'ratio',
                           cov.field.col = NA,
                           cov.bin.width = 1e4,
                           cov.color.field = NULL,
                           ref = NA,
                           overwrite = FALSE,
                           annotation = c('simple', 'bfb', 'chromoplexy',
                                       'chromothripsis', 'del', 'dm', 'dup',
                                       'pyrgo', 'qrdel', 'qrdup', 'qrp', 'rigma',
                                       'tic', 'tyfonas'),
                           kag.col = 'kag',
                           ncn.gr = NA,
                           tree = NA,
                           cid.field = NULL,
                           connections.associations = FALSE,
                           mc.cores = 1
                     ){
    # check the path and make a clone of the github repo if needed
    outdir = js_path(outdir, js.type = js.type, append = append)

    set_reference_files(outdir, js.type = js.type, ref = ref)

    # get the path to the metadata file
    meta.js = get_path_to_meta_js(outdir, js.type = js.type)
    # read and check the input data
    data = read.js.input.data(data, name.col = name.col)
    # generate coverage files
    if (!is.na(cov.col)){
        message('Generating coverage files')
        coverage_files = gen_js_coverage_files(data, outdir, name.col = name.col, overwrite = overwrite, cov.col = cov.col,
                          js.type = js.type, cov.field = cov.field,
                          cov.field.col = cov.field.col, gg.col = gg.col,
                          bin.width = cov.bin.width, patient.id = patient.id,
                          ref = ref, cov.color.field = cov.color.field,
                          meta.js = meta.js, kag.col = kag.col, ncn.gr = ncn.gr, mc.cores = mc.cores)

        data$coverage = coverage_files
    }else{data$coverage = NA}
    if (!is.na(gg.col)){
        message('Generating gGraph json files')
        gg.js.files = gen_gg_json_files(data, outdir, meta.js = meta.js, name.col = name.col, gg.col = gg.col,
                                    js.type = js.type, patient.id = patient.id, ref = ref,
                                    overwrite = overwrite, annotation = annotation, cid.field = cid.field,
                                    connections.associations = connections.associations)
        data$gg.js = gg.js.files
    } else{data$gg.js = NA} 
    if (!is.na(gw.col)){
        message('Generating gWalk json files')
        gw.js.files = gen_gw_json_files(data, outdir, meta.js = meta.js, name.col = name.col, gw.col = gw.col,
                                    js.type = js.type, patient.id = patient.id, ref = ref,
                                    overwrite = overwrite, annotation = annotation)
        data$gw.js = gw.js.files
    }else{data$gw.js = NA}
    # generate the datafiles    
    # pass descriptors into meta_col
    dfile = gen_js_datafiles(data, outdir, js.type, 
                             name.col = name.col, ref = ref, 
                             patient.id = patient.id, tree = tree, 
                             meta_col = descriptors)
}

#' @name set_reference_files
#' @description internal
#'
#' Generate the datafiles object for a PGV or gGnome.js instance
#'
#' @param outdir the path to the PGV/gGnome.js repository clone
#' @param js.type either "PGV" or "gGnome.js"
#' @param ref the genome reference used for this dataset.
set_reference_files = function(outdir, js.type = js.type, ref = ref){
    if (js.type == 'gGnome.js'){
        pub_dir = paste0(outdir, '/public')
        built_in_refs = dir(pub_dir)
        built_in_refs = c('hg19', built_in_refs[dir.exists(paste0(pub_dir, '/', built_in_refs))])
        if (ref %in% built_in_refs){
            message('Built-in reference requested: "', ref, '"')
            ref_dir = paste0(pub_dir, '/', ref)
            message('Downloading genes.json file.')
            genes = paste0(pub_dir, '/', 'genes.json')
            # build the URL for downloading genes.json from AWS
            url = paste0('https://mskilab.s3.amazonaws.com/pgv/',
                         ref, '.genes.json')
            system(paste0('wget -O ',
                          genes,
                          ' ',
                          url))
        } else {
            # if it is not a built-in reference then we assume it is a path to a directory
            ref_dir = ref
            if (!dir.exists(ref_dir)){
                stop('Invliad reference provided: "', ref, 
                     '". Please provide either a name of one one of the following built-in references: ',
                     paste(built_in_refs, collapse = ', '), 
                     ', or a path to a directory containing properly formatted "genes.json" and "metadata.json".')
            }
            genes = paste0(ref_dir, '/genes.json')
            if (!file.exists(genes)){
                stop('Invalid reference folder: ', ref_dir, 
                     '. The reference folder must contain a file named "genes.json".')
            }
            message('Copying reference JSON file to: ', pub_dir)
            system(paste0('cp ', genes, ' ', pub_dir))
        }

        if (ref != 'hg19'){
            metadata = paste0(ref_dir, 
                              '/metadata.json')
            if (!file.exists(metadata)){
                stop('Invalid reference folder: ', ref_dir, 
                     '. The reference folder must contain a file named "metadata.json".')
            }
            message('Copying reference metadata file to: ', 
                    pub_dir)
            system(paste0('cp ', metadata, 
                          ' ', pub_dir))
        }
    } else {
        # set up reference files for PGV
        built_in_refs = c('hg19', 'hg19_chr', 'hg38', 'hg38_chr')
        if (!(ref %in% built_in_refs)){
            # TODO: in the future we should probably allow ref to be a path to a directory with all required reference files
            # this would require automatically updating the settings.json file
            #    by adding the reference metadata.json info to the "sets" in settigs.json
            stop('Invalid reference: ', ref, '. Please choose one of the following: ', paste(built_in_refs, collapse = ', '))
        } else {
            gdir = paste0(outdir, '/public/genes')
            dir.create(gdir, recursive = TRUE, showWarnings = FALSE)
            gpath = paste0(gdir, '/', ref, '.arrow')
            url = paste0('https://mskilab.s3.amazonaws.com/pgv/',
                         ref, '.arrow')
            if (!file.exists(gpath)){
                system(paste0('wget -O ',
                              gpath,
                              ' ',
                              url))
            } else {
                message('Found genes file at: ', gpath)
            }
        }
    }

}

#' @name gen_js_datafiles
#' @description internal
#'
#' Generate the datafiles object for a PGV or gGnome.js instance
#'
#' @param data either a path to a TSV/CSV or a data.table
#' @param outdir the path to the PGV/gGnome.js repository clone
#' @param js.type either "PGV" or "gGnome.js"
#' @param name.col column name in the input data table containing the sample names (default: "sample")
#' @param meta_col a list of columns in the input data table containing the 
#' description of each sample. 
#' A single string is expected in which each description term is separated by a semicolon and space ("; "). For example: "ATCC; 2014; Luciferase; PTEN-; ESR1-""
#' @param ref the genome reference name used for this dataset. Only relevant for PGV
#' @param patient.id column name in the input data table containing the patient ID/names (default: 'patricipant'). 
#' If your table includes more than one patient (e.g. samples from multiple patients), then you can specify 
#' the column from which to read the patient id. This column would be used to group together samples that belong to the 
#' same patient. If no values are passed, we take the pair name as patientID.
#' 
#' @export
gen_js_datafiles = function(data, outdir, js.type, name.col = NA, 
                            meta_col = NA, ref = NA, patient.id = NA, tree = NA){
    
    dfile = get_js_datafiles_path(outdir, js.type)
    message(paste0('Writing description file to: ', dfile))
    if (js.type == 'gGnome.js'){
        if (file.exists(dfile)){
            datafiles = fread(dfile)
            # if some of the samples that we are adding are already in the datafiles then we want to override these
            jsons = data[, paste0(get(name.col), '.json')]
            datafiles_trimmed = datafiles[!(datafile %in% jsons)]
            datafiles = rbind(datafiles_trimmed, data[, .(datafile = paste0(get(name.col), '.json'), description)])
        } else {
            datafiles = data[, .(datafile = paste0(get(name.col), '.json'), description)]
        }
        fwrite(datafiles, dfile)
    }
    if (js.type == 'PGV'){
        if (is.na(patient.id)){
            stop('patient.id must be provided for PGV.')
        }
        if (is.na(ref)){
            stop('ref must be provided for PGV.')
        }

        if (!('visible' %in% names(data))){
            data$visible = TRUE
        }

        if (file.exists(dfile)){
            # there is already a file and we want to extend/update it
            datafiles = jsonlite::read_json(dfile)
            if (patient.id %in% names(datafiles)){
                warning('Notice that an entry for "', patient.id, '" previously existed  in your datafiles.json and will now be override.')
            }
        } else {
            datafiles = list()
        }

        if (!is.na(tree)){
            #TODO: check that this is a valid newick
            if (!file.exists(tree)){
                warning('The provided tree was not found: ', tree)
                tree = NA
            } else {
                if (requireNamespace("ape", quietly = TRUE)) {
                    tree_ = ape::read.tree(tree)
                    if (is.null(tree_)){
                        warning('The provided tree: "', tree, '" is not in newick format and so will be ignored.')
                        tree = NA
                    } else {
                        # check overlap between sample names and tree nodes
                        tree.labels = tree_$tip.label
                        common = intersect(tree.labels, data[, get(name.col)])
                        if (length(common) == 0){
                            warning('There is no overlap between labels in your tree and sample names in your data. The tree will still show in PGV but all interaction between the tree and graphs will be disabled.')
                        } else {
                            missing.names = setdiff(data[, get(name.col)], tree.labels)
                            if (length(missing.names) > 0){
                                warning('The following samples are missing from your tree: ', paste(missing.names, collapse = ', '))
                            }
                            extra.names = setdiff(tree.labels, data[, get(name.col)])
                            if (length(extra.names) > 0){
                                warning('The following labels appear in provided tree, but do not correspond to any samples in your data: ', paste(extra.names, collapse = ', '))
                            }
                            # remember the original order for samples not in the tree
                            l1 = 1:data[, .N] 
                            names(l1) = data[, get(name.col)]

                            l2 = seq_along(tree.labels)
                            names(l2) = tree.labels

                            l3 = c(l2[common], l1[missing.names])
                            sample_names = names(sort(l3))
                            sample_order = unname(l1[sample_names])
                        }
                    }
                } else {
                    warning('Package "ape" is not installed so skipping validation of tree newick format. If things dont work later then it might be worth checking if the provided tree is in valid newick format.')
                }
            }
        }

        if (!is.na(tree)){
            tree.new.path = paste0(outdir, "/public/data/", patient.id, "/", patient.id, ".newick")
            message('Copying input newick file to: ', tree.new.path)
            file.copy(tree, tree.new.path)
            tree_plot = list("sample" = NA_character_,
                             "type" = "phylogeny",
                             "source" = paste0(patient.id, ".newick"),
                             "title" = paste0("Phylogenetic Information for ", patient.id),
                             "visible" = TRUE)
        } else {
            # we use the order of samples in the data file since we don't have a tree
            sample_order = 1:data[,.N]
        }
        plots = lapply(sample_order, function(idx){
                    if ("gg.js" %in% colnames(data)){   
                        gg.js = data[idx, gg.js]
                    }
                     if ("gw.js" %in% colnames(data)){
                        gw.js = data[idx, gw.js]
                     }
                     if ("coverage" %in% colnames(data)){
                        cov.fn = data[idx, coverage]
                     }
                     gg.track = NULL
                     gw.track = NULL
                     cov.track = NULL
                     nm = data[idx, get(name.col)]
                     if (!is.na(gg.js)){
                         if (file.exists(gg.js)){
                             gg.track = list('sample' = nm,
                                             'type' = 'genome',
                                             'source' = paste0(nm, '.json'),
                                             'title' = nm,
                                             'visible' = ifelse(data[idx, visible] == TRUE, TRUE, FALSE))
                         }
                     }
                     if (!is.na(gw.js)){
                         if (file.exists(gw.js)){
                             gw.track = list('sample' = nm,
                                             'type' = 'walk',
                                             'source' = paste0(nm, '.walks.json'),
                                             'title' = paste0(nm, ' Walks'),
                                             'visible' = ifelse(data[idx, visible] == TRUE, TRUE, FALSE))
                         }
                     }
                     if (!is.na(cov.fn)){
                         if (file.exists(cov.fn)){
                             cov.track = list('sample' = nm,
                                              'type' = 'scatterplot',
                                              'source' = paste0(nm, '-coverage.arrow'),
                                              'title' = paste0(nm, ' Coverage Distribution'),
                                              'visible' = FALSE) # coverage tracks will always be set to not visible on load
                         }
                     }
                     tracks = list(gg.track, gw.track, cov.track)
                     return(tracks)
        })

        plots = do.call(c, plots)

        item = list()
        # TODO: we need to figure out the purpose of the description in PGV and update this accordingly
        if (is.na(meta_col)){
            # if nothing is provided we just push patient.id into our description
            item$description = list(paste0('patientid=', patient.id))    
        }else{
            list_desc=sapply(meta_col,function(x){
                out= data %>% select(all_of(x)) %>% .[1]
                paste0(x, "=",out)
            })
            item$description = list_desc
        }
        
        item$reference = ref
        item$plots = plots

        if (!is.na(tree)){
            item$plots = c(list(tree_plot), item$plots)
        }


        datafiles[[patient.id]] = item
        jsonlite::write_json(datafiles, dfile,
                             pretty=TRUE, auto_unbox=TRUE, digits=4)
    }
    return(dfile)
}

#' @name get_js_datafiles_path
#' @description
#'
#' Get the path to the datafiles (CSV for gGnome.js, JSON for PGV) inside the clone of the repository
#'
#' @param outdir the path to the PGV/gGnome.js repository clone
#' @param js.type either "PGV" or "gGnome.js"
get_js_datafiles_path = function(outdir, js.type){
    is.acceptable.js.type(js.type)
    if (js.type == 'gGnome.js'){
        dfile = paste0(outdir, '/datafiles.csv')
    }
    if (js.type == 'PGV'){
        dfile = paste0(outdir, '/public/datafiles.json')
    }
    return(dfile)
}

#' @name gen_gg_json_files
#' @description internal
#'
#' Generate the json files that will represent your gGraphs
#'
#' @param data either a path to a TSV/CSV or a data.table
#' @param outdir the path to the PGV/gGnome.js repository clone
#' @param meta.js path to JSON file with metadata (for PGV should be located in "public/settings.json" inside the repository and for gGnome.js should be in public/genes/metadata.json)
#' @param name.col column name in the input data table containing the sample names (default: "sample")
#' @param gg.col column name in the input data table containing the paths to RDS files containing the gGnome objects
#' @param js.type either "PGV" or "gGnome.js"
#' @param patient.id column name in the input data table containing the patient ID/names (default: 'patricipant'). 
#' If your table includes more than one datasets (e.g. samples from multiple patients), then you can specify 
#' the column from which to read the dataset names. This column would be used to group together samples that belong to the 
#' same dataset. If no values are passed, we take the pair name as patientID.
#' @param ref the genome reference name used for this dataset. For specific behaviour refer to the PGV/gGnome.js wrappers
#' @param overwrite by default only files that are missing will be created. If set to TRUE then existing coverage arrow files and gGraph JSON files will be overwritten
#' @param annotation which node/edge annotation fields to add to the gGraph JSON file. By default we assume that gGnome::events has been executed and we add the following SV annotations: 'simple', 'bfb', 'chromoplexy', 'chromothripsis', 'del', 'dm', 'dup', 'pyrgo', 'qrdel', 'qrdup', 'qrp', 'rigma', 'tic', 'tyfonas'
gen_gg_json_files = function(data, outdir, meta.js, name.col = 'sample', 
                             gg.col = 'graph', js.type = 'gGnome.js',
                             patient.id = NA, ref = NULL, overwrite = FALSE, 
                             annotation = NULL, cid.field = NULL, 
                             connections.associations = FALSE){
    json_dir = get_gg_json_dir_path(outdir, js.type, patient.id)
    json_files = lapply(1:data[, .N], function(idx){
        gg.js = file.path(json_dir, paste0(data[idx, get(name.col)], ".json"))
        if (!file.exists(gg.js) | overwrite){
            print(paste0("reading in ", data[idx, get(gg.col)]))
            # TODO: at some point we need to do a sanity check to see that a valid rds of gGraph was provided
            gg = readRDS(data[idx, get(gg.col)])
            sl = parse.js.seqlengths(meta.js, js.type = js.type, ref = ref)
            # check for overlap in sequence names
            gg.reduced = gg[seqnames %in% names(sl)]
            if (length(gg.reduced) == 0){
                stop(sprintf('There is no overlap between the sequence names in the reference used by gGnome.js and the sequences in your gGraph. Here is an example sequence from your gGraph: "%s". And here is an example sequence from the reference used by gGnome.js: "%s"', seqlevels(gg$nodes$gr)[1], names(sl)[1]))
            }
            gg.js = refresh(gg[seqnames %in% names(sl)])$json(filename = gg.js,
                        verbose = TRUE,
                        annotation = annotation,
                        cid.field = cid.field)
        } else {
            message(gg.js, ' found. Will not overwrite it.')
        }
        return(normalizePath(gg.js))
    })
    if (connections.associations){
        ca.fn = paste0(json_dir, '/connections.associations.json')
        if (!file.exists(ca.fn) | overwrite){
            message('Generating connections.associations file')
            cid_lists = lapply(1:data[, .N], function(idx){
                gg = readRDS(data[idx, get(gg.col)])
                nm = data[idx, get(name.col)]
                cids = get_cids(gg, cid.field)
                return(list(sample = nm, connections = cids))
            })
            jsonlite::write_json(cid_lists, ca.fn,
                                 pretty=TRUE, auto_unbox=TRUE, digits=4)
        } else {
            message('Found existing connections.associations file at: ', ca.fn, '. Will not overwrite it.')
        }
    }
    return(unlist(json_files))
}

#' @name is.dir.a.PGV.instance
#' @description internal
#'
#' Check if a path matches something that looks like a clone of the PGV github repository
#'
#' This is done by checking if the folder contains the subdirectory "public" and within it "settings.json".
#' If the file is not found then an error is raised
#' 
#' @param outdir path to directory
is.dir.a.PGV.instance = function(outdir){
    if (!file.exists(paste0(outdir, '/public/settings.json'))){
        stop(outdir, ' does not seem to be a proper clone of the PGV github repository.')
    }
}

#' @name is.dir.a.gGnome.js.instance
#' @description internal
#'
#' Check if a path matches something that looks like a clone of the gGnome.js github repository
#'
#' This is done by checking if the folder contains the subdirectory "public" and within it "metadata.json".
#' If the file is not found then an error is raised
#' 
#' @param outdir path to directory
is.dir.a.gGnome.js.instance = function(outdir){
    if (!file.exists(paste0(outdir, '/public/metadata.json'))){
        stop(outdir, ' does not seem to be a proper clone of the gGnome.js github repository. 
             Delete the directory.')
    }
}

#' @name is.dir.a.js.instance
#' @description internal
#'
#' Check if a path matches something that looks like a clone of the PGV or gGnome.js github repositories
#'
#' This is done by checking if the folder contains the subdirectory "public" and within it "settings.json".
#' If the file is not found then an error is raised
#' 
#' @param outdir path to directory
#' @param js.type either "PGV" (current) or "gGnome.js" (legacy)
is.dir.a.js.instance = function(outdir, js.type){
    if (js.type == 'gGnome.js'){
        is.dir.a.gGnome.js.instance(outdir)
    }
    if (js.type == 'PGV'){
        is.dir.a.PGV.instance(outdir)
    }
}

#' @name js_path
#' @description internal
#'
#' Takes a path and checks if it is a valid path to a gGnome.js/PGV directory. 
#'
#' If the directory does not exist then a clone from github is generated.
#' 
#' @param outdir path to directory
#' @param append if set to FALSE then the directory is expected to not already exist
#' @param js.type either "PGV" or "gGnome.js"
#' 
#' @keywords internal
js_path = function(outdir, append = FALSE, js.type = 'PGV'){

    is.acceptable.js.type(js.type)
    outdir = suppressWarnings(normalizePath(outdir))

    if (file.exists(outdir) & !dir.exists(outdir)){
        stop('The output directory must be a valid path for a diretory, but you provided a path of a file that already exists:', outdir)
    }

    if (dir.exists(outdir)){
        if (!append){
            # if the folder exists and there is no append flag then throw error
            stop('The output directory already exists. If you wish to generate a new ', js.type, ' isntance, please provide a path for a new directory. If you wish to add more file to an existing instance of gGnome.js then use "append = TRUE".')
        } else {
            is.dir.a.js.instance(outdir, js.type)
        }
    } else {
        # clone the repository from github
        message('Cloning the ', js.type, ' repository from github.')
        if (js.type == 'gGnome.js'){
            system(paste0('git clone https://github.com/mskilab/gGnome.js.git ', outdir))
        } else {
            system(paste0('git clone https://github.com/mskilab/PGV.git ', outdir))
        }
    }

    # normalize path one more time to make sure we return the absolute path
    outdir = suppressWarnings(normalizePath(outdir))
    is.dir.a.js.instance(outdir, js.type)
    return(outdir)
}

acceptable.js.types = c('PGV', 'gGnome.js')

#' @name is.acceptable.js.type
#' @description internal
#'
#' Checks that the provided js.type is valid
#' 
#' @param js.type either "PGV" or "gGnome.js"
is.acceptable.js.type = function(js.type){
    if (!(js.type %in% acceptable.js.types)){
        stop('Invalid js.type. The only js types familiar to us are: ', acceptable.js.types)
    }
}

#' @name gen_js_coverage_files
#' @description internal
#'
#' Generate arrow (for PGV) or the CSV (for gGnome.js, legacy) coverage files
#'
#' accepts any data type that is acceptable for #' 
#' @param data either a path to a TSV/CSV or a data.table
#' @param outdir the path to the PGV/gGnome.js repository clone
#' @param name.col column name in the input data table containing the sample names (default: "sample")
#' @param overwrite by default only files that are missing will be created. If set to TRUE then existing coverage arrow files and gGraph JSON files will be overwritten
#' @param cov.col column name in the input data table containing the paths to coverage files
#' @param js.type either "PGV" or "gGnome.js"
#' @param cov.field the name of the field in the coverage GRanges that should be used (default: "ratio")
#' @param cov.field.col column name in the input data table containing the name of the field in the coverage GRanges that should be used. If this is supplied then it overrides the value in "cov.field". Use this if some of your coverage files differ in the field used.
#' @param cov.bin.width bin width to use when rebinning the coverage data (default: 1e4). If you don't want rebinning to be performed then set to NA.
#' @param patient.id column name in the input data table containing the patient ID/names (default: 'patricipant'). 
#' If your table includes more than one datasets (e.g. samples from multiple patients), then you can specify 
#' the column from which to read the dataset names. This column would be used to group together samples that belong to the 
#' same dataset. If no values are passed, we take the pair name as patientID.
#' @param ref the genome reference name used for this dataset. For specific behaviour refer to the PGV/gGnome.js wrappers
#' @param cov.color.field field in the coverage GRanges to use in order to set the color of coverage data points. If nothing is supplied then default colors are used for each seqname (namely chromosome) by reading the colors that are defined in the settings.json file for the specific reference that is being used for this dataset.
#' @param meta.js path to JSON file with metadata (for PGV should be located in "public/settings.json" inside the repository and for gGnome.js should be in public/genes/metadata.json)
#' @param kag.col (default: 'kag') name of column in input table that includes the paths to JaBbA karyographs
#' @param ncn.gr GRanges object or path to GRanges object containing normal copy number (ncn) values. The ncn values must be contained in a field named "ncn"
#' @param mc.cores how many cores to use
#' 
#' @export
gen_js_coverage_files = function(data, outdir, name.col = 'sample', overwrite = FALSE, cov.col = 'coverage',
                                 js.type = 'PGV', cov.field = 'ratio', cov.field.col = NA,
                                 bin.width = 1e4, patient.id = NA, ref = 'hg19', gg.col = 'graph',
                                 cov.color.field = NULL, meta.js = NULL, kag.col = kag.col,
                                 ncn.gr = NA, mc.cores = 1){
    if (!is.na(cov.field.col)){
        if (!(cov.field.col %in% names(data))){
            stop(paste0('You provided the following invalid column name for cov.field.col: ', cov.field.col))
        }
        message('cov.field.col provided: "', cov.field.col, '". Will be reading coverage field name from this column.')
    }

    if (!(cov.col %in% names(data))){
        stop('Invalid cov.col. There is no column "', cov.col, '" in your data.')
    }

    cov_dir = get_js_cov_dir_path(outdir, js.type, patient.id)
    cov_files = mclapply(1:data[, .N], function(idx){
        skip_cov = FALSE
        covfn = get_js_cov_path(data[idx, get(name.col)], cov_dir, js.type)
        if (!is.na(cov.field.col)){
            cov.field = data[idx, get(cov.field.col)]
            message('cov.field: ', cov.field)
        }
        if (!file.exists(covfn) | overwrite){
            if (is.na(cov.field)){
                warning(paste0('No coverage field was provided for ', data[idx, get(name.col)], ' so no coverage will be generated.'))
                skip_cov = TRUE
            } else {
                if (is.na(cov.col)){
                    warning(paste0('No coverage data was provided for ', data[idx, get(name.col)], ' so no coverage will be generated.'))
                    skip_cov = TRUE
                } else {
                    cov_input_file = data[idx, get(cov.col)]
                    if (is.na(cov_input_file)){
                        warning(paste0('No coverage file was provided for ', data[idx, get(name.col)], ' so no coverage will be generated.'))
                        skip_cov = TRUE
                    } else {
                        if (!file.exists(cov_input_file)){
                            warning(paste0('No coverage file was provided for ', data[idx, get(name.col)], ' so no coverage will be generated.'))
                            skip_cov = TRUE
            }}}}
            if (skip_cov){
                return(NA)
            } else {
                # let's check kag file and ncn.gr
                if (!all(is.na(ncn.gr), na.rm = TRUE)){
                    if (is.character(ncn.gr)){
                        message('Loading normal copy number values from: ', ncn.gr)
                        ncn.gr = readRDS(ncn.gr)
                    }
                    if (!inherits(ncn.gr, 'GRanges')){
                        warning('Invalid ncn.gr value of calss: "', class(ncn.gr), '" will be ignored. Expected GRanges.')
                    }
                } else {
                    if (kag.col %in% names(data)){
                        kag.fn = data[idx, get(kag.col)]
                        if (file.exists(kag.fn)){
                            kag = readRDS(kag.fn)
                            tryCatch({
                                ncn.gr = kag$segstats[, 'ncn']
                            }, error = function(e){
                                warning('Failed to load normal copy number (NCN) values from karyograph. Either bad path was provided or karyograph does not include ncn values. This is the error that was encountered: ', e)
                            })
                            if (is.null(ncn.gr)){
                                warning('Looks like the provided karyograph does not contain normal copy number (ncn) values. Proceeding without.')
                                ncn.gr = NA
                            }
                        }
                    }
                }

                if (all(is.na(ncn.gr), na.rm = TRUE)){
                    message('No values provided for normal copy number. Proceeding without.')
                }

                # load gGraph
                gg = readRDS(data[idx, get(gg.col)])

                if (js.type == 'gGnome.js'){
                    cov2csv(cov_input_file, field = cov.field,
                            output_file = covfn, meta.js = meta.js, ncn.gr = ncn.gr, gg = gg)
                } else {
                    cov2arrow(cov_input_file, field = cov.field,
                              output_file = covfn, ref = ref,
                              cov.color.field = cov.color.field, overwrite = overwrite,
                              meta.js = meta.js, bin.width = bin.width, ncn.gr = ncn.gr, gg = gg)
                }
            }
        } else {
            message(covfn, ' found. Will not overwrite it.')
        }
        return(normalizePath(covfn))
    }, mc.cores = mc.cores)
    return(unlist(cov_files))
}

#' @name gen_gw_json_files
#' @description internal
#'
#' Generate json files that will represent your gWalk objects
#'
#' @param data either a path to a TSV/CSV or a data.table
#' @param outdir the path to the PGV (or gGnome.js, legacy) repository clone
#' @param meta.js path to JSON file with metadata (for PGV should be located in "public/settings.json" inside the repository and for gGnome.js should be in public/genes/metadata.json)
#' @param name.col column name in the input data table containing the sample names (default: "sample")
#' @param gw.col column name in the input data table containing the paths to RDS files containing the gWalk objects
#' @param js.type either "PGV" (current viz tool) or "gGnome.js" (legacy)
#' @param patient.id column name in the input data table containing the patient ID/names (default: 'patricipant'). 
#' If your table includes more than one datasets (e.g. samples from multiple patients), then you can specify 
#' the column from which to read the dataset names. This column would be used to group together samples that belong to the 
#' same dataset. If no values are passed, we take the pair name as patientID.
#' @param ref the genome reference name used for this dataset. For specific behaviour refer to the PGV/gGnome.js wrappers
#' @param overwrite by default only files that are missing will be created. If set to TRUE then existing gWalk JSON files will be overwritten
#' @param annotation which node/edge annotation fields to add to the gWalk JSON file. 
gen_gw_json_files= function(data, outdir, meta.js, name.col = 'sample', 
                            gw.col = 'walks', js.type = 'PGV',
                            patient.id = 'participant', ref = NULL, overwrite = FALSE, 
                            annotation = NULL){
    json_dir = get_gg_json_dir_path(outdir, js.type, patient.id)
    json_files = lapply(1:data[, .N], function(idx){
        gw.js = file.path(json_dir, paste0(data[idx, get(name.col)], ".walks.json"))
        if (!file.exists(gw.js) | overwrite){
            print(paste0("reading in ", data[idx, get(gw.col)]))
            # TODO: at some point we need to do a sanity check to see that a valid rds of gWalk was provided
            gw = readRDS(data[idx, get(gw.col)]) %>% refresh
            if (gw$length == 0) {
                warning(sprintf("Zero walks in gWalk .rds file provided for sample %s! No walks json will be produced!", data[idx, get(name.col)]))
	        return(NA)
            }
            sn.ref = parse.js.seqlengths(meta.js, js.type = js.type, ref = ref) %>% names
            sn.walks = seqlevels(gw)
            sn.walks.only = sn.walks[!sn.walks %in% sn.ref]
            gw.reduced = gw %&% sn.ref
            if (length(sn.walks.only) > 0) gw.reduced = gw.reduced[gw.reduced %^% sn.walks.only == FALSE]
            if (gw.reduced$length == 0){
                warning(sprintf('Provided gWalk .rds for sample %s contained walks, but they all involved sequences not contained in the chosen reference genome, so no walks json will be produced! Here is an example sequence name from your gWalks: "%s". And here is an example sequence from the reference used by "%s": "%s"', 
                                data[idx, get(name.col)], sn.walks.only[1], js.type, sn.ref[1]))
                return(NA)
            }
            if (length(sn.walks.only) > 0) {
                gw.excluded = gw %&% sn.walks.only
            } else {
                gw.excluded = gW()
            }
            if (gw.excluded$length > 0) {
                warning(sprintf('%i walks excluded because they (fully or partially) fell outside of reference ranges.', 
                                gw.excluded$length))
            }
            also.print.graph.to.json = ifelse(js.type == "PGV", FALSE, TRUE)
	    gw.js = gw.reduced$json(filename = gw.js, verbose = TRUE, annotation = annotation, 
                                    include.graph = ifelse(js.type == "PGV", FALSE, TRUE))
        } else {
            message(gw.js, ' found. Will not overwrite it.')
        }
        return(normalizePath(gw.js))
    })
    return(unlist(json_files))
}

#' @name read.js.input.data
#' @description internal
#'
#' Accepts a TSV/CSV or a data.table and checks that it is properly formatted
#'
#' @param data either a path to a TSV/CSV or a data.table
#' @param name.col column name in the input data table containing the sample names (default: "sample")
read.js.input.data = function(data, name.col = 'sample'){
    if (!inherits(data, 'data.table')){
        if (!is.character(data)){
            stop('Invalid input data of class: "', class(data), '". Expected data.table or path to CSV/TSV file.')
        }
        if (!file.exists(data)){
            stop('Invalid input data. The input data must be a data.table or a path to a CSV/TSV file')
        }
        data = fread(data)
    }
    if (!(name.col %in% names(data))){
        stop('Invalid name.col provided: "', name.col, '".')
    }
    if (any(duplicated(data[, get(name.col)]))){
        stop('The name.col must hold non-redundant values, but the name.col you provided has duplicates. Here is an example for a value with duplicates: ', data[duplicated(get(name.col)), get(name.col)])
    }
    return(data)
}

#' @name get_gg_json_dir_path
#' @description internal
#'
#' get the path to the gGraph JSON directory inside a js repository clone
#'
#' @param outdir the path to the PGV/gGnome.js repository clone
#' @param js.type either "PGV" or "gGnome.js"
#' @param patient.id column name in the input data table containing the patient ID/names (default: 'patricipant'). 
#' If your table includes more than one datasets (e.g. samples from multiple patients), then you can specify 
#' the column from which to read the dataset names. This column would be used to group together samples that belong to the 
#' same dataset. If no values are passed, we take the pair name as patientID.
get_gg_json_dir_path = function(outdir, js.type, patient.id = NA){
    if (js.type == 'gGnome.js'){
        gg_json_dir = paste0(outdir, '/json')
    } else {
        if (js.type == 'PGV'){
            gg_json_dir = get_pgv_data_dir(outdir, patient.id = patient.id)
        }
    }
    return(gg_json_dir)
}

#' @name get_js_cov_path
#' @description internal
#'
#' get the path to the gGraph JSON file inside a js directory
#'
#' @param nm name of the sample
#' @param cov_dir the path to the directory holding the coverage files
#' @param js.type either "PGV" or "gGnome.js"
get_js_cov_path = function(nm, cov_dir, js.type){
    if (js.type == 'gGnome.js'){
        covfn = paste0(cov_dir, "/", nm, ".csv")
        return(covfn)
    } else {
        if (js.type == 'PGV'){
            covfn = paste0(cov_dir, '/', nm, '-coverage.arrow')
        }
    }
    return(covfn)
}

#' @name get_js_cov_dir_path
#' @description internal
#'
#' get the path to the directory holding the coverage files inside a PGV/gGnome.js repository clone
#'
#' @param outdir the path to the PGV/gGnome.js repository clone
#' @param js.type either "PGV" or "gGnome.js"
#' @param patient.id column name in the input data table containing the patient ID/names (default: 'patricipant'). 
#' If your table includes more than one datasets (e.g. samples from multiple patients), then you can specify 
#' the column from which to read the dataset names. This column would be used to group together samples that belong to the 
#' same dataset. If no values are passed, we take the pair name as patientID.
get_js_cov_dir_path = function(outdir, js.type, patient.id = NA){
    if (js.type == 'gGnome.js'){
        cov_dir = paste0(outdir, '/scatterPlot')
    } else {
        if (js.type == 'PGV'){
            cov_dir = get_pgv_data_dir(outdir, patient.id)
        }
    }
    return(cov_dir)
}

#' @name get_pgv_data_dir
#' @description internal
#'
#' get the path to the dataset's data dir inside the PGV directory
#'
#' @param outdir the path to the PGV/gGnome.js repository clone
#' @param patient.id column name in the input data table containing the patient ID/names (default: 'patricipant'). 
#' If your table includes more than one datasets (e.g. samples from multiple patients), then you can specify 
#' the column from which to read the dataset names. This column would be used to group together samples that belong to the 
#' same dataset. If no values are passed, we take the pair name as patientID.
get_pgv_data_dir = function(outdir, patient.id = NA){
    if (is.na(patient.id)){
        stop('patient.id must be provided for PGV.')
    }
    # check
    outpath = strsplit("/gpfs/commons/groups/imielinski_lab/pgv", 
                       split = "/") %>% 
        .[[1]] %>% 
        tail(., n=1)
    if (outpath != "data"){
        data_dir = paste0(outdir, '/public/data/', 
                          patient.id, '/')
    } else {
        # assume we are in data folder in outdir
        data_dir = paste0(outdir, "/", 
                          patient.id, '/')
    }
    # make sure the directory exists
    if (!dir.exists(data_dir)){
        message('Creating a directory for the PGV data files here: ', data_dir)
        dir.create(data_dir, recursive = TRUE)
    }
    return(data_dir)
}

#' @name cov2cov.js
#' @description
#'
#' Takes a GRanges with coverage data and converts it to a data.table with the info needed for gGnome.js and PGV
#'
#' if bin.width is specified then coverage data will also be rebinned
#' if convert.to.cn == TRUE then rel2abs will be applied
#'
#' @param cov coverage GRanges or path to file with coverage data (see gGnome::readCov for details
#' @param meta.js path to JSON file with metadata (for PGV should be located in "public/settings.json" inside the repository and for gGnome.js should be in public/genes/metadata.json)
#' @param js.type either "PGV" or "gGnome.js"
#' @param field the name of the field in the coverage GRanges that should be used (default: "ratio")
#' @param bin.width bin width to use when rebinning the coverage data (default: 1e4). If you don't want rebinning to be performed then set to NA.
#' @param ref the genome reference name used for this dataset. For specific behaviour refer to the PGV/gGnome.js wrappers
#' @param cov.color.field field in the coverage GRanges to use in order to set the color of coverage data points. If nothing is supplied then default colors are used for each seqname (namely chromosome) by reading the colors that are defined in the settings.json file for the specific reference that is being used for this dataset.
#' @param convert.to.cn (TRUE) convert coverage depth values to copy number using purity and ploidy values.
#' @param ncn.gr GRanges object containing normal copy number (ncn) values. The ncn values must be contained in a field named "ncn"
#' @return data.table containing the coverage data formatted according to the format expected by PGV and gGnome.js
#' @export
cov2cov.js = function(cov, meta.js = NULL, js.type = 'gGnome.js', field = 'ratio',
                      bin.width = NA, ref = NULL, cov.color.field = NULL,
                      convert.to.cn = TRUE, ncn.gr = NA, gg = NA){
    message(paste0("reading in file"))
    x = readCov(cov)
    overlap.seqnames = seqlevels(x)

    ## respect the seqlengths in meta.js
    if (is.character(meta.js) && file.exists(meta.js)){
        sl = parse.js.seqlengths(meta.js, js.type = js.type, ref = ref)

        overlap.seqnames = intersect(seqlevels(x), names(sl))
        if (length(overlap.seqnames) == 0){
            stop('The names of sequences in the input coverage and in the reference don\'t match. This is an example seqname from the ref: "',
                 names(sl)[1],
            '". And here is an example from the coverage file: "', seqnames(x)[1], '".')
        }
        # make sure that seqnames overlap between coverage and reference
        invalid.seqnames = setdiff(seqnames(x), names(sl))
        if (length(invalid.seqnames) > 0){
            warning(sprintf('The coverage input includes sequence names that are not in the specified reference, and hence these sequence names will be excluded. These are the excluded sequences: %s', paste(as.character(invalid.seqnames), collapse = ', ')))
        }
    }

    if (!exists("sl")){
        sl = seqlengths(x)
    }
    if (all(is.na(sl))){
        stop("No seqlengths in the input.")
    }

    if (!is.element(field, names(mcols(x)))){
        stop("The provided field '", field, "' is not in the input coverage data")
    }

    fields = field

    if (!is.null(cov.color.field)){
        if (!is.element(cov.color.field, names(mcols(x)))){
            stop("The cov.color.field '", cov.color.field, "' is not in the input data")
        }
        fields = c(field, cov.color.field)
    }

	
	is_bin_width_numeric = is.numeric(bin.width)
	is_bin_width_null = is.null(bin.width)
	if (!is_bin_width_numeric && !is_bin_width_null) bin.width = as.numeric(bin.width)
	is_bin_width_numeric = is.numeric(bin.width)
	is_bin_width_len_one = NROW(bin.width) == 1
	is_bin_width_na = is_bin_width_len_one && (is.na(bin.width) || bin.width %in% c("NA"))
	# set.seed(42)
	# sample_len = ceiling(0.1*NROW(x))
	# widths = width(sample(x, sample_len, replace = TRUE))
	widths = width(x)
	is_rebinning_sensible = (
		!is_bin_width_null
		&& is_bin_width_len_one && !is_bin_width_na
		&& !(round(median(widths, na.rm = TRUE)) == bin.width)
	)

    if (is_rebinning_sensible){
        message('Rebinning coverage with bin.width=', bin.width)
        if (!is_bin_width_numeric){
            stop('bin.width must be numeric')
        }
        x.rebin = rebin(x, bin.width, field, FUN = median)
        if (!is.null(cov.color.field)){
            # we will take the median value of numeric values and the a single value (by majority votes) for any other type
            # this is intended so that we can keep the colors if they existed
            my_cool_fn = function(value, width, na.rm){
                ifelse(is.numeric(value), median(value),
                                   names(sort(table(value), decreasing = T)[1]))
            }
            x = gr.val(x.rebin, x,
                       val = fields, FUN = my_cool_fn)
        } else {
            x = x.rebin
        }
        message('Done rebinning')
    }

    cn.converted = FALSE
    if (!is.na(gg) & convert.to.cn == TRUE){
        tryCatch({
            purity = gg$meta$purity
            ploidy = gg$meta$ploidy
            if (is.numeric(purity) & !is.na(purity) & !is.na(ploidy) & is.numeric(ploidy)){
                if (!all(is.na(ncn.gr), na.rm = TRUE)){
                    message('Adding ncn values to coverage GRanges.')
                    x = x %$% ncn.gr
                    message('Done adding ncn values.')
                }
                x$cn = gGnome::rel2abs(x, purity = purity, ploidy = ploidy,
                                       field = field, field.ncn = 'ncn')
                cn.converted = TRUE
            } else {
                warning('No purity / ploidy values found. Coverage data will not be converted to copy number.')
            }
        },
        error = function(e){
            warning('Encountered error while converting coverage data to copy number values. Will skip and use coverage depth values as-is. Here is the error that was encountered: ', e)
        })
    }


    ## build the cumulative coordinates
    dt = data.table(seqlevels = names(sl),
                    seqlengths = as.double(sl),
                    cstart = c(1, 1 + cumsum(as.double(sl))[-length(sl)]))

    ## build the data.table
    dat = as.data.table(merge(gr2dt(x), dt,
                by.x = "seqnames",
                by.y = "seqlevels",
                all.x = TRUE))
    dat = dat[seqnames %in% overlap.seqnames] # only keep seqnames that are in the reference
    dat[, new.start := start + cstart - 1]

    if (cn.converted){
        # if we converted to CN then we override the coverage depth values with CN values
        message('Coverage data converted to copy number values')
        dat[, (field) := cn]
    }
    # convert Inf to NA
    dat[get(field) == Inf, (field) := NA]

    return(dat)
}

#' @name cov2csv
#' @description
#'
#' Takes a GRanges with coverage data and converts it to a data.table with the info needed for gGnome.js and PGV
#'
#' if bin.width is specified then coverage data will also be rebinned
#' if convert.to.cn == TRUE then rel2abs will be applied
#'
#' @param cov coverage GRanges or path to file with coverage data
#' @param meta.js path to JSON file with metadata (for PGV should be located in "public/settings.json" inside the repository and for gGnome.js should be in public/genes/metadata.json)
#' @param js.type either "PGV" or "gGnome.js"
#' @param field the name of the field in the coverage GRanges that should be used (default: "ratio")
#' @param bin.width bin width to use when rebinning the coverage data (default: 1e4). If you don't want rebinning to be performed then set to NA.
#' @param ref the genome reference name used for this dataset. For specific behaviour refer to the PGV/gGnome.js wrappers
#' @param cov.color.field field in the coverage GRanges to use in order to set the color of coverage data points. If nothing is supplied then default colors are used for each seqname (namely chromosome) by reading the colors that are defined in the settings.json file for the specific reference that is being used for this dataset.
#' @return data.table containing the coverage data formatted according to the format expected by PGV and gGnome.js
#' @export
cov2csv = function(cov,
        field = "ratio",
        output_file = "coverage.csv",
        ...)
{

    dat = cov2cov.js(cov, field = field, ...)

    outdir = dirname(output_file)
    ## make sure the path to the file exists
    if (!dir.exists(outdir)){
        dir.create(outdir)
    }

    cov.out = dat[, .(x = start, y = get(field), chromosome = as.character(seqnames))][!is.na(y)]
    message('Writing coverage data to file: ', output_file)
    fwrite(cov.out, output_file,  sep = ",")
    return(output_file)
}


#' @name cov2arrow
#' @description
#'
#' Prepares an scatter plot arrow file with coverage info for PGV (https://github.com/mskilab/pgv)
#'
#' @param cov input coverage data (GRanges)
#' @param field which field of the input data to use for the Y axis
#' @param output_file output file path.
#' @param ref the name of the reference to use. If not provided, then the default reference that is defined in the meta.js file will be loaded.
#' @param cov.color.field a field in the input GRanges object to use to determine the color of each point
#' @param overwrite (logical) by default, if the output path already exists, it will not be overwritten.
#' @param meta.js path to JSON file with metadata for PGV (should be located in "public/settings.json" inside the repository)
#' @param bin.width (integer) bin width for rebinning the coverage (default: 1e4)
#' @author Alon Shaiber
#' @export
cov2arrow = function(cov,
        field = "ratio",
        output_file = 'coverage.arrow',
        ref = 'hg19',
        cov.color.field = NULL,
        overwrite = FALSE,
        meta.js = NULL,
        ...){

    outdir = dirname(output_file)
    dir.create(outdir, showWarnings = FALSE, recursive = TRUE)

    if (!file.exists(output_file) | overwrite){
        if (!requireNamespace("arrow", quietly = TRUE)) {
            stop('You must have the package "arrow" installed in order for this function to work. Please install it.')
        }

        message('Converting coverage format')
        dat = cov2cov.js(cov, meta.js = meta.js, js.type = 'PGV', field = field,
                         ref = ref, cov.color.field = cov.color.field, ...)
        message('Done converting coverage format')

        if (!is.null(cov.color.field)){
            dat[, color := color2numeric(get(cov.color.field))]
        } else {
            if (!is.null(meta.js)){
                ref_meta = get_ref_metadata_from_PGV_json(meta.js, ref)
                setkey(ref_meta, 'chromosome')
                dat$color = color2numeric(ref_meta[dat$seqnames]$color)
            } else {
                # no cov.color.field and no meta.js so set all colors to black
                dat$color = 0
            }
        }

        outdt = dat[, .(x = new.start, y = get(field), color)]

        # if there are any NAs for colors then set those to black
        outdt[is.na(color), color := 0]

        # remove NAs
        outdt = outdt[!is.na(y)]

        # sort according to x values (that is what PGV expects)
        outdt = outdt[order(x)]

        message('Writing arrow file (using write_feather)')
        arrow_table = arrow::Table$create(outdt, schema = arrow::schema(x = arrow::float32(), y = arrow::float32(), color = arrow::float32()))
        arrow::write_feather(arrow_table, output_file, compression="uncompressed")
    } else {
        message('arrow file, "', output_file, '" already exists.')
    }
    return(output_file)
}

#' @name color2hex
#' @description
#'
#' Takes a vector of colors and returns the hex color code for the colors
#'
#' Any color that could be parsed by col2rgb is acceptable. Missing or invalid values are assigned a default color. The color names could be a mix of hex color codes and names (e.g. "black")
#'
#' @param x vector of colors names
#' @param default_color the color to default to for NAs and invalid values
#' @return vector of hex color codes
#' @author Alon Shaiber
color2hex = function(x, default_color = '#000000'){
    cols = lapply(x, function(y){
       tryCatch(col2rgb(y),
                error = function(e) 'default')
    })
    out = sapply(cols, function(y){
       tryCatch(rgb(y[1], y[2], y[3], maxColorValue = 255),
                error = function(e) 'default')
    })

    default_pos = sum(out == 'default')
    if (default_pos > 0){
        warning(sprintf('There were %s entries with missing or invalid colors and these were set to the default color: %s', default_pos, default_color))
        out[which(out == 'default')] = default_color
    }
    return(out)
}

#' @name colorhex2numeric
#' @description
#'
#' Takes a vector of colors hex code and returns a vector with integers corresponding to each hex number
#'
#' @param x vector of colors hex codes
#' @return numeric vector
#' @author Alon Shaiber
colorhex2numeric = function(x){
     return(strtoi(gsub('\\#', '0x', x)))
}

#' @name color2numeric
#' @description
#'
#' Takes a vector of colors and returns a numeric vector as expected by PGV
#'
#' Any color that could be parsed by col2rgb is acceptable. Missing or invalid values are assigned a default color. The color names could be a mix of hex color codes and names (e.g. "black")
#'
#' @param x vector of colors names
#' @return numeric vector
#' @author Alon Shaiber
color2numeric = function(x, default_color = '#000000'){
     return(colorhex2numeric(color2hex(x, default_color = default_color)))
}

#' @name brewer.master
#' @title brewer.master
#'
#' @description
#' Makes a lot of brewer colors using an "inexhaustible" brewer palette ie will not complain if number of colors requested is too high.
#'
#' Yes - this technically violates the "grammar of graphics", but meant for quick and dirty use.
#'
#' @param n TODO
#' @param palette character specifyign pallette to start with (options are: Blues, BuGn, BuPu, GnBu, Greens Greys, Oranges, OrRd, PuBu, PuBuGn, PuRd, Purples, RdPu, Reds, YlFn, YlFnBu, YlOrBr, YlOrRd, BrBg, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn, Spectral, Accent, Dark2, Paired, Pastel1, Pastel2, Set2, Set3)
#' @return length(n) character vector of colors
#' @author Marcin Imielinski
#' @export
brewer.master = function(n, palette = NULL, wes = FALSE,  list = FALSE)
{
    if (wes)
    {
      palettes = c("Royal2"=5, "Chevalier1"=4, "Darjeeling1"=5, "IsleofDogs1"=6, "Darjeeling2"=5, "Moonrise1"=4, "BottleRocket1"=7, "Rushmore"=5, "Moonrise3"=5, "Cavalcanti1"=5, "Rushmore1"=5, "FantasticFox1"=5, "BottleRocket2"=5, "Royal1"=4, "IsleofDogs2"=5, "Moonrise2"=4, "GrandBudapest1"=4, "GrandBudapest2"=4, "Zissou1"=5)
    }
    else
    {
    palettes = list(
      sequential = c('Blues'=9,'BuGn'=9, 'BuPu'=9, 'GnBu'=9, 'Greens'=9, 'Greys'=9, 'Oranges'=9, 'OrRd'=9, 'PuBu'=9, 'PuBuGn'=9, 'PuRd'=9, 'Purples'=9, 'RdPu'=9, 'Reds'=9, 'YlGn'=9, 'YlGnBu'=9, 'YlOrBr'=9, 'YlOrRd'=9),
      diverging = c('BrBG'=11, 'PiYG'=11, 'PRGn'=11, 'PuOr'=11, 'RdBu'=11, 'RdGy'=11, 'RdYlBu'=11, 'RdYlGn'=11, 'Spectral'=11),
          qualitative = c('Accent'=8, 'Dark2'=8, 'Paired'=12, 'Pastel1'=8, 'Pastel2'=8, 'Set1'=9, 'Set2'=8, 'Set3'=12)
        );
      }

  palettes = unlist(palettes);
  if (list)
    return(palettes)


  if (is.null(palette))
    palette = names(palettes)[1]

  nms = NULL
    if (is.character(n) | is.factor(n))
    {
        nms = unique(n)
        n = length(nms)
    }
  
    names(palettes) = gsub('\\w+\\.', '', names(palettes))

    if (palette %in% names(palettes))
      i = match(palette, names(palettes))
    else
      i = ((max(c(1, suppressWarnings(as.integer(palette))), na.rm = T)-1) %% length(palettes))+1

    col = c();
    col.remain = n;

    while (col.remain > 0)
    {
      if (col.remain > palettes[i])
      {
        next.n = palettes[i]
        col.remain = col.remain-next.n;
      }
      else
      {
        next.n = col.remain
        col.remain = 0;
      }

      if (!wes)
        {
          col = c(col, RColorBrewer::brewer.pal(max(next.n, 3), names(palettes[i])))
        }
      else
      {
        col = c(col, wesanderson::wes_palettes[[names(palettes[i])]])
      }

      i = ((i) %% length(palettes))+1
    }

    col = col[1:n]
    names(col) = nms
    return(col)
}


#' @name gtf2json
#' @description Turning a GTF format gene annotation into JSON
#'
#' @param gtf path to GTF input file.
#' @param gtf.rds path to rds file which includes a data.table holding the GTF information.
#' @param gtf.gr.rds path to rds file which includes a GRanges holding the GTF information.
#' @param gr GRanges object with the GTF information
#' @param metadata.filename metadata JSON output file name (./metadata.json).
#' @param genes.filename genes JSON output file name (./genes.json).
#' @param genes
#' @param gene_weights table with weights for genes. The first column is the gene name and the second column is the numeric weight. Either a data.frame, data.table, or a path to a file need to be provided. Genes with weights above 10 would be prioritized to show when zoomed out in gGnome.js.
#' @param grep
#' @param grepe
#' @param chrom.sizes if not provided then the default hg19 chromosome lengths will be used.
#' @param include.chr chromosomes to include in the output. If not provided then all chromosomes in the reference are included.
#' @param gene.collapse
#' @param verbose
#' @author Xiaotong Yao, Alon Shaiber
#' @return file_list list containing the paths of the metadata and genes JSON-formatted output files.
#' @export
gtf2json = function(gtf=NULL,
                    gtf.rds=NULL,
                    gtf.gr.rds=NULL,
                    gr=NULL,
                    metadata.filename="./metadata.json",
                    genes.filename="./genes.json",
                    genes=NULL,
                    gene_weights=NULL,
                    grep=NULL,
                    grepe=NULL,
                    chrom.sizes=NULL,
                    include.chr=NULL,
                    gene.collapse=TRUE,
                    verbose = TRUE){

    if (!is.null(gtf.gr.rds)){
        message("Using GRanges from rds file.")
        infile = gtf.gr.rds
        gr = readRDS(gtf.gr.rds)
        dt = gr2dt(gr)
    } else if (!is.null(gtf.rds)){
        message("Using GTF data.table from rds file.")
        infile = gtf.rds
        dt = as.data.table(readRDS(gtf.rds))
    } else if (!is.null(gr)){
        message("Using input GRanges.")
        infile = 'Input GRanges'
        dt = gr2dt(gr)
    } else if (!is.null(gtf)){
        message("Using raw GTF file.")
        infile = gtf

        gr = rtracklayer::import.gff(gtf)
        dt = gr2dt(gr)

    } else {
        stop("No input gene annotation. Please provide one. If you wish to download the Human GENCODE v19 release you can do so using the following command: 'wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz'")
    }

    if (verbose){
        message("Finished reading raw data, start processing.")
    }

    ## get seqlengths
    if (is.null(chrom.sizes)){
        message("No ref genome seqlengths given, use default.")
        chrom.sizes = system.file("extdata", "human_g1k_v37.regular.chrom.sizes", package="gGnome")
    }
    if (!is.character(chrom.sizes)){
        stop('Invalid path provided for chrom.sizes.')
    }
    if (!file.exists(chrom.sizes) | file.size(chrom.sizes) == 0){
        stop('Invalid path provided for chrom.sizes.')
    }

    Sys.setenv(DEFAULT_GENOME=chrom.sizes)
    sl = tryCatch(hg_seqlengths(include.junk=TRUE),
                  error = function(x){
                      stop('Something seems to be wrong with the file you provided for chrom.sizes. This is what we know: "', x, ". Here is an example for the format we expect: https://github.com/mskilab/gUtils/blob/master/inst/extdata/hg19.regularChr.chrom.sizes .")
                  })

    if (!is.null(include.chr)){
        sl = sl[include.chr]
    }
    chrs = data.table(seqnames = names(sl), seqlengths=sl)

    ## meta data field
    qw = function(x) paste0('"', x, '"') ## quote

    meta.json =paste0(paste0("{", qw("metadata"),': [\n'),
                     chrs[, paste("\t\t{",
                                  qw("chromosome"),": ", qw(seqnames),
                                  ", ", qw("startPoint"),": ", 1,
                                  ", ", qw("endPoint"), ": ", seqlengths,
                                  ", ", qw("color"),
                                  ": ", qw(substr(tolower(brewer.master( max(.I), 'BrBG' )), 1, 7)), " }",
                                  collapse=",\n",
                                  sep="")],
                     '\n  ],\n',
                     paste(
                     paste0(qw("sequences"), ": {", qw("T"), ": ", qw("#E6E431"), ", ", qw("A"), ": ", qw("#5157FB"), ", ", qw("G"), ": ", qw("#1DBE21"), ", ", qw("C"), ": ",qw("#DE0A17"), ", ", qw("backbone"), ": ", qw("#AD26FA"), "}"),
                     paste0(qw("coveragePointsThreshold"), ":  30000"),
                     paste0(qw("scatterPlot"), ": {", qw("title"), ": ", qw("Coverage"), "}"),
                     paste0(qw("barPlot"), ":  {", qw("title"), ":  ", qw("RPKM"), "}"),
                     paste0(qw("intervalsPanelHeightRatio"), ": 0.6"),
                     sep = ",\n")
                    )


    if (verbose){
        message("Metadata fields done.")
    }

    ## reduce columns: seqnames, start, end, strand, type, gene_id, gene_name, gene_type, transcript_id
    ## reduce rows: gene_status, "KNOWN"; gene_type, not "pseudo", not "processed transcript"
    dtr = dt
    if ('gene_status' %in% names(dt)){
        dtr = dt[gene_status=="KNOWN"]
    }
    dtr = dtr[!grepl("pseudo", gene_type) &
             gene_type != "processed_transcript",
             .(chromosome=seqnames, startPoint=start, endPoint=end, strand,
               title = gene_name, gene_name, type, gene_id, gene_type,
               transcript_id, transcript_name)]

    if(!is.null(include.chr)){
           dtr = dtr[chromosome %in% include.chr]
    }
    if (!is.null(genes)){
        dtr = dtr[title %in% genes]
    } else {
            if (!is.null(grep) | !is.null(grepe)) {
                if (!is.null(grep)){
                dtr = dtr[grepl(grep, title)]
            }
            if (!is.null(grepe)){
                dtr = dtr[!grepl(grepe, title)]
            }
        }
    }

    if (nrow(dtr)==0){
        stop("Error: No more data to present.")
    }

    if (gene.collapse){
        ## collapse by gene
        dtr[, hasCds := is.element("CDS", type), by=gene_id]
        dtr = rbind(dtr[hasCds==TRUE][type %in% c("CDS","UTR","gene")],
                    dtr[hasCds==FALSE][type %in% c("exon", "gene")])
        ## dedup
        dtr = dtr[!duplicated(paste(chromosome, startPoint, endPoint, gene_id))]
        dtr[, title := gene_name]
        dtr = dtr[type != "transcript"]

        ## group id
        dtr[, gid := as.numeric(as.factor(gene_id))]
        if (verbose){
            message("Intervals collapsed to gene level.")
        }
    } else {
        ## collapse by transcript
        dtr[, hasCds := is.element("CDS", type), by=transcript_id]
        dtr = rbind(dtr[hasCds==TRUE][type %in% c("CDS","UTR","transcript")],
                    dtr[hasCds==FALSE][type %in% c("exon","transcript")])
        ## dedup
        dtr = dtr[!duplicated(paste(chromosome, startPoint, endPoint, transcript_id))]
        dtr[, title := transcript_name]
        dtr = dtr[type != "gene"]

        ## group id
        dtr[, gid := as.numeric(as.factor(transcript_id))]
        if (verbose){
            message("Intervals collapsed to transcript level.")
        }
    }

    dtr[, iid := 1:nrow(dtr)]

    #' incorporate gene_weights 
    if (!is.null(gene_weights)){
        # check that this is a data.table
        if (inherits(gene_weights, 'data.frame')){
            gene_weights = gene_weights %>% as.data.table
        } else {
            if (!file.exists(gene_weights)){
                stop('Gene weights must be provided either as a dataframe or as a path to a file with a tabular text format (with no header).')
            }
            gene_weights = fread(gene_weights, header = FALSE)
        }

        if (dim(gene_weights)[2] != 2){
            stop('gene_weights must be a table with just two columns.')
        }
        setnames(gene_weights, names(gene_weights), c('gene_name', 'weight'))

        # make sure that weights are numeric
        gene_weights[, weight := as.numeric(weight)]

        if (gene_weights[is.na(weight), .N] > 0){
            print('Some weights provided in gene_weights are either not-valid or missing and would be set to the default value (1).')
        }

        # check names of genes
        genes_in_gene_weights_but_not_in_dtr = setdiff(gene_weights$gene_name, dtr$gene_name)
        if (length(genes_in_gene_weights_but_not_in_dtr) > 0){
            print(sprintf('Warning: the following gene names appear in the provided gene_weights, but do not match any of the genes in the reference genome (and hence will be ignored): %s', genes_in_gene_weights_but_not_in_dtr))
        }
        dtr = merge(dtr, gene_weights, by = 'gene_name', all.x = TRUE)

        #' set all missing weights to 1
        dtr[is.na(weight), weight := 1]
    } else {
        dtr[, weight := 1]
    }

    ## processing genes
    genes.json = dtr[, paste0(
        c(paste0('{', qw("genes"),": ["),
          paste(
              "\t{",
              qw("iid"), ": ", iid,
              ", ", qw("chromosome"), ": ", qw(chromosome),
              ", ", qw("startPoint"), ": ", startPoint,
              ", ", qw("endPoint"), ": ", endPoint,
              ", ", qw("y"), ": ", 0,
              ", ", qw("title"), ": ", qw(title),
              ", ", qw("group_id"), ": ", qw(gid),
              ", ", qw("type"), ": ", qw(type),
              ", ", qw("strand"), ": ", qw(strand),
              ", ", qw("weight"), ": ", weight,
              "}",
              sep = "",
              collapse = ',\n'),
          "]"),
        collapse = '\n')
        ]


    ## assembling the JSON
    out_meta = paste(c(meta.json, "}"),
                     sep = "")

    writeLines(out_meta, metadata.filename)
    message(sprintf('Wrote JSON metadata file of %s to %s', infile, metadata.filename))


    out_genes = paste(c(genes.json, "}"),
                     sep = "")
    writeLines(out_genes, genes.filename)
    message(sprintf('Wrote JSON genes file of %s to %s', infile, genes.filename))

    return(list(metadata.filename = metadata.filename, genes.filename = genes.filename))
}


#' @name jab2json
#' @description a wrapper function to dump JaBbA results run with Flow to gGnome.js viz
#' @export
jab2json = function(fn = "./jabba.simple.rds",
                    gGnome.js.dir = "~/git/gGnome.js"){

}


#' @name parse.js.seqlengths
#' @description
#' Takes a settings JSON file from either gGnome.js or PGV and parses it into a data.table
#' @param meta.js path to JSON file with metadata (for PGV should be located in "public/settings.json" inside the repository and for gGnome.js should be in public/genes/metadata.json)
#' @param js.type either 'gGnome.js' or 'PGV' to determine the format of the JSON file
#' @param ref the name of the reference to load (only relevant for PGV). If not provided, then the default reference (which is set in the settings.json file) will be loaded.
#' @author Alon Shaiber
#' @export
parse.js.seqlengths = function(meta.js, js.type = 'gGnome.js', ref = NULL){
    if (!(js.type %in% c('gGnome.js', 'PGV'))){
        stop('js.type must be either gGnome.js or PGV')
    }
     if (js.type == 'gGnome.js'){
        message('Getting seqlengths for gGnome.js')
        settings = jsonlite::read_json(meta.js)
        if (!('metadata' %in% names(settings))){
            stop('Input JSON file is not a valid gGnome.js settings JSON. Please check the required format.')
        }
        meta = rbindlist(settings$metadata)
        sl = meta[, setNames(endPoint, chromosome)]
     } else {
        message('Getting seqlengths for PGV')
        ref_meta = get_ref_metadata_from_PGV_json(meta.js, ref)
        sl = ref_meta[, setNames(endPoint, chromosome)]
    }
    return(sl)
}

#' @name get_ref_metadata_from_PGV_json
#' @description internal
#' get a data.table with the metadata for the reference (columns: chromosome, startPoint, endPoint, color)
#' @param meta.js path to JSON file with metadata (for PGV should be located in "public/settings.json" inside the repository and for gGnome.js should be in public/genes/metadata.json)
#' @param ref the name of the reference to load (only relevant for PGV). If not provided, then the default reference (which is set in the settings.json file) will be loaded.
get_ref_metadata_from_PGV_json = function(meta.js, ref = NULL){
    meta = jsonlite::read_json(meta.js)
    if (!('coordinates' %in% names(meta))){
        stop('Input meta file is not a proper PGV settings.json format.')
    }
    coord = meta$coordinates
    if (is.null(ref)){
        # if no ref was provided then take the default
        if (!('default' %in% names(coord))){
            stop('Invalid meta file. The meta file, ', meta.js, ', is missing a default coordinates value.')
        }
        ref = coord$default
        message('No reference name provided so using default: "', ref, '".')
    }
    if (!('sets' %in% names(coord))){
        stop('Input meta file is not a proper PGV settings.json format.')
    }
    sets = coord$sets
    if (!(ref %in% names(sets))){
        stop('Invalid ref: ', ref, '. The ref does not appear to be described in ', meta.js)
    }
    seq.info = rbindlist(sets[[ref]])
    return(seq.info)
}

#' @name get_path_to_meta_js
#' @description internal
#' Get the path to the meta.js file
#' @param outdir the path where to the PGV/gGnome.js folder.
#' @param js.type either "PGV" or "gGnome.js"
get_path_to_meta_js = function(outdir, js.type = js.type){
    is.acceptable.js.type(js.type)
    if (!dir.exists(outdir)){
        stop('No such directory: ', outdir)
    }
    if (js.type == 'gGnome.js'){
        meta.js = suppressWarnings(normalizePath(paste0(outdir, '/public/metadata.json')))
    }
    if (js.type == 'PGV'){
        meta.js = suppressWarnings(normalizePath(paste0(outdir, '/public/settings.json')))
    }
    if (!file.exists(meta.js)){
        stop('We could not find a metadata file where we expected it to be ("',
             meta.js,
             '"). Something must have gone wrong, please check that this is a valid clone of the ',
             js.type, ' github repository.')
    }
    return(meta.js)
}

get_cids = function(gg, cid.field){
    if (!(cid.field %in% names(gg$edges$dt))){
        warning('Invalid cid.field: "', cid.field, '"')
        return(NA)
    }
    if (length(gg$edges[type == 'ALT']) == 0){
        return(NA)
    }
    if (any(is.na(gg$edges[type == 'ALT']$dt[, get(cid.field)]))){
        warning('cid.field: ,"', cid.field, '" contains NAs and so will be ignored')
        return(NA)
    }
    if (!all(is.numeric(gg$edges[type == 'ALT']$dt[, get(cid.field)]))){
        warning('cid.field: ,"', cid.field, '" must contain numeric values only')
        return(NA)
    }
    return(gg$edges[type == 'ALT']$dt[, get(cid.field)])
}

#' @name is_cmd_available
#' @description internal
#'
#' Check if a certain command is available on the terminal
#'
#' @param raise by default if cmd is not available then an error will be raised. Set raise = FALSE if you don't want an error to occur but just want to know if the command is available
is_cmd_available = function(cmd, raise = TRUE){
    conn = pipe(paste0('command -v ', cmd))
    available = length(readLines(conn)) > 0
    close(conn)
    if (!available){
        if (raise){
            stop(cmd, ' is not installed, please install.')
        }
        return(FALSE)
    }
    return(TRUE)
}
mskilab/gGnome documentation built on June 10, 2025, 11:29 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
mskilab/gGnome
gGnome: reference based assembly graph for analyzing rearranged genomes

R/jsUtils.R
In mskilab/gGnome: gGnome: reference based assembly graph for analyzing rearranged genomes

Defines functions error gen_js_datafiles set_reference_files gen_js_instance gGnome.js pgv

Documented in gGnome.js pgv

R Package Documentation

Browse R Packages

We want your feedback!

mskilab/gGnome gGnome: reference based assembly graph for analyzing rearranged genomes

R/jsUtils.R In mskilab/gGnome: gGnome: reference based assembly graph for analyzing rearranged genomes

Defines functions error gen_js_datafiles set_reference_files gen_js_instance gGnome.js pgv

Documented in gGnome.js pgv

R Package Documentation

Browse R Packages

We want your feedback!

mskilab/gGnome
gGnome: reference based assembly graph for analyzing rearranged genomes

R/jsUtils.R
In mskilab/gGnome: gGnome: reference based assembly graph for analyzing rearranged genomes