extractGeneModels: Extract gene models from the GAF

Description Usage Arguments Details Value GAF gene names Errors Warnings Todo Examples

Description

This extracts a tab delimited file of all canonical exon gene models from the TCGA gene annotation file or GAF. The canonical model is intended to be at least a common starting point for a simplified definition of "the gene called X", based on the union of all transcript. By default each model in the output file (row) is uniquely identified by gene name.

Usage

1
2
extractGeneModels(gaf, outFile = paste0(gaf, ".geneModels"), force = FALSE,
  uniqueGene = TRUE, skipUnknownGene = TRUE)

Arguments

gaf

The full-path name of the GAF [REQ]

outFile

The output filename. By default will be the same as the GAF, with ".geneModels" appended (hence it will be created in the same directory by default). This will not overwrite an existing file unless force = TRUE.

force

If the output file exists, setting this TRUE will allow overwriting it. Doing so generates a warning.

uniqueGene

By default this is TRUE and only one copy of every gene will be kept. This makes the gene name a unique key. The GAF contains additional versions of some genes which are skipped. See the GAF gene names section below for more information.

skipUnknownGene

By default this is TRUE and the unknown genes (those whose names in the GAF are "?") are dropped. See the GAF gene names section below for more information. Note that this setting is over-ridden to TRUE with a warning if uniqueGene= TRUE as all these genes have the same gene name ("?").

Details

This function is implemented using a unix system command and requires the "grep" program, so this only works on linux/mac systems. [TODO - reimplement in pure R.] The GAF version this works with is the version used for the TCGA RNAseq expression data files. It is available for download at the NCI uncompressed: TCGA.hg19.June2011.gaf or gzipped: TCGA.hg19.June2011.gaf.gz

Value

The main output is the GAF gene mode extract file, which is by default just the genes with unique names. See GAF_v2_file_description.docx However, this function will returns a list with some summary info about the GAF and the generated geneModels file:

$gaf The GAF filename used as a parameter (relative filename).
$gaf_real The absolute full path filename to the input GAF.
$gaf_md5 The md5 checksum of the GAF, as a string.
$uniqueGene The uniqueGene parameter setting used.
$skipUnknownGene The skipUnknownGene parameter setting used.
$gaf_lines The number of lines in the GAF.
$gaf_models The number of models in the GAF. [Currently this does not include unknown genes if skipUnknownGene= TRUE].
$gaf_models_unique The number of unique models in the GAF. Will be NA if uniqueGene= FALSE.
$gaf_extract The output filename, based on the input GAF by default.
$gaf_extract_real The absolute full path filename of the output file.
$gaf_extract_md5 The md5 checksum of the output file.

GAF gene names

Some genes in the GAF have multiple variants with the same name. These are annotated like "1ofN", "2ofN", etc. By default only the "...1of" variant is kept. Setting uniqueGene= FALSE will keep all variants. Gene name will not then be a unique key.

Some genes in the GAF have no known name and are annotated as "?". Some of these have multiple versions also. There are 32 such genes in the GAF. By default non of these are kept. If uniqueGene= FALSE is set, these will still be skipped unless skipUnknownGene= FALSE is also set. There is no way to keep these genes while skipping the variants of the named genes.

The gene "SLC35E2" is present twice, but without a numbered annotation. By default, only the larger (encompassing) version of "SLC35E2" is kept. If uniqueGene= FALSE is set, then both versions of this gene will be kept, regardless of the skipUnknownGene setting.

Errors

These errors are fatal and will terminate processing.

Unsafe character in GAF filename!

An invalid characters was passed as part of the GAF filename. This is important as the filename is used in a system command as a parameter and could be used for command injection.

Can't find the specified GAF: "file"

The specified GAF doesn't seem to exist on the file system. Probably have the name wrong or are using a relative name from the wrong directory, but could also be that permissions are hiding it.

Unsafe character in output geneModel filename!

An invalid characters was passed as part of the output filename. This is important as the filename is used in a system command as a parameter and could be used for command injection.

Output file already exists; use force= TRUE to overwrite: "file"

The specified GAF gene extract output file already exists. You probably don't want to overwrite it. However, you can set force= TRUE to allow this. It will still generate a warning.

Warnings

Forcing overwrite of output file: "file"

Just letting you know an existing file is actually being overwritten. This won't happen unless explicitly allowed by setting force= TRUE). Having a warning allows distinguishing between the cases where an overwrite occurred vs those where one was allowed but did not occur.

uniqueGene=TRUE sets skipUnknownGene=TRUE

You can't have unique gene names if you keep the genes without names. I could just blow up, but I'm just going to assume that since you asked for unique gene names, that's what you really want. That means I have to ignore your request to keep the unknown genes. Not what you wanted? That's why I'm warning you.

Various warnings from failed system commands

System commands are used for several things in this function. If they fail, error messages are returned as warnings.

Todo

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## Not run: 

# Extract gene models to the default output file
stats <- extractGeneModels( 'path/to/GAF' )

# Same, with all defaults made explicit
stats <- extractGeneModels(
   gaf= 'path/to/GAF', outFile= 'path/to/GAF.geneModels', force= FALSE,
   uniqueGene= TRUE, skipUnknownGene= TRUE
)

# Extract gene models to gaf.genes in run directory
stats <- extractGeneModels( 'path/to/GAF', outFile= 'gaf.genes' )

# Overwrite outFile if it exists (here using the default name)
stats <- extractGeneModels( 'path/to/GAF', force= TRUE )

# Extract all gene models, including duplicates and unknowns.
stats <- extractGeneModels(
   gaf= 'path/to/GAF', uniqueGene= FALSE, skipUnknownGene= FALSE
)

# Extract all gene models except unknown (includes duplicates)
stats <- extractGeneModels( gaf= 'path/to/GAF', uniqueGene= FALSE)
)

## End(Not run)

jefferys/fusionExpressionPlot documentation built on May 19, 2019, 3:59 a.m.