Description Usage Arguments Details Value GAF gene names Errors Warnings Todo Examples
This extracts a tab delimited file of all canonical exon gene models from the TCGA gene annotation file or GAF. The canonical model is intended to be at least a common starting point for a simplified definition of "the gene called X", based on the union of all transcript. By default each model in the output file (row) is uniquely identified by gene name.
1 2 |
gaf |
The full-path name of the GAF [REQ] |
outFile |
The output filename. By default will be the same as the
GAF, with ".geneModels" appended (hence it will be created in the same
directory by default). This will not overwrite an existing file unless
|
force |
If the output file exists, setting this |
uniqueGene |
By default this is |
skipUnknownGene |
By default this is |
This function is implemented using a unix system command and requires the "grep" program, so this only works on linux/mac systems. [TODO - reimplement in pure R.] The GAF version this works with is the version used for the TCGA RNAseq expression data files. It is available for download at the NCI uncompressed: TCGA.hg19.June2011.gaf or gzipped: TCGA.hg19.June2011.gaf.gz
The main output is the GAF gene mode extract file, which is by default just the genes with unique names. See GAF_v2_file_description.docx However, this function will returns a list with some summary info about the GAF and the generated geneModels file:
$gaf | The GAF filename used as a parameter (relative filename). |
$gaf_real | The absolute full path filename to the input GAF. |
$gaf_md5 | The md5 checksum of the GAF, as a string. |
$uniqueGene | The uniqueGene parameter setting used. |
$skipUnknownGene | The skipUnknownGene parameter setting used. |
$gaf_lines | The number of lines in the GAF. |
$gaf_models | The number of models in the GAF. [Currently this
does not include unknown genes if skipUnknownGene= TRUE ]. |
$gaf_models_unique | The number of unique models in the GAF. Will
be NA if uniqueGene= FALSE . |
$gaf_extract | The output filename, based on the input GAF by default. |
$gaf_extract_real | The absolute full path filename of the output file. |
$gaf_extract_md5 | The md5 checksum of the output file. |
Some genes in the GAF have multiple variants with the same name. These are
annotated like "1ofN", "2ofN", etc. By default only the "...1of" variant
is kept. Setting uniqueGene= FALSE
will keep all variants. Gene name will
not then be a unique key.
Some genes in the GAF have no known name and are annotated as "?". Some of
these have multiple versions also. There are 32 such genes in the GAF. By
default non of these are kept. If uniqueGene= FALSE
is set, these
will still be skipped unless skipUnknownGene= FALSE
is also set.
There is no way to keep these genes while skipping the variants of the
named genes.
The gene "SLC35E2" is present twice, but without a numbered annotation. By
default, only the larger (encompassing) version of "SLC35E2" is kept.
If uniqueGene= FALSE
is set, then both versions of this gene will be
kept, regardless of the skipUnknownGene
setting.
These errors are fatal and will terminate processing.
Unsafe character in GAF filename!
An invalid characters was passed as part of the GAF filename. This is important as the filename is used in a system command as a parameter and could be used for command injection.
Can't find the specified GAF: "file"
The specified GAF doesn't seem to exist on the file system. Probably have the name wrong or are using a relative name from the wrong directory, but could also be that permissions are hiding it.
Unsafe character in output geneModel filename!
An invalid characters was passed as part of the output filename. This is important as the filename is used in a system command as a parameter and could be used for command injection.
Output file already exists; use force= TRUE to overwrite: "file"
The specified GAF gene extract output file already exists. You probably
don't want to overwrite it. However, you can set force= TRUE
to allow this. It will still generate a warning.
Forcing overwrite of output file: "file"
Just letting you know an existing file is actually being overwritten.
This won't happen unless explicitly allowed by setting force=
TRUE
). Having a warning allows distinguishing between the cases where
an overwrite occurred vs those where one was allowed but did not occur.
uniqueGene=TRUE sets skipUnknownGene=TRUE
You can't have unique gene names if you keep the genes without names. I could just blow up, but I'm just going to assume that since you asked for unique gene names, that's what you really want. That means I have to ignore your request to keep the unknown genes. Not what you wanted? That's why I'm warning you.
System commands are used for several things in this function. If they fail, error messages are returned as warnings.
Add test for corner case - one exon gene
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | ## Not run:
# Extract gene models to the default output file
stats <- extractGeneModels( 'path/to/GAF' )
# Same, with all defaults made explicit
stats <- extractGeneModels(
gaf= 'path/to/GAF', outFile= 'path/to/GAF.geneModels', force= FALSE,
uniqueGene= TRUE, skipUnknownGene= TRUE
)
# Extract gene models to gaf.genes in run directory
stats <- extractGeneModels( 'path/to/GAF', outFile= 'gaf.genes' )
# Overwrite outFile if it exists (here using the default name)
stats <- extractGeneModels( 'path/to/GAF', force= TRUE )
# Extract all gene models, including duplicates and unknowns.
stats <- extractGeneModels(
gaf= 'path/to/GAF', uniqueGene= FALSE, skipUnknownGene= FALSE
)
# Extract all gene models except unknown (includes duplicates)
stats <- extractGeneModels( gaf= 'path/to/GAF', uniqueGene= FALSE)
)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.