Description Usage Arguments Details Value Examples
Filter gencode file to include only genes on chromosome 1-22,X,Y,M and reformat before returning
as data.table
1 | process_gencodefile(gencode_path)
|
gencode_path |
file path of downloaded .gtf.gz file |
Gencode files are downloaded from http://www.gencodegenes.org/releases/grch37_mapped_releases.html
Assumed input format .gtf.gz:
1) chromosome name chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M} or GRC accession
2) annotation source {ENSEMBL,HAVANA}
3) feature type {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
4) genomic start location integer-value (1-based)
5) genomic end location integer-value
6) score (not used)
7) genomic strand {+,-}
8) genomic phase (for CDS features) {0,1,2,.}
9) additional information as key-value pairs
First, the entries are filtered on feature_type == 'gene'
and status == 'KNOWN'
.
This mostly excludes transcripts. The 'chr' prefix is removed from chromosome values and any
chromosomes other than 1-2,X,Y,M are removed. The resulting chromosome values are cast into
an ordered factor (ordering: 1-22,X,Y,M). Then additional columns are extracted from the
key,value pairs in the info column. Any genes with gene_types in c('misc_RNA','snoRNA','snRNA')
are removed. Finally the redundant columns score, phase, and info are removed and a
new column ensembl_gene_id is created from gene_id that does not contain subnumbering
(i.e. id is x instead of x.y). The resulting file still contains duplicate gene names,
but these will be removed after the merge with the canonical hgnc data.
processed gencode table as data.table
1 2 3 4 | ## Not run:
gencode_data <- process_gencodefile(gencode_path)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.