process_gencodefile: Preprocess gencode file

Description Usage Arguments Details Value Examples

Description

Filter gencode file to include only genes on chromosome 1-22,X,Y,M and reformat before returning as data.table

Usage

1
process_gencodefile(gencode_path)

Arguments

gencode_path

file path of downloaded .gtf.gz file

Details

Gencode files are downloaded from http://www.gencodegenes.org/releases/grch37_mapped_releases.html

Assumed input format .gtf.gz:

First, the entries are filtered on feature_type == 'gene' and status == 'KNOWN'. This mostly excludes transcripts. The 'chr' prefix is removed from chromosome values and any chromosomes other than 1-2,X,Y,M are removed. The resulting chromosome values are cast into an ordered factor (ordering: 1-22,X,Y,M). Then additional columns are extracted from the key,value pairs in the info column. Any genes with gene_types in c('misc_RNA','snoRNA','snRNA') are removed. Finally the redundant columns score, phase, and info are removed and a new column ensembl_gene_id is created from gene_id that does not contain subnumbering (i.e. id is x instead of x.y). The resulting file still contains duplicate gene names, but these will be removed after the merge with the canonical hgnc data.

Value

processed gencode table as data.table

Examples

1
2
3
4
## Not run: 
gencode_data <- process_gencodefile(gencode_path)

## End(Not run)

svenstringer/genematrix documentation built on May 30, 2019, 8:48 p.m.