GetGeneList: A Function to Filter and Save Genomic Features from NCBI (all...

Description Usage Arguments Details Value Note Author(s) References See Also Examples

View source: R/GetGeneList.R

Description

GetGeneList allows the user to access the NCBI database for the species specified using the secure ftp site, download feature information as well as filter and save feature information for future use. This update now allows users to specify if the latest assembly build should be used or not using the rentrez package. Once the GetGeneList function is complete, no other access to NCBI or the internet is required. This function requires user input to determine the feature and class types that will be retained during the filtering process. Note: The requirements for this function have changed slightly due to NCBI ftp site organization changes.

Usage

1
GetGeneList(Species,latest = TRUE, savefiles = TRUE, destfile)

Arguments

Species

This term designates the species to be used in the function and is dependent on the scientific name. Options: Must include in quotation marks, where the genus and species should be separated by a space (e.g., "Bos taurus").

latest

Default is true. This term indicates if the most recent (latest) assembly build for that species should be used to get genomic features for. If set to false, the user will be prompted to idenify the assembly to use. In some species, the same assembly link may be listed more than once (e.g. GCF_000003055.6_Bos_taurus_UMD_3.1.1 vs. GCF_000003055.5_Bos_taurus_UMD_3.1.1). In any case, there is a number that designates one with a higher file number (e.g., "3055.6" vs. "3055.5" for Bos taurus 3.1). Always start with the higher file number for that build as it likely contains the feature table. If this fails, then try the other version. The assembly build should always match the marker map file build.

savefiles

Default is true. This term allows you to save the original feature list downloaded from the NCBI database as a text file as well as the filtered feature list produced from the function only if set to TRUE. Options: Must be either TRUE or FALSE.

destfile

This is the pathway to the computer location in which files will be saved and must be specified using quotation marks (e.g., getwd()).

Details

In running this function, the user will be prompted to enter feedback after the file downloads. Items that will be requested, if multiples are present include 1) primary feature type and 2) primary class type to prioritize filtering the dataset on. In each case, the user can opt to keep all feature and class types. This will mean that duplicate information is available per gene ID. If filtered, all unique gene ID will be returned, where preference is given to the class feature and class types specified. Gene ID without the preferred feature and class types will be queried for their available information and added while still removing duplicates. The file returned contains 20 columns based on the current NCBI file structure. Those column headings and descriptions are provided below in the Value section.

Note: While waiting for the function to run, if the user presses "Enter" prematurely, this will result in the function not running correctly and it will have to be started over. Please read instructions carefully.

If savefiles = TRUE, then both the original file from NCBI and the filtered file the user specified will be saved in the destfile location. Once the function has run, the user can choose to either use the information at that time or call it later using the saved file. In either case, the output from the filtered file can be used with marker data to run the MapMarkers function that is also a part of this package.

Value

Column headings and descriptions returned to the user from the GetGeneList function.

feature

The type of feature based on INSDC, which can include GENE, RNA (various types), and CDS.

class

Gene features are subdivided into classes according to the gene biotype. ncRNA features are subdivided according to the ncRNA_class. CDS features are subdivided into with_protein and without_protein, depending on whether the CDS feature has a protein accession assigned or not. CDS features marked as without_protein include CDS features for C regions and V/D/J segments of immunoglobulin and similar genes that undergo genomic rearrangement, and pseudogenes.

assembly

Accession.version of the assembly.

assembly_unit

The name of the assembly unit, such as "Primary Assembly", "ALT_REF_LOCI_1", or "non-nuclear".

seq_type

The type of sequence the feature is from. Typically include chromosome, mitochondrion, plasmid, or unplaced scaffold.

chromosome

The chromosome the feature is located on, which can include mitochondrial DNA or unknown (blank) if applicable.

genomic_accession

The accession.version of that genome the feature is found on.

start

The start position of the feature on the chromosome.

end

The end position of the feature on the chromosome.

strand

The orientation of the feature on the chromosome (can be + or -).

product_accession

The accession.version of the product referenced by this feature, if it exists.

non-redundant_refseq

For bacteria and archaea assemblies, this column contains the non-redundant WP_ protein accession corresponding to the CDS feature. This may be the same as the previous column for RefSeq genomes annotated directly with WP_ RefSeq proteins, or may be different for genomes annotated with genome-specific protein accessions (e.g. NP_ or YP_ RefSeq proteins) that reference a WP_ RefSeq accession.

related_accession

For eukaryotic RefSeq annotations, this is the RefSeq protein accession corresponding to the transcript feature, or the RefSeq transcript accession corresponding to the protein feature.

name

For genes, this is the gene description or full name. For RNA, CDS, and some other features, this is the product name.

symbol

The gene symbol.

GeneID

The corresponding gene ID on the NCBI database the feature is located in.

locus_tag

No description available from NCBI. Typically a blank column.

feature_interval_length

This is the sum of the lengths of all intervals for the feature (i.e. the length without introns for a joined feature).

product_length

This is the length of the product corresponding to the accession.version in product_accession" column. Protein product lengths are in amino acid units and do not include the stop codon which is included in "feature_interval_length" column. Additionally, product_length may differ from feature_interval_length if the product contains sequence differences vs. the genome, as found for some RefSeq transcript and protein products based on mRNA sequences and also for INSDC proteins that are submitted to correct genome discrepancies.

attributes

A semi-colon delimited list of a controlled set of qualifiers, if available. The list currently includes: partial, pseudo, pseudogene, ribosomal_slippage, trans_splicing, anticodon=NNN (for tRNAs), old_locus_tag=XXX.

Note

For issues or problems with this function, please contact Lauren Hanna at Lauren.Hanna@ndsu.edu.

Author(s)

Lauren L. Hulsman Hanna and David G. Riley

References

Hulsman Hanna, L. L., and D. G. Riley. 2014. Mapping genomic markers to closest feature using the R package Map2NCBI. Livest. Sci. 162:59-65. doi:10.1016/j.livsci.2014.01.019

National Center for Biotechnology Information. 2018. Latest assembly version 'README' file, last updated 26 February 2018. Available at: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/README.txt (Accessed 23 Jan 2020).

See Also

Function: MapMarkers, Package: rentrez

Examples

1
2
3
4
5
6
7
8
9
#Example 1: Run the following example and, when prompted,
#choose [n],[1],[n],[3] to filter the build and feature
#information. This example is interactive and requires
#user input. Please note that pressing "Enter" prematurely
#can cause the function to not run properly.
## Not run: 
GeneList = GetGeneList("Bos taurus",destfile=getwd())

## End(Not run)

Map2NCBI documentation built on March 26, 2020, 6:23 p.m.