library(ungeneanno)
UNGeneAnno is written to enable the rapid collation of gene details from publicly available databases, initially those being the NCBI gene and Uniprot databases.
Further to the original aim, the package also includes the getPublicationList
function which will returns a vector of objects detailing the results of a journal search of the NCBI PubMed database.
Nota Bene: The package was originally written to collate gene information from Uniprot and NIH/NCBI databases, thus alleviating repetitive searches. The aim was both to speed up "accessing" this data and to minimise the number of database calls, network load and server traffic. This vignette outlines how this is acheived.
A typical workflow begins with a matrix, wherein the first column represents a group identifier, numeric
or character
, and the second, gene names. The gene names, may include descriptive elements
Nota Bene where the term 'gene names' is used throughout this vignette, the functions allow the use of gene or protein names, nucleotide or protein accession numbers, or Ensembl identifiers.
mat <- matrix(c("1","BRAF.exp","1","BRCA2.mut","2","BRAF.cnv","2","AURKB.mut","2","PTEN.exp") ,ncol = 2,byrow = TRUE) mat
The getuniquegenelist
function parses second column of the input matrix into a vector of character strings containing only the initial alphanumeric characters, so the above is treat as:
1 BRAF 1 BRCA2 2 BRAF 2 AURKB 2 PTEN
once the matrix is passed to getuniquegenelist
, A geneanno object is populated with unique lists of both the group identifiers and gene names:
geneanno <- getUniqueGeneList(geneanno(),mat) slot(geneanno,"genelist") slot(geneanno,"groupnos")
Nota Bene The getuniquegenelist
function assumes that all supplied gene names are alphanumeric and anything following is descriptive; therefore the function truncates contents of the second column to the initial alphanumeric characters only.
Once these lists have been populated, the summary information can be sourced from the databases; returning a vector of gene
objects containing the downloaded details for each gene.
genesummaries <- getGeneSummary(geneanno)
Once the details have been downloaded, the gene object is saved to a subdirectory which defaults to "genes" and is created in the working directory; However the main directory and subdirectory can be amended using slot(geneanno,"fileroot") <- "/path/to/directory"
and slot(geneanno,"genefilestem") <- "directory_name"
.
Prior to downloading, the method will check the subdirectory for a saved gene object younger than seven days and preferentially use any saved objects it finds.
Note Bene Whilst the functions accept Ensembl identifiers, gene names cannot currently be identified for those which are archived. These will be flagged and
The final task is to produce output files for each group identifier listing details of the genes related to it in the original input matrix.
groupgenelist <- getGroupGeneList(geneanno,mat) produceOutputFiles(geneanno, groupgenelist, genesummaries)
Optionally, the original matrix can be used to search the NCBI PubMed database and the returned article information incorporated into the output files. The author would like to advise against using the searchPublications
function unless the group identifiers are meaningful, for example if they are drug names. Further details can be found in the "UNGeneAnno: PubMed Journal Query example" vignette.
publicationmatrix <- searchPublications(mat) produceOutputFiles(geneanno, groupgenelist, genesummaries, publicationmatrix)
Output files are saved in a subdirectory, which defaults to "gene_annotations" of the slot(geneanno,"fileroot")
unless an alternative directory has been provided by slot(geneanno,"outputstem") <- "directory_name"
.
Currently, the files are created as plain text files, named as the group identifiers, containing the appropriate details stored in the gene
objects.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.