library(ungeneanno)
UNGeneAnno is written to enable the rapid collation of gene details from publicly available databases, initially those being the NCBI gene and Uniprot databases.
Further to the original aim, the package also includes the getPublicationList
function which will returns a vector of objects detailing the results of a journal search of the NCBI PubMed database.
Nota Bene: The package was originally written to collate gene information from Uniprot and NIH/NCBI databases, thus alleviating repetitive searches. The aim was both to speed up "accessing" this data and to minimise the number of database calls, network load and server traffic. This vignette outlines how this is acheived.
A typical workflow begins with a matrix, wherein the first column represents a group identifier, numeric
or character
, and the second, gene names. The gene names, may include descriptive elements
f <- matrix(c("1","BRAF.exp","1","BRCA2.mut","2","BRAF.cnv","2","AURKB.mut","2","PTEN.exp") ,ncol=2,byrow=TRUE) f
The getuniquegenelist
function parses second column of the input matrix into a vector of character strings containing only the initial alphanumeric characters, so the above is treat as:
1 BRAF 1 BRCA2 2 BRAF 2 AURKB 2 PTEN
once the matrix is passed to getuniquegenelist
, A geneanno object is populated with unique lists of both the group identifiers and gene names:
geneanno <- getUniqueGeneList(geneanno(),f) geneanno@genelist geneanno@groupnos
Nota Bene The getuniquegenelist
function assumes that all supplied gene names are alphanumeric and anything following is descriptive; therefore the function truncates contents of the second column to the initial alphanumeric characters only.
Once these lists have been populated, the summary information can be sourced from the databases; returning a vector of gene
objects containing the downloaded details for each gene.
genesummaries <- getGeneSummary(geneanno)
Once the details have been downloaded, the gene object is saved to a subdirectory which defaults to "genes" and is created in the working directory; However the main directory and subdirectory can be amended using geneanno@fileroot <- "/path/to/directory"
and geneanno@genefilestem <- "directory_name"
.
Prior to downloading, the method will check the subdirectory for a saved gene object younger than seven days and preferentially use any saved objects it finds.
The final task is to produce output files for each group identifier listing details of the genes related to it in the original input matrix.
groupgenelist <- getGroupGeneList(geneanno,matrix) produceOutputFiles(geneanno, groupgenelist, genesummaries)
Optionally, the original matrix can be used to search the NCBI PubMed database and the returned article information incorporated into the output files. Theauthor would like to advise against using the searchPublications
function unless the group identifiers are meaningful, for example if they are drug names. Further details can be found in the "UNGeneAnno: PubMed Journal Query example" vignette.
publicationmatrix <- searchPublications(matrix) produceOutputFiles(geneanno, groupgenelist, genesummaries, publicationmatrix)
Output files are saved in a subdirectory, which defaults to "gene_annotations" of the geneanno@fileroot
unless an alternative directory has been provided by geneanno@outputstem <- "directory_name"
.
Currently, the files are created as plain text files, named as the group identifiers, containing the appropriate details stored in the gene
objects.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.