library(ungeneanno)

UNGeneAnno is written to enable the rapid collation of gene details from publicly available databases, initially those being the NCBI gene and Uniprot databases.

Further to the original aim, the package also includes the getPublicationList function which will returns a vector of objects detailing the results of a journal search of the NCBI PubMed database.

Nota Bene: The package was originally written to collate gene information from Uniprot and NIH/NCBI databases, thus alleviating repetitive searches. The aim was both to speed up "accessing" this data and to minimise the number of database calls, network load and server traffic. This vignette outlines how this is acheived.

Initial Matrix Input

A typical workflow begins with a matrix, wherein the first column represents a group identifier, numeric or character, and the second, gene names. The gene names, may include descriptive elements Nota Bene where the term 'gene names' is used throughout this vignette, the functions allow the use of gene or protein names, nucleotide or protein accession numbers, or Ensembl identifiers.

    mat <- matrix(c("1","BRAF.exp","1","BRCA2.mut","2","BRAF.cnv","2","AURKB.mut","2","PTEN.exp")
        ,ncol = 2,byrow = TRUE)
    mat

The getuniquegenelist function parses second column of the input matrix into a vector of character strings containing only the initial alphanumeric characters, so the above is treat as:

1   BRAF
1   BRCA2
2   BRAF
2   AURKB
2   PTEN

once the matrix is passed to getuniquegenelist, A geneanno object is populated with unique lists of both the group identifiers and gene names:

    geneanno <- getUniqueGeneList(geneanno(),mat)
    slot(geneanno,"genelist")
    slot(geneanno,"groupnos")

Nota Bene The getuniquegenelist function assumes that all supplied gene names are alphanumeric and anything following is descriptive; therefore the function truncates contents of the second column to the initial alphanumeric characters only.

Getting Gene Objects

Once these lists have been populated, the summary information can be sourced from the databases; returning a vector of gene objects containing the downloaded details for each gene.

genesummaries <- getGeneSummary(geneanno)

Once the details have been downloaded, the gene object is saved to a subdirectory which defaults to "genes" and is created in the working directory; However the main directory and subdirectory can be amended using slot(geneanno,"fileroot") <- "/path/to/directory" and slot(geneanno,"genefilestem") <- "directory_name". Prior to downloading, the method will check the subdirectory for a saved gene object younger than seven days and preferentially use any saved objects it finds.

Note Bene Whilst the functions accept Ensembl identifiers, gene names cannot currently be identified for those which are archived. These will be flagged and

Producing Output Files

The final task is to produce output files for each group identifier listing details of the genes related to it in the original input matrix.

groupgenelist <- getGroupGeneList(geneanno,mat)
produceOutputFiles(geneanno, groupgenelist, genesummaries)

Optionally, the original matrix can be used to search the NCBI PubMed database and the returned article information incorporated into the output files. The author would like to advise against using the searchPublications function unless the group identifiers are meaningful, for example if they are drug names. Further details can be found in the "UNGeneAnno: PubMed Journal Query example" vignette.

publicationmatrix <- searchPublications(mat)
produceOutputFiles(geneanno, groupgenelist, genesummaries, publicationmatrix)

Output files are saved in a subdirectory, which defaults to "gene_annotations" of the slot(geneanno,"fileroot") unless an alternative directory has been provided by slot(geneanno,"outputstem") <- "directory_name". Currently, the files are created as plain text files, named as the group identifiers, containing the appropriate details stored in the gene objects.



biscuit13161/UNGeneAnno documentation built on May 26, 2019, 2:33 a.m.