10-kgml-util: Utility Functions to Parse KGML FIles

kgml-utilityR Documentation

Utility Functions to Parse KGML FIles

Description

Extract entities of different types from KGML files in order to convert the pathway to a mathematical graph that we can compute on.

Usage

collectEntries(xmldoc, anno = c("all", "one", "batch"))
collectRelations(xmldoc)
collectReactions(xmldoc)

Arguments

xmldoc

Either the name of an XML file meeting the specifications of the KEGG Genomic Markup Language (KGML), or an object of class XMLInternalDocument obtained by running such a file through the xmlParseDoc function of the XML package. (All of the functions described here will call xmlParseDoc if it hasn't already been used.)

anno

Choose a method for analyzing KEGG compounds and glycans. See Details.

Details

These functions are primarily intended as utility functions that implement processes required by the main function in the package, KGMLtoIgraph. They have been made accessible to the end user for use in debugging problematic KGML files or to reuse the KGML files in contexts other than the one we focus on in this package.

We have implemented three different methods for annotating KEGG compounds and glycans in their reaction entities. These are recorded in the KGML pathway files as "C-numbers" (e.g., C12345) or "G-numbers" (e.g., G12345). These serve as identifieers into their local databases, and we want to convert them (usually) to IUPAC names to display on nodes in the final graph. Method "one" makes a separate call to keggGet from the KEGGREST package. Method "batch" makes calls in batches of ten identifiers, using the fact that keggGet enforces that limit. Method "all" makes a single call using keggLink to download the entire database. Note that all three methods cache their results in a package-local environment to avoid repeating the same call. In a profiling test of one moderat sized pathway, a single invocation of collectEntities took 54 seconds for method "one", 53 seconds for method "batch", and 47 seconds foir method "all". If you are procesing multiple pathways in one session, we expect that the advantage of the "all" method would be even greater since the results are cached.

Value

The collectReactions and collectRelations functions return a data frame with three columns (Source, Target, and MIM), where each row describes one edge of the pathway/graph. In KEGG, they distingiuiish between relations (which usually connect genes) and reactions (which connect chemical compounds). The Source and Target columns are the alphanumeric identifiers of items decribing nodes. The MIM column is the edge type in KGML.

The collectEntries function returns a data frame with three columns (GraphId, label, and Type), where each row describes one node or vertex of the pathway/graph. The GraphId column is a unique alphanumeric identifier. The label column is a human-readable name for the node, often the official gene symbol. When creating an igraph object from a pathway, the first column is used as an identifier to define the node. Also, the plot method for igraphs recognizes the term label as a column that defines the text that should be displayed in a node.

Author(s)

Kevin R. Coombes krc@silicovore.com, Polina Bombina pbombina@augusta.edu

Examples

xmlfile <- system.file("pathways/WP3850.kgml", package = "WayFindR")
xmldoc <- XML::xmlParseDoc(xmlfile)

WayFindR documentation built on June 30, 2024, 3 a.m.