Extracts from a collection of XML trees a (sparse) design matrix with variables for linguistic surface (word n-grams) and/or structural features (syntactic trails) of textual objects. Objects are specified with one or more observations (rows) in an annotations table. Such objects can f.i. be text segments or single sentences, provided that they can be extracted from some XML document based on an XPath. The structural feature representation developed in this package, linguistic networks, are graph unions of those XML trees. The unions are multidigraphs with arcs labeled with attribute (name, value) pairs of elements. For example, given dependency tree encoded as XML, arcs can be considered labeled with linguistic dependency relations between words in the syntax tree. Trails in these linguistic networks are randomly sampled, and finally treated as sequences of arcs from parent to child elements, from which n-grams can be extracted as well.
|Package repository||View on GitHub|
Install the latest version of this package by entering the following in R:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.