sanmai-NL/feat: Extract (Linguistic) Structural and Surface Features from XML Documents

Extracts from a collection of XML trees a (sparse) design matrix with variables for linguistic surface (word n-grams) and/or structural features (syntactic trails) of textual objects. Objects are specified with one or more observations (rows) in an annotations table. Such objects can f.i. be text segments or single sentences, provided that they can be extracted from some XML document based on an XPath. The structural feature representation developed in this package, linguistic networks, are graph unions of those XML trees. The unions are multidigraphs with arcs labeled with attribute (name, value) pairs of elements. For example, given dependency tree encoded as XML, arcs can be considered labeled with linguistic dependency relations between words in the syntax tree. Trails in these linguistic networks are randomly sampled, and finally treated as sequences of arcs from parent to child elements, from which n-grams can be extracted as well.

Getting started

Package details

Package repositoryView on GitHub
Installation Install the latest version of this package by entering the following in R:
sanmai-NL/feat documentation built on May 26, 2017, 12:29 a.m.