iparse: Parse text

Description Usage Arguments Details Value

View source: R/scripts.R

Description

Parse a vector of texts by making calls to a lookup parsing service

Usage

1
2
3
iparse(text, split = " ",
  URLPrefix = "http://www.perseus.tufts.edu/hopper/xmlmorph?lang=la&lookup=",
  lineIDs = NULL)

Arguments

text

a vector of character strings (<e2><80><9c>documents<e2><80><9d>), each of which may contain one or several words. Each element will be split into words. It is expected that the text has already been stripped of punctuation and special characters.

split

the character to be used as the split marker in splitting the elements of text into words.

URLPrefix

the url of the XML parsing results for each word is constructed by prefixing URLPrefix to the word. The default value will query Perseus Latin word tool database.

lineIDs

if a vector of same length as text, will be used as the names of the output list.

Details

In case the word form is not recognized, lemma is UNKNOWN. In case the query failed for some reason, lemma is ERROR. Numbers, both modern digits and roman, are parsed with the lemma which is the number, converted to character (e.g. "2018"), and with pos=numeral_digits. After cycling through all the words in the text, this function gives a second try to any cases with lemma ERROR and prints out how many errors there were and how many got fixed. It does it by calling the function iremoveLookupErrors(). In the unlikely case the same errors persist, use this function again.

Value

a list with the structure: documents >> contain>> words >>contain>> analyses >>contain>> fields

Each element of this list corresponds to an element of text, in the same order. We can call these elements documents.

A document is a list, where each element corresponds to a word, in the same order as in the text. We call these elements words. They are named (the names are the words from the text), however one should be careful about using these names for lookup, since there may be duplicate words in a document. If no words are detected (e.g. the string in the text was empty), a NULL will be returned.

A word is a list of one or several elements called analyses, each of which represents a parsing.

An analysis is a list of elements called fields, such as pos (part of speech), lemma, case, etc. The inventory of fields varies from word to word (e.g. tense will be present for a verb but not for a noun). They all, however, contain a field form (the word form repeated) and lemma.


rushkin/parseR documentation built on May 17, 2019, 12:52 p.m.