phrasemachine: POS tag and extract phrases from a collection of documents

Description Usage Arguments Value Examples

Description

Extracts phrases from a set of documents using the "FilterFSA" method in Handler et al. 2016.

Usage

1
2
3
4
phrasemachine(documents, regex = "(A|N)*N(PD*(A|N)*N)*",
  maximum_ngram_length = 8, minimum_ngram_length = 2,
  return_phrase_vectors = TRUE, return_tag_sequences = FALSE,
  memory = "-Xmx512M")

Arguments

documents

A vector of strings (one per document).

regex

The regular expression used to find phrases. Defaults to "(A|N)*N(PD*(A|N)*N)*", the "SimpleNP" grammar in Handler et al. 2016. A vector of regular expressions may also be provided if the user wishes to match more than one.

maximum_ngram_length

The maximum length phrases returned. Defaults to 8. Increasing this number can greatly increase runtime.

minimum_ngram_length

The minimum length phrases returned. Defaults to 2. Can be increased to remove shorter phrases, or decreased to include unigrams.

return_phrase_vectors

Logical indicating whether a list of phrase vectors (with each entry contain a vector of phrases in one document) should be returned, or whether phrases should combined into a single space separated string.

return_tag_sequences

Logical indicating whether tag sequences should be returned along with phrases. Defaults to FALSE.

memory

The default amount of memory (512MB) assigned to the NLP package to POS tag documents is often not enough for large documents, which can lead to a "java.lang.OutOfMemoryError". The memory argument defaults to "-Xmx512M" (512MB) in this package, and can be increased if necessary to accommodate very large documents.

Value

A list object.

Examples

1
phrasemachine("Hello there my red good cat.")

Example output

phrasemachine: Simple Phrase Extraction
Version 1.1.2 created on 2017-05-29.
copyright (c) 2016, Matthew J. Denny, Abram Handler, Brendan O'Connor.
Type help('phrasemachine') or
vignette('getting_started_with_phrasemachine') to get started.
Development website: https://github.com/slanglab/phrasemachine
Currently tagging document 1 of 1 
OpenJDK 64-Bit Server VM warning: Can't detect initial thread stack location - find_vma failed
Extracting phrases from document 1 of 1 
[[1]]
[1] "red_good_cat" "good_cat"    

Warning message:
system call failed: Cannot allocate memory 

phrasemachine documentation built on May 2, 2019, 8:23 a.m.