extract_phrases: Extract Phrases
In phrasemachine: Simple Phrase Extraction

Description Usage Arguments Value Examples

Extracts phrases from a list of POS tagged document using the "FilterFSA" method in Handler et al. 2016.

1
2
3

extract_phrases(POS_tagged_documents, regex = "(A|N)*N(PD*(A|N)*N)*",
  maximum_ngram_length = 8, minimum_ngram_length = 2,
  return_phrase_vectors = TRUE, return_tag_sequences = FALSE)

`POS_tagged_documents`	A list object of the form produced by the 'POS_tag_documents()' function, with either Penn TreeBank or Petrov/Gimpel style tags.
`regex`	The regular expression used to find phrases. Defaults to "(A\|N)N(PD(A\|N)N)", the "SimpleNP" grammar in Handler et al. 2016. A vector of regular expressions may also be provided if the user wishes to match more than one.
`maximum_ngram_length`	The maximum length phrases returned. Defaults to 8. Increasing this number can greatly increase runtime.
`minimum_ngram_length`	The minimum length phrases returned. Defaults to 2. Can be increased to remove shorter phrases, or decreased to include unigrams.
`return_phrase_vectors`	Logical indicating whether a list of phrase vectors (with each entry contain a vector of phrases in one document) should be returned, or whether phrases should combined into a single space separated string.
`return_tag_sequences`	Logical indicating whether tag sequences should be returned along with phrases. Defaults to FALSE.

A list object.

## Not run: 
# make sure quanteda is installed
requireNamespace("quanteda", quietly = TRUE)
# load in U.S. presidential inaugural speeches from Quanteda example data.
documents <- quanteda::data_corpus_inaugural
# use first 10 documents for example
documents <- documents[1:10,]

# run tagger
tagged_documents <- POS_tag_documents(documents)

phrases <- extract_phrases(tagged_documents,
                           regex = "(A|N)*N(PD*(A|N)*N)*",
                           maximum_ngram_length = 8,
                           minimum_ngram_length = 1)

## End(Not run)