Extractor: Generic extraction function which calls boilerpipe extractors

Description Usage Arguments Value Author(s) References

View source: R/Extractor.R


It is the actual workhorse which directly calls the boilerpipe Java library. Typically called through functions as listed for parameter exname.


Extractor(exname, content, asText = TRUE, ...)



character specifying the extractor to be used. It can take one of the following values:

  • ArticleExtractorA full-text extractor which is tuned towards news articles.

  • ArticleSentencesExtractorA full-text extractor which is tuned towards extracting sentences from news articles.

  • CanolaExtractorA full-text extractor trained on a 'krdwrd'.

  • DefaultExtractorA quite generic full-text extractor.

  • KeepEverythingExtractorMarks everything as content.

  • LargestContentExtractorA full-text extractor which extracts the largest text component of a page.

  • NumWordsRulesExtractorA quite generic full-text extractor solely based upon the number of words per block.


Text content or URL as character


should content specifed be treated as actual text to be extracted or url (from which HTML document is first downloaded and extracted afterwards), defaults to TRUE


additional parameters


extracted text as character


Mario Annau



boilerpipeR documentation built on May 19, 2017, 8:27 a.m.

Search within the boilerpipeR package
Search all R packages, documentation and source code

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs in the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.