Generic extraction function which calls boilerpipe extractors

Share:

Description

It is the actual workhorse which directly calls the boilerpipe Java library. Typically called through functions as listed for parameter exname.

Usage

1
Extractor(exname, content, asText = TRUE, ...)

Arguments

exname

character specifying the extractor to be used. It can take one of the following values:

  • ArticleExtractorA full-text extractor which is tuned towards news articles.

  • ArticleSentencesExtractorA full-text extractor which is tuned towards extracting sentences from news articles.

  • CanolaExtractorA full-text extractor trained on a 'krdwrd'.

  • DefaultExtractorA quite generic full-text extractor.

  • KeepEverythingExtractorMarks everything as content.

  • LargestContentExtractorA full-text extractor which extracts the largest text component of a page.

  • NumWordsRulesExtractorA quite generic full-text extractor solely based upon the number of words per block.

content

Text content or URL as character

asText

should content specifed be treated as actual text to be extracted or url (from which HTML document is first downloaded and extracted afterwards), defaults to TRUE

...

additional parameters

Value

extracted text as character

Author(s)

Mario Annau

References

http://code.google.com/p/boilerpipe/