Description Usage Arguments Format Details Inherit Methods See Also
It is a class that inherits from the Instance
class and
implements the functions of extracting the text and the date of an warc-type
file.
1 | ExtractorWarc$new(path)
|
path |
(character) Path of the warc-type file. |
An object of class R6ClassGenerator
of length 24.
The read_warc function of the jwart package was overwritten because it returned the hours wrong.
The jwart package makes calls to Java so it is necessary to have rJava installed.
This class inherits from Instance
and implements the
obtainSource
and obtainDate
abstracts functions.
obtainDate Function that obtains the date of the warc file. Finds the warcinfo type records in which the date appearss and standardizes it with the format: "%a %b %d %H:%M:%S %Z %Y" (Example: "Thu May 02 06:52:36 UTC 2013").
Usage
obtainDate()
obtainSource Function that obtains the source of the warc file. The list of records that contain information are obtained, which they are resource and response. Then they are traversed and the charset of that record is obtained. If that charset matches the one obtained from guess_encoding, payload_content is used to get the contents of the record. If it does not match, the content is obtained, converting the content in bytes to string. This is done because there are coding problems in cases that the charset that is detected is different from the one that is really. In addition it initializes the data with the initial source.
Usage
obtainSource()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.