TextReuseTextDocument: TextReuseTextDocument

Description Usage Arguments Details Value See Also Examples

View source: R/TextReuseTextDocument.R

Description

This is the constructor function for TextReuseTextDocument objects. This class is used for comparing documents.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
TextReuseTextDocument(
  text,
  file = NULL,
  meta = list(),
  tokenizer = tokenize_ngrams,
  ...,
  hash_func = hash_string,
  minhash_func = NULL,
  keep_tokens = FALSE,
  keep_text = TRUE,
  skip_short = TRUE
)

is.TextReuseTextDocument(x)

has_content(x)

has_tokens(x)

has_hashes(x)

has_minhashes(x)

Arguments

text

A character vector containing the text of the document. This argument can be skipped if supplying file.

file

The path to a text file, if text is not provided.

meta

A list with named elements for the metadata associated with this document. If a document is created using the text parameter, then you must provide an id field, e.g., meta = list(id = "my_id"). If the document is created using file, then the ID will be created from the file name.

tokenizer

A function to split the text into tokens. See tokenizers. If value is NULL, then tokenizing and hashing will be skipped.

...

Arguments passed on to the tokenizer.

hash_func

A function to hash the tokens. See hash_string.

minhash_func

A function to create minhash signatures of the document. See minhash_generator.

keep_tokens

Should the tokens be saved in the document that is returned or discarded?

keep_text

Should the text be saved in the document that is returned or discarded?

skip_short

Should short documents be skipped? (See details.)

x

An R object to check.

Details

This constructor function follows a three-step process. It reads in the text, either from a file or from memory. It then tokenizes that text. Then it hashes the tokens. Most of the comparison functions in this package rely only on the hashes to make the comparison. By passing FALSE to keep_tokens and keep_text, you can avoid saving those objects, which can result in significant memory savings for large corpora.

If skip_short = TRUE, this function will return NULL for very short or empty documents. A very short document is one where there are two few words to create at least two n-grams. For example, if five-grams are desired, then a document must be at least six words long. If no value of n is provided, then the function assumes a value of n = 3. A warning will be printed with the document ID of a skipped document.

Value

An object of class TextReuseTextDocument. This object inherits from the virtual S3 class TextDocument in the NLP package. It contains the following elements:

content

The text of the document.

tokens

The tokens created from the text.

hashes

Hashes created from the tokens.

minhashes

The minhash signature of the document.

metadata

The document metadata, including the filename (if any) in file.

See Also

Accessors for TextReuse objects.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
doc  <- TextReuseTextDocument(file = file, meta = list(id = "ny1850"))
print(doc)
meta(doc)
head(tokens(doc))
head(hashes(doc))
## Not run: 
content(doc)

## End(Not run)

Example output

TextReuseTextDocument
file : /usr/lib/R/site-library/textreuse/extdata/legal/ny1850-match.txt 
hash_func : hash_string 
id : ny1850 
tokenizer : tokenize_ngrams 
content : <U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD> 597. Every action must be prosecuted in the name
of the real party in interest, except as otherwise provided in section 599.

..a<U+FFFD><U+FFFD><U+FFFD>

5./imended Code, <U+FFFD><U+FFFD> 111.

<U+FFFD><U+FFFD>598. In the case of an assignmen$file
[1] "/usr/lib/R/site-library/textreuse/extdata/legal/ny1850-match.txt"

$hash_func
[1] "hash_string"

$id
[1] "ny1850"

$tokenizer
[1] "tokenize_ngrams"

NULL
[1]   864021270 -1576256129  -659164900 -1012025761    45449552 -1175961819
<U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD> 597. Every action must be prosecuted in the name
of the real party in interest, except as otherwise provided in section 599.

..a<U+FFFD><U+FFFD><U+FFFD>

5./imended Code, <U+FFFD><U+FFFD> 111.

<U+FFFD><U+FFFD>598. In the case of an assignment of a thing in

action, the action by the assignee is without prejudice
to any set-off or other defence existing at the time of, or

before notice of, the sssignment ; but this section does
not apply to a negotiable promissory note or bill of exchange transferred in good faith and upon good considerations, before due.

yfmended Code, <U+FFFD><U+FFFD> 112.

<U+FFFD><U+FFFD> 599. An executor or administrator, a trustee of an
express trust, or a person expressly authorised by statute,

may sue without joining with him the persons for

whose benefit the action is prosecuted. A person with
whom, or in whose name, a contract is made, for the

benefit of another, is a trustee of an express trust, within the meaning of this section.

<U+FFFD><U+FFFD> 602. When an infant is a party, he must appear by
guardian, who may be appointed by the court in which
the action is prosecuted, or by a judge thereof

Jlmended Code, <U+FFFD><U+FFFD> 115.

<U+FFFD><U+FFFD> 603. The guardian must be appointed as -follows:

1. When the infant is plaintiff, upon the application

of the infant, if he be of the age of fourteen years, or if
under that age, upon the application of some other party
to the action, or of a relative or friend of the infant:

2. When the infant is defendant, upon the application
of the infant, if he be of the age of fourteen years, and
apply within twenty days after the service of the summons. If he be under the age of fourteen, or neglect

so to apply, then upon the application of any other

party to the action, or of a relative or friend of the infant.

<U+FFFD><U+FFFD> 607. When a husband and father has deserted his
family, the wife and mother may prosecute or defend,
in his name, any action which he might have prosecuted or defended, and shall have the same powers and

rights therein as he might have had.

To provide for cases of great hardship, that sometimes
happen.

<U+FFFD><U+FFFD> 608. All persons having an interest in the subject
of the action, and in obtaining the relief demanded,
may be joined as plaintiffs, except when otherwise provided in this title.
Jmended Code, <U+FFFD><U+FFFD> 117.

<U+FFFD><U+FFFD>609. Any person may be made a defendant, who

has or claims an interest in the controversy, adverse to

the plainti&', or who is a necessary party to a complete
determination or settlement of the question involved

therein.

<U+FFFD><U+FFFD>610. Of the parties to the action, those who are
united in interest must be joined as plaintiffs or defendants; but if the consent of any one, who should have
been joined as plaintiff, cannot be obtained, he may be
made a defendant, the reason thereof being stated in
the complaint: and when the question is one of a
common or general interest of many persons, or when
the parties are numerous and it is impracticable to

bring them all before the court, one or more may sue
or defend for the benefit of all.

Jimended Code, <U+FFFD><U+FFFD>119.

<U+FFFD><U+FFFD> 611. Persons severally liable upon the same obligation or instrument, including the parties to bills of exchange and promissory notes, and sureties on the same
or separate instruments, may, all or any of them, be
included in the same action, at the option of the plaintiff.

Amended Code, <U+FFFD><U+FFFD>l20, amended.

<U+FFFD><U+FFFD>612. An action does not abate by the death, marriage or other disability of a party, or by the transfer of
any interest therein, if the cause of action survi've' or<U+FFFD><U+FFFD><U+FFFD>
continue. In case of the death, marriage, or other disabilityof a party, the court on motion, may allow the
action to be continued by or against his representative
or successor in interest. In case of any other transfer
of interest, the action may be continued in the name

of the original party; or the court may allow the person to whom the transfer is made to be substituted in
the action.

dmended Code, <U+FFFD><U+FFFD> 121.

<U+FFFD><U+FFFD> 613. The court may determine any controversy between pilrties before it, when <U+FFFD><U+FFFD><U+FFFD>it can be done without
prejudice to the rights of others, or by saving their
rights; but when a complete determination of the controversy cannot be had without the presence of other
parties, the court must order them to be brought in.
And when, in an action for the recovery of real or personal property. a person, not a party to the action, but
having an interest in the subject thereof, makes application to the court to be made a party, it may order
him to be brought in by the proper amendment.

textreuse documentation built on July 8, 2020, 6:40 p.m.