tCorpus_modify_by_reference: Modify tCorpus by reference

Description Details


(back to overview)


If any tCorpus method is used that changes the corpus (e.g., set, subset), the change is made by reference. This is very convenient when working with a large corpus, because it means that the corpus does not have to be copied when changes are made, which is slow and memory inefficient.

To illustrate, for a tCorpus object named 'tc', the subset method can be called like this:

tc$subset(doc_id %in% selection)

The 'tc' object itself is now modified, and does not have to be assigned to a name, as would be the more common R philosophy. Like this:

tc = tc$subset(doc_id %in% selection)

The results of both lines of code are the same. The assignment in the second approach is not necessary, but doesn't harm either because tc$subset returns the modified corpus invisibly (see ?invisible if that sounds spooky).

Be aware, however, that the following does not work!!

tc2 = tc$subset(doc_id %in% selection)

In this case, tc2 does contain the subsetted corpus, but tc itself will also be subsetted!!

We force this approach on you, because it is faster and more memory efficient, which becomes crucial for large corpora. If you do want to make a copy, it has to be done explicitly with the copy() method.

tc2 = tc$copy()

For methods where copying is often usefull, such as subset, there is also a copy parameter.

tc2 = tc$subset(doc_id %in% selection, copy=TRUE)

Now, tc will not be subsetted itself, but will subset a copy of itself and return it to be assigned to tc2.

Note that tc is also modified by reference if the subset method (or any other method that modified the corpus) is called within a function. No matter where and how you call the method, tc itself will be subsetted unless you explicitly copy it first or set copy to True.

kasperwelbers/corpustools documentation built on Sept. 1, 2018, 1:03 p.m.