findPairs: A convenience wrapper around 'findExamples'.

Description Usage Arguments Details Value See Also Examples

Description

Sift the dataset for word pairs such that the first word contains x and the second word contains y in the corresponding segment or segments.

Usage

1
findPairs(data, x, y, exact, cols)

Arguments

data

[soundcorrs] The dataset in which to look. Only datasets with two languages are supported.

x

[character] The sequence to find in language1. May be a regular expression. If an empty string, anything will be considered a match.

y

[character] The sequence to find in language2. May be a regular expression. If an empty string, anything will be considered a match.

exact

[numeric] If 0 or FALSE, distance.start=distance.end=-1, na.value=0, and zeros=FALSE. If 0.5, distance.start=distance.end=1, na.value=0, and zeros=FALSE. If 1 or TRUE, distance.start=distance.end=0, na.value=-1, and zeros=TRUE. Defaults to 0.

cols

[character vector] Which columns of the dataset to return as the result. Can be a vector of names, "aligned" (the two columns with segmented, aligned words), or "all" (all columns). Defaults to "aligned".

Details

Probably the most common usage of findExamples is with datasets containing pairs of words. This function is a simple wrapper around findExamples which hopes to facilitate its use in this most common case. Instead of the five arguments that findExamples requires, this function only takes two. It is, of course, at the cost of control but should a more fine-tuned search be required, findExamples can always still be used instead of findPairs.

The default is the inexact mode (exact set to 0 or FALSE). It corresponds to distance.start and distance.end being both set to -1, na.value being set to 0, and zeros being set to FALSE, which are also the default settings in findExamples(). The risk here are false positives. In my experience, however, those are rare, and because they are displayed, the user has a chance to spot them.

The opposite is the exact mode (exact set to 1 or TRUE), which corresponds to distance.start and distance.end being both set to 0, na.value being set to -1, and zeros to TRUE. The risk are false negatives, in my experience both much more common than false positives in the inexact mode, and effectively impossible to spot as they are simply not displayed.

A middle ground is the semi-exact mode (exact set to 0.5), where distance.start and distance.end are both set to 1, na.value is set to 0, and zeros to FALSE. It decreases the risk of false positives while increasing only a little the risk of false negatives.

Value

[df.findExamples] A subset of the dataset, containing only the pairs with corresponding sequences. Warning: pairs with multiple occurrences of such sequences are only included once.

See Also

findExamples, allPairs

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# In the examples below, non-ASCII characters had to be escaped for technical reasons.
# In the actual usage, Unicode is supported under BSD, Linux, and macOS.

# prepare sample dataset
dataset <- loadSampleDataset ("data-ie")
# run findPairs
findPairs (dataset, "a", "a")
findPairs (dataset, "e", "f", exact=0)
findPairs (dataset, "e", "f", exact=0.5)
findPairs (dataset, "e", "f", exact=1)

soundcorrs documentation built on Nov. 16, 2020, 5:09 p.m.