mpick: Subsetting by picking random levels from multiple factors

Description Usage Arguments Details Value Warning See Also Examples

View source: R/mpick.R

Description

Like pick, but allows specifying multiple factors (columns) at the same time, trying hard to return the desired result. You want 2 species from the same 3 strata during the same 4 years? Use mpick. Want just one of those? Use pick.

Usage

1
mpick(X, p, weight = FALSE, limit = 10, screen = TRUE, dt = FALSE)

Arguments

X

A data.table

p

A named vector of integers. Names are columns in X, the integers are the number of levels to select

weight

Logical, default FALSE. Same as w in pick; weight selection of factor level by its relevative frequency of occurrence in X. Could have performance implications, see 'Details'.

limit

Time limit for searching, in seconds

screen

Logical If TRUE (default) then before random searching, will screen out factor levels that definitely cannot satisfy the full sweet of conditions in p. Can be a little slow, but is extremely effect when most combinations of factors in p do not exist. See 'Details'.

dt

Logical, if TRUE, returns a data.table; if FALSE (default), returns an index of that data.table?

Details

This problem may ultimately be better suited for a real optimization algorithm. Right now, relies and arbitrary guess-and-check. Does not "forget" failed guesses (only specific combinations are worth forgetting, and for large data sets there's a very low probability of happening upon same combination). Thus, this is a very brute-force approach, with the exception of the checking done when screen=TRUE.

It is highly recommended that limit be set to allow for a couple minutes of searching. Of course, this depends on the size of X and the details of p.

screen is very effective when many possible factor levels in p can be ruled out based on their overall scarcity. Consider the example of 2 spp, 3 stratum, 4 year. If a given level of spp does not occur at least 3*4=12 times in the data set, it can be ruled out. Because very rare species comprise the majority of unique spp in trawl data, this screening can be outstandingly effective.

Be aware that it is easy to accidentally ask a lot of this function, and don't be surprised when it doesn't give you an answer quickly, or at all. For example, asking for 10 spp 5 stratum 5 year might seem meager for a data set observed over 30 years for 100 strata and 800 spp. However, this is a big ask: 10 species found together in the same 5 places in each of 5 years. If the average stratum has about 30 species, you're requesting that a 3rd of the local biodiversity constitute the same species 25 separate times. If a stratum is small or if species are cosmopolitan, you might get a good result; but that'd be lucky.

Value

A data.table that is a subset of X.

Warning

This function is still experimental. See http://stackoverflow.com/q/33714985/2343633 for possible updates (but this was not a popular question).

See Also

pick

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# simple and fast example
set.seed(1337)
mpick(clean.ebs, p=c(spp=2, year=1), weight=TRUE, screen=TRUE, dt=TRUE)

# More complex example
# if we want 5 spp that are
# found in the same 5 strat in
# at least 1 year; but then
# we want to allow for +/- 2 years
# on either side of that shared year
# First we get the 5-5-1 subset index,
# Then we search for those chosen spp-stratum-year,
# but then we also search for the additional years
## Not run: 
set.seed(1337)
ind <- mpick(clean.ebs, p=c(spp=5, stratum=5, year=1), weight=TRUE, limit=60)
logic <- expression(
	spp%in%spp[ind]
	& stratum%in%stratum[ind]
	& as.integer(year)%in%(as.integer(unique(year[ind])) + (-2:2))
)
clean.ebs[eval(logic)]

## End(Not run)

rBatt/trawlData documentation built on May 26, 2019, 7:45 p.m.