psel: Preference Selection
In rPref: Database Preferences and Skyline Computation

View source: R/pref-eval.r

psel	R Documentation

Preference Selection

Description

Evaluates a preference on a given data set, i.e., returns the maximal elements of a data set for a given preference order.

Usage

psel(df, pref, ...)

psel.indices(df, pref, ...)

peval(pref, ...)

Arguments

df

A data frame, data frame extension (e.g. a tibble), or a grouped data frame from group_by.

pref

A preference object. See complex_pref and base_pref for details. All variables occurring in the definition of pref must be either columns of df or variables/functions of the environment where pref was defined.

...

Additional optional parameters:

top: Integer. A top value of k means that the k-best tuples of the data set are returned. For top = Inf all tuples are returned.
at_least: Integer. An at_least value of k returns the top-k tuples and additionally all tuples which are not dominated by the worst tuple (i.e. the minima) of the Top-k set. The number of tuples returned is greater or equal than at_least. In contrast to top-k, this is deterministic.
top_level: Integer. A top_level value of k returns all tuples from the k-best levels. See below for the definition of a level.
and_connected: Logical value. This is only relevant if more than one of the above {top, at_least, top_level} values is given, otherwise it will be ignored. Then and_connected = TRUE (which is the default) means that all top-conditions must hold for the returned tuples.
show_level: Logical value. If TRUE, a column .level is added to the result, containing all level values. If at least one of the {top, at_least, top_level} values are given, then show_level is TRUE by default for the psel function. Otherwise, and for psel.indices in all cases, this option is FALSE by default.
show_index: Logical value. If TRUE, a column .index is added to the result. Not applicable for psel.indices.

Details

The difference between the three variants of the preference selection is:

psel: Returns a subset of the data set containing the maxima according to the given preference.
psel.indices: Returns just the row indices of the maxima (except top-k queries with show_level = TRUE, see top-k preference selection). Hence, psel(df, pref) is equivalent to df[psel.indices(df, pref), ] for non-grouped data frames.
peval: Does the same as psel, but assumes that pref has an associated data frame which is used for the preference selection. See base_pref for details, or use assoc.df to explicitly associate a preference with a data frame.

Top-k Preference Selection

For a given top value of k the k best elements and their level values (column .level) are returned. The elements of level k are also called the n-th stratum in the literature. The level values are determined as follows:

All the maxima of a data set w.r.t. a preference have level 1.
The maxima of the remainder, i.e., the data set without the level 1 maxima, have level 2.
The n-th iteration of "Take the maxima from the remainder" returns tuples of level n.

By default, psel.indices does not return the level values. By setting show_level = TRUE this function returns a data frame with the columns '.index' and '.level'. Note that, if none of the top-k values {top, at_least, top_level} is set, then all level values are equal to 1.

By definition, a top-k preference selection is non-deterministic. A top-1 query of two equivalent tuples (equivalence according to pref) can return both of these tuples.

Consider the following example:

df <- data.frame(a = c(1, 1, 3), b = c(1, 2, 3))

The query psel(df, low(a)) returns:

    a b
  1 1 1
  2 1 2

The top-1 psel(df, low(a), top = 1) selection returns:

    a b .level
  1 1 1      1

Theoretically, the b=2 could also be returned in the above query. On the contrary, a preference selection using at_least is deterministic by adding all tuples having the same level as the worst level of the corresponding top-k query. This means, the result is filled with all tuples being not worse than the top-k result. A preference selection with top-level-k returns all tuples having level k or better.

If the top or at_least value is greater than the number of elements in df (i.e., nrow(df)), or top_level is greater than the highest level in df, then all elements of df will be returned without further warning. In addition, the we can set top = Inf to return all tuples, i.e., psel(df, low(a), top = Inf) returns:

    a b .level
  1 1 1      1
  2 1 2      1
  3 3 3      2

By setting top_level = 2 we return the first two levels, i.e., all tuples in this case.

If multiple top-k parameters are specified, their interaction is controlled by and_connected. Let cond1 and cond2 be top-conditions like top=2 or top_level=3, then psel([...], cond1, cond2) is equivalent to the intersection of psel([...], cond1) and psel([...], cond2). If we have and_connected = FALSE, these conditions are or-connected. This corresponds to the union of psel([...], cond1) and psel([...], cond2).

Grouped Preference Selection

Using psel it is also possible to perform a preference selection where the maxima are calculated for every group separately. The groups have to be created with group_by from the dplyr package. The preference selection preserves the grouping, i.e., the groups are restored after the preference selection.

For example, if the summarize function from dplyr is applied to psel(group_by(...), pref), the summarizing is done for the set of maxima of each group. This can be used to e.g., calculate the number of maxima in each group, see the examples below.

A {top, at_least, top_level} preference selection is applied to each group separately. A top=k selection returns the k best tuples for each group. Hence if there are 3 groups in df, each containing at least 2 elements, and we have top = 2, then 6 tuples will be returned.

Parallel Computation

On multi-core machines the preference selection can be run in parallel using a divide-and-conquer approach. Depending on the data set, this may be faster than a single-threaded computation. To activate parallel computation within rPref the following option has to be set:

options(rPref.parallel = TRUE)

If this option is not set, rPref will use single-threaded computation by default. With the option rPref.parallel.threads the maximum number of threads can be specified. The default is the number of cores on your machine. To set the number of threads to the value of 4, use:

options(rPref.parallel.threads = 4)

Examples


# Skyline and top-k/at-least Skyline
psel(mtcars, low(mpg) * low(hp))
psel(mtcars, low(mpg) * low(hp), top = 5)
psel(mtcars, low(mpg) * low(hp), at_least = 5, show_index = TRUE)

# Preference with associated data frame and evaluation
p <- low(mpg, df = mtcars) * (high(cyl) & high(gear))
peval(p)

# Visualizes the Skyline in a plot.
sky1 <- psel(mtcars, high(mpg) * high(hp))
plot(mtcars$mpg, mtcars$hp)
points(sky1$mpg, sky1$hp, lwd = 3)

# Grouped preference with dplyr.
library(dplyr)
psel(group_by(mtcars, cyl), low(mpg))

# Returns the size of each maxima group.
summarise(psel(group_by(mtcars, cyl), low(mpg)), n())

rPref documentation built on Aug. 21, 2025, 5:44 p.m.