clusterdataframe-class: Data-clustering
In RLogik/clusterby: Data Clustering

This package contains methods, which enables clustering in dataframes. Particularly useful for bio-mathematics, cognitive sciences, etc.

`tib`	Tibble/Dataframe to be clustered. Method also possible with vectors.
`by`	string vector. Specifies the column(s) for geometric data, according to which the clusters are to be built.
`filter.by`	string vector. Defaults to `c()`. Specificies columns, by which data is to be preliminarily divided into groups, within which the clusters are to be built.
`keep`	string vector. Defaults to `c()`. Specificies columns, which should be kept when using the $get.
`near`	symmetric function in two arguments. This function operates pairs of entries in the columns with geometric data and returns `TRUE`/`FALSE` if entries are near. Defaults to a Manhattan metric.
`min.dist`	a real number. Defaults to `0`. If the default manhattan metric is used for `near`, this is the minimum tolerated distance between geometric data.
`max.dist`	a real number. Defaults to `Inf`. If the default manhattan metric is used for `near`, this is the maximum tolerated distance between geometric data.
`strict`	boolean. Defaults to `FALSE`. If the default manhattan metric is used for `near`, this sets the proximity to be a strict `< dist` or else `<= dist`.
`cluster.name`	string. Defaults to `'cluster'`. Running `tib %>% clusterby(...)` returns a data frame, which extends `tib` by 1 column with this name. This column tags the clusters by a unique index.
`min.size`	a natural number. Defaults to `1`. If a cluster has fewer elements as this, it will not be viewed as a cluster.
`max.size`	a natural number. Defaults to `Inf`, determining the maximum allowable size of a cluster.
`split`	boolean. Defaults to `FALSE`. If set to `TRUE`, then the output will be group the tibble data by cluster (equivalent to performing `%>% group_by(...)`).
`is.lexical`	boolean. Defaults to `TRUE` if `length(by)=1`, otherwise to `FALSE`. If set to `TRUE`, then the geometry is assumed to be linear and endowed with a simple difference-metric. This allows for faster computation.
`no.overlaps`	boolean. Defaults to `FALSE`. If set to `TRUE` in combination with `is.lexical=TRUE`, then the clusters must occupy intervals that do not overlap.
`summary`	boolean. Defaults to `FALSE`. If set to `TRUE` in combination with `is.lexical=TRUE` and assuming the user has presorted the data by the `by`-column, then a summary of the clusters as intervalls is provided. This makes most sense, if `no.overlaps=TRUE`. This produces the columns `filter.by, by, pstart, pend, nstart, nend, n` where `pstart`, `pend` describes the interval, `nstart`, `nend` provides the original indices in the input data, and `n` is the cluster size (number of points).
`as.interval`	boolean. Defaults to `TRUE` if `is.lexical=TRUE` and `no.overlaps=TRUE`, otherwise defaults to `FALSE`. If `TRUE` and, then summaries provide information as interval end points. If `FALSE`, then summaries are provided as lists.

cd <- clusterby::clusterdataframe(tib) cd$build(...) cd$summarise(...) cd$get('original', ...) cd$get('clusters', summary=<lgl>, ...)

cdf <- clusterby::clusterdataframe(gene);
cdf$build(by='position', filter.by=c('gene','actve'), min.size=4, max.dist=400, strict=TRUE, is.lexical=TRUE, no.overlaps=TRUE);
cdf <- clusterby::clusterdataframe(protein3d);
cdf$build(by=c('x','y','z'), filter.by='celltype', max.dist=5.8e-7, cluster.name='segment');
cdf <- clusterby::clusterdataframe(soil_data);
cdf$build(by=c('x','y'), filter.by=c('density','substance'), max.dist=10e-3, cluster.name='clump');
data <- cdf$get('clusters');
tib <- cdf$get('clusters', keep=c('colour','age'));
tib <- cdf$get('clusters', summary=FALSE);
tib_summ <- cdf$get('clusters', summary=TRUE, as.interval=TRUE);
tib_summ <- cdf$get('clusters', summary=TRUE, as.interval=FALSE);