clusterdataframe-class: Data-clustering

Description Arguments Details Examples

Description

This package contains methods, which enables clustering in dataframes. Particularly useful for bio-mathematics, cognitive sciences, etc.

Arguments

tib

Tibble/Dataframe to be clustered. Method also possible with vectors.

by

string vector. Specifies the column(s) for geometric data, according to which the clusters are to be built.

filter.by

string vector. Defaults to c(). Specificies columns, by which data is to be preliminarily divided into groups, within which the clusters are to be built.

keep

string vector. Defaults to c(). Specificies columns, which should be kept when using the $get.

near

symmetric function in two arguments. This function operates pairs of entries in the columns with geometric data and returns TRUE/FALSE if entries are near. Defaults to a Manhattan metric.

min.dist

a real number. Defaults to 0. If the default manhattan metric is used for near, this is the minimum tolerated distance between geometric data.

max.dist

a real number. Defaults to Inf. If the default manhattan metric is used for near, this is the maximum tolerated distance between geometric data.

strict

boolean. Defaults to FALSE. If the default manhattan metric is used for near, this sets the proximity to be a strict < dist or else <= dist.

cluster.name

string. Defaults to 'cluster'. Running tib %>% clusterby(...) returns a data frame, which extends tib by 1 column with this name. This column tags the clusters by a unique index.

min.size

a natural number. Defaults to 1. If a cluster has fewer elements as this, it will not be viewed as a cluster.

max.size

a natural number. Defaults to Inf, determining the maximum allowable size of a cluster.

split

boolean. Defaults to FALSE. If set to TRUE, then the output will be group the tibble data by cluster (equivalent to performing %>% group_by(...)).

is.lexical

boolean. Defaults to TRUE if length(by)=1, otherwise to FALSE. If set to TRUE, then the geometry is assumed to be linear and endowed with a simple difference-metric. This allows for faster computation.

no.overlaps

boolean. Defaults to FALSE. If set to TRUE in combination with is.lexical=TRUE, then the clusters must occupy intervals that do not overlap.

summary

boolean. Defaults to FALSE. If set to TRUE in combination with is.lexical=TRUE and **assuming** the user has presorted the data by the by-column, then a summary of the clusters as intervalls is provided. This makes most sense, if no.overlaps=TRUE. This produces the columns filter.by, by, pstart, pend, nstart, nend, n where pstart, pend describes the interval, nstart, nend provides the original indices in the input data, and n is the cluster size (number of points).

as.interval

boolean. Defaults to TRUE if is.lexical=TRUE and no.overlaps=TRUE, otherwise defaults to FALSE. If TRUE and, then summaries provide information as interval end points. If FALSE, then summaries are provided as lists.

Details

cd <- clusterby::clusterdataframe(tib) cd$build(...) cd$summarise(...) cd$get('original', ...) cd$get('clusters', summary=<lgl>, ...)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
cdf <- clusterby::clusterdataframe(gene);
cdf$build(by='position', filter.by=c('gene','actve'), min.size=4, max.dist=400, strict=TRUE, is.lexical=TRUE, no.overlaps=TRUE);
cdf <- clusterby::clusterdataframe(protein3d);
cdf$build(by=c('x','y','z'), filter.by='celltype', max.dist=5.8e-7, cluster.name='segment');
cdf <- clusterby::clusterdataframe(soil_data);
cdf$build(by=c('x','y'), filter.by=c('density','substance'), max.dist=10e-3, cluster.name='clump');
data <- cdf$get('clusters');
tib <- cdf$get('clusters', keep=c('colour','age'));
tib <- cdf$get('clusters', summary=FALSE);
tib_summ <- cdf$get('clusters', summary=TRUE, as.interval=TRUE);
tib_summ <- cdf$get('clusters', summary=TRUE, as.interval=FALSE);

RLogik/clusterby documentation built on May 5, 2019, 12:28 p.m.