clusterPeople: Runs kmeans or generalised low rank models on the cluster...

Description Usage Arguments Details Value Author(s) References Examples

Description

This function clusters patients into subgroups based on covariates corresponding to sets of concept_ids or concept_ids. It is recommended to use generalised low rank models to preprocess the data when clustering patients using individual concept_ids and reduce the dimensionality before applying k-means. When the data extraction used covariate groups kmeans can be run directly.

Usage

1
2
3
4
5
6
7
clusterPeople()

## Default S3 method:
clusterPeople(clusterData, ageSpan=c(0,100), gender=8507, method='kmeans',
              clusterSize=10, glrmFeat=NULL,normalise=T, binary=T,
              fraction=F, covariatesToInclude=NULL,covariatesToExclude=NULL,
              covariatesGroups=NULL, loc=loc)

Arguments

minAge

class:numeric default(NULL)- the minimum age a person in the cohort must be to be included in the data

maxAge

class:numeric default(NULL)- the maximum age a person in the cohort must be to be included in the data

gender

class:numeric - gender concept_id (8507- male; 8532-female)

method

class:character - method used to do clustering (currently only supports kmeans)

clusterSize

class:numeric - number of clusters returned,

glrmFeat

class:numeric - number of features engineered by generalised low rank model

normalise

class:boolean - whether to center the data prior to clustering

binary

class:boolean - whether to treat features as binary

fraction

class:boolean - whether to treat features as fraction of total records

covariatesToInclude

class:character vector - features to include: default NULL

covariatesToExclude

class:character vector - features to exclude;Default NULL

covariatesGroups

class:covariatecluster result of clusterCovariate();Default NULL

extraparameters

- a list of parameters that can be used when adding a non default cluster method

cohortid

class:numeric - id of cohort in cohort table

Details

This function performs kmeans clustering or general low rank model clustering on clusterData extraced from the CDM using dataExtract(). The user can specify a subset of the data based on ageSpan=c(lowerAgeLimit, upperAgeLimit) and gender=gender_concept_id and then the clustering method 'kmeans' or 'glrm' and the required cluster size: clusterSize=10.

When method 'kmeans' is chosen, the people are clustered using kmeans from the h2o package into clusterSize number of groups. When method 'glrm' is chosen, a glrm is run on the data to reduce the dimensionality to glrmFeat number of features and then kmeans is run on the reduced dimensionality data to cluster the people into clusterSize number of groups.

The data can be pre-processed using the normalise, binary and fraction variables. When normalise is TRUE then the data have the feature means subtracted and the result is divided by the feature standard deviation. When binary is TRUE, each feature for a person is set to 1 if the patient has the feature in the covariate list and 0 otherwise. When binary is set to FALSE the feature value is set to the number of concepts in the feature set that the patient has in the covariates list (e.g. if feature 1 consists of three concept_ids, 12, 1 and 304 and patient 1 has none of these concept_ids in the covariate list, he will have 0 in the feature 1 column, whereas if patient 2 has concept_id 12 and 304, she will have 2 in the feature 1 column). When fraction is TRUE then the features for each patient are scaled by dividing by the total sum of the patient's feature values (e.g. if patient 1 has value 3 for feature 5, value 1 for feature 10 and 0 for all other features then if fraction =TRUE this will be scale to 3/4 for feature 5 and 1/4 for feature 10).

The user can also specify covariates to include/exclude from the clustering by specifying the covariate_ids in a vector, for example setthing covariatesToInclude=c(1,3,10,45) will cluster the data using only the four specified covariates whereas setting covariatesToExclude=c(1,3,10,45) will exclude the specified covariates from the clustering.

Value

A list is returned of class 'clusterResult' containing:

strata

An ffdf containing the row_id (unique reference of the person), their age and gender

covariates

An ffdf containing the covariates each person has in sparse format

covariateRef

An ffdf containing the description of each covariate

clusters

A data frame containing the cluster allocated for each row_id

centers

A data frame containing the cluster centers returned by the kmeans algorithm

metadata

A list containing the information about the paramaters set to extract the data and do the clustering

newData

An ffdf containing the reduced dimensionality data returned when glrm pre-processing is done

features

An ffdf containing the clustering of the original covariates by glrm

Author(s)

Jenna Reps

References

todo...

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# set database connection
dbconnection <- DatabaseConnector::createConnectionDetails(dbms = dbms,server = server,
user = user,password = pw,port = port,schema = cdmDatabaseSchema)

# then extract the data - in thie example using default groups
clusterData <- dataExtract(dbconnection, cdmDatabaseSchema,
cohortDatabaseSchema=cdmDatabaseSchema,
workDatabaseSchema='scratch.dbo',
cohortid=2000006292, agegroup=NULL, gender=NULL,
type='group', groupDef = 'default',
historyStart=1,historyEnd=365,  loc=getwd())

# initialise the h2o cluster
h2o.init(nthreads=-1, max_mem_size = '50g')

# cluster the males aged between 30 and 50 into 15 clusters
clusterPeople <- clusterRun(clusterData, minAge=30, maxAge=50, gender=8507,
                         method='kmeans', clusterSize=15,
                         normalise=F, binary=F,fraction=T)

jreps/patientCluster documentation built on May 20, 2019, 10:46 a.m.