Description Usage Arguments Details Value Author(s) References Examples
This function clusters patients into subgroups based on covariates corresponding to sets of concept_ids or concept_ids. It is recommended to use generalised low rank models to preprocess the data when clustering patients using individual concept_ids and reduce the dimensionality before applying k-means. When the data extraction used covariate groups kmeans can be run directly.
1 2 3 4 5 6 7 | clusterPeople()
## Default S3 method:
clusterPeople(clusterData, ageSpan=c(0,100), gender=8507, method='kmeans',
clusterSize=10, glrmFeat=NULL,normalise=T, binary=T,
fraction=F, covariatesToInclude=NULL,covariatesToExclude=NULL,
covariatesGroups=NULL, loc=loc)
|
minAge |
class:numeric default(NULL)- the minimum age a person in the cohort must be to be included in the data |
maxAge |
class:numeric default(NULL)- the maximum age a person in the cohort must be to be included in the data |
gender |
class:numeric - gender concept_id (8507- male; 8532-female) |
method |
class:character - method used to do clustering (currently only supports kmeans) |
clusterSize |
class:numeric - number of clusters returned, |
glrmFeat |
class:numeric - number of features engineered by generalised low rank model |
normalise |
class:boolean - whether to center the data prior to clustering |
binary |
class:boolean - whether to treat features as binary |
fraction |
class:boolean - whether to treat features as fraction of total records |
covariatesToInclude |
class:character vector - features to include: default NULL |
covariatesToExclude |
class:character vector - features to exclude;Default NULL |
covariatesGroups |
class:covariatecluster result of clusterCovariate();Default NULL |
extraparameters |
- a list of parameters that can be used when adding a non default cluster method |
cohortid |
class:numeric - id of cohort in cohort table |
This function performs kmeans clustering or general low rank model clustering on clusterData extraced from the CDM using dataExtract(). The user can specify a subset of the data based on ageSpan=c(lowerAgeLimit, upperAgeLimit) and gender=gender_concept_id and then the clustering method 'kmeans' or 'glrm' and the required cluster size: clusterSize=10.
When method 'kmeans' is chosen, the people are clustered using kmeans from the h2o package into clusterSize number of groups. When method 'glrm' is chosen, a glrm is run on the data to reduce the dimensionality to glrmFeat number of features and then kmeans is run on the reduced dimensionality data to cluster the people into clusterSize number of groups.
The data can be pre-processed using the normalise, binary and fraction variables. When normalise is TRUE then the data have the feature means subtracted and the result is divided by the feature standard deviation. When binary is TRUE, each feature for a person is set to 1 if the patient has the feature in the covariate list and 0 otherwise. When binary is set to FALSE the feature value is set to the number of concepts in the feature set that the patient has in the covariates list (e.g. if feature 1 consists of three concept_ids, 12, 1 and 304 and patient 1 has none of these concept_ids in the covariate list, he will have 0 in the feature 1 column, whereas if patient 2 has concept_id 12 and 304, she will have 2 in the feature 1 column). When fraction is TRUE then the features for each patient are scaled by dividing by the total sum of the patient's feature values (e.g. if patient 1 has value 3 for feature 5, value 1 for feature 10 and 0 for all other features then if fraction =TRUE this will be scale to 3/4 for feature 5 and 1/4 for feature 10).
The user can also specify covariates to include/exclude from the clustering by specifying the covariate_ids in a vector, for example setthing covariatesToInclude=c(1,3,10,45) will cluster the data using only the four specified covariates whereas setting covariatesToExclude=c(1,3,10,45) will exclude the specified covariates from the clustering.
A list is returned of class 'clusterResult' containing:
strata |
An ffdf containing the row_id (unique reference of the person), their age and gender |
covariates |
An ffdf containing the covariates each person has in sparse format |
covariateRef |
An ffdf containing the description of each covariate |
clusters |
A data frame containing the cluster allocated for each row_id |
centers |
A data frame containing the cluster centers returned by the kmeans algorithm |
metadata |
A list containing the information about the paramaters set to extract the data and do the clustering |
newData |
An ffdf containing the reduced dimensionality data returned when glrm pre-processing is done |
features |
An ffdf containing the clustering of the original covariates by glrm |
Jenna Reps
todo...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # set database connection
dbconnection <- DatabaseConnector::createConnectionDetails(dbms = dbms,server = server,
user = user,password = pw,port = port,schema = cdmDatabaseSchema)
# then extract the data - in thie example using default groups
clusterData <- dataExtract(dbconnection, cdmDatabaseSchema,
cohortDatabaseSchema=cdmDatabaseSchema,
workDatabaseSchema='scratch.dbo',
cohortid=2000006292, agegroup=NULL, gender=NULL,
type='group', groupDef = 'default',
historyStart=1,historyEnd=365, loc=getwd())
# initialise the h2o cluster
h2o.init(nthreads=-1, max_mem_size = '50g')
# cluster the males aged between 30 and 50 into 15 clusters
clusterPeople <- clusterRun(clusterData, minAge=30, maxAge=50, gender=8507,
method='kmeans', clusterSize=15,
normalise=F, binary=F,fraction=T)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.