| Team Members | | -- | | Sreya Guha | | Ayla Pearson | | Fan Nie | | Ting Pan |
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). In k-medians clustering, we partition n
observations into k
clusters. It calculates the median for each cluster to determine its centroid. The kmedians
package performs k-medians clustering on the dataset entered by the users and returns clustered data. This can prove to be an extremely beneficial package as k-medians is more robust to outliers than the arithmetic mean(k-means). The objective of this package is to provide a robust clustering method to solve some of the issues related to outliers and missing data.
The three main functions in the package are :
distance
function
mxn
array of m
original observations in an n
-dimensional spacepxk
array of p
original observations in an k
-dimensional space
It returns a mxp
distance matrix. For each i
and j
, the mteric distance(u=X[i], v=Y[j])
is computed and stored in the ij
th entrykmedian
function
summary
function
kmedians
function on the input data. It returns a dataframe that contains information about the model run such as the number of clusters, the number of points in each cluster, the inter and intra cluster distancek-median is more robust to noise and outliers as compared to k-means. Let's look at an example, suppose we want to cluster on one dimension with k=2. One cluster has most of its members around 1000 and the other around -1000; but there is an outlier (or noise) at 100000. It obviously belongs to the cluster around 1000 but k-means will put the center point away from 1000 and towards 100000. This may even make some of the members of the 1000 cluster (say a member with value 500) to be assigned to the -1000 cluster. k-median will select one of the members around 1000 as the median, it'll probably select one that is bigger than 1000, but it will not select an outlier. Thus we would want to use
kGmedian
: This is a fast k-medians clustering based on recursive averaged stochastic gradient algorithms. The advantage of the kGmedian
algorithm compared to kmeans
strategy is that it deals with sum of norms instead of sum of squared norms, ensuring a more robust behaviour against outlying values.
The above ideas are presented as a part of the initial proposal. However, they could be subject to change in the following milestones based on the project timeline or technical complexity.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.