In ginagruenhage/cmdsr: continuous Multi-dimensional Scaling

Personality over time: an example with cMDS

This vignette describes how to use cMDS to produce interactive plots.

The BFI dataset

In this example we'll use data from a personality study, "bfi" from package psych. People answered questions such as "I am full of ideas" with a rating of how well that describes their personality (from 1 to 6).

We load the package:

if (suppressWarnings(require(psych))==FALSE)
    {
        install.packages("psych")
    }
library(psych)
data(bfi)

The questions asked are coded in the data.frame as "A1","A2", etc. The help file for the dataset gives us the corresponding questions, which we encode as a list :

source('personality_questions.R')
questions[["A1"]]

Questions as vectors

The data include the answers of around 3,000 participants. Participants' responses to a certain question can be grouped into a (long) response vector. If two questions measure a similar aspect of personality, then the responses will tend to correlate: presumably, if you're cocky enough to declare that you "know how to captivate people", you'll probably also declare that you are "make friends easily". To start with, we'll measure the similarity of two response vectors using simple Euclidean distance.

First we get rid of the rows that contain NAs

nas <- apply(bfi,1,function(v) any(is.na(v)))
sub <- bfi[!nas,]

then we compute a distance matrix between all the response vectors:

X <- t(sub[,1:25]) #Questions only, no covariates
D <- as.matrix(dist(X))
image(t(D))

Clustering

We can cluster the response vectors:

plot(hclust(dist(X)))

the highest-level cluster is pretty straightforward to interpret:

head(questions[which(cutree(hclust(dist(X)),2)==1)],3)

vs:

head(questions[which(cutree(hclust(dist(X)),2)==2)],3)

This seems to suggest there are questions that people prefer to answer positively and others not.

The age covariate

The bfi dataset also comes with a column vector that gives us the age of the participants.

hist(bfi$age)

Using cMDS to look at the effects of age

We'll use cMDS to see if the relationship between questions changes over time.

We define a function that gives a distance matrix between questions specific to a certain age group:

dmat <- function(age.med)
    {
        sub.age <- subset(sub,abs(age-age.med)<=2)
        X <- sub.age[,1:25]
        as.matrix(dist(t(X)))/sqrt(nrow(X))
    }

we collect distance matrices over a succession of age groups:

ages <- seq(16,60,l=30)
DL <- llply(ages,dmat) #Will output a list of distance matrices
names(DL) <- round(ages) #This will be used later by the plotting function

and run cMDS to get a 2d embedding:

library(cmdsr)
res <- cmds(DL,k=2,l=1)

The structure of cMDS output

The most important bit in the output returned by cMDS is XL, a list of configurations with the same length as DL, the list of distances. Each XL is a k*n matrix, where k is the embedding dimension (here, k=2) and n is the number of points.

plot(res$XL[[1]][1,],res$XL[[1]][2,],xlab="Dimension 1",ylab="Dimension 2",
     pch=19,main="cMDS configuration at the initial timestep")

Each point represents a question, located so that its distance to the other points reflects the initial distance matrix. We have what looks like clustering, and sure enough if we label the points according to the roughest hierarchical clustering division we have:

cols <- c('red','blue')[cutree(hclust(dist(X)),2)]
plot(res$XL[[1]][1,],res$XL[[1]][2,],xlab="Dimension 1",ylab="Dimension 2",
     pch=19,main="cMDS configuration at the initial timestep",col=cols)

A motion chart using RGoogleVis

The embedding can be displayed using the motion widget in the GoogleVis package:

#Create a data.frame with some basic information on what's being displayed
df <- data.frame(id=names(questions),question=do.call("c",questions))
print(googleVis.cmds(res,df),"chart")

Due to a limitation in GoogleVis, the ages appear as years: 1920 actually means people around 20.

A motion chart using animint (work in progress)

Interactive graphs can also be produced using other libraries. Here's an example using the https://github.com/tdhock/animint|animint package:

library(animint)
library(stringr)
df$group <- str_sub(df$id,1,1) #Questions are grouped thematically, we'll plot themes using different colours

#Turn cMDS output into a data.frame
embed <- ldply(1:length(ages),function(ind)
               {
                   df <- as.data.frame(t(res$XL[[ind]]))
                   names(df) <- c("cmds.x1","cmds.x2")
                   df[,"age"] <- ages[ind]
                   df[,"id"] <- names(questions)
                   df
               })

embed <- merge(embed,df) #Add info on the questions
#animint is based on ggplot. see animint documentation for info
pa <- ggplot(df,aes(rep(0,25),25:1,col=group))+geom_text(aes(label=question,clickSelects=id)) +
    theme(axis.line=element_blank(),
          axis.text=element_blank(),
          axis.ticks=element_blank(),
          axis.title=element_blank(),legend.position="none")
pb <- ggplot(embed)+make_text(embed,0,2,label="age",format="Age %d +/- 2 years")+geom_text(aes(cmds.x1,cmds.x2,label=id,showSelected=id,showSelected2=age),size=2)+geom_text(aes(cmds.x1,cmds.x2,label=id,showSelected=age,col=group),alpha=.8)
gg2animint_knitr(list(pa=pa,pb=pb,time=list(variable="age", ms=500)))