# A clean start rm(list = ls()) graphics.off()
# Important: Remember # build the vignettes with devtools::build_vignettes() knitr::opts_chunk$set( collapse = TRUE, fig.width = 9, comment = "#>" )
# To get the vignette as a pdf file rmarkdown::render("sortingNoodlesInAsia.Rmd", bookdown::pdf_document2(keep_tex = TRUE, toc_depth = 3) , output_dir = '../../test4Spise2018/sortingNoodles/')
# Knitr options here knitr::opts_knit$get()
nK = 30
This document
presents the analysis of the Ramen Noodle sorting task
collected with r nK
participants on Tuesday July 24 2018
in the Advanced SPISE2018 workshop.
Participants were provided with 20 pictures (each printed on a 8cm by 13cm card) of packets of Ramen Noodles (see Figures below). Participants were given the following instructions:
"Look at the pictures and put together
he Ramen noodles that for you belong to the same category.
You can make as many groups as you want
(at least two groups but less than 20)
and you can put as many samples as you want in each group."
```r", fig.height=4, fig.width=6, include=TRUE, out.width='85%'} knitr::include_graphics('../man/figures/ramenNoodlesImages.png')
# Prelude to the analysis If you want to make sure that you have a clean start, you can execute the following commands: ```r rm(list = ls()) graphics.off()
Or, better, you can (should? must?)
use an Rproject
for this project (see
preamble below).
Make sure that you start this analysis
as a new Rproject
so that
the default directory will be correctly set.
Before we start the analysis,
we need to have our three standard packages installed
(from Github
)
and the corresponding libraries loaded:
DistatisR
PTCA4CATA
R4SPISE2018
We also need some other packages namely:
Matrix
factoextra
ExPosition
All these packages are installed or loaded with the following instructions:
# Decomment all/some these lines if the packages are not installed # devtools::install_github('HerveAbdi/PTCA4CATA') # devtools::install_github('HerveAbdi/DistatisR') # install.packages(prettyGraphs) # install.packages('Matrix') # install.packages('factoextra') # install.packages('ExPosition') # install.packages('pander') # nice pdf-tables # # load the libraries that we will need suppressMessages(library(Matrix)) suppressMessages(library(factoextra)) suppressMessages(library(DistatisR)) suppressMessages(library(PTCA4CATA)) suppressMessages(library(prettyGraphs)) suppressMessages(library(ExPosition))
The name of the excel data file and of the excel sheets are given below:
file2read.name <- 'dataSortingRamen.xlsx' path2file <- system.file("extdata", file2read.name, package = "R4SPISE2018") sheetName4Data <- 'DataSort' sheetName4Vocabulary <- "Vocabulary" sheetName4Judges <- "JudgesDescription"
The data are stored in an excel file called
r file2read.name
whose location
is stored in the variable path2file.
```{R findDataPath} path2file <- system.file("extdata", "dataSortingRamen.xlsx", package = "R4SPISE2018")
```r", fig.height=3, fig.width=4, include=TRUE, out.width='70%'} # label.spicesxl = capFigNo #  # decomment if needed knitr::include_graphics('../man/figures/imageOfSortingNoodles.png') #knitr::include_graphics('imageOfSortingNoodles.png') # copy the file # file.copy(path2file, 'ramen.xlsx')
if you open this excel file, it will look like the
Figure above.
The sorting data are stored in the sheet
called r sheetName4Data
(whose name is stored in the variable sheetName4Data
).
In this sheet,
the first column gives the names of the products
(here spices) and the following columns give how
each Judge sorted the products: The products that
were sorted together are assigned the same number
(arbitrarily chosen).
The contingency table storing the number of Judges
using a descriptor (column) for a product (row),
is stored in the sheet r sheetName4Vocabulary
(whose name is stored in the variable sheetName4Vocabulary
).
The description of the Judges (e.g., country, age, gender)
is stored in the sheet r sheetName4Judges
(whose name is stored in the variable sheetName4Judges
).
When you record you own data, make sure that you follow the same format, this way the script described in this vignette will apply to your own analysis with minimal change.
We will first compute the results of the analysis, then create the graphics, and finally save everything into a PowerPoint.
The excel file name and location (i.e., path) are
stored in the variable path2file
.
To read the sorted data and the vocabulary
contingency table we will use the function
DistatisR::read.df.excel()
(based upon
the function readxl::read_excel()
).
We will use again this function to read the description
of the judges.
# read the sortinig data and the vocabulary multiSort.list <- read.df.excel(path = path2file, sheet = sheetName4Data, voc.sheet = sheetName4Vocabulary) multiSort <- multiSort.list$df.data vocabulary <- multiSort.list$df.voc
The sorting data and the vocabulary
are now stored into the list multiSort.list
in which the sorting data are stored in the
dataframe
called multiSort.list$df.data
and in which the vocabulary contingency table is stored
in the
dataframe called multiSort.list$df.voc
.
To facilitate coding, we put the sorting data and the vocabulary
in the data frames called (respectively) multisSort
and
vocabulary
.
# saveFile <- file.copy(from = path2file, to = '~/Downloads/myDataFile.xlsx')
To make sure that we have read the correct file we can
peek at the dataframe multiSort
and look at the data for
the first 5 spices of
the first 10 assessors:.
library(pander) # knitr::kable(multiSort[1:5,1:10]) pander::pander(multiSort[1:5,1:10])
The description of the judges can be found in the
sheet `r sheetName4Judges
and it is read
with the function DistatisR::read.df.excel
as:
# read the sortinig data and the vocabulary judgesDescription <- read.df.excel(path = path2file, sheet = sheetName4Judges)$df.data nVarJudges <- ncol(judgesDescription)
The judges' description is now stored in the data frame
JudgesDescription
.
k <- 1 # this is the first descriptor for the judges
Here we analyze the judges' descriptor number r k
(r colnames(judgesDescription)[k]
).
In order to run an analysis for another descriptor,
we just need to change the value of k
and re-run the analysis.
descJudges <- judgesDescription[,k ]
To color the graph of the participants,
we need to associate a color to each level
of the Judges' variable r colnames(judgesDescription)[k]
;
to do so we
use the function createColorVectorsByDesign()
from the package prettyGraphs.
# Create a 0/1 group matrix with ExPosition::makeNominalData() nominal.Judges <- makeNominalData(as.data.frame(descJudges)) # get the colors color4Judges.list <- prettyGraphs::createColorVectorsByDesign(nominal.Judges) # color4Judges.list
The first step of the analysis is to transform the sorting data
into a brick of distance matrices.
This is done with the function
DistanceFromSort()
as:
DistanceCube <- DistanceFromSort(multiSort)
The previous line of code, created the brick of distance (from
the raw data) and stored it in the data frame DistanceCube
.
The brick of distance matrices
(i.e., the data frame DistanceCube
) is used
as the argument of the function distatis
that will compute
the plain DISTATIS method for the sorting task.
resDistatis <- distatis(DistanceCube)
The list resDistatis
contains the results of the analysis.
This list contains two lists:
The first list is
called resDistatis$res4Cmat
, it contains
the results for the $R_V$ analysis
(because the $R_V$ matrix is also called the C matrix);
The second list is called resDistatis$res4Splus
,
it contains the results for the analysis of the
compromise (called the S matrix) .
If you have forgotten the output format of distatis
,
or want to have more information: have a look
at the help for the functiondistatis
:
In the console type:
?distatis
, or print resDistatis
.
We want to find if there are differences between the
different groups of Judges corresponding to the levels of the variable
r colnames(judgesDescription)[k]
.
The first step is to compute the mean of
these groups and their bootstrap confidence intervals.
This is done as:
# Get the factors from the Cmat analysis G <- resDistatis$res4Cmat$G # Compute the mean by groups of HJudges JudgesMeans.tmp <- aggregate(G, list(descJudges), mean) JudgesMeans <- JudgesMeans.tmp[,2:ncol(JudgesMeans.tmp )] rownames(JudgesMeans) <- JudgesMeans.tmp[,1] # Get the bootstrap estimates BootCube <- PTCA4CATA::Boot4Mean(G, design = descJudges, niter = 100, suppressProgressBar = TRUE) # head(BootCube)
The compromise is obtained as a weighted sum of the
$\alpha_j\mathbf{S}_j$ with each of the $J$ judges belonging to one
of $K$ groups (e.g., r colnames(judgesDescription)[k]
).
In plain DISTATIS
, each observation can be projected onto
the compromise; in a similar manner, when the
assessors are nested into another group factor
(e.g., here r colnames(judgesDescription)[k]
),
we can project
these groups onto the compromise.
The easiest way to do so is to use the partial projections
and compute the weighted sum corresponding to each group.
F_j <- resDistatis$res4Splus$PartialF alpha_j <- resDistatis$res4Cmat$alpha # create the groups of Judges #groupsOfJudges <- substr(names(alpha_j),1,1) groupsOfJudges <- descJudges code4Groups <- unique(groupsOfJudges) nK <- length(code4Groups) # initialize F_K and alpha_k F_k <- array(0, dim = c(dim(F_j)[[1]], dim(F_j)[[2]],nK)) dimnames(F_k) <- list(dimnames(F_j)[[1]], dimnames(F_j)[[2]], code4Groups) alpha_k <- rep(0, nK) names(alpha_k) <- code4Groups Fa_j <- F_j # A horrible loop for (j in 1:dim(F_j)[[3]]){ Fa_j[,,j] <- F_j[,,j] * alpha_j[j] } # Another horrible loop for (k in 1:nK){ lindex <- groupsOfJudges == code4Groups[k] alpha_k[k] <- sum(alpha_j[lindex]) F_k[,,k] <- (1/alpha_k[k])*apply(Fa_j[,,lindex],c(1,2),sum) }
To compute the projection of the vocabulary onto
the compromise space,
we use the function DistatisR::projectVoc
.
F4Voc <- projectVoc(multiSort.list$df.voc, resDistatis$res4Splus$F)
Most of the graphics will be created with either
prettyGraphs
or with PTCA4CATA
,
The graphs will be saved in a PowerPoint file.
In DISTATIS
,
the $R_V$ analysis describes the Judges' similarity structure.
In the first map shows the scree plot of the eigenvalues of the $R_V$ between-Judges matrix.
# 5.A. A scree plot for the RV coef. Using standard plot (PTCA4CATA) scree.rv.out <- PlotScree(ev = resDistatis$res4Cmat$eigValues, title = "RV-map: Explained Variance per Dimension") a1.Scree.RV <- recordPlot() # Save the plot
The eigen-analysis of the $R_V$ between-Judges matrix gives the Judges' factor scores that are used to create the factor maps describing the between Judges similarity structure
# Create the layers of the map gg.rv.graph.out <- createFactorMap(X = resDistatis$res4Cmat$G, axis1 = 1, axis2 = 2, title = "Judges: RVMap", col.points = color4Judges.list$oc, col.labels = color4Judges.list$oc) # create the labels for the dimensions of the RV map labels4RV <- createxyLabels.gen( lambda = resDistatis$res4Cmat$eigValues , tau = resDistatis$res4Cmat$tau, axisName = "Dimension ") # # Create the map from the layers # Here with lables and dots a2a.gg.RVmap <- gg.rv.graph.out$zeMap + labels4RV # Here with colored dots only a2b.gg.RVmap <- gg.rv.graph.out$zeMap_background + gg.rv.graph.out$zeMap_dots + labels4RV
To print the RV map, we simply use the function print()
as described below:
print(a2a.gg.RVmap )
# First the means # A tweak for colors in.tmp <- sort(rownames(color4Judges.list$gc), index.return = TRUE)$ix col4Group <- color4Judges.list$gc[in.tmp] # gg.rv.means <- PTCA4CATA::createFactorMap(JudgesMeans, axis1 = 1, axis2 = 2, constraints = gg.rv.graph.out$constraints, col.points = col4Group , alpha.points = 1, # no transparency col.labels = col4Group) # dimnames(BootCube$BootCube)[[2]] <- paste0('dim ',1: dim(BootCube$BootCube)[[2]]) #c('Dim1','Dim2') GraphElli.rv <- MakeCIEllipses(BootCube$BootCube[,1:2,], names.of.factors = c("dim 1","dim 2"), col = col4Group, p.level = .95) a2d.gg.RVMap.CI <- a2b.gg.RVmap + gg.rv.means$zeMap_dots + GraphElli.rv
The values of the means of the assessors from the r k
groups of Judges
can be seen in the table of means for the first three dimensions
knitr::kable(JudgesMeans[,1:3])
To evaluate the significance of these differences, we plot the means and their bootstrapped derived confidence intervals on the $R_V$ map.
print(a2d.gg.RVMap.CI )
Here we conduct a hierarchical cluster analysis on the factor scores of the $R_V$ matrix.
D <- dist(resDistatis$res4Cmat$G, method = "euclidean") fit <- hclust(D, method = "ward.D2") a05.tree4participants <- fviz_dend(fit, k = 1, k_colors = 'burlywood4', label_cols = color4Judges.list$oc[fit$order], cex = .7, xlab = 'Participants', main = 'Cluster Analysis: Participants')
print(a05.tree4participants)
The cluster analysis suggests that there are groups of participants, but how many groups maybe more difficult to evaluate.
We try $k$-means with 4 centers
# First plain k-means set.seed(42) participants.kMeans <- kmeans(x = G , centers = 4) #_____________________________________________________________________ # Now to get a map by cluster: col4Clusters <- createColorVectorsByDesign( makeNominalData( as.data.frame(participants.kMeans$cluster) ))
baseMap.i.km <- PTCA4CATA::createFactorMap(G, title = "RV map. k-means 4 groups", col.points = col4Clusters$oc, col.labels = col4Clusters$oc, constraints = gg.rv.graph.out$constraints, alpha.points = .4) a06.aggMap.i.km <- baseMap.i.km$zeMap_background + baseMap.i.km$zeMap_dots + baseMap.i.km$zeMap_text + labels4RV print(a06.aggMap.i.km)
First a scree plot of the compromise
#--------------------------------------------------------------------- # A scree plot for the Compromise. scree.S.out <- PlotScree( ev = resDistatis$res4Splus$eigValues, title = "Compromise: Explained Variance per Dimension") b1.Scree.S <- recordPlot() #---------------------------------------------------------------------
# 4.1 Get the bootstrap factor scores (with default 1000 iterations) BootF <- BootFactorScores(resDistatis$res4Splus$PartialF) # 5.2 a compromise plot # General title for the compromise factor plots: genTitle4Compromise = 'Compromise.' # To get graphs with axes 1 and 2: h_axis = 1 v_axis = 2 # To get graphs with say 2 and 3 # change the values of v_axis and h_axis color4Products <- # Create color for the Products from prettyGraph prettyGraphsColorSelection(n.colors = nrow(resDistatis$res4Splus$F)) gg.compromise.graph.out <- createFactorMap(resDistatis$res4Splus$F, axis1 = h_axis, axis2 = v_axis, title = genTitle4Compromise, col.points = color4Products , col.labels = color4Products) # NB for the lines below You need DISTATIS version > 1.0.0 # to get the eigen values and tau for the compromise label4S <- createxyLabels.gen( x_axis = h_axis, y_axis = v_axis, lambda = resDistatis$res4Splus$eigValues , tau = resDistatis$res4Splus$tau, axisName = "Dimension ") b2.gg.Smap <- gg.compromise.graph.out$zeMap + label4S # # 5.4 a bootstrap confidence interval plot # 5.3 create the ellipses gg.boot.graph.out.elli <- MakeCIEllipses( data = BootF[,c(h_axis,v_axis),], names.of.factors = c(paste0('Factor ',h_axis), paste0('Factor ',v_axis)), col = color4Products, ) # Add ellipses to compromise graph b3.gg.map.elli <- gg.compromise.graph.out$zeMap + gg.boot.graph.out.elli + label4S #
print(b2.gg.Smap)
nFac4Prod = 3 D4Prod <- dist(resDistatis$res4Splus$F[,1:nFac4Prod], method = "euclidean") fit4Prod <- hclust(D4Prod, method = "ward.D2") b3.tree4Product <- fviz_dend(fit4Prod, k = 1, k_colors = 'burlywood4', label_cols = color4Products[fit4Prod$order], cex = .7, xlab = 'Products', main = 'Cluster Analysis: Products')
print(b3.tree4Product)
# get the partial map map4PFS <- createPartialFactorScoresMap( factorScores = resDistatis$res4Splus$F, partialFactorScores = F_k, axis1 = 1, axis2 = 2, colors4Items = as.vector(color4Products), names4Partial = dimnames(F_k)[[3]], # font.labels = 'bold') d1.partialFS.map.byProducts <- gg.compromise.graph.out$zeMap + map4PFS$mapColByItems + label4S d2.partialFS.map.byCategories <- gg.compromise.graph.out$zeMap + map4PFS$mapColByBlocks + label4S
print(d1.partialFS.map.byProducts )
print(d2.partialFS.map.byCategories)
# 5.5. Vocabulary # 5.5.2 CA-like Barycentric (same Inertia as products) gg.voc.bary <- createFactorMap(F4Voc$Fvoca.bary, title = 'Vocabulary', col.points = 'red4', col.labels = 'red4', display.points = FALSE, constraints = gg.compromise.graph.out$constraints) # e1.gg.voc.bary.gr <- gg.voc.bary$zeMap + label4S #print(e1.gg.voc.bary.gr) b5.gg.voc.bary.dots.gr <- gg.compromise.graph.out$zeMap_background + gg.compromise.graph.out$zeMap_dots + gg.voc.bary$zeMap_text + label4S #print(gg.voc.bary.dots.gr)
print(e1.gg.voc.bary.gr)
print(b5.gg.voc.bary.dots.gr)
The graphics are saved as a PowerPoint with the following command
r name4Graphs = 'SortingRamenNoodles.pptx'
toto <- PTCA4CATA::saveGraph2pptx(file2Save.pptx = name4Graphs, title = '30 (mostly Asian) participants sort pictures of 20 Ramen noodles', addGraphNames = TRUE)
Note that we could also have created a PowerPoint with
Rmarkdown
by using the following options in the
preamble:
output: powerpoint_presentation: slide_level: 4
instead of (for example):
output: rmarkdown::html_vignette: toc: true number_sections: true
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.