**NOTE:** This `pdf` was generated from the vignette `R4SPISE2018::cheeseMCA` from the `R` Package `R4SPISE2018`. Check the help for the very last version of this document.
If you want to make sure that you have a clean start, you can execute the following commands:
rm(list = ls()) graphics.off()
Or, better (see below, preamble), you can use an Rproject
for this project.
# Important: Remember # build the vignettes with devtools::build_vignettes() knitr::opts_chunk$set( collapse = TRUE, comment = "#>")
knitr::opts_knit$get()
Make sure that you start this analysis as a new Rproject
so that
the default directory will be correctly set.
Before we start the analysis,
we need to have our standard packages installed
(some from Github
)
and the corresponding libraries loaded:
Distatis
(from Github)ExPosition
InPosition
PTCA4CTA
(from Github)R4SPISE2018
(from Github)dplyr
(to recode the data)car
(to recode the data)grid
, gridExtra
, and gTable
(to save tables as graphics)stringi
(to recode strings).and all their extensions:
# Decomment all/some these lines if the packages are not installed # devtools::install_github('HerveAbdi/PTCA4CATA') # devtools::install_github('HerveAbdi/DistatisR') # devtools::install_github('HerveAbdi/R4SPISE2018') # of course! # install.packages(prettyGraphs) # install.packages('Matrix') # install.packages('dplyr') # install.packages('gridExtra') # install.packages('grid') # install.packages('gtable') # install.packages('stringi') # load the libraries that we will need suppressMessages(library(Matrix)) suppressMessages(library(DistatisR)) suppressMessages(library(PTCA4CATA)) suppressMessages(library(prettyGraphs)) suppressMessages(library(ExPosition)) suppressMessages(library(dplyr)) suppressMessages(library(gridExtra)) # to save a table as a graph suppressMessages(library(grid)) # that will be saved in the suppressMessages(library(gtable)) # powerpoint with the figures
The data set can be found from the package R4SPISE2018.
The data are stored in an excel file called
miniCheeseSurvey4MCA.xlsx
whose location
can be found and stored in the
variable path2file
using the R
function system.functionfile()
as shown in the following command:
```{R findDataPath}
path2file <- system.file("extdata",
"miniCheeseSurvey4MCA.xlsx", package = "R4SPISE2018")
```r", fig.height=3, fig.width=4, include=TRUE, out.width='70%'} knitr::include_graphics('../man/figures/maroilesMCA.png')
Now the variable path2file
contains the name (and location)
of the excel data file.
If you open this excel file, it will look like the
figure above.
The sorting data are stored in the sheet labeled Data.
The first column gives the ID of the respondents
and each column represents a question asked to the respondents.
At the intersection of a row and a column we have the
response of a respondent (row) to a question (column).
When you record your own data, make sure that you follow the same format, this way the script described in this vignette will apply to your own analysis with minimal change.
We will first compute the results of the analysis, then create the graphics, and finally save everything into a powerpoint.
Recall that,
the excel file name and location (i.e., path) are
stored in the variable path2file
.
To read the data we will use the function
DistatisR::read.df.excel()
(based upon
the function readxl::read_excel()
).
rawData <- read.df.excel(path = path2file, sheet = 'Data')$df.data
The data are now read and stored in
the data frame called rawData
.
To save the data file
(under the---not very---original name of myDataFile
or maybe a more informative name)
in a directory
(say Downloads
). use the following command
saveFile <- file.copy(from = path2file, to = '~/Downloads/myDataFile.xlsx')
In this example we will analyze the answers to two sets of questions.
The first set evaluates the knowledge of the respondent:
and it comprises the
questions Q01_Know1
to Q08_Know8
(or 1 to 8);
these questions have 5 different options
one correct and 4 incorrect options.
Here we have decided to code these questions
with only two values: correct or incorrect.
the second set
comprises the questions Q15_C01
to Q38_C24
(or 32 to 55);
these questions
evaluate the behaviors, opinions, or attitudes of
the respondents toward cheese that are either farm-made or
industrial. These questions are answered with a 4 point
Likert scale (from 1 meaning "I totally agree" to
4 meaning "I totally disagree")
In addition we have some information about the respondents:
Sex (Q09_Sex)
, Age (Q10_Age)
, the city where they live
(Q11_City)
.
In MCA, before proceeding to analysis per se, all variables need to be nominal variables with the constraint that the different levels of a variable are roughly balanced. When a variable is already nominal, the process is straightforward but some levels may need to be fused to have the levels roughly balanced. For ordinal or quantitative data the procedure is to "bin" the data so that the bins are roughly balanced
From the dataframe rawData
we know create
the new dataframe cleanData
. It will have the
same rows but the columns are obtained by recoding
the columns of rawData.
The first step is to create the new data frame with only the variables we want to use; we also re-order the variables to have the respondent variables (i.e., Sex, Age, city) first. This is done with this line of code
temp.df <- dplyr::select(rawData, Q09_Sex:Q11_City, Q01_Know1:Q08_Know8, Q15_C01:Q38_C24)
We use car::recode
to recode the variable sex
from 1/2 to m/f
# We recode sex as m/f sex = car::recode(temp.df[,'Q09_Sex'], " 1 = 'm';2 = 'f'")
The questions Q01_Know1
to Q08_Know8
are knowledge questions
their sum could go from 0 to 8:
# correct responses correct.Know = c(3,3,1,2,1,1,1,3) # where are the knowledge questions knowQuestions = (substr( colnames(temp.df),5,8) == 'Know') # Get the number of correct amswers correct.answers = rowSums(matrix(correct.Know, nrow = nrow(temp.df), ncol = length(correct.Know), byrow = TRUE) == temp.df[,which(knowQuestions)])
But in this sample the sum goes from 1 to 6 with a mode at 4/5 as can be seen from the histogram of the number of correct answers.
hist(correct.answers, main = '', breaks = 0:8, xlab = 'Number of Correct Answers')
we will concatenate them.and re-evaluate them from 1 to 3
# Make only three categories use recode from package car know = car::recode(correct.answers, "0 = 1; 1 = 1; 2 = 1; 3 = 1; 4 = 2; 5 = 3; 6 = 3")
We will also recode the variable Age
into four levels
oriAge <- temp.df$Q10_Age newAge <- car::recode(oriAge, "1 = 1; 2 = 1; 3 = 2; 4 = 3; 5 = 4; 6 = 4")
The first step when analyzing Likert scales is to look at the distribution of the scores. Here we will create a table with the questions as rows and the columns as levels (from 1 to 4). ```r', outwidth = '100%'} knitr::kable( t(apply(rawData[,32:55],2,function(x){summary(as.factor(x))})))
This table shows that the levels of the responses are not very balanced, so we decided to binarize the answers and to keep only the subset of questions that can be binarized as {1,2}|{3,4} (other choices would have been acceptable of course). The set that we keep is in the variable `Lick2Keep` ```r Lick2Keep = c('Q15_C01','Q19_C05','Q21_C07', 'Q24_C10','Q25_C11','Q32_C18')
Lickert <- temp.df[, colnames(temp.df) %in% Lick2Keep] recl <- function(x){y = car::recode(x, "1 = 1; 2 = 1; 3 = 2; 4 = 2") } rec.Lickert <- apply(Lickert,2,recl)
We can now put together the new data set
cleanData <- cbind(sex,newAge,rawData$Q11_City,know,rec.Lickert) colnames(cleanData) <- c('Sex','Age','City','Know', 'C01','C05','C07','C10','C11','C18') # make sure that there is no NA left cleanData <- cleanData[complete.cases(cleanData),]
The dataframe is now ready to be analyzed with multiple correspondence analysis
Use the function epMCA
from the package ExPosition
resMCA <- epMCA(cleanData, graphs = FALSE)
Because Dimension 1 mostly separates the inhabitants of Lille from the inhabitants of Angers, we decided to perform a subset analysis and analyze separately the data from each city.
Create the "Lille" data set by subsetting the whole data set.
cleanData.Lille = cleanData[cleanData[,'City'] == 'Lille',] cleanData.Lille <- cleanData.Lille[, (colnames(cleanData.Lille) != 'City') ]
resMCA.Lille <- epMCA(cleanData.Lille, graphs = FALSE)
Create the "Angers" data set by subsetting the whole data set.
cleanData.Angers <- cleanData[cleanData[,'City'] == 'Angers',] cleanData.Angers <- cleanData.Angers[, (colnames(cleanData.Angers) != 'City') ]
resMCA.Angers <- epMCA(cleanData.Angers, graphs = FALSE)
Make some graphs.
First, get the colors for the variables
cJ <- resMCA$ExPosition.Data$cj color4Var <- prettyGraphs::prettyGraphsColorSelection(ncol(cleanData))
# Extract the root names before the "." varNames <- stringi::stri_extract(rownames(cJ),regex = '[^.]*') varCtr.tmp <- aggregate(cJ ~ varNames, (cbind(varNames,cJ)),sum) varCtr <- varCtr.tmp[,-1] rownames(varCtr) <- varCtr.tmp[,1] rownames(color4Var) <- varCtr.tmp[,1]
The variable contributions show the variables important for a given dimension.
nFact <- min(5, ncol(cJ) - 1) # knitr::kable(round( varCtr[,1:nFact]*1000 ) ) # save table as a graph ctrTable <- tableGrob(round(varCtr[,1:nFact]*1000)) h <- grobHeight(ctrTable) w <- grobWidth(ctrTable) title <- textGrob("Variable Contributions", y = unit(0.5,"npc") + 1.2*h, vjust = 0, gp = gpar(fontsize=15)) TableWithTitle <- gTree(children = gList(ctrTable, title)) grid.draw(TableWithTitle) a00.1.ctrTable <- recordPlot()
To create graphics we will color the labels of a given variable with the same color. First we create the vector storing the colors.
nM <- nrow(cJ) nVar <- nrow(color4Var) col4Labels <- rep("",nM) for (i in 1:nVar){ lindex <- varNames %in% rownames(color4Var)[i] col4Labels[lindex] <- color4Var[i] }
# 5.A. A scree plot for the RV coef. Using standard plot (PTCA4CATA) scree.mca <- PlotScree(ev = resMCA$ExPosition.Data$eigs, p.ev = NULL, max.ev = NULL, alpha = 0.05, col.ns = "#006D2C", col.sig = "#54278F", title = "MCA. Explained Variance per Dimension", plotKaiser = FALSE, color4Kaiser = "darkorchid4", lwd4Kaiser = 2.5) a1.Scree <- recordPlot() # Save the plot
axis1 = 1 axis2 = 2 Fj <- resMCA$ExPosition.Data$fj # generate the set of maps BaseMap.Fj <- createFactorMap(X = Fj , # resMCA$ExPosition.Data$fj, axis1 = axis1, axis2 = axis2, constraints = NULL, title = 'MCA. Variables', col.points = col4Labels, display.points = TRUE, pch = 19, cex = 1, display.labels = TRUE, col.labels = col4Labels, text.cex = 2.5, font.face = "bold", font.family = "sans", col.axes = "darkorchid", alpha.axes = 0.2, width.axes = 1.1, col.background = adjustcolor("lavender", alpha.f = 0.2), force = 2, segment.size = 0) # add labels labels4MCA <- createxyLabels.gen(x_axis = axis1, y_axis = axis2, lambda = resMCA$ExPosition.Data$eigs, tau = resMCA$ExPosition.Data$t) # make the maps aa.1.BaseMap.Fj <- BaseMap.Fj$zeMap + labels4MCA aa.2.BaseMapNoDot.Fj <- BaseMap.Fj$zeMap_background + BaseMap.Fj$zeMap_text + labels4MCA
print(aa.1.BaseMap.Fj)
Here we color the observations by City
,
to confirm that this is, indeed,
the main source of variation for this data set.
Fi <- resMCA$ExPosition.Data$fi colCity <- c('darkblue', 'red4') nI <- nrow(Fi) col4I.City <- rep("",nI) for (i in 1:length(colCity) ){ lindex <- cleanData[,'City'] %in% unique(cleanData[,'City'])[i] col4I.City[lindex] <- colCity[i] } # generate the set of maps BaseMap.Fi <- createFactorMap(X = Fi , # resMCA$ExPosition.Data$fj, axis1 = axis1, axis2 = axis2, constraints = NULL, title = 'MCA. Variables', col.points = col4I.City, alpha.points = .2, display.points = TRUE, pch = 19, cex = .8, display.labels = TRUE, col.labels = col4I.City, text.cex = 2.5, font.face = "bold", font.family = "sans", col.axes = "darkorchid", alpha.axes = 0.2, width.axes = 1.1, col.background = adjustcolor("lavender", alpha.f = 0.2), force = 2, segment.size = 0) # make the maps aa.5.BaseMapNoLabels.Fi <- BaseMap.Fi$zeMap_background + BaseMap.Fi$zeMap_dots + labels4MCA
print(aa.5.BaseMapNoLabels.Fi)
col4Labels.sub <- col4Labels[varNames != 'City'] axis1 = 1 axis2 = 2 Fj.Angers <- resMCA.Angers$ExPosition.Data$fj # generate the set of maps BaseMap.Fj.Angers <- createFactorMap(X = Fj.Angers , # resMCA$ExPosition.Data$fj, axis1 = axis1, axis2 = axis2, constraints = NULL, title = 'Angers. MCA. Variables', col.points = col4Labels.sub, display.points = TRUE, pch = 19, cex = 1, display.labels = TRUE, col.labels = col4Labels.sub, text.cex = 2.5, font.face = "bold", font.family = "sans", col.axes = "darkorchid", alpha.axes = 0.2, width.axes = 1.1, col.background = adjustcolor("lavender", alpha.f = 0.2), force = 2, segment.size = 0) # add labels labels4MCA.Angers <- createxyLabels.gen(x_axis = axis1, y_axis = axis2, lambda = resMCA.Angers$ExPosition.Data$eigs, tau = resMCA.Angers$ExPosition.Data$t) # make the maps ca.1.BaseMap.Fj.Angers <- BaseMap.Fj.Angers$zeMap + labels4MCA.Angers ca.2.BaseMapNoDot.Fj.Angers <- BaseMap.Fj.Angers$zeMap_background + BaseMap.Fj.Angers$zeMap_text + labels4MCA.Angers
print(ca.1.BaseMap.Fj.Angers)
col4Labels.sub <- col4Labels[varNames != 'City'] axis1 = 1 axis2 = 2 Fj.Lille <- resMCA.Lille$ExPosition.Data$fj # generate the set of maps BaseMap.Fj.Lille <- createFactorMap(X = Fj.Lille , # resMCA$ExPosition.Data$fj, axis1 = axis1, axis2 = axis2, constraints = NULL, title = 'Lille. MCA. Variables', col.points = col4Labels.sub, display.points = TRUE, pch = 19, cex = 1, display.labels = TRUE, col.labels = col4Labels.sub, text.cex = 2.5, font.face = "bold", font.family = "sans", col.axes = "darkorchid", alpha.axes = 0.2, width.axes = 1.1, col.background = adjustcolor("lavender", alpha.f = 0.2), force = 2, segment.size = 0) # add labels labels4MCA.Lille <- createxyLabels.gen(x_axis = axis1, y_axis = axis2, lambda = resMCA.Lille$ExPosition.Data$eigs, tau = resMCA.Lille$ExPosition.Data$t) # make the maps cb.1.BaseMap.Fj.Lille <- BaseMap.Fj.Lille$zeMap + labels4MCA.Lille cb.2.BaseMapNoDot.Fj.Lille <- BaseMap.Fj.Lille$zeMap_background + BaseMap.Fj.Lille$zeMap_text + labels4MCA.Lille
print(cb.1.BaseMap.Fj.Lille)
The graphics are saved as a powerpoint called
'cheeseFrom2Cities.pptx'
with the following command
name4Graphs = 'cheeseFrom2Cities.pptx' list2Graphs <- PTCA4CATA::saveGraph2pptx(file2Save.pptx = name4Graphs, title = 'Attitudes toward Cheese', addGraphNames = TRUE)
Note that we could also have created a powerpoint with
Rmarkdown
by using the following options in the
preamble:
output: powerpoint_presentation: slide_level: 4
instead of (for example):
output: rmarkdown::html_vignette: toc: true number_sections: true
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.