knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 6 )
Exploratory Data Analysis (EDA) is a common activity once data has been cleaned and prepared. EDA involves running functions which allow you to better understand the responses and begin to formulate initial hypotheses based on the data.
This tutorial follows on from Tutorial 1 and guides you through an EDA of data which has been prepared into CoNLL-U format. These EDA functions are contained in r/02_data_exploration.R
.
Once the package is installed, you can load the finnsurveytext
package as below:
(Other required packages such as dplyr
and stringr
will also be installed if they are not currently installed in your environment.)
library(finnsurveytext)
The functions covered in this tutorial are:
fst_summarise_short()
fst_summarise()
fst_pos()
fst_length_summary()
fst_freq_table()
fst_ngrams_table()
fst_ngrams_table2()
fst_freq_plot()
fst_ngrams_plot()
fst_wordcloud()
fst_freq()
fst_ngrams()
There are two sets of data files available within the package which could be used in this tutorial. These files have been created following the process demonstrated in Tutorial 1.
You can read these in as follows:
df1 <- fst_child df2 <- fst_dev_coop
fst_summarise_short()
and fst_summarise()
This first function creates a simple summary table for the data that shows the total number of words, number of unique words, and number of unique lemmas in the data. You can either view the table in the console, or define a variable which will contain this table.
The second function adds information about the number and proportion of survey respondents which answered this question.
fst_summarise_short(data = df1) summary_table <- fst_summarise(data = df2, desc = "All") knitr::kable(summary_table)
fst_summarise_short()
and fst_summarise()
take 1 argument:
data
is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu()
.fst_summarise()
takes an optional second argument:
desc
is an optional string describing respondents. This description is included in the table in the first column. If not defined, it will default to 'All respondents'. fst_pos()
This function creates a table which counts the number and proportion of words of each part-of-speech (POS) tag within the data. Again, you can either view the table in the console, or define a variable which will contain this table.
fst_pos(data = df1) pos_table <- fst_pos(data = df2) knitr::kable(pos_table)
fst_pos()
takes 1 argument:
data
is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu()
.fst_length_summary()
This function creates a table which summarises the distribution of lengths in the responses. Again, you can either view the table in the console, or define a variable which will contain this table.
fst_length_summary(data = df1, desc = "All Children") length_table <- fst_length_summary(data = df2, incl_sentences = FALSE) knitr::kable(length_table)
fst_length_summary()
takes 3 arguments:
data
is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu()
.desc
is an optional string describing respondents. If not defined, it will remain blank in the table meaning that the 'Description' column will only show whether the row is showing data for words or sentences. incl_sentences
is a boolean of whether to include sentence data in table, default is TRUE
. If incl_sentences = TRUE
, the table will also provide length information for the number of sentences within responses. If incl_sentences = FALSE
, the table will show only show results for the number of words in responses. Next we will demonstrate some functions which are used to create plots of most frequent words and n-grams occurring in the data. An n-gram is a set of n successive words in the data.
fst_freq_table()
This functions creates a table of the most frequently occurring words in the data (noting that "stopwords" may have been removed in previous data preparation steps.)
The top words tables is able have the words normalised if you choose. The variable norm
is the method for normalising the data. Valid settings are 'number_words'
(the number of words in the responses), 'number_resp'
(the number of responses), or NULL
(raw count returned, default).
Optionally, you can indicate which POS tags to include.
In this function, you must determine what you want to do in the case of ties with the variable strict
. Words with equal occurrence are presented in alphabetial order. By default, words are presented in order to the number
cutoff word. This means that equally-occurring later-alphabetically words beyond the cutoff word will not be displayed. Alternatively, you can decide that the cutoff is not strict, in which case words occurring equally often as the number
cutoff words will be displayed.
(fst_freq_table()
will print a message regarding this decision.)
We run the functions as follows:
fst_freq_table(data = df1) fst_freq_table( data = df1, number = 15, norm = NULL, pos_filter = c("NOUN", "VERB", "ADJ", "ADV"), strict = FALSE, use_svydesign_weights = FALSE, id = "", svydesign = NULL, use_column_weights = FALSE ) table1 <- fst_freq_table(data = df2, number = 5) knitr::kable(table1) table2 <- fst_freq_table(data = df2, number = 5, norm = "number_resp", pos_filter = c("NOUN", "VERB"), strict = FALSE) knitr::kable(table2)
fst_freq_table()
takes the following arguments:
data
is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu()
.number
is the number of top words/n-grams to return, default is 10
which means that the top 10 words will be returned.norm
is the method for normalising the data. Valid settings are 'number_words'
(the number of words in the responses), 'number_resp'
(the number of responses), or NULL
(raw count returned, default).pos_filter
is an optional list of which POS tags to include such as 'c("NOUN", "VERB", "ADJ", "ADV")'
. The default is NULL
, in which case all words in the data are considered.strict
is a boolean that determines how the function will deal with 'ties'. If strict = TRUE
, the table will cut-off at the exact number
(words are presented in alphabetical order so later-alphabetically, equally occurring words to the word at number
will not be shown.) If strict = FALSE
, the table will show any words that occur equally frequently as the number cutoff word. use_svydesign_weights
is a boolean for whether to get weights for the responses from a svydesign
objectsvydesign
object, the id
field needs to not be empty, as this is used to join the data. svydesign
object this is the named object.use_column_weights
is a boolean for if weights have already been included in the formatted data and should be included. fst_ngrams_table()
Similar to fst_freq_table()
, this functions creates a table of the most frequently occurring n-grams in the data (noting that "stopwords" may have been removed in previous data preparation steps.)
The top n-grams tables are able have the n-grams normalised if you choose. The variable norm
is the method for normalising the data. Valid settings are 'number_words'
(the number of words in the responses), 'number_resp'
(the number of responses), or NULL
(raw count returned, default).
Optionally, you can indicate which POS tags to include.
In this function, you must determine what you want to do in the case of ties with the variable strict
. N-grams with equal occurrence are presented in alphabetial order. By default, n-grams are presented in order to the number
cutoff n-gram. This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed. Alternatively, you can decide that the cutoff is not strict, in which case n-grams occurring equally often as the number
cutoff n-gram will be displayed.
(fst_get_top_ngrams()
will print a message regarding this decision. There is another function fst_ngrams_table2()
which doesn't print a message. This function is used within the comparison functions in 04_comparison_functions.R
)
We run the functions as follows:
fst_ngrams_table(data = df1, ngrams = 2) fst_ngrams_table(data = df1, ngrams = 2, norm = "number_words", strict = FALSE) table3 <- fst_ngrams_table(data = df2, number = 15, ngrams = 3) knitr::kable(table3) table4 <- fst_ngrams_table(data = df2, number = 15, ngrams = 2, pos_filter = c("NOUN", "VERB"), strict = FALSE) knitr::kable(table4)
fst_freq_table()
has the same setup as fst_ngrams_table()
plus an additional argument ngrams
:
data
is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu()
.number
is the number of top words/n-grams to return, default is 10
which means that the top 10 n-grams will be returned.ngrams
is the type of n-grams. The default is "1" (so top words). Set ngrams = 2
to get bigrams and n = 3
to get trigrams etc. norm
is the method for normalising the data. Valid settings are 'number_words'
(the number of words in the responses, default), 'number_resp'
(the number of responses), or NULL
(raw count returned).pos_filter
is an optional list of which POS tags to include such as 'c("NOUN", "VERB", "ADJ", "ADV")'
. The default is NULL
, in which case all words in the data are considered.strict
is a boolean that determines how the function will deal with 'ties'. If strict = TRUE
, the table will cut-off at the exact number
(n-grams are presented in alphabetical order so later-alphabetically, equally occurring n-grams to the n-gram at number
will not be shown.) If strict = FALSE
, the table will show any n-grams that occur equally frequently as the number cutoff n-gram. use_svydesign_weights
, svydesign
, id
and use_column_weights
defined as above.fst_freq_plot()
This functions plots the results of fst_freq_table()
.
fst_freq_plot(table = table1, number = 5, name = "Table 1")
The arguments are:
table
is the output of fst_get_top_words
or fst_get_top_ngrams()
number
The number of words/n-grams, default is 10
.name
is an optional "name" for the plot, default is NULL
fst_ngrams_plot()
This functions plots the results of fst_get_top_ngrams()
.
fst_ngrams_plot(table = table3, number = 15, ngrams = 3, "Trigrams") fst_ngrams_plot(table = table4, number = 15, ngrams = 2, "Bigrams")
The arguments are:
table
is the output of fst_get_top_words
or fst_get_top_ngrams()
number
The number of words/n-grams, default is 10
.name
is an optional "name" for the plot, default is NULL
ngrams
is the type of n-grams. As you can see above, you can plot top words using ngrams = 1
.fst_freq()
This functions runs fst_get_top_words()
and fst_freq_plot()
within one function:
fst_freq(data = df2, number = 12, strict = FALSE, name = "Q11_1")
The arguments are as defined in the component functions:
data
is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu()
.number
is the number of top words/n+grams to return, default is 10
.norm
is the method for normalising the data. Valid settings are 'number_words'
(the number of words in the responses, default), 'number_resp'
(the number of responses), or NULL
(raw count returned).pos_filter
is an optional list of which POS tags to include such as 'c("NOUN", "VERB", "ADJ", "ADV")'
. The default is NULL
, in which case all words in the data are considered.strict
is a boolean that determines how the function will deal with 'ties'. If strict = TRUE
, the table will cut-off at the exact number
(words are presented in alphabetical order so later-alphabetically, equally occurring words to the word at number
will not be shown.) If strict = FALSE
, the table will show any words that occur equally frequently as the number cutoff word. name
is an optional "name" for the plot, default is NULL
use_svydesign_weights
, svydesign
, id
and use_column_weights
defined as above.fst_ngrams()
This functions runs fst_get_top_ngrams()
and fst_ngrams_plot()
within one function:
fst_ngrams(data = df1, number = 12, ngrams = 2)
The arguments are as defined in the commponent functions:
data
is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu()
.number
is the number of top words/n+grams to return, default is 10
.ngrams
is the type of n-grams. The default is "1" (so top words). Set ngrams = 2
to get bigrams and n = 3
to get trigrams etc. norm
is the method for normalising the data. Valid settings are 'number_words'
(the number of words in the responses, default), 'number_resp'
(the number of responses), or NULL
(raw count returned).pos_filter
is an optional list of which POS tags to include such as 'c("NOUN", "VERB", "ADJ", "ADV")'
. The default is NULL
, in which case all words in the data are considered.strict
is a boolean that determines how the function will deal with 'ties'. If strict = TRUE
, the table will cut-off at the exact number
(n-grams are presented in alphabetical order so later-alphabetically, equally occurring n-grams to the n-gram at number
will not be shown.) If strict = FALSE
, the table will show any n-grams that occur equally frequently as the number cutoff word. use_svydesign_weights
, svydesign
, id
and use_column_weights
defined as above.fst_wordcloud()
This function will create a wordcloud plot for the data. There is an option to select only specific word types (POS tag).
fst_wordcloud(data = df1) fst_wordcloud( data = df2, pos_filter = c("NOUN", "VERB", "ADJ", "ADV"), max = 150 )
fst_wordclouds()
takes 7 arguments:
data
is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu()
.pos_filter
is an optional list of POS tags for inclusion in the wordcloud. The defaul is NULL
.max
is the maximum number of words to display, the default is 100
. use_svydesign_weights
, svydesign
, id
and use_column_weights
defined as above.EDA of open-ended survey questions can be conducted using functions in r/02_data_exploration.R
such as finding most frequent words and n-grams, summarising the length of responses and words used, and visualising responses in word clouds. The results of this EDA can help researchers better understand their data, create hypotheses based on this initial insights, and inform future analysis of the surveys.
The Office of Ombudsman for Children: Child Barometer 2016 [dataset]. Version 1.0 (2016-12-09). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD3134
Finnish Children and Youth Foundation: Young People's Views on Development Cooperation 2012 [dataset]. Version 2.0 (2019-01-22). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD2821
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.