InDetail5-AnalysisExample1"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6, fig.height = 6
)

Introduction

This tutorial goes into further details demonstrating the use of the package, covering functions in r/02_data_exploration.R and r/03_concept_network.R. The intention is to provide an example where some initial analysis of an open-ended survey question is used to inform the settings and concept words of the Concept Network plot. Then, we demonstrate how the Concept Network settings can be fine-tuned to produce plots which are more persuasive visualisations of your findings.

Installation of package.

Once the package is installed, you can load the finnsurveytext package as below: (Other required packages such as dplyr and stringr will also be installed if they are not currently installed in your environment.)

library(finnsurveytext)

Overview of Functions

The functions covered in this tutorial are:

  1. fst_summarise()
  2. fst_pos()
  3. fst_length_summary()
  4. fst_freq()
  5. fst_ngrams()
  6. fst_wordcloud()
  7. fst_concept_network()

The Data

We're looking again at the FSD2821 Young People's Views on Development Cooperation 2012 (FSD2821 Nuorten ajatuksia kehitysyhteistyöstä 2012) survey which is provided as sample data with the package.

Development Cooperation Data

There are 3 open-ended questions we look at within this data and we will set the threshold and other function arguments to be suit each question.

The open-ended questions

Demonstration of analysis

Format

First, we will format our data data.

dev_coop <- dev_coop

q11_1 <- fst_prepare(data = dev_coop,
                     question = 'q11_1',
                     id = 'fsd_id',
                     model = "ftb",
                     stopword_list = "nltk",
                     weights = 'paino',
                     add_cols = NULL,
                     manual = FALSE,
                     manual_list = "")

q11_2 <- fst_prepare(data = dev_coop,
                     question = 'q11_2',
                     id = 'fsd_id',
                     model = "ftb",
                     stopword_list = "nltk",
                     weights = 'paino',
                     add_cols = NULL,
                     manual = FALSE,
                     manual_list = "")

q11_3 <- fst_prepare(data = dev_coop,
                     question = 'q11_3',
                     id = 'fsd_id',
                     model = "ftb",
                     stopword_list = "nltk",
                     weights = 'paino',
                     add_cols = NULL,
                     manual = FALSE,
                     manual_list = "")

Here we'll print the first 10 rows of the the raw data and then one of the the processed datasets (Q11_1).

knitr::kable(head(dev_coop, 5))

knitr::kable(head(q11_1, 10))

Initial EDA

To begin with, let's look at the differences between the types of reponses we get for each of these open-ended questions by using some of the EDA functions. (For further details into these functions, see "Tutorial2-Data_Exploration")

First, let's consider these functions:

  1. fst_summarise() - This table will indicate response rate, and word counts.
  2. fst_pos() - This table counts the number and proportion of words with each part-of-speech tag.
  3. fst_length_summary() - This table gives the average and quartile values of lengths of responses in words and sentences.

The differences in the types of responses can be seen in the exploratory data analysis results below:

Summarise

knitr::kable(
  fst_summarise(data = q11_1, desc = "Q11_1")
)
knitr::kable(
  fst_summarise(data = q11_2, desc = "Q11_2")
)
knitr::kable(
  fst_summarise(data = q11_3, desc = "Q11_3")
)

Remarks:

POS Summary

knitr::kable(
  fst_pos(data = q11_1)
)
knitr::kable(
  fst_pos(data = q11_2)
)
knitr::kable(
  fst_pos(data = q11_3)
)

Remarks:

Length Summary

knitr::kable(
  fst_length_summary(
    data = q11_1,
    desc = "Q11_1",
    incl_sentences = TRUE
  )
)
knitr::kable(
  fst_length_summary(
    data = q11_2,
    desc = "Q11_2",
    incl_sentences = TRUE
  )
)
knitr::kable(
  fst_length_summary(
    data = q11_3,
    desc = "Q11_2",
    incl_sentences = TRUE
  )
)

Remarks:

Most common words and n-grams

Now, lets create some tables of the most common words, bigrams and trigrams using the following functions:

  1. fst_freq()
  2. fst_ngrams()

Top Words

fst_freq(
  data = q11_1,
  number = 10,
  norm = "number_words",
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  strict = TRUE,
  name = "Q11_1"
)

fst_freq(
  data = q11_2,
  number = 10,
  norm = "number_words",
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  strict = TRUE,
  name = "Q11_2"
)

fst_freq(
  data = q11_3,
  number = 10,
  norm = "number_words",
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  strict = TRUE,
  name = "Q11_3"
)

(We leave the default for norm her so "occurrence" is based on the number of words in the set of responses for each question.)

Here we can see that there is a clear 'top' set of words in each question:

Bi-grams and Trigrams

Here we look at common sets of two or three words.

fst_ngrams(
  data = q11_1,
  number = 10,
  ngrams = 2,
  norm = "number_words",
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  strict = TRUE,
  name = "Q11_1"
)
fst_ngrams(
  data = q11_1,
  number = 10,
  ngrams = 3,
  norm = "number_words",
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  strict = TRUE,
  name = "Q11_1"
)
fst_ngrams(
  data = q11_2,
  number = 10,
  ngrams = 2,
  norm = "number_words",
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  strict = TRUE,
  name = "Q11_2"
)
fst_ngrams(
  data = q11_2,
  number = 10,
  ngrams = 3,
  norm = "number_words",
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  strict = TRUE,
  name = "Q11_2"
)
fst_ngrams(
  data = q11_3,
  number = 10,
  ngrams = 2,
  norm = "number_words",
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  strict = TRUE,
  name = "Q11_3"
)
fst_ngrams(
  data = q11_3,
  number = 10,
  ngrams = 3,
  norm = "number_words",
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  strict = TRUE,
  name = "Q11_3"
)

Recall the open-ended questions

Common themes raised in bigrams and trigrams:

Wordcloud

fst_wordcloud()

We can see the most frequent words identified above also coming out in the wordclouds.

fst_wordcloud(
  data = q11_1,
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  max = 100
)

fst_wordcloud(
  data = q11_2,
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  max = 100
)

fst_wordcloud(
  data = q11_3,
  pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
  max = 100
)

Exploring some concept networks based on thematic frequently occurring words

fst_concept_network()

Based on the most frequently-occurring words, we have chosen some lists of "concepts" for each question and will create Concept Networks based on these. You can see below how the threshold can be used to make sure the concept network isn't too "busy" by removing less frequent connection words. Similarly, you can see how the length of the "concepts" list impacts the "busyness" of the plot. Often, the "concept" list and appropriate threshold will be the product of trial and error.

Q11_1

fst_concept_network(
  data = q11_1,
  concepts = "elintaso, köyhä",
  threshold = NULL,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_1 - No threshold"
)
fst_concept_network(
  data = q11_1,
  concepts = "elintaso, köyhä",
  threshold = 3,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_1 - Threshold = 3"
)
fst_concept_network(
  data = q11_1,
  concepts = "elintaso, köyhä, ihminen",
  threshold = NULL,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_1 - No threshold"
)
fst_concept_network(
  data = q11_1,
  concepts = "elintaso, köyhä, ihminen",
  threshold = 3,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_1 - Threshold = 3"
)

Remarks:

Here, our first plot may be the best. We can see that a threshold is not required as there are not too many words displayed but we can get some insight into the use of these words (in English, they are 'standard-of-living' and 'poor'). As you can see in plots 3 and 4, when including the most common word, 'ihminen' ('human being' or 'man'), a threshold such as 3 is advisable.

Q11_2

fst_concept_network(
  data = q11_2,
  concepts = "kehitysmaa, auttaa, pyrkiä, maa, ihminen",
  threshold = NULL,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_2 - No threshold"
)
fst_concept_network(
  data = q11_2,
  concepts = "kehitysmaa, auttaa, pyrkiä, maa, ihminen",
  threshold = 10,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_2 - Threshold = 10"
)
fst_concept_network(
  data = q11_2,
  concepts = "kehitysmaa, auttaa, pyrkiä, maa, ihminen",
  threshold = 5,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_2 - Threshold = 5"
)

fst_concept_network(
  data = q11_2,
  concepts = "kehitysmaa, auttaa, pyrkiä",
  threshold = NULL,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_2 - No threshold"
)
fst_concept_network(
  data = q11_2,
  concepts = "kehitysmaa, auttaa, pyrkiä",
  threshold = 5,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_2 - Threshold = 5"
)

Remarks:

In Q11_2, plots 1-3 show that if we include all 5 of our words ('developing country', 'help', 'strive', 'country', 'man') or a subset that a threshold is required, but that a threshold of 10 is too large to gain additional words in the Network. Here threshold = 5 seems appropriate.

Q11_3

fst_concept_network(
  data = q11_3,
  concepts = "köyhyys",
  threshold = NULL,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_3 - köyhyys / No threshold"
)
fst_concept_network(
  data = q11_3,
  concepts = "köyhyys, puute",
  threshold = NULL,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_3 - köyhyys, puute / No threshold"
)
fst_concept_network(
  data = q11_3,
  concepts = "köyhyys, nälänhätä, sota, ilmastonmuutos, puute",
  threshold = NULL,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_3 - No threshold"
)
fst_concept_network(
  data = q11_3,
  concepts = "köyhyys, nälänhätä, sota, ilmastonmuutos, puute",
  threshold = 3,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_3 - Threshold = 3"
)
fst_concept_network(
  data = q11_3,
  concepts = "köyhyys, nälänhätä, sota, ilmastonmuutos, puute",
  threshold = 2,
  norm = "number_words",
  pos_filter = NULL,
  title = "Q11_3 - Threshold = 2"
)

Remarks:

In Q11_3, we can see that if your concept word list is short (such as a single word) thresholds generally are not required, but that with longer concept lists, setting a threshold = 2 or threshold = 3, we can simplify and improve a plot that is a little too crowded in this case. The "best" threshold is generally a matter of context and at the analyst's discretion.

Conclusion

From the above, we can see that different settings create different Concept Networks. There is no "right" setting for a Concept Network, so it is worthwhile exploring the Concept Networks that result from different concept words and thresholds to investigate the data and identify trends which warrant further analysis. Initial Concept Network settings can be informed by other functions in finnsurveytext, such as choosing the most frequent words/n-grams or considering insights from wordclouds.

Citation

Finnish Children and Youth Foundation: Young People's Views on Development Cooperation 2012 [dataset]. Version 2.0 (2019-01-22). Finnish Social Science Data Archive distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD2821

unlink('finnish-ftb-ud-2.5-191206.udpipe')


Try the finnsurveytext package in your browser

Any scripts or data that you put into this service are public.

finnsurveytext documentation built on April 4, 2025, 5:07 a.m.