Overview-PackageTutorial"

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = FALSE, message = FALSE) 

Overview of finnsurveytext

This tutorial aims to provide a simple overview of what is included within the finnsurveytext package and teach you how to use the main functions included in the package.

The below table shows you all the functions that are included in the package. The functions which are bolded are the main functions which are outlined in the sections below.

| Section | Usage | Functions | |:---:|---|---| | 1. Data Preparation | use the udpipe R package to clean and annotate the raw data into a standardised format (CoNLL-U) suitable for analysis. | fst_format()
fst_format_svydesign()
fst_print_available_models()
fst_find_stopwords()
fst_rm_stop_punct()
fst_prepare()
fst_prepare_svydesign() | | 2. Data Exploration | create wordclouds, n-gram tables and summary tables for initial insights into trends across responses. | fst_summarise_short()
fst_summarise()
fst_pos()
fst_length_summary()
fst_use_svydesign()
fst_freq_table
fst_ngrams_table()
fst_ngrams_table2()
fst_freq_plot()
fst_ngrams_plot()
fst_freq()
fst_ngrams()
fst_wordcloud() | | 3. Concept Network | creation of a concept network using the textrank R package with node size indicating word importance (PageRank)
and edge weight showing co-occurrence of words. | fst_cn_search()
fst_cn_edges()
fst_cn_nodes()
fst_cn_plot()
fst_concept_network() | | 4. Comparison Functions | corresponding Data Exploration and Concept Network functions allowing for comparison between groups of survey respondents. | fst_pos_compare()
fst_summarise_compare()
fst_length_compare()
fst_get_unique_ngrams_separate()
fst_get_unique_ngrams()
fst_join_unique()
fst_ngrams_compare_plot()
fst_freq_compare()
fst_ngrams_compare()
fst_comparison_cloud()
fst_cn_get_unique_separate()
fst_cn_get_unique()
fst_cn_compare_plot()
fst_concept_network_compare() | | 5. RShiny Demo App | A beta version of a UI for the package | runDemo() |

0. Install and Load Package

First, the finnsurveytext package needs to be installed into your R environment and loaded into the environment. You may also want to load in the survey package if you want to use a svydesign object for the data and/or weights.

library(finnsurveytext)
library(survey)

1. Data Preparation

The data preparation functions are used to take your raw survey data (in a dataframe or svydesign object within your R environment) and convert it into a standardised format ready for analysis.

The functions in the remaining sections require your data to be pre-formatted into this format.

(To learn move about the format we use, see the Universal Dependencies Project.)

Option 1: Data is in a dataframe

The package comes with sample data. For this demonstration, we use dev_coop. The raw data looks like this:

data(dev_coop)
knitr::kable(head(dev_coop, 5))

We will look at question 11_3 (responses to ''Jatka lausetta: Maailman kolme suurinta ongelmaa ovat... (Avokysymys)') as our open-ended survey question. We also want to include our survey weights (in 'paino' column) and bring in the gender and region columns so we can use these values to compare groups.

The main function here is fst_prepare()

# FUNCTION DEFINITION
fst_prepare <- function(data,
                        question,
                        id,
                        model = "ftb",
                        stopword_list = "nltk",
                        language = "fi"
                        weights = NULL,
                        add_cols = NULL,
                        manual = FALSE,
                        manual_list = "")

We can run the function as follows:

df <- fst_prepare(data = dev_coop,
                  question = 'q11_3', 
                  id = 'fsd_id', 
                  weights = 'paino',
                  add_cols = c('gender', 'region')
                  )

Summary of components

The formatted data looks like this:

knitr::kable(head(df, 2))

Option 2: Data is in a svydesign object

The other option is to get your data from a svydesign object from the survey package. The survey package is a popular package used for analysing surveys.

svy_dev <- survey::svydesign(id = ~1, weights = ~paino, data =dev_coop)

The main function here is fst_prepare_svydesign()

# FUNCTION DEFINITION
fst_prepare_svydesign <- function(svydesign,
                                  question,
                                  id,
                                  model = "ftb",
                                  stopword_list = "nltk",
                                  language = "fi"
                                  use_weights = TRUE,
                                  add_cols = NULL,
                                  manual = FALSE,
                                  manual_list = "") 

We can run the function as follows:

df2 <- fst_prepare_svydesign(svydesign = svy_dev,
                            question = 'q11_3', 
                            id = 'fsd_id', 
                            use_weights = TRUE,
                            add_cols = c('gender', 'region')
                            )

The only differences between the previous function and this one are:

The formatted data looks like this (should look very similar to the above formatted data!):

knitr::kable(head(df2, 2))

2. Data Exploration

Now that we have formatted data, we can begin data exploration. These functions are used to create summary tables and to find the most common themes in your survey responses.

Summary Tables

First, let's create some summaries using fst_summarise, fst_pos and fst_length_summary

These functions are defined as follows:

# FUNCTION DEFINITIONS
fst_summarise <- function(data, 
                          desc = "All respondents") 

fst_pos <- function(data) 

fst_length_summary <- function(data,
                               desc = "All respondents",
                               incl_sentences = TRUE) 

Summary of components

Hence, these functions are run for our sample data as follows:

fst_summarise(df)
fst_pos(df)
fst_length_summary(df)

Identification of frequent words and phrases

Wordclouds

The first of our frequent words visualisations in the wordcloud which comes from the wordcloud package.

It is defined as follows:

# FUNCTION DEFINITION
fst_wordcloud <- function(data,
                          pos_filter = NULL,
                          max = 100,
                          use_svydesign_weights = FALSE,
                          id = "",
                          svydesign = NULL,
                          use_column_weights = FALSE)

Summary of components

Then, we have options for weighting the words in the cloud. These will all default to not include weights.

Here are some examples of creating wordclouds:

fst_wordcloud(df)
# We can only get weights from svydesign if they are NOT already in our formatted data. Hence we remove them for this demonstration!
df2$weight <- NULL
fst_wordcloud(df2, 
              pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
              max=100, 
              use_svydesign_weights = TRUE, 
              id = 'fsd_id', 
              svydesign = svy_dev)

N-gram plots

Then, we have functions to identify and plot the most frequent words or n-grams (sets of n words in order).

# FUNCTION DEFINITIONS
fst_freq <- function(data,
                     number = 10,
                     norm = NULL,
                     pos_filter = NULL,
                     strict = TRUE,
                     name = NULL,
                     use_svydesign_weights = FALSE,
                     id = "",
                     svydesign = NULL,
                     use_column_weights = FALSE)

fst_ngrams <- function(data,
                       number = 10,
                       ngrams = 1,
                       norm = NULL,
                       pos_filter = NULL,
                       strict = TRUE,
                       name = NULL,
                       use_svydesign_weights = FALSE,
                       id = "",
                       svydesign = NULL,
                       use_column_weights = FALSE)

Summary of components

Then, we again have options for weighting the words in the plot. Again, these all default to not include weights.

fst_freq(df)

fst_ngrams(df, 
           number = 9, 
           ngrams = 2, 
           strict = FALSE,
           use_column_weights = TRUE)

fst_freq(df,
         number = 5, 
         strict = FALSE,)

(fst_freq_table() and fst_ngrams_table() can be used to instead create tables of the top words.)

fst_freq_table(df, number = 15, strict = FALSE)

3. Concept Network

The finnsurveytext package currently contains our first iteration of a function which plots a concept network. These plots visualise keywords which are identified through the TextRank algorithm and maps co-occurrences between these terms. Vertices represent words with vertex size indicating word importance and co-occurrence between words is shown through edges with edge thickness indicating number of co-occurrences. Word importance is determined recursively (through the unsupervised TextRank algorithm, a graph-based ranking model for text processing) where words get more weight based on how many words co-occur and the weight of these co-occurring words. The concept network functions take search terms input by the user and the algorithm then suggests other words that are related to these input terms by co-occurrence. The input terms can be identified through functions in the package (such as fst_cn_search() or fst_freq_table()) or through other analysis separately conducted by the user. The concept network function can be used to identify concepts which could be individual words or a group of co-occurring words, or may contain a single ’concept’ whose component words are investigated and identified within a single network plot.

To utilise the TextRank algorithm in finnsurveytext, we use the textrank package. For further information on the package, please see this documentation. This package implements the TextRank and PageRank algorithms. (PageRank is the algorithm that Google uses to rank webpages.) You can read about the underlying TextRank algorithm here and about the PageRank algorithm here.

The main concept network function is fst_concept_network(). It is defined as follows:

# FUNCTION DEFINITIONS
fst_concept_network <- function(data,
                                concepts,
                                threshold = NULL,
                                norm = NULL,
                                pos_filter = NULL,
                                title = NULL) 

Summary of components

For example, we can create the following concept network plots:

fst_concept_network(df, 
                    concepts = "köyhyys, nälänhätä, sota, ilmastonmuutos, puute", 
                    )

4. Comparison Functions

Recall that when we preprocessed the data, we included two additional columns, gender and region, to allow for comparison between respondents based on these values.

There are counterpart comparison functions for each of the functions we have shown above.

The comparison summary tables are defined as follows:

fst_pos_compare <- function(data,
                            field,
                            exclude_nulls = FALSE,
                            rename_nulls = 'null_data')

fst_summarise_compare <- function(data,
                                  field,
                                  exclude_nulls = FALSE,
                                  rename_nulls = 'null_data')

fst_length_compare <- function(data,
                               field,
                               incl_sentences = TRUE,
                               exclude_nulls = FALSE,
                               rename_nulls = 'null_data') 

Summary of Components

Let's compare our responses based on the region of the respondent:

knitr::kable(fst_pos_compare(df, 'region'))

knitr::kable(fst_summarise_compare(df, 'region'))

knitr::kable(fst_length_compare(df, 'region'))

The ngrams comparison functions are defined similarly (with some additional new values):

# FUNCTION DEFINITIONS
fst_freq_compare <- function(data,
                             field,
                             number = 10,
                             norm = NULL,
                             pos_filter = NULL,
                             strict = TRUE,
                             use_svydesign_weights = FALSE,
                             id = "",
                             svydesign = NULL,
                             use_column_weights = FALSE,
                             exclude_nulls = FALSE,
                             rename_nulls = 'null_data',
                             unique_colour = "indianred",
                             title_size = 20,
                             subtitle_size = 15)


fst_ngrams_compare <- function(data,
                              field,
                              number = 10,
                              ngrams = 1,
                              norm = NULL,
                              pos_filter = NULL,
                              strict = TRUE,
                              use_svydesign_weights = FALSE,
                              id = "",
                              svydesign = NULL,
                              use_column_weights = FALSE,
                              exclude_nulls = FALSE,
                              rename_nulls = 'null_data',
                              unique_colour = "indianred",
                              title_size = 20,
                              subtitle_size = 15)

The new components are:

For the ngrams, let's compare respondents by gender.

fst_freq_compare(df, 
                 'gender', 
                 use_column_weights = TRUE,
                 exclude_nulls = TRUE)

fst_ngrams_compare(df, 
                   'gender', 
                   ngrams = 2, 
                   use_column_weights = TRUE, 
                   exclude_nulls = TRUE)

The comparison cloud extends the wordcloud concept.

A comparison cloud compares the relative frequency with which a term is used in two or more documents. This cloud shows words that occur more regularly in responses from a specific type of respondent. For more information about comparison clouds, you can read this documentation.

The comparison cloud is defined as follows, with settings as defined for the previous functions:

# FUNCTION DEFINITION
fst_comparison_cloud <- function(data,
                                 field,
                                 pos_filter = NULL,
                                 norm = NULL,
                                 max = 100,
                                 use_svydesign_weights = FALSE,
                                 id = "",
                                 svydesign = NULL,
                                 use_column_weights = FALSE,
                                 exclude_nulls = FALSE,
                                 rename_nulls = "null_data") 

Thus, we can create comparison clouds:

fst_comparison_cloud(df, 'gender', max = 40, use_column_weights = TRUE, exclude_nulls = TRUE)

Finally we have the comparison concept network which has the following components which should be familiar from previous functions:

# FUNCTION DEFINITION
fst_concept_network_compare <- function(data,
                                        concepts,
                                        field,
                                        norm = NULL,
                                        threshold = NULL,
                                        pos_filter = NULL,
                                        exclude_nulls = FALSE,
                                        rename_nulls = 'null_data',
                                        title_size = 20,
                                        subtitle_size = 15)

We run the comparison concept network as follows:

fst_concept_network_compare(df, 
                            concepts = "köyhyys, nälänhätä, sota, ilmastonmuutos, puute", 
                            'gender',
                            exclude_nulls = TRUE
                            )

For more information on the finnsurveytext functions, see the package website and documentation available from the CRAN.

Data

The package comes with sample data from two Finnish surveys obtained from the Finnish Social Science Data Archive an a survey in English available from GESIS:

1. Child Barometer Data

2. Development Cooperation Data

3. Patient Joe (open-ended question)

unlink('finnish-ftb-ud-2.5-191206.udpipe')


Try the finnsurveytext package in your browser

Any scripts or data that you put into this service are public.

finnsurveytext documentation built on April 4, 2025, 5:07 a.m.