InDetail1-DataPreparation"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Many natural language processing (NLP) tasks require data which is systematically pre-processed into a format useful for analysis. Pre-processing commonly involves activities such as:

Installation of package.

Once the package is installed, you can load the finnsurveytext package as below: (Other required packages such as dplyr and stringr will also be installed if they are not currently installed in your environment.)

library(finnsurveytext)

Overview of Functions

The functions covered in this tutorial are:

  1. fst_format()
  2. fst_find_stopwords()
  3. fst_rm_stop_punct()
  4. fst_prepare()

Data

This tutorial uses two sources of data from the Finnish Social Science Data Archive:

1. Child Barometer Data

2. Development Cooperation Data

Both of these will be demonstrated below but either can be used to complete the tutorial. If you would prefer to use your own data, you can read in this data through read.csv() or similar so that you have a 'raw' dataframe ready in your R environment.

CoNLL-U Format Overview

The finnsurveytext package uses the CoNLL-U format. This tutorial demonstrates the process of preparing Finnish survey text data into this format using functions in r/01_prepare.r.

CoNLL-U is a popular annotation scheme often used in Natural Language Processing (NLP) tasks to tokenise and annotate text. In CoNLL-U format, the text is split into one line per word and ten features of each word are recorded including an ID, part-of-speech tagging, the word itself (eg. 'likes'), and word lemma (eg. 'like'). CoNLL stands for the Conference of Natural Language Learning and CoNLL-U format was introduced in 2014.

More information on CoNLL-U format can be found in the Universal Dependencies Project, https://universaldependencies.org/format.html.

The Whole Story

A single function, fst_prepare (which calls all the data preparation functions within the package) can be used to prepare the data into the required CoNNL-U format.

1. Child Barometer Data

Using our Child Barometer bullying data, we can call this function as follows:

prepd_bullying <- fst_prepare(
  data = child,
  question = "q7",
  id = 'fsd_id'
  stopword_list = "nltk",
  language = "fi"
  model = "ftb",
  weights = NULL,
  add_cols = NULL
)

Summary of components

2. Development Cooperation Data

As an example, the Development Cooperation survey q11_2 data could be prepared using this function call:

prepd_dev <- fst_prepare_conllu(
  dev_coop,
  question = "q11_2",
  stopword_list = "none",
  model = "tdt", 
  weights = NULL,
  add_cols = NULL, 
  manual = FALSE,
  manual_list = ""
)

In greater detail

To better understand the fst_prepare() function, we will go through each of the functions that this one calls. These are:

Additionally, the fst_find_stopwords() function can be used to find currently available lists of stopwords for exclusion from the data. (The default "language" is "fi", but ) The "name" column can be used to choose a list for the stopword_list variable above. Stopword lists are lists of common words (eg. "and", "the", and "is", or in Finnish "olla", "ollet", "ollen", and "on"...) which are often filtered out of the data, leaving less frequently-occurring, and thus more more meaningful, words remaining.

stopwords <- fst_find_stopwords("fi")

The stopwords lists can be very long, so only one (nltk) is shown below. Another two lists, snowball and stopwords-iso, can be found by running the fst_find_stopwords() function in your local environment.

knitr::kable(head(stopwords, 1))

Format as CoNNL-U

fst_format()

This function is used to format the data from your open-ended survey question into CoNLL-U format. It also:

Our package works for two of the Finnish language models available, Turku Dependency Treebank (TDT) and FinnTreeBank (FTB). Further information about these treebanks can be found at the links but, in brief, the TDT is considered "broad coverage" and includes texts from Wikipedia and news sources, and FTB consists of manually annotated grammatical examples from VISK.

The fst_format_conllu() function utilises the udpipe package and can be run as follows:

conllu_dev_q11_1 <- fst_format(data = dev_coop, question = "q11_1", id = 'fsd_id')
conllu_cb_bullying <- fst_format(data = child, question = "q7", model = "tdt", id = 'fsd_id')

Note: the first time you run this function, it will download the relevant treebank from udpipe for use in the annotations.

The top 5 rows of the "conllu_cb_bullying" table are shown below:

knitr::kable(head(conllu_cb_bullying))

´fst_format()` takes 6 arguments:

  1. data the dataframe containing the survey data.
  2. question is the open-ended survey question header in the table, such as "q9"
  3. id is the unique ID for each survey response.
  4. model is the chosen Finnish treebank for annotation, either "ftb" (the default) or "tdt".
  5. weights, optional, a column containing weights for the reponses.
  6. add_cols, optional, any other columns to bring into the formatted data.

Remove stopwords and punctuation from CoNLL-U data

fst_rm_stop_punct()

This (optional) function will remove stopwords and punctuation from the CoNLL-U data. fst_find_stopwords can be used to find options for stopwords lists.

fst_rm_stop_punct() takes 2 arguments:

  1. data is output from fst_format_conllu()
  2. stopword_list is a list of Finnish stopwords, the default is "nltk" but any "Name" column from fst_find_stopwords() can be used.
conllu_dev_q11_1_nltk <- fst_rm_stop_punct(data = conllu_dev_q11_1)
conllu_cb_bullying_iso <- fst_rm_stop_punct(conllu_cb_bullying, "stopwords-iso")

The top 5 rows of the "conllu_bullying_iso" table are shown below:

knitr::kable(head(conllu_cb_bullying_iso))

Conclusion

Now that you have data in CoNLL-U format, this pre-processed data is ready for the analysis using finnsurveytext functions. For more information on these, please review the other vignettes in this package.

Citation

The Office of Ombudsman for Children: Child Barometer 2016 [dataset]. Version 1.0 (2016-12-09). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD3134

Finnish Children and Youth Foundation: Young People's Views on Development Cooperation 2012 [dataset]. Version 2.0 (2019-01-22). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD2821

unlink('finnish-ftb-ud-2.5-191206.udpipe')
unlink("finnish-tdt-ud-2.5-191206.udpipe")


Try the finnsurveytext package in your browser

Any scripts or data that you put into this service are public.

finnsurveytext documentation built on April 4, 2025, 5:07 a.m.