knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Many natural language processing (NLP) tasks require data which is systematically pre-processed into a format useful for analysis. Pre-processing commonly involves activities such as:
finnsurveytext
uses lemmatisation rather than stemming. Once the package is installed, you can load the finnsurveytext
package as below:
(Other required packages such as dplyr
and stringr
will also be installed if they are not currently installed in your environment.)
library(finnsurveytext)
The functions covered in this tutorial are:
fst_format()
fst_find_stopwords()
fst_rm_stop_punct()
fst_prepare()
This tutorial uses two sources of data from the Finnish Social Science Data Archive:
Both of these will be demonstrated below but either can be used to complete the tutorial. If you would prefer to use your own data, you can read in this data through read.csv()
or similar so that you have a 'raw' dataframe ready in your R environment.
The finnsurveytext
package uses the CoNLL-U format. This tutorial demonstrates the process of preparing Finnish survey text data into this format using functions in r/01_prepare.r.
CoNLL-U is a popular annotation scheme often used in Natural Language Processing (NLP) tasks to tokenise and annotate text. In CoNLL-U format, the text is split into one line per word and ten features of each word are recorded including an ID, part-of-speech tagging, the word itself (eg. 'likes'), and word lemma (eg. 'like'). CoNLL stands for the Conference of Natural Language Learning and CoNLL-U format was introduced in 2014.
More information on CoNLL-U format can be found in the Universal Dependencies Project, https://universaldependencies.org/format.html.
A single function, fst_prepare
(which calls all the data preparation functions within the package) can be used to prepare the data into the required CoNNL-U format.
Using our Child Barometer bullying data, we can call this function as follows:
prepd_bullying <- fst_prepare( data = child, question = "q7", id = 'fsd_id' stopword_list = "nltk", language = "fi" model = "ftb", weights = NULL, add_cols = NULL )
Summary of components
data
is the dataframe of interest. In this case, we are using data that comes with the package called 'child_barometer'. Otherwise, if you read in a csv containing a dataframe, such as through read.csv()
in base R for use in this tutorial.question
is the name of the column in your data which contains the open-ended survey question. In this example, the responses about bullying are in question 7. id
is a unique identifier for each response. fst_find_stopwords()
function which is outlined below. Punctuation is also removed from the data whenever stopwords are removed.udpipe
, in this case we are using the default Finnish Treebank, model = "ftb"
. (There are two options for Finnish langage model; the other option is the Turku Dependency Treebank. For further detail on the treebanks, see the Format as CoNLL-U section below.)manual
and manual_list
can be used if you want to manually provide a list of stopwords to remove from the data. As an example, the Development Cooperation survey q11_2 data could be prepared using this function call:
prepd_dev <- fst_prepare_conllu( dev_coop, question = "q11_2", stopword_list = "none", model = "tdt", weights = NULL, add_cols = NULL, manual = FALSE, manual_list = "" )
data
is the dataframe of interest. In this case, we are using data that comes with the package called 'dev_data'. Otherwise, if you read in a csv containing a dataframe, such as through read.csv()
in base R for use in this tutorial.question
is the name of the column in your data which contains the open-ended survey question. In this example, the responses are in question 11_2. stopword_list = NULL
)model = "tdt"
.To better understand the fst_prepare()
function, we will go through each of the functions that this one calls. These are:
fst_format()
fst_rm_stopwords_punct()
Additionally, the fst_find_stopwords()
function can be used to find currently available lists of stopwords for exclusion from the data. (The default "language" is "fi", but ) The "name" column can be used to choose a list for the stopword_list
variable above.
Stopword lists are lists of common words (eg. "and", "the", and "is", or in Finnish "olla", "ollet", "ollen", and "on"...) which are often filtered out of the data, leaving less frequently-occurring, and thus more more meaningful, words remaining.
stopwords <- fst_find_stopwords("fi")
The stopwords lists can be very long, so only one (nltk) is shown below. Another two lists, snowball and stopwords-iso, can be found by running the fst_find_stopwords()
function in your local environment.
knitr::kable(head(stopwords, 1))
fst_format()
This function is used to format the data from your open-ended survey question into CoNLL-U format. It also:
Our package works for two of the Finnish language models available, Turku Dependency Treebank (TDT) and FinnTreeBank (FTB). Further information about these treebanks can be found at the links but, in brief, the TDT is considered "broad coverage" and includes texts from Wikipedia and news sources, and FTB consists of manually annotated grammatical examples from VISK.
The fst_format_conllu()
function utilises the udpipe
package and can be run as follows:
conllu_dev_q11_1 <- fst_format(data = dev_coop, question = "q11_1", id = 'fsd_id') conllu_cb_bullying <- fst_format(data = child, question = "q7", model = "tdt", id = 'fsd_id')
Note: the first time you run this function, it will download the relevant treebank from udpipe for use in the annotations.
The top 5 rows of the "conllu_cb_bullying" table are shown below:
knitr::kable(head(conllu_cb_bullying))
´fst_format()` takes 6 arguments:
data
the dataframe containing the survey data. question
is the open-ended survey question header in the table, such as "q9"id
is the unique ID for each survey response. model
is the chosen Finnish treebank for annotation, either "ftb" (the default) or "tdt". weights
, optional, a column containing weights for the reponses. add_cols
, optional, any other columns to bring into the formatted data. fst_rm_stop_punct()
This (optional) function will remove stopwords and punctuation from the CoNLL-U data. fst_find_stopwords
can be used to find options for stopwords lists.
fst_rm_stop_punct()
takes 2 arguments:
data
is output from fst_format_conllu()
stopword_list
is a list of Finnish stopwords, the default is "nltk" but any "Name" column from fst_find_stopwords()
can be used.conllu_dev_q11_1_nltk <- fst_rm_stop_punct(data = conllu_dev_q11_1) conllu_cb_bullying_iso <- fst_rm_stop_punct(conllu_cb_bullying, "stopwords-iso")
The top 5 rows of the "conllu_bullying_iso" table are shown below:
knitr::kable(head(conllu_cb_bullying_iso))
Now that you have data in CoNLL-U format, this pre-processed data is ready for the analysis using finnsurveytext
functions. For more information on these, please review the other vignettes in this package.
The Office of Ombudsman for Children: Child Barometer 2016 [dataset]. Version 1.0 (2016-12-09). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD3134
Finnish Children and Youth Foundation: Young People's Views on Development Cooperation 2012 [dataset]. Version 2.0 (2019-01-22). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD2821
unlink('finnish-ftb-ud-2.5-191206.udpipe') unlink("finnish-tdt-ud-2.5-191206.udpipe")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.