Extra-AnalysingOtherLanguages"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(finnsurveytext)
library(udpipe)
library(stopwords)

How to Use finnsurveytext in another language!

Despite the package's name, finnsurveytext can be used to analyse surveys in LOTS of different languages. This vignette aims to explain how to use finnsurveytext in another language with as little additional effort as possible.

The reason finnsurveytext can be used with other languages is that the packages it employs to process the raw survey data work in multiple languages! So we have the developers of the udpipe and stopwords packages to thank!

There is a survey in English provided with the package called english_sample_survey which we will use to demonstrate the use of the package in a language other than Finnish.

knitr::kable(head(english_sample_survey, 5))

1. Essential: Your language has a language model available for udpipe

The udpipe package is available from the CRAN. The relevant udpipe function we use is udpipe::udpipe_download_model. You can see the list of available models in the udpipe manual.

At the time of writing this vignette, these were:

afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb

Alternatively, you can find the list of available models by running fst_print_available_models(). By providing a search term, the list will be filtered for models containing this language:

fst_print_available_models()

fst_print_available_models(search = 'estonian')

fst_print_available_models('sami')

How to use:

The relevant model, eg "swedish-talbanken", should be used for the model input in fst_format() or fst_prepare()

Demonstration:

We find an English model and format our English data below:

fst_print_available_models("english")

en_df <- fst_format(data = english_sample_survey,
           question = 'text', 
           id = 'id', 
           model = 'english-ewt'
           )

knitr::kable(head(en_df, 5))

2. Recommended: Your language has a stopwords list available for stopwords package

The stopwords package is available from the CRAN. The relevant stopwords functions are stopwords::stopwords, stopwords::stopwords_getsources and stopwords::stopwords_getlanguages. We recommend you first identify the two-letter ISO code for the language you are using. You can see the list of available sources and languages in the stopwords manual or by running the 'get sources' and 'get languages' functions:

stopwords_getsources()
stopwords::stopwords_getlanguages(source = 'nltk')
stopwords('da', source = 'nltk')
stopwords('da') # The default source is 'snowball'

Alternatively, you can use our function fst_find_stopwords to simplify this process. This function provides a table of lists available through the stopwords package for a language and provides the contents for comparison (if you have multiple options!). To run this, you need the two-letter ISO language code:

knitr::kable(fst_find_stopwords(language = 'lv'))
fst_find_stopwords(language = 'no')

How to use:

The relevant language and stopword list ('source'), eg "sv" and "nltk", should be used for the language and stopword_list inputs respectively in fst_prepare() (or fst_rm_stop_punct() which is automatically called within fst_prepare()).

Demonstration:

We can find and compare English stopwords lists as below. Once we have chosen a stopwords list, we can run fst_prepare() to format the data and remove the stopwords:

knitr::kable(head(fst_find_stopwords(language = 'en'), 5))

en_df2 <- fst_prepare(data = english_sample_survey,
                      question = 'text',
                      id = 'id',
                      model = 'english-ewt',
                      stopword_list = 'smart', 
                      language = 'en')

knitr::kable(head(en_df2, 5))

2b. Optional: Provide your own list of stopwords

If a stopword list is not available for your language, or you would like to provide your own, you can use the manual_list option within fst_prepare() (or fst_rm_stop_punct()) making sure to also either set manual = TRUE or stopwords_list = "manual".

You can also chose to not remove stopwords but you may find that you want to remove them to get more meaningful results!

If you provide a manual list, you can leave language as its default values.

Demonstration

#EXAMPLE OF PROVIDING A MANUAL LIST
manualList <- c('and', 'the', 'of', 'you', 'me', 'ours', 'mine', 'them', 'theirs')
manualList2 <- "to, the, I"

df1 <- fst_prepare(data = english_sample_survey,
                  question = 'text',
                  id = 'id',
                  model = 'english-ewt',
                  manual_list = manualList,
                  stopword_list = 'manual'
                  )

knitr::kable(head(df1, 5))

df2 <- fst_prepare(data = english_sample_survey,
                  question = 'text',
                  id = 'id',
                  model = 'english-ewt',
                  manual = TRUE,
                  manual_list = manualList2
                  )

knitr::kable(head(df2, 5))

The remainder of the package works the same regardless of language of survey responses.

unlink('english-ewt-ud-2.5-191206.udpipe')


Try the finnsurveytext package in your browser

Any scripts or data that you put into this service are public.

finnsurveytext documentation built on April 4, 2025, 5:07 a.m.