knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(finnsurveytext) library(udpipe) library(stopwords)
finnsurveytext
in another language!Despite the package's name, finnsurveytext
can be used to analyse surveys in LOTS of different languages. This vignette aims to explain how to use finnsurveytext
in another language with as little additional effort as possible.
The reason finnsurveytext
can be used with other languages is that the packages it employs to process the raw survey data work in multiple languages! So we have the developers of the udpipe
and stopwords
packages to thank!
There is a survey in English provided with the package called english_sample_survey
which we will use to demonstrate the use of the package in a language other than Finnish.
knitr::kable(head(english_sample_survey, 5))
udpipe
The udpipe
package is available from the CRAN. The relevant udpipe
function we use is udpipe::udpipe_download_model
. You can see the list of available models in the udpipe
manual.
At the time of writing this vignette, these were:
afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb
Alternatively, you can find the list of available models by running fst_print_available_models()
. By providing a search
term, the list will be filtered for models containing this language:
fst_print_available_models() fst_print_available_models(search = 'estonian') fst_print_available_models('sami')
The relevant model, eg "swedish-talbanken", should be used for the model
input in fst_format()
or fst_prepare()
We find an English model and format our English data below:
fst_print_available_models("english") en_df <- fst_format(data = english_sample_survey, question = 'text', id = 'id', model = 'english-ewt' ) knitr::kable(head(en_df, 5))
stopwords
packageThe stopwords
package is available from the CRAN. The relevant stopwords
functions are stopwords::stopwords
, stopwords::stopwords_getsources
and stopwords::stopwords_getlanguages
. We recommend you first identify the two-letter ISO code for the language you are using. You can see the list of available sources and languages in the stopwords
manual or by running the 'get sources' and 'get languages' functions:
stopwords_getsources() stopwords::stopwords_getlanguages(source = 'nltk') stopwords('da', source = 'nltk') stopwords('da') # The default source is 'snowball'
Alternatively, you can use our function fst_find_stopwords
to simplify this process. This function provides a table of lists available through the stopwords
package for a language and provides the contents for comparison (if you have multiple options!). To run this, you need the two-letter ISO language code:
knitr::kable(fst_find_stopwords(language = 'lv')) fst_find_stopwords(language = 'no')
The relevant language and stopword list ('source'), eg "sv" and "nltk", should be used for the language
and stopword_list
inputs respectively in fst_prepare()
(or fst_rm_stop_punct()
which is automatically called within fst_prepare()
).
We can find and compare English stopwords lists as below. Once we have chosen a stopwords list, we can run fst_prepare()
to format the data and remove the stopwords:
knitr::kable(head(fst_find_stopwords(language = 'en'), 5)) en_df2 <- fst_prepare(data = english_sample_survey, question = 'text', id = 'id', model = 'english-ewt', stopword_list = 'smart', language = 'en') knitr::kable(head(en_df2, 5))
If a stopword list is not available for your language, or you would like to provide your own, you can use the manual_list
option within fst_prepare()
(or fst_rm_stop_punct()
) making sure to also either set manual = TRUE
or stopwords_list = "manual"
.
You can also chose to not remove stopwords but you may find that you want to remove them to get more meaningful results!
If you provide a manual list, you can leave language
as its default values.
#EXAMPLE OF PROVIDING A MANUAL LIST manualList <- c('and', 'the', 'of', 'you', 'me', 'ours', 'mine', 'them', 'theirs') manualList2 <- "to, the, I" df1 <- fst_prepare(data = english_sample_survey, question = 'text', id = 'id', model = 'english-ewt', manual_list = manualList, stopword_list = 'manual' ) knitr::kable(head(df1, 5)) df2 <- fst_prepare(data = english_sample_survey, question = 'text', id = 'id', model = 'english-ewt', manual = TRUE, manual_list = manualList2 ) knitr::kable(head(df2, 5))
The remainder of the package works the same regardless of language of survey responses.
unlink('english-ewt-ud-2.5-191206.udpipe')
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.