annotated_hymns: Hymns, tokenized and POS-tagged
In lassehjorthmadsen/salmer: A Data Set with Hymns in Danish

annotated_hymns

R Documentation

Hymns, tokenized and POS-tagged

The hymn dataset, tokenized and tagged with part-of-speach (POS). Each line represent a word from a given hymn.

annotated_hymns

A tibble with 153,717 rows and 7 variables:

doc_id: Official hymn number
paragraph_id: Line number in hymn
token_id: Token number in line
token: The original token, i.e. word
lemma: The lemmatized, i.e. dictionary form, of token
upos: POS-tag, i.e. part-of-speach, like VERB or PUNCT for punctuation
vowels: Number of vowels in token – good for finding alternative words using the cut_up() function