annotated_hymns: Hymns, tokenized and POS-tagged

annotated_hymnsR Documentation

Hymns, tokenized and POS-tagged

Description

The hymn dataset, tokenized and tagged with part-of-speach (POS). Each line represent a word from a given hymn.

Usage

annotated_hymns

Format

A tibble with 153,717 rows and 7 variables:

doc_id

Official hymn number

paragraph_id

Line number in hymn

token_id

Token number in line

token

The original token, i.e. word

lemma

The lemmatized, i.e. dictionary form, of token

upos

POS-tag, i.e. part-of-speach, like VERB or PUNCT for punctuation

vowels

Number of vowels in token – good for finding alternative words using the cut_up() function


lassehjorthmadsen/salmer documentation built on April 15, 2022, 3:38 a.m.