mobydick: Lemmatized Text of Moby-Dick (Chapters 1-10)

mobydickR Documentation

Lemmatized Text of Moby-Dick (Chapters 1-10)

Description

This dataset contains the lemmatized version of the first 10 chapters of the novel Moby-Dick by Herman Melville. The data is structured as a dataframe with multiple linguistic annotations.

Usage

data(mobydick)

Format

A dataframe with multiple rows and 26 columns:

doc_id

Character: Unique document identifier

paragraph_id

Integer: Paragraph index within the document

sentence_id

Integer: Sentence index within the paragraph

sentence

Character: Original sentence text

start

Integer: Start position of the token in the sentence

end

Integer: End position of the token in the sentence

term_id

Integer: Unique term identifier

token_id

Integer: Token index in the sentence

token

Character: Original token (word)

lemma

Character: Lemmatized form of the token

upos

Character: Universal POS tag

xpos

Character: Language-specific POS tag

feats

Character: Morphological features

head_token_id

Integer: Head token in dependency tree

dep_rel

Character: Dependency relation label

deps

Character: Enhanced dependency relations

misc

Character: Additional information

folder

Character: Folder containing the document

split_word

Character: The word used to separate the chapters in the original book

filename

Character: Source file name

doc_selected

Logical: Whether the document is selected

POSSelected

Logical: Whether POS was selected

sentence_hl

Character: Highlighted sentence

docSelected

Logical: Whether the document was manually selected

noHapax

Logical: Whether hapax legomena were removed

noSingleChar

Logical: Whether single-character words were removed

lemma_original_nomultiwords

Character: Lemmatized form without multi-word units

Source

Extracted and processed from the text of Moby-Dick by Herman Melville.

Examples

data(mobydick)
head(mobydick)

tall documentation built on April 16, 2025, 5:10 p.m.