mobydick: Lemmatized Text of Moby-Dick (Chapters 1-10)
In tall: Text Analysis for All

mobydick

R Documentation

Lemmatized Text of Moby-Dick (Chapters 1-10)

Description

This dataset contains the lemmatized version of the first 10 chapters of the novel Moby-Dick by Herman Melville. The data is structured as a dataframe with multiple linguistic annotations.

Usage

data(mobydick)

Format

A dataframe with multiple rows and 26 columns:

doc_id: Character: Unique document identifier
paragraph_id: Integer: Paragraph index within the document
sentence_id: Integer: Sentence index within the paragraph
sentence: Character: Original sentence text
start: Integer: Start position of the token in the sentence
end: Integer: End position of the token in the sentence
term_id: Integer: Unique term identifier
token_id: Integer: Token index in the sentence
token: Character: Original token (word)
lemma: Character: Lemmatized form of the token
upos: Character: Universal POS tag
xpos: Character: Language-specific POS tag
feats: Character: Morphological features
head_token_id: Integer: Head token in dependency tree
dep_rel: Character: Dependency relation label
deps: Character: Enhanced dependency relations
misc: Character: Additional information
folder: Character: Folder containing the document
split_word: Character: The word used to separate the chapters in the original book
filename: Character: Source file name
doc_selected: Logical: Whether the document is selected
POSSelected: Logical: Whether POS was selected
sentence_hl: Character: Highlighted sentence
docSelected: Logical: Whether the document was manually selected
noHapax: Logical: Whether hapax legomena were removed
noSingleChar: Logical: Whether single-character words were removed
lemma_original_nomultiwords: Character: Lemmatized form without multi-word units