epub_sift: Sift EPUB sections
In epubr: Read EPUB File Metadata and Text

View source: R/sift.R

epub_sift

R Documentation

Sift EPUB sections

Description

Sift out EPUB sections that have suspiciously low word or character count.

Usage

epub_sift(data, n, type = c("word", "char"))

Arguments

`data`	a data frame created by `epub`.
`n`	integer, minimum number of words or characters to retain a section.
`type`	character, `"word"` or `"character"`.

Details

This function is like a sieve that lets small section rows fall through. Choose the minimum number of words or characters to accept as a meaningful section in the e-book worth retaining in the nested data frame, e.g., book chapters. Data frame rows pertaining to smaller sections are dropped.

This function is helpful for isolating meaningful content by removing extraneous e-book sections that may be difficult to remove by other methods when working with poorly formatted e-books. The EPUB file included in epubr is a good example of this. It does not contain meaningful section identifiers in its metadata. This creates a need to restructure the text table after reading it with epub by subsequently calling epub_recombine. However, some unavoidable ambiguity in this leads to many small sections appearing from the table of contents. These can then be dropped with epub_sift. See a more comprehensive in the epub_recombine documentation. A simpler example is shown below.

Value

a data frame

Examples


file <- system.file("dracula.epub", package = "epubr")
x <- epub(file) # parse entire e-book
x$data[[1]]

x <- epub_sift(x, n = 3000) # drops last two sections
x$data[[1]]

epubr documentation built on Sept. 12, 2024, 6:23 a.m.