Reorder text sections in an e-book based on a user-provided function.
a data frame created by
a scalar function to determine a single row index based on a matched regular expression. It must take two strings, the text and the pattern, and return a single number. See examples.
regular expression passed to
Many e-books have chronologically ordered sections based on quality metadata. This results in properly book sections in the nested data frame. However, some poorly formatted e-books have their internal sections occur in an arbitrary order. This can be frustrating to work with when doing text analysis on each section and where order matters.
This function addresses this case by reordering the text sections in the nested data frame based on a user-provided function that re-indexes the data frame rows based on their content.
In general, the approach is to find something in the content of each section that describes the section order.
epub_recombine can use a regular expression to identify chapters.
Taking this a step further,
epub_reorder can use a function that works with the same information to reorder the rows.
It is enough in the former case to identify where in the text the pattern occurs. There is no need to extract numeric ordering from it.
The latter takes more effort. In the example EPUB file included in
epubr, chapters can be identified using a pattern of the word CHAPTER in capital letters followed by a space and then some Roman numerals.
The user must provide a function that would parse the Roman numerals in this pattern so that the rows of the data frame can be reordered properly.
a data frame
1 2 3 4 5 6 7 8 9 10 11 12
file <- system.file("dracula.epub", package = "epubr") x <- epub(file) # parse entire e-book x <- epub_recombine(x, "CHAPTER [IVX]+", sift = list(n = 1000)) # clean up library(dplyr) set.seed(1) x$data[] <- sample_frac(x$data[]) # randomize rows for example x$data[] f <- function(x, pattern) as.numeric(as.roman(gsub(pattern, "\\1", x))) x <- epub_reorder(x, f, "^CHAPTER ([IVX]+).*") x$data[]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.