knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##"
)

Introduction


This vignette will be discussing the structure of Episode objects, how to query the contents with the {xml2} package, and how to use the methods and active bindings to get information about, extract, and manipulate anything inside of a Markdown or R Markdown document.

Reading Markdown Content

Each Episode object starts from a Markdown file. In particular for {pegboard}, we assume that this Markdown file is written using Pandoc syntax (a superset of CommonMark). It can be any markdown file, but for us to explore what the Episode object has to offer us, let's take an example R Markdown file that is present in a fragment of a Carpentries Workbench lesson that we have in this package. We will be using the {xml2} package to explore the object and the {fs} package to help with constructing file paths.

library("pegboard")
library("xml2")
library("fs")

This is what our lesson fragment looks like. It is a fragment because it's main purpose is to be used for examples and tests, but it contains the basic structure of a lesson that we want.

dir_tree(lesson_fragment("sandpaper-fragment"), recurse = 1, regex = "site/[^R].*", invert = TRUE)

We can retrieve it with the lesson_fragment() function, which loads example data from pegboard. Here we will take that lesson fragment and read in the first episode with the initialization method, Episode$new(), followed by $confirm_sandpaper(), a confirmation that the episode was created to work with {sandpaper}, the user interface and build engine of The Carpentries Workbench (for information on non-workbench content, see the section on Jekyll Lesson Markdown Content) and $protect_math() which will prevent special characters in LaTeX math from being escaped.

lsn <- lesson_fragment("sandpaper-fragment")
# Read in the intro.Rmd document as an `Episode` object
intro_path <- path(lsn, "episodes", "intro.Rmd")
intro <- Episode$new(intro_path)$confirm_sandpaper()$protect_math()

If we print out the Episode object, I'm going to get a long list of methods, fields and active bindings (functions that act like fields) printed:

intro

The actual XML content is in the $body field. This contains all the data from the markdown document, but in XML form.

intro$body

If we want to see what the contents look like, you can use the $show(), $head(), or $tail() methods (note: the $show() method will print out the entire markdown document).

intro$head(10)
intro$tail(10)
intro$show()

File information

For information about the file and its relationship to other files, you can use the following active bindings, which are useful when working with Episodes in a lesson context.

intro$path
intro$name
intro$lesson
# NOTE: relationships to other episodes are automatically handled in the
#       Lesson context
intro$has_parents
intro$has_children
intro$children # separate documents processed as if they were part of this document
intro$parents  # the immediate documents that would require this document to build
intro$build_parents # the final documents that would require this document to build

Accessing Markdown Elements

The Episode object is centered around the $body item, which contains the XML representation of document. It is possible to find markdown elements from XPath statments:

xml2::xml_find_all(intro$body, ".//md:link", ns = intro$ns)
xml2::xml_find_first(intro$body, ".//md:list[@type='ordered']", ns = intro$ns)

However, there are some useful elements that we want to know about, so I have implemented them in active bindings and methods:

# headings where level 2 headings are equivalent to sections
intro$headings
# all callouts/fenced divs
intro$get_divs()
intro$challenges
intro$solutions
# questions, objectives, and keypoints are standard and return char vectors
intro$objectives 
intro$questions
intro$keypoints
# code blocks and output types
intro$code
intro$output
intro$warning
intro$error
# images and links
intro$images
intro$get_images() # parses images embedded in `<img>` tags
intro$links

Much of these are summarized in the $summary() method:

intro$summary()

Code blocks and code chunks

In markdown, a code block is written with fences of at least three backtick characters (`) followed by the language for syntax highlighting:

List all files in reverse temporal order, printing their sizes in
a human-readable format:

```bash
ls -larth /path/to/folder
```

List all files in reverse temporal order, printing their sizes in a human-readable format:

bash ls -larth /path/to/folder

When these are processed by {pegboard}, the resulting XML has this structure where the backticks inform that kind of node (code_block) and the language type is known as the "info" attribute. Everything inside the code block is the node text and has whitespace preserved

cb <- "```bash

ls -larth /path/to/folder
```"
cbx <- xml2::read_xml(commonmark::markdown_xml(cb))
txt <- as.character(xml2::xml_find_first(cbx, ".//d1:code_block"))
writeLines(c("```xml", txt, "```"))

In R Markdown, there are special code blocks that are called code chunks that can be dynamically evaluated. These are distinguished by the curly braces around the language specifier and optional attributes that control the output of the chunk.

There is a code chunk here that will produce a plot, but not show the code:

```r
plot(1:10, type = "l")
```

There is a code chunk here that will produce a plot, but not show the code:

r plot(1:10, type = "l")

When this is processed with {pegboard}, the "info" part of the code block is further split into "language", "name" and further attributes based on the chunk options:

chunk <- 'There is a code chunk here that will produce a plot, but not show the code:

```r

plot(1:10, type = "l")
```'
tmp <- tempfile()
writeLines(chunk, tmp)
chunky <- pegboard::Episode$new(tmp)$code[[1]]
xml2::xml_set_attr(chunky, "sourcepos", NULL)
txt <- as.character(chunky)
writeLines(c("```xml", txt, "```"))
unlink(tmp)

Both code blocks will be encountered, but the difference between them is that the R Markdown code chunks will have the "language" attribute. This is an important concept to know about when you are searching and manipulating R Markdown documents with XPath (see vignette("intro-xml", package = "pegboard")). The next section will walk through some aspects of manipulation that we can do with these documents.

Manipulation

Because everything centers around the $body element and is extracted with {xml2}, it's possible to manipulate the elements of the document. One thing that is possible is that we can add new content to the document using the $add_md() method, which will add a markdown element after any paragraph in the document.

For example, we can add information about pegboard with a new code block after the first heading:

intro$head(26) # first 26 lines
intro$body # first heading is item 11
cb <- c("You can clone the **{pegboard} package**:

```sh
git clone https://github.com/carpentries/pegboard.git
```
")
intro$add_md(cb, where = 11)
intro$head(26) # code block has been added
intro$code

You can also manipulate existing elements. For example, let's say we wanted to make sure all R code chunks were named. We can do so by querying and manipulating the code blocks:

code <- intro$code
code
# executable code chunks will have the "language" attribute
is_chunk <- xml2::xml_has_attr(code, "language")
chunks <- code[is_chunk]
chunk_names <- xml2::xml_attr(chunks, "name")
nonames <- chunk_names == ""
chunk_names[nonames] <- paste0("chunk-", seq(sum(nonames)))
xml2::xml_set_attr(chunks, "name", chunk_names)
code

We can see that the chunks now have names, but the proof is in the rendering:

intro$show()

Handouts

NOTE: This will change in version 0.8.0. It is possible to generate a code handout by using the $handout() method. This will grab all challenge blocks along with any code block that contains purl = TRUE and strip out everything else. You can also specifiy solution = TRUE to include the solutions:

writeLines(intro$handout())
writeLines(intro$handout(solution = TRUE))

I want to show that the purl = TRUE method actually works, so I'll take the chunks from above and include the last one:

xml2::xml_set_attr(chunks[chunk_names == "pyramid"], "purl", TRUE)
writeLines(intro$handout())

The path arguments allows this to be written to a file so that it can be converted to a script using knitr::purl

tmp <- tempfile()
intro$handout(path = tmp)
writeLines(readLines(tmp))

Reset

One of the things about manipulating these documents in code is that it is possible to go back and reset if things are not correct, which is why we have the $reset() method:

intro$reset()$confirm_sandpaper()$protect_math()$head(25)

Jekyll Lesson Markdown Content

This section describes the features that you would expect to find in a lesson that was built with the former infrastructure, https://github.com/carpentries/styles, which was built using the Jekyll static site generator. These style lessons are no longer supported by The Carpentries. {pegboard} does support these lessons so that they can be transitioned to use The Workbench syntax via The Carpentries Lesson Transition Tool. This was the first syntax that was supported by {pegboard} because the package was written initially as a way to explore the structure of our lessons.

The Syntax of Jekyll Lessons


Special methods and active bindings

library("pegboard")
library("xml2")
library("fs")

Episodes written in the Jekyll syntax have special functions and active bindings that allow them to be analyzed and transformed to Workbench episodes. Here is an example from a lesson fragment:

lf <- lesson_fragment()
ep <- Episode$new(path(lf, "_episodes", "14-looping-data-sets.md"))
# show relevant sections of head and tail
ep$head(29)
ep$tail(53)

Notice that the questions, objectives, and keypoints are in the yaml frontmatter. This is why we have an accessor that returns the list instead of the node, for compatibility with the Jekyll lessons:

ep$questions
ep$objectives
ep$keypoints

Even though the challenges are formatted differently, the accessors will still return them correctly:

ep$challenges
ep$solutions

You can also get all of the block quotes using the $get_blocks() method. NOTE: this will extract all block quotes (including those that do not have the ktag attributes.

ep$get_blocks() # default is all top-level blocks (challenges/callouts)
ep$get_blocks(level = 2) # nested blocks are usually solutions
ep$get_blocks(level = 0) # level zero is all levels
ep$get_blocks(type = ".solution", level = 0) # filter by type

One of the things that was advantageous about blockquotes is that we could analyze the pathway through the blockquotes and figure out how they were comonly written in a lesson. The $get_challenge_graph() creates a data frame that describes these relationships:

ep$get_challenge_graph()

You might notice that there is an attribute called ktag. When a Jekyll-formatted episode is read in, all of the IAL tags are processed and placed in an attribute called ktag (kramdown tag), which is accessible via the $tags active binding. This is needed because commonmark does not know how to process postfix tags and it is important for the translation to commonmark syntax:

ep$tags
xml2::xml_parent(ep$tags)

Transformation

It was always known that we would want to use a different syntax to write the lessons as much of the community struggled with the kramdown syntax and it was difficult to parse and validate. The automated transformation workflow is what powers the Lesson Transformation Tool and we have composed it into a few basic steps:

  1. transform block quotes to fenced divs (via the $unblock() method, using the internal replace_with_div() function).
  2. removing the jekyll syntax, liquid templating, and fix relative links (via the $use_sandpaper() method, using the internal use_sandpaper() function.
  3. moving the yaml frontmatter using the $move_ methods

The process looks like this composable chain of methods:

ep$reset()
ep$
  unblock()$
  use_sandpaper()$
  move_questions()$
  move_objectives()$
  move_keypoints()
ep$head(31)
ep$tail(65)

Transformation tip!

There are times where the lesson authors forget to tag a block quote with the correct type of tag to make it a callout block. In this case, the block quote will remain unconverted in the lesson. For example let's say there was a block quote at the very top of the lesson that defined prerequisites, but the author accidentally had a new line between the blockquote and the IAL tag, meaning that the blockquote was not processed:

# add a prerequisite block after the questions and objectives
untranslated_block <- "
> ## Barnaby
>
> - a barn
> - a bee
>

{: .prereq}
"
ep$add_md(untranslated_block, where = 10)
ep$head(31)

Notice that the block quote is picked up only as a blockquote, but not a special block quote.

ep$get_blocks(".prereq")
ep$get_blocks()

If we set the ktag attribute of this block quote to "{: .prereq}", then it will be recognised as a special blockquote, which can be translated.

# remove the orphan prereq tag:
orphan_tag <- xml2::xml_find_all(ep$body, 
  ".//md:paragraph[md:text[contains(text(),'.prereq}')]]", ns = ep$ns)
xml2::xml_remove(orphan_tag)
ep$head(31)

# set the attribute of the block quote:
xml2::xml_set_attr(ep$get_blocks(), "ktag", "{: .prereq}")
ep$head(31)
ep$get_blocks(".prereq")

Now we can convert the block to a fenced div with $unblock()

ep$unblock(force = TRUE)$head(31)


carpentries/pegboard documentation built on Nov. 13, 2024, 8:53 a.m.