knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "../man/figures/README-",
  out.width = "100%"
)

This document presents corporaexplorer's three main functions:

  1. prepare_data() converts a data frame to a "corporaexplorerobject".
  2. explore() runs the package's core feature, a Shiny app for fast and flexible exploration of a "corporaexplorerobject".
  3. run_document_extractor() runs a Shiny app for simple retrieval/extraction of documents from a "corporaexplorerobject" in a reading-friendly format.

See the reference section for full details and all available options.

1. Prepare data for the Shiny apps

The prepare_data() function returns a "corporaexplorerobject" that can be explored in the package's two Shiny apps.

The three most important arguments are:

The rest of the arguments can be used to fine-tune the presentation of the corpora in the corporaexplorer apps.

prepare_data can also be run with a character vector as only argument. In this case the function will return a simple "corporaexplorerobject" with no metadata.

After installing corporaexplorer, run the following in the R console to see full documentation for the prepare_data() function.

library(corporaexplorer)
?prepare_data

2. The corpus exploration app

Start the app by running the explore() function with a "corporaexplorerobject" created by prepare_data() as argument. Run the following in the R console to see documentation for the explore() function.

library(corporaexplorer)
?explore

The default arguments are recommended for most use cases.

While it should be possible to use the app without reading any further, the rest of this section includes user interface instructions as well as some details about the app's inner workings that are relevant for advanced users. A date-based corpus is used as example.

2a. Sidebar input

Note: Text input -- regular expressions

All text input will be treated as regular expressions (or regexes). Regular expressions can be very powerful for identifying exactly the text patterns one is interested in, but this power comes at a high complexity cost. That said, for simple searches that do not include punctuation, all one needs to know is basically this:

Thus, (in a case insensitive search):

arctic  # will match both "Arctic" and "Antarctic"
\barctic  # will match only "Arctic"

civili.ation  # will match both "civilisation" and "civilization"

For more about regex syntax and the regex flavours available, see the section about regex engines below.

(N.B. As seen in the example, a single backslash (not a double backslash as in the R console) is used as escape character. For example will \. match a literal ".", and \d match any digit.)

Note: Additional search arguments

corporaexplorer offers two optional arguments that can be used separately or together by adding them to the end of a search pattern (with no space between):

  1. The "threshold argument" has the syntax --threshold and determines the minimum number of search hits a day/document should contain in order to be coloured in the corpus map:
Russia--10  # Will find documents that includes the pattern "Russia" at least 10 times.
  1. The "column argument" has the syntax --column_name and allows for searches in other columns than the default full text column:
Russia--Title  # Will find documents that has the pattern "Russia" in its "Title" column.
  1. The two arguments can be combined in any order:
Russia--2--Title
Russia--Title--2
# Will both find documents that includes the pattern "Russia" at least 2 times
# in the Title column.

These arguments have the following consequences:

2b. Corpus map

The result of the search is an interactive heat map, a corpus map, where the filling indicates how many times the search term is found (legend above the plot).

In the calendar view (only for date-based corpora), each tile represents a day, and the filling indicates how many times the search term is found in the documents that day:

knitr::include_graphics("../man/figures/first_search.png")


In the document wall view, each tile represents one document, and the filling indicates how many times the search term is found in this document:

knitr::include_graphics("../man/figures/wall_1.png")


The Corpus info tab presents some very basic summary statistics of the search results. (Look at e.g. quanteda and tidytext for excellent R packages for quantitative text analysis. Using such packages together with corporaexplorer is highly recommended in order to combine qualitative and quantitative insights.)

Clicking on a tile in the corpus map opens the document view to the right of the corpus map.

2c. Document view

When in calendar view: Clicking on a day creates a second heat map tile chart where one tile is one document, and where the colour in a tile indicates how many times the search term is found in the document. In the box below is produced a list of the title of the documents this day.

knitr::include_graphics("../man/figures/day_corpus.png")


Clicking on a "document tile" produces two things. First, the full text of the document with search terms highlighted. Second, above the text a tile chart consisting of n tiles where each tile represents a 1/n part of the document, and where the colour in a tile indicates whether and how many times the search term is found in that part of the document. Clicking on a tile scrolls the document to the corresponding part of the document.

knitr::include_graphics("../man/figures/wall.png")

When in document wall view: Clicking on a tile in the corpus map leads straight to the relevant document.

2d. Advanced detail: Regular expression engines

explore() lets you choose among three regex engine setups:

  1. default: use the re2 package for simple searches and the stringr package for complex regexes (details below). This is the recommended option.
  2. use stringr for all searches.
  3. use re2for all searches.

re2 is very fast but has a more limited feature set than stringr, especially in handling non-ASCII text, including word boundary detection. With the default option, the re2 engine is run when no special regex characters are used; otherwise stringr is used. This option should fit most use cases.

Please consult the documentation for re2 and stringr for full information about syntax flavours.

By default, searches for patterns consisting of a single word and without special characters will be carried out in a document term matrix. Other searches are carried out in full text.

Advanced users can set the optional_info parameter in explore() to TRUE: this will print to console the following information for each input term: which regex engine was used, and whether the search was carried out in the document term matrix or in the full text documents.

3. The download documents app

This app is a simple helper app to retrieve a subset of the corpus in a format suitable for close reading.

knitr::include_graphics("../man/figures/download_tool.png")

Start the app by running the run_document_extractor() function with a "corporaexplorerobject" created by prepare_data() as argument. Run the following in the R console to see documentation for the run_document_extractor() function.

library(corporaexplorer)
?run_document_extractor

3a. Sidebar input

By default, there is an upper limit of 400 documents to be included in one report (can be changed in the max_html_docs parameter in run_document_extractor()).

3b. Advanced detail: Regular expression engines

Speed is considered to be of less importance in this app, and all searches are carried out as full text searches with stringr. Again, note that a single backslash is used as escape character. For example will \. match a literal ".", and \d match any digit.




kgjerde/corporaexplorer documentation built on July 3, 2023, 7:02 p.m.