Identify terms present in document.

Share:

Description

This function takes as input a document which the user wishes to mine, a list of terms which they wish to identify, and an acceptance function for deciding on associations. This is the main function of the package; all others are helper functions, exported for your convenience. For full instructions on this function's usage, please see the documentation at github.com/Chris1221/goldi, or read the associated publication. We recommend it as background regardless.

Usage

1
2
3
4
5
goldi(doc,
  terms = "You must put your terms here if not using a precomputed TDM.",
  lims = c(1, 2, 3, 3, 4, 5, 6, 6, 7, 8, 8), output, syn = FALSE,
  syn.list = NULL, object = FALSE, log = NULL, reader = "local",
  term_tdm = NULL, log.level = "warn")

Arguments

doc

Either a file path to a document which will be read in, or a string already read into R. See "reader" for more details. Depending on the "reader" selected, there are four options for document input.

terms

Either a character vector of terms, with each element being a separate term, or a file path to a newline seperated text document which may be parsed into terms.

lims

Number of identical (or synonymous) words which must be present in a sentence in order for it to be accepted as a match for the term. "interactive" is default and allows you to interavtively build your own list, but a list or vector of n elements can be supplied where n is the largest term you wish to search for.

output

path to output file

syn

If you would like to use synonyms, set "syn = TRUE" with "syn.list" left as default to launch the interactive generator ("goldi::make.syn()"), or give a list if synonyms are already formatted.

syn.list

LIST of synonyms to be used. First element of each list item is the word that will counted if any of the other elements of that list item are present.

object

Return as an R object?

log

If specified, the path to the log you wish to keep.

reader

Option for how to read in the text files. See details.

term_tdm

If using a precompiled TDM.

log.level

Logging level. See ?flog.threshold for details.

Value

A data frame of terms and their context within the document.

Author(s)

Christopher B. Cole <chris.c.1221@gmail.com>

References

See ArXiv prepubliation.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
## Not run: 

# Give the free form text
doc <- "In this sentence we will talk about ribosomal chaperone activity."

# Load in the included term document matrix for the terms
data("TDM.go.df")

# Pipe output and log to /dev/null
output = "/dev/null"
log = "/dev/null"

# Run the function
goldi(doc = doc,
      term_tdm = TDM.go.df,
      output = output,
      log = log,
      object = TRUE)


## End(Not run)