parseDoc: parse a document and place content in a DocSet

Description Usage Arguments Value Note Examples

View source: R/parseDoc.R

Description

parse a document and place content in a DocSet

Usage

1
2
3
4
5
6
7
8
parseDoc(csv, DocSetInstance = new("DocSet"), doctitle = NA_character_,
  docabst = NA_character_, rec_id_field = "experiment.accession",
  exclude_fields = c("study.accession"),
  substrings_to_omit = c("http://purl.obolibrary.org/obo/"),
  patterns_to_kill = "....-..-..|.*...,...",
  token_fixups = list(c("t''", "t'"), c(":$", "")), max_tok_nchar = 25,
  min_tok_nchar = 4, cleanFields = list("..*id$", ".name$", "_name$",
  "checksum", "isolate", "filename", "^ID$", "barcode", "Sample.Name"))

Arguments

csv

a character(1) CSV file path

DocSetInstance

if missing, DocSet is initialized in this function, otherwise the instance is updated with new content

doctitle

character(1) document title

docabst

character(1) a string: the document abstract

rec_id_field

character(1) field in CSV identifying records

exclude_fields

character vector of fields to ignore while parsing

substrings_to_omit

character vector of strings to remove from candidate keywords via gsub

patterns_to_kill

character(1) regexp that identifies tokens to be omitted from keyword set

token_fixups

a list if character(2) vectors that will be

max_tok_nchar

numeric(1) defaults to 25, tokens with more characters will be truncated to this length and suffixed with ellipsis

min_tok_nchar

numeric(1) defaults to 4, tokens shorter than this are not in index used with gsub() to repair irregularities. For example ‘c("t”", "t’")‘ will transform 'Burkitt”s' to 'Burkitt’s'

cleanFields

list of regular expressions identifying fields to ignore

Value

instance of DocSet

Note

The expected use case has 'DocSetInstance' being updated in a loop. Sharing of environments across multiple DocSetInstances can occur and unexpected behaviors may ensue. Note also that many of the parameter defaults to parseDoc are for the use case of processing SRA metadata.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
myob = ssrch::docset_cancer68
td = tempdir()
alld = ls(docs2kw(myob))
r1 = retrieve_doc(alld[1], myob)
expo = write.csv(r1, paste0(td, "/expo.csv"))
pd = parseDoc(paste0(td, "/expo.csv"), doctitle=ssrch::titles68[alld[1]],
    docabst="qwerty")
pd
searchDocs("quer", pd) # query will fail
searchDocs("qwer", pd) # should succeed

ssrch documentation built on Nov. 8, 2020, 5:39 p.m.