export_browser_data: Output data files for dfr-browser
In agoldst/dfrtopics: Tools for exploring topic models of text

export_browser_data

R Documentation

Output data files for dfr-browser

Description

Transform and save modeling results in a format suitable for use by dfr-browser, the web-browser based model browser. For a quick export and immediate viewing, see also dfr_browser.

Usage

export_browser_data(
  m,
  out_dir,
  zipped = TRUE,
  n_top_words = 50,
  n_scaled_words = 1000,
  supporting_files = FALSE,
  overwrite = FALSE,
  internalize = FALSE,
  info = NULL,
  proper = FALSE,
  digits = getOption("digits"),
  permute = NULL,
  metadata_header = FALSE
)

Arguments

`m`	`mallet_model` object from `train_model` or `load_mallet_model`
`out_dir`	directory for output. If `supporting_files` is TRUE, the exported data files will go in a `"data"` directory under `out_dir`.
`zipped`	should the larger data files be zipped? (If TRUE, uses `zip`, which requires the `zip` utility be available.)
`n_top_words`	how many top words per topic to save?
`n_scaled_words`	how many word types to use in scaled coordinates calculation?
`supporting_files`	if TRUE (FALSE is default), all the files needed to run the browser are copied to `out_dir`, with the exported data placed appropriately. From a shell in `out_dir`, run `bin/server` to launch a local web server.
`overwrite`	if TRUE, this will clobber existing files
`internalize`	always set to FALSE. If TRUE, model data is in the browser home page rather than separate files, but this behavior is deprecated. See Details.
`info`	a list of dfr-browser parameters. Converted to JSON with `toJSON` and stored in `info.json`. If omitted, default values (`getOption("dfrtopics.browser_info")`) are used. No file is written if `info=FALSE`.
`proper`	if TRUE, the document-topic and topic-word matrices will be smoothed by the hyperparameters alpha and beta (respectively) and normalized before export, instead of the "raw" sampling weights (which is the default). For MALLET models, moothed and normalized weights then give the maximum a posteriori estimates of the corresponding probabilities, which is "properly" what the modeling process yields (but may disguise the effects of variations in document length—and increase the storage space required).
`digits`	if `proper` is TRUE, probabilities are rounded to this decimal place, yielding a somewhat sparser doc-topics matrix (the topic-word matrix is more aggressively truncated anyway). Set to NULL for no rounding. Rounded weights are renormalized within dfr-browser itself.
`permute`	if non-NULL, specifies a renumbering of the topics: the new topic `k` is old topic `permute[k]`. (If you have the inverse, use `order(permute)` to invert it back.)
`metadata_header`	if TRUE (FALSE is default), the exported metadata CSV will have a header row (not expected by dfr-browser by default)

Details

This routine reports on its progress. By default, it saves zipped versions of the document-topics matrix and metadata files; dfr-browser supports client-side unzipping. This function compresses files using R's zip command. If that fails, set zipped=F (and, if you wish, zip the files using another program).

A detailed description of the output files can be found in the dfr-browser technical notes at http://github.com/agoldst/dfr-browser.

This package includes a copy of the dfr-browser files necessary to run the browser. By default, this routine only exports data files. To also copy over the dfr-browser source (javascript, HTML, and CSS), pass supporting_files=T.

Metadata format

If you are working with non-JSTOR documents, the one file that will reflect this is the exported metadata. dfr-browser expects seven metadata columns by default: id,title,author,journaltitle,volume,issue,pubdate,pagerange. This function looks for these seven columns and, if it finds them, writes the metadata with these columns in this order. Any remaining columns are pushed all the way to the right of the output. (dfr-browser ignores them unless you customize it.) If any these columns is not present in metadata(m), then export_browser_data will simply save all the metadata as is, adjusting only the CSV format to match the baseline expectation of dfr-browser (namely, a headerless CSV conforming to RFC 4180.).

If your metadata does not match these expectations, an alternative is to set dfr-browser's configuration parameters VIS.metadata.type and VIS.bib.type to "base" (using the info parameter) and to write out a metadata file with a header by passing metadata_header=T to this function or dfr_browser. For polished results more customization of dfr-browser might be necessary.

Note that you can adjust the metadata held on the model object by assigning to metadata(m) before exporting the browser data. In particular, if you have many documents, you may wish to conserve space by eliminating metadata columns that are not used by the visualization: for example, metadata(m)$publisher <- NULL. Earlier versions of dfrtopics tried to eliminate such columns automatically, but this more conservative approach aims to allow you more flexibility about what gets exported.

Deprecated option

To insert the data directly into the main index.html file, pass internalize=T. This behavior is now deprecated and will be removed in a future version.

Examples


## Not run: 
m <- model_dfr_documents("citations.CSV", "wordcounts",
    "stoplist.txt", n_topics=40)

# export all files needed for browser program
export_browser_data(m, out_dir="browser", supporting_files=T)

# or: overwrite model data only for an already-existing browser
export_browser_data(m, out_dir="browser/data",
    supporting_files=F, overwrite=T)

## End(Not run)

agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.