Multi-hypothesis parsing of CSV files.

Share:

Description

hypoparsr takes a different approach to CSV parsing by creating different parsing hypotheses for a given file and ranking them based on data quality features. parse_file creates and returns the ranked parsing results.

Usage

1
2
3
4
5
parse_file(file, pruning_level = 0.1, quality_weights =
                 c(warnings = -1, edits = -1, moves = -1, confidence = 1,
                 total_cells = 1, typed_cells = 1, empty_header = -1,
                 empty_cells = -1, non_latin_chars = -1, row_col_ratio =
                 1)) 

Arguments

file

Path to a CSV file.

pruning_level

Numeric value between 0-1 which defined the lower threshold for confidence values of parsing hypotheses. The higher the value, the less hypotheses are created and the correct hypothesis might be omitted.

quality_weights

A named list of numeric quality feature weights which influence the hypothesis ranking. Positive weights improve the ranking of results with the respective characteristic and negative weights penalize the same.

Value

A hypoparsr_result, which contains all created hypotheses and their ranking. Call as.data.frame() on this object to retrieve the highest ranked parsing result.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# generate a CSV
csv <- tempfile()
write.csv(iris, csv, row.names=FALSE)

# call hypoparsr
res <- hypoparsr::parse_file(csv)

# get result data frames
best_guess <- as.data.frame(res)
second_best_guess <- as.data.frame(res, rank=2)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.