auk_clean: Clean an EBD file
In mstrimas/auk: eBird Data Processing with AWK

Description Usage Arguments Details Value Examples

Some rows in the eBird Basic Dataset (EBD) may have an incorrect number of columns, often resulting from tabs embedded in the comments field. This function drops these problematic records. Note that this function typically takes at least 3 hours to run on the full EBD.

1 2	auk_clean(f_in, f_out, sep = "\t", remove_blank = TRUE, overwrite = FALSE)

`f_in`	character; input file.
`f_out`	character; output file.
`sep`	character; the input field separator, the EBD is tab separated by default. Must only be a single character and space delimited is not allowed since spaces appear in many of the fields.
`remove_blank`	logical; whether the trailing blank should be removed from the end of each row. The EBD comes with an extra tab at the end of each line, which causes a extra blank column.
`overwrite`	logical; overwrite output file if it already exists

This function can clean an EBD file or an EBD sampling file.

Calling this function requires that the command line utility AWK is installed. Linux and Mac machines should have AWK by default, Windows users will likely need to install Cygwin.

If AWK ran without errors, the output filename is returned, however, if an error was encountered the exit code is returned.

## Not run: 
# example data with errors
f <- system.file("extdata/ebd-sample_messy.txt", package = "auk")
tmp <- tempfile()

# clean file to remove problem rows
auk_clean(f, tmp)
# number of lines in input
length(readLines(f))
# number of lines in output
length(readLines(tmp))

# note that the extra blank column has also been removed
ncol(read.delim(f, nrows = 5, quote = ""))
ncol(read.delim(tmp, nrows = 5, quote = ""))
unlink(tmp)

## End(Not run)