clean_file: Clean File

Description Usage Arguments Value Examples

Description

A wrapper around the clean.py script in /inst/python/. Specify the path to file you want to clean, an output directory and a string containing the commands you want to send to the cleaning script. Be sure to pass the maintain-newlines parameter if your files are in a format where many documents are in one text file delimited by newlines. For conveniance I am putting all the possible commands that can be passed to that python script here.

-l : if words should be lowercased
-n : if digits should be stripped
-p : if punctuation should be stripped
-r : if roman numerals should be stripped
-s : if stop words should be stripped
-d : if non dictionary words should be stripped
-t : if tweet specific cleaning options should be used
\--additional : if you want to add all stopwords and dictionary files
\--no-usernames : remove twitter usternames ampersand<name>
\--maintain-newlines : use space for delim instead of default (newline)
\--min-size [N] : specify the minimum size for a token (default=2)

Usage

1
clean_file(ifile, odir, clean_commands_str)

Arguments

ifile

A string containing the path to the input file.

odir

A string containing the path to the output directory.

clean_commands_str

A string containing the combined commands for the cleaning script.

Value

A string containing the name of the file that was cleaned.

Examples

1
2
3
4
5
6
7
8
## Not run: 
clean_file("myfile.txt", "./cleaned/", "-d")
clean_file("myfile.txt", "./cleaned/", "-lnp")
clean_file("myfile.txt", "./cleaned/", "-lnprsdt")
clean_file("myfile.txt", "./cleaned/", "-lnprsdt --additional")
clean_file("myfile.txt", "./cleaned/", "-lnprsdt --tags --maintain-newlines --min-size 3")

## End(Not run)

avkoehl/textprocessingDSI documentation built on June 5, 2019, 7:41 p.m.