Description Usage Arguments Value Examples
A wrapper around the clean.py script in /inst/python/.
Specify the path to file you want to clean, an output directory and a string
containing the commands you want to send to the cleaning script.
Be sure to pass the maintain-newlines parameter if your files are in a format
where many documents are in one text file delimited by newlines.
For conveniance I am putting all the possible commands that can be passed to that python
script here.
-l | : if words should be lowercased |
-n | : if digits should be stripped |
-p | : if punctuation should be stripped |
-r | : if roman numerals should be stripped |
-s | : if stop words should be stripped |
-d | : if non dictionary words should be stripped |
-t | : if tweet specific cleaning options should be used |
\--additional | : if you want to add all stopwords and dictionary files |
\--no-usernames | : remove twitter usternames ampersand<name> |
\--maintain-newlines | : use space for delim instead of default (newline) |
\--min-size [N] | : specify the minimum size for a token (default=2) |
1 | clean_file(ifile, odir, clean_commands_str)
|
ifile |
A string containing the path to the input file. |
odir |
A string containing the path to the output directory. |
clean_commands_str |
A string containing the combined commands for the cleaning script. |
A string containing the name of the file that was cleaned.
1 2 3 4 5 6 7 8 | ## Not run:
clean_file("myfile.txt", "./cleaned/", "-d")
clean_file("myfile.txt", "./cleaned/", "-lnp")
clean_file("myfile.txt", "./cleaned/", "-lnprsdt")
clean_file("myfile.txt", "./cleaned/", "-lnprsdt --additional")
clean_file("myfile.txt", "./cleaned/", "-lnprsdt --tags --maintain-newlines --min-size 3")
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.