stringi-package: THE String Processing Package
In stringi: THE string processing package for R

Description Details Facilities available Author(s) References See Also

stringi is THE R package for very fast, correct, consistent, and convenient string/text manipulation in each locale and any character encoding. We are putting great effort to create software that works as you expect on any platform, in each locale, and any “native” system encoding.

Keywords: internationalization, localization, ICU, ICU4C, i18n, l10n, Unicode

Homepage: http://stringi.rexamine.com

License: The MIT license for the package code, the ICU license for accompanying ICU4C distribution, and the UCD license for the Unicode Character Database. See the COPYRIGHTS and LICENSE file for more details.

Manual pages on general topics (must-read):

stringi-encoding – character encoding issues, including information on encoding management in stringi, as well as on encoding detection, conversion, and Unicode normalization.
stringi-locale – locale issues, including i.a. locale management and specification in stringi, and the list of locale-sensitive operations. In particular, see stri_opts_collator for a description of the string collation algorithm, which is used for string comparing, ordering, sorting, casefolding, and searching.
stringi-arguments – how stringi deals with its functions' arguments.

Refer to the following:

stringi-search for string searching facilities; these include pattern searching, matching, string splitting, and so on. The following independent search engines are provided:
- stringi-search-regex – with ICU (Java-like) regular expressions;
- stringi-search-fixed – Locale-aware or byte-exact fixed pattern searching;
- stringi-search-charclass – for finding character classes, like “all whitespaces” or “all digits”.
stri_stats_general and stri_stats_latex for gathering some statistics on a character vector's contents.
stri_join, stri_dup, and stri_flatten for concatenation-based operations.
stri_sub for extracting and replacing substrings, and stri_reverse for a funny function to reverse all characters in a string.
stri_trim (among others) for trimming characters from the beginning or/and end of a string, see also stringi-search-charclass.
stri_length (among others) for determining the number of code points in a string.
stri_trans_tolower (among others) for case mapping, i.e. conversion to lower, UPPER, or Title case.
stri_compare, stri_order, and stri_sort for comparison-based, locale-aware operations, see also stringi-locale.
stri_split_lines (among others) to split a string into text lines.
stri_escape_unicode (among others) for escaping certain code points.
DRAFT API: stri_read_raw, stri_read_lines, and stri_write_lines for reading and writing text files.
TO DO [these will appear in future versions of stringi]: pad, wrap, justify, HTML entities, character translation, MIME Base 64 encode/decode, random string generation, number and data/time formatting, and many more.

Note that each man page has many links to other interesting facilities.

Marek Gagolewski gagolews@rexamine.com,
Bartek Tartanus bartektartanus@rexamine.com,
with some contributions from Marcin Bujarski at the early stage of package development. ICU4C was developed by IBM and others. The Unicode Character Database is due to Unicode, Inc.

stringi Package homepage, http://stringi.rexamine.com

ICU – International Components for Unicode, http://www.icu-project.org/

ICU4C API Documentation, http://www.icu-project.org/apiref/icu4c/

The Unicode Consortium, http://www.unicode.org/

UTF-8, a transformation format of ISO 10646 – RFC 3629, http://tools.ietf.org/html/rfc3629

Other stringi_general_topics: stringi-arguments; stringi-encoding; stringi-locale; stringi-search-charclass; stringi-search-fixed; stringi-search-regex; stringi-search