Locale-Sensitive Text Searching in stringi

Share:

Description

String searching facilities described in this very man page provide a way to detect and extract a specific piece of text. Note that locale-sensitive searching , especially on a non-English language text, is a much more complex process than one may think at the first glance.

Locale-Aware String Search Engine

By default, all stri_*_fixed functions in stringi utilize ICU's StringSearch engine – which is a language-aware string search algorithm. Note that a bitwise match will not give correct results in cases of:

  1. accented letters;

  2. conjoined letters;

  3. ignorable punctuation;

  4. ignorable case.

The matches are defined using the notion of “canonical equivalence” between strings.

This string search engines uses a modified version of the Boyer-Moore algorithm (cf. Werner, 1999), with time complexity of O(n+p) (n == length(str), p == length(pattern)). According to the ICU User Guide, the Boyer-Moore searching algorithm is based on automata or combinatorial properties of strings and pre-processes the pattern and known to be much faster than the linear search when search pattern length is longer. The Boyer-Moore search is faster than the linear search when the pattern text is longer than 3 or 4 characters.

Tuning the Collator's parameters allows you to perform correct matching that properly takes into account accented letters, conjoined letters, and ignorable punctuation and letter case.

For more information on ICU's Collator and SearchEngine and how to tune it up in stringi, refer to stri_opts_collator.

Byte Compare

If opts_collator is NA, then a very fast (for small p) bitwise (locale independent) search is performed, with time complexity of O(n*p) (n == length(str), p == length(pattern)) [Naive implementation - to be upgraded in some future version of stringi]. For a natural language, non-English text this is, however, not what you probably want.

You should note that, however, the conversion of input data to Unicode is done as usual.

General Notes

In all the functions, if a given fixed search pattern is empty, then the result is NA and a warning is generated.

References

ICU String Search Service – ICU User Guide, http://userguide.icu-project.org/collation/icu-string-search-service

L. Werner, Efficient Text Searching in Java, 1999, http://icu-project.org/docs/papers/efficient_text_searching_in_java.html

See Also

Other locale_sensitive: stri_cmp, stri_compare; stri_count_fixed; stri_detect_fixed; stri_enc_detect2; stri_locate_all_fixed, stri_locate_all_fixed,, stri_locate_first_fixed, stri_locate_first_fixed,, stri_locate_last_fixed, stri_locate_last_fixed; stri_opts_collator; stri_order, stri_sort; stri_replace_all_fixed, stri_replace_all_fixed, stri_replace_first_fixed, stri_replace_first_fixed, stri_replace_last_fixed, stri_replace_last_fixed; stri_split_fixed, stri_split_fixed; stri_trans_tolower, stri_trans_totitle, stri_trans_toupper; stringi-locale

Other search_fixed: stri_count_fixed; stri_detect_fixed; stri_extract_all_fixed, stri_extract_all_fixed,, stri_extract_first_fixed, stri_extract_first_fixed,, stri_extract_last_fixed, stri_extract_last_fixed; stri_locate_all_fixed, stri_locate_all_fixed,, stri_locate_first_fixed, stri_locate_first_fixed,, stri_locate_last_fixed, stri_locate_last_fixed; stri_opts_collator; stri_replace_all_fixed, stri_replace_all_fixed, stri_replace_first_fixed, stri_replace_first_fixed, stri_replace_last_fixed, stri_replace_last_fixed; stri_split_fixed, stri_split_fixed; stringi-search

Other stringi_general_topics: stringi-arguments; stringi-encoding; stringi-locale; stringi-package; stringi-search-charclass; stringi-search-regex; stringi-search

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.