Canned Regular Expressions (United States of America)

Share:

Description

A dataset containing a list U.S. specific, canned regular expressions for use in various functions within the qdapRegex package.

Usage

1

Format

A list with 54 elements

Details

The following canned regular expressions are included:

rm_abbreviation

abbreviations containing single lower case or capital letter followed by a period and then an optional space (this must be repeated 2 or more times)

rm_between

Remove characters between a left and right boundary including the boundaries; note contains "%s" that is replaced by sprintf and is not a valid regex on its own

rm_between2

Remove characters between a left and right boundary NOT including the boundaries; note contains "%s" that is replaced by sprintf and is not a valid regex on its own

rm_caps

words containing 2 or more consecutive upper case letters and no lower case

rm_caps_phrase

phrases of 1 word or more containing 1 or more consecutive upper case letters and no lower case; if phrase is one word long then phrase must be 2 or more consecutive capital letters

rm_citation

substring that looks for in-text and parenthetical APA6 style citations (attempts to exclude references)

rm_citation2

substring that looks for in-text APA6 style citations (attempts to exclude references)

rm_citation3

substring that looks for parenthetical APA6 style citations (attempts to exclude references)

rm_city_state

substring with city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters)

rm_city_state_zip

substring with city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters) & zip code (exactly 5 or 5+4 consecutive digits)

rm_date

dates in the form of 2 digit month, 2 digit day, and 2 or 4 digit year. Separator between month, day, and year may be dot (.), slash (/), or dash (-)

rm_date2

dates in the form of 3-9 letters followed by one or more spaces, 2 digits, a comma(,), one or more spaces, and 4 digits

rm_date3

dates in the form of XXXX-XX-XX; hyphen separated string of 4 digit year, 2 digit month, and 2 digit day

rm_date4

dates in the form of both rm_date, rm_date2, and rm_date3

rm_dollar

substring with dollar sign ($) followed by (1) just dollars (no decimal), (2) dollars and cents (whole number and decimal), or (3) just cents (decimal value)

rm_email

substring with (1) alphanumeric characters or dash (-), plus (+), or underscore (_) (This may be repeated) (2) followed by at (@), followed by the same regex sequence as before the at (@), and ending with dot (.) and 2-14 digits

rm_emoticon

common emoticons (logic is complicated to explain in words) using ">?[:;=8XB]{1}[-~+o^]?[|\")(&gt;DO>{pP3/]+|</?3|XD+|D:<|x[-~+o^]?[|\")(&gt;DO>{pP3/]+" regex pattern; general pattern is optional hat character, followed by eyes character, followed by optional nose character, and ending with a mouth character

rm_endmark

substring of the last endmark group in a string; endmarks include (! ? . * OR |)

rm_endmark3

substring of the last endmark group in a string; endmarks include (! ? OR .)

rm_endmark3

substring of the last endmark group in a string; endmarks include (! ? . * | ; OR :)

rm_hash

substring that begins with a hash (#) followed by a word

rm_nchar_words

substring of letters (that may contain apostrophes) n letters long (apostrophe not counted in length); note contains "%s" that is replaced by sprintf and is not a valid regex on its own

rm_nchar_words2

substring of letters (that may contain apostrophes) n letters long (apostrophe counted in length); note contains "%s" that is replaced by sprintf and is not a valid regex on its own

rm_non_ascii

substring of 2 digits or letters a-f inside of a left and right angle brace in the form of "<a4>"

rm_non_words

substring of any character that isn't a letter, apostrophe, or single space

rm_number

substring that may begin with dash (-) for negatives, and is (1) just whole number (no decimal), (2) whole number and decimal, or (3) just decimal value; regex pattern provided by Jason Gray

rm_percent

substring beginning with (1) just whole number (no decimal), (2) whole number and decimal, or (3) just decimal value and followed by a percent sign (%)

rm_phone

phone numbers in the form of optional country code, valid 3 digit prefix, and 7 digits (may contain hyphens and parenthesis); logic is complex to explain (see http://stackoverflow.com/a/21008254/1000343 for more)

rm_postal_code

U.S. state abbreviations (and District of Columbia) that is constrained to just possible U.S. state names, not just two consecutive capital letters; taken from Mike Hamilton's submission found http://regexlib.com/REDetails.aspx?regexp_id=2177

rm_repeated_characters

substring with a repetition of repeated characters within a word; regex pattern retrieved from StackOverflow's, vks: http://stackoverflow.com/a/29438461/1000343

rm_repeated_phrases

substring with a phrase (a sequence of 1 or more words) that is repeated 2 or more times (case is ignored; separating periods and commas are ignored); regex pattern retrieved from StackOverflow's, BrodieG: http://stackoverflow.com/a/28786617/1000343

rm_repeated_words

substring with a word (marked with a boundary) that is repeat 2 or more times (case is ignored)

rm_tag

substring that begins with an at (@) followed by a word

rm_tag2

Twitter substring that begins with an at (@) followed by a word composed of alpha-numeric characters and underscores, no longer than 15 characters

rm_title_name

substring beginning with title (Mrs., Mr., Ms., Dr.) that is case independent or full title (Miss, Mizz, mizz) followed by a single lower case word or multiple capitalized words

rm_time

substring that (1) must begin with 0-2 digits, (2) must be followed by a single colon (:), (3) optionally may be followed by either a colon (:) or a dot (.), (4) optionally may be followed by 1-infinite digits (if previous condition is true)

rm_time2

substring that is identical to rm_time with the additional search for Ante Meridiem/Post Meridiem abbreviations (e.g., AM, p.m., etc.)

rm_transcript_time

substring that is specific to transcription time stamps in the form of HH:MM:SS.OS where OS is milliseconds. HH: and .OS are optional. The SS.OS period divide may also be a comma or additional colon. The HH:SS divid may also be a period. String may be affixed with pound sign (#).

rm_twitter_url

Twitter short link/url; substring optionally beginning with http, followed by t.co ending on a space or end of string (whichever comes first)

rm_url

substring beginning with http, www., or ftp and ending on a space or end of string (whichever comes first); note that this regex is simple and may not cover all valid URLs or may include invalid URLs

rm_url2

substring beginning with http, www., or ftp and more constrained than rm_url; based on @imme_emosol's response from https://mathiasbynens.be/demo/url-regex

rm_url3

substring beginning with http or ftp and more constrained than rm_url & rm_url2 though light-weight, making it ideal for validation purposes; taken from @imme_emosol's response found https://mathiasbynens.be/demo/url-regex

rm_white

substring of white space(s); this regular expression combines rm_white_bracket, rm_white_colon, rm_white_comma, rm_white_endmark, rm_white_lead, rm_white_trail, and rm_white_multiple

rm_white_bracket

substring of white space(s) following left brackets ("{", "(", "[") or preceding right brackets ("}", ")", "]")

rm_white_colon

substring of white space(s) preceding colon(s)/semicolon(s)

rm_white_comma

substring of white space(s) preceding a comma

rm_white_endmark

substring of white space(s) preceding a single occurrence/combination of period(s), question mark(s), and exclamation point(s)

rm_white_lead

substring of leading white space(s)

rm_white_lead_trail

substring of leading/trailing white space(s)

rm_white_multiple

substring of multiple, consecutive white spaces

rm_white_punctuation

substring of white space(s) preceding a comma or a single occurrence/combination of colon(s), semicolon(s), period(s), question mark(s), and exclamation point(s)

rm_white_trail

substring of trailing white space(s)

rm_zip

substring of 5 digits optionally followed by a dash and 4 more digits

Extra

Use qdapRegex:::examine_regex() to interactively explore the regular expressions in regex_usa. This will provide a browser + console based break down of each regex in the dictionary.

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.