language: Parse IETF Language Tag
In NLP: Natural Language Processing Infrastructure

language

R Documentation

Parse IETF Language Tag

Description

Extract language, script, region and variant subtags from IETF language tags.

Usage

parse_IETF_language_tag(x, expand = FALSE, strict = TRUE)

Arguments

`x`	a character vector with IETF language tags.
`expand`	a logical indicating whether to expand subtags into their description(s).
`strict`	a logical indicating whether invalid language tags should result in an error (default) or not.

Details

Internet Engineering Task Force (IETF) language tags are defined by IETF BCP 47, which is currently composed by the normative RFC 5646 and RFC 4647, along with the normative content of the IANA Language Subtag Registry regulated by these RFCs. These tags are used in a number of modern computing standards.

Each language tag is composed of one or more “subtags” separated by hyphens. Normal language tags have the following subtags:

a language subtag (optionally, with language extension subtags),
an optional script subtag,
an optional region subtag,
optional variant subtags,
optional extension subtags,
an optional private use subtag.

Language subtags are mainly derived from ISO 639-1 and ISO 639-2, script subtags from ISO 15924, and region subtags from ISO 3166-1 alpha-2 and UN M.49, see package ISOcodes for more information about these standards. Variant subtags are not derived from any standard. The Language Subtag Registry (https://www.iana.org/assignments/language-subtag-registry), maintained by the Internet Assigned Numbers Authority (IANA), lists the current valid public subtags, as well as the so-called “grandfathered” language tags.

See https://en.wikipedia.org/wiki/IETF_language_tag for more information.

Value

If expand is false, a list of character vectors of the form "type=subtag", where type gives the type of the corresponding subtag (one of ‘Language’, ‘Extlang’, ‘Script’, ‘Region’, ‘Variant’, or ‘Extension’), or "type=tag" with type either ‘Privateuse’ or ‘Grandfathered’.

Otherwise, a list of lists of character vectors obtained by replacing the subtags by their corresponding descriptions (which may be multiple) from the IANA registry. Note that no such descriptions for Extension and Privateuse subtags are available in the registry; on the other hand, empty expansions of the other subtags indicate malformed tags (as these subtags must be available in the registry).

Examples

## German as used in Switzerland:
parse_IETF_language_tag("de-CH")
## Serbian written using Latin script as used in Serbia and Montenegro:
parse_IETF_language_tag("sr-Latn-CS")
## Spanish appropriate to the UN Latin American and Caribbean region:
parse_IETF_language_tag("es-419")
## All in one:
parse_IETF_language_tag(c("de-CH", "sr-Latn-CS", "es-419"))
parse_IETF_language_tag(c("de-CH", "sr-Latn-CS", "es-419"),
                        expand = TRUE)
## Two grandfathered tags:
parse_IETF_language_tag(c("i-klingon", "zh-min-nan"),
                        expand = TRUE)

NLP documentation built on April 12, 2025, 1:36 a.m.