knitr::opts_chunk$set(message = FALSE, warning = FALSE) if(!require(regioncode)) install.packages("regioncode") library(regioncode) library(dplyr)
The term "city" in China encompasses a multifaceted concept. It may denote a county-level, prefectural, or provincial administrative unit. Scholars focusing on China often encounter frustration in converting these names or corresponding geocodes, particularly when handling data spanning multiple years. This complexity further arises due to periodic modifications or cancellations of some unit's name by the central government [@GuoJiaTongJiJu2022].
Inspired by Vincent Arel-Bundock's countrycode
package, we developed regioncode
.
This package aims to perform similar functions but is tailored specifically for region name/code conversions within China for the period 1986--2019.
regioncode
?The Chinese government assigns unique geocodes to each county, city (prefecture), and provincial-level administrative unit. These "administrative division codes" are consistently adjusted and updated to align with national and regional development plans [@MinZhengBu2022]. However, these adjustments may pose challenges for researchers conducting longitudinal studies or merging geo-based data from different years. For instance, inconsistencies between map data and statistical data can result in erroneous outputs when rendering statistical data on a Chinese map.
A One-Step Solution: regioncode
regioncode
offers a one-step solution to these challenges.
In its current version, it enables seamless conversion of formal names, commonly used names, and administrative division codes of Chinese provinces and prefectures between each other, covering a span of thirty-four years from 1986 to 2019.
To install:
install.packages("regioncode")
.remotes::install_github("sammo3182/regioncode")
.We demonstrate the basic application of regioncode
with a toy data randomly sampled from @Wang2020c's China's Corruption Investigations Dataset.
In the regioncode
field, administrative division codes are denoted as code
, and the formal names of regions are referred to as name
.
The current version facilitates the mutual conversion between any pair of these elements.
Users merely need to input a character vector of names or a numeric vector of geocodes into the function, specifying the desired output type with the convert_to
argument.
The following example illustrates the conversion of 2019 geocodes in the sample data to their 1989 version.
It is essential for users to correctly set the year_from
argument to reference the appropriate year.
Subsequently, the year_to
and convert_to
arguments can be used to determine the desired year's projection and the format type.
library(regioncode) data("corruption") # Conversion to the 1989 version regioncode(data_input = corruption$prefecture_id, convert_to = "code", # default setting year_from = 2019, year_to = 1989) # Comparison tibble( code2019 = corruption$prefecture_id, code1989 = regioncode(data_input = corruption$prefecture_id, convert_to = "code", # default setting year_from = 2019, year_to = 1989), name2019 = regioncode(data_input = corruption$prefecture_id, convert_to = "name", # default setting year_from = 2019, year_to = 2019), name1989 = regioncode(data_input = corruption$prefecture_id, convert_to = "name", # default setting year_from = 2019, year_to = 1989) )
Note that if a region was initially geocoded in, for example, 1989 and later included in a new region in 2019, the new region geocode will be subsequently used. If a large area was divided into several regions, the later-year codes will align with the first region according to the ascending order of the regions' numeric geocodes.
In the current version, regioncode
automatically identifies the input format: numerics for geocodes and characters for names.
The following example demonstrates the conversions from various types of input to alternative formats of outputs:
# Original name tibble( id = corruption$prefecture_id, name = corruption$prefecture ) # Codes to name regioncode(data_input = corruption$prefecture_id, convert_to = "name", year_from = 2019, year_to = 1989) # Name to codes of the same year regioncode(data_input = corruption$prefecture, convert_to = "code", year_from = 2019, year_to = 2019) # Name to name of a different year regioncode(data_input = corruption$prefecture, convert_to = "name", year_from = 2019, year_to = 1989)
The regioncode
package also offers specialized conversion functions to assist users with more complex data and diverse requirements, including:
Frequently, data codes may exclude the administrative level when recording geographical information, such as "北京" instead of "北京市," or "内蒙" instead of "内蒙古自治区" referred to as "incomplete names."
To execute conversions for such data, one can specify the incomplete_name
argument to "TRUE."
As long as there are two characters that can help to identify the city or province, regioncode
can conduct the conversion.
In the following example, we randomly removed 70% of the input city names to incomplete names and show how regioncode
can deal with such problems:
# Original full names corruption$prefecture fake_incomplete <- corruption$prefecture index_incomplete <- sample(seq(length(corruption$prefecture)), 7) fake_incomplete[index_incomplete] <- fake_incomplete[index_incomplete] |> substr(start = 1, stop = 2) fake_incomplete # Conversion to full names in 2008 regioncode(data_input = fake_incomplete, convert_to = "name", year_from = 2019, year_to = 2008, incomplete_name = TRUE)
Municipalities ("直辖市") in China are geographically cities but administratively provincial. Different geographic data may categorize them differently. Some data may treat municipalities as equivalent to prefectures.
To convert this type of data, regioncode
introduces a specific argument zhixiashi
.
The default value is "FALSE," treating municipalities as provinces.
When set to "TRUE," municipalities are considered as prefectures, and their provincial codes are utilized as geocodes.
The following example illustrates the municipalities identifier with a mixed string of names of municipalities, their districts, and a prefecture:
names_municipality <- c("北京市", # Beijing, a municipality "海淀区", # A district of Beijing "上海市", # Shanghai, a municipality "静安区", # A district of Shanghai "济南市") # A prefecture of Shandong # When `zhixiashi` is FALSE, only the districts are recognized regioncode(data_input = names_municipality, year_from = 2019, year_to = 2019, convert_to = "code", zhixiashi = FALSE) # When `zhixiashi` is TRUE, municipalities are recognized regioncode(data_input = names_municipality, year_from = 2019, year_to = 2019, convert_to = "code", zhixiashi = TRUE)
The Statistical Yearbook of Urban and Rural Construction classifies Chinese cities into different levels, largely based on their populations [@GuoJiaTongJiJu2022a]. From 1989 to 2014, there were four levels of cities, and the system expanded to a 7-level scale after 2014, as detailed in the following table:
| Criterion | Population | Rank
|:-------------------|-----------------------|------------------|
| Old (1989)| > 1 million | 超大城市 |
| | 500,000 ~ 1 million | 大城市 |
| | 200,000 ~ 500,000 | 中等城市 |
| | < 200,000 | 小城市 |
| | | |
| New (2014)| > 10 million | 超大城市 |
| | 5 million ~ 10 million| 特大城市 |
| | 3 million ~ 5 million | I型大城市 |
| | 1 million ~ 3 million | II型大城市 |
| | 500,000 ~ 1 million | 中等城市 |
| | 200,000 ~ 500,000 | I型小城市 |
| | < 200,000 | II型小城市 |
The regioncode
function can return the rank of cities according to their populations for a given year.
If the population is untraceable, the rank will be marked as NA
.
Users simply need to set convert_to = "rank"
to perform the conversion.
For regions in and before 1989, the old ranking system is applied.
For other region-years, the function will return the new ranks.
For some cities, we cannot find their populations from the official sources.
The rank
of them will be NA
.
The following example compares the ranks from the same input in different years:
tibble( city = corruption$prefecture, rank1989 = regioncode(data_input = corruption$prefecture, year_from = 2019, year_to = 1989, convert_to="rank"), rank2014 = regioncode(data_input = corruption$prefecture, year_from = 2019, year_to = 2014, convert_to = "rank") )
Pinyin is a phonetic romanization of Chinese characters.
Some data may store region names in pinyin instead of Chinese characters.
The default name output of regioncode
is in Chinese characters.
However, thanks to Peng Zhao and Qu Cheng's pinyin package, users can now obtain pinyin format output from the regioncode
function by setting the argument to_pinyin = TRUE
.
This function also corrects the romanization output for areas with special spellings, such as Shanxi vs. Shaanxi, Inner Mongolia, and special administrative regions.
It works for official names, incomplete names, and administrative area outputs.
The following example demonstrates how this function operates on various demands:
tibble( city = corruption$prefecture, cityPY = regioncode(data_input = corruption$prefecture, year_from = 2019, year_to = 1989, convert_to = "name", to_pinyin = TRUE ), areaPY = regioncode(data_input = corruption$prefecture, year_from = 2019, year_to = 1989, convert_to = "area", to_pinyin = TRUE ) ) # Regions with special spelling regioncode(data_input = c("山西", "陕西", "内蒙古", "香港", "澳门"), year_from = 2019, year_to = 2008, convert_to = "name", incomplete_name = TRUE, province = TRUE, to_pinyin = TRUE )
The regioncode
function also supports conversions at the provincial level.
By setting the argument province = TRUE
, users can convert all geocodes and names at this level.
Chinese provinces have abbreviations, and when the converted data only contain abbreviations, users can set the convert_to
argument to abbreTocode
, abbreToname
, or abbreToarea
to obtain the desired data types.
To receive abbreviation outputs, simply set convert_to = "abbre"
.
The following example demonstrates the conversion of a vector of province geocodes to their official names and abbreviations:
tibble( province = corruption$province_id, prov_name = regioncode(data_input = corruption$province_id, convert_to = "name", year_from = 2019, year_to = 1989, province = TRUE), prov_abbre = regioncode(data_input = corruption$province_id, convert_to = "codeToabbre", year_from = 2019, year_to = 1989, province = TRUE) )
The current version of regioncode
encompasses two types of region conversion beyond the provincial level: administrative area and linguistic zones.
Chinese regions are divided into seven areas for social, political, and martial reasons [@SunPing2020]:
| Region | Provincial-level Administrative Unit | |:-------|----------------------------------------------------------------| | 华北 | 北京市, 天津市, 山西省, 河北省, 内蒙古自治区 | | 东北 | 黑龙江省, 吉林省, 辽宁省 | | 华东 | 上海市, 江苏省, 浙江省, 安徽省, 福建省, 台湾省, 江西省, 山东省 | | 华中 | 河南省, 湖北省, 湖南省 | | 华南 | 广东省, 海南省, 广西壮族自治区, 香港特别行政区, 澳门特别行政区 | | 西南 | 重庆市, 四川省, 贵州省, 云南省, 西藏自治区 | | 西北 | 陕西省, 甘肃省, 青海省, 宁夏回族自治区, 新疆维吾尔自治区 |
In certain cases, users may wish to identify the area to which a prefecture or province belongs.
regioncode
offers a function to convert codes and names of the region (both prefectures and provinces) into areas by setting the output format as "area":
regioncode(data_input = corruption$prefecture, year_from = 2019, year_to = 1989, convert_to = "area")
China is a multilingual country with various dialects.
These dialects may be used across several prefectures in a province or even across different provinces.
For political and sociolinguistic studies, regioncode
includes a function to return approximate linguistic zones of given geocodes or prefectural names.
In the current version, regioncode
offers two levels of linguistic zone identification: dialect groups (dia_group
, "方言大类") and dialect sub-groups (dia_sub_group
, "分区片"), according to the 1987 language atlas of China [@LiEtAl1987].
(When province = TRUE
, the linguistic conversion can only be to the dialect group level.)
The following example converts the toy data to dialect groups and sub-groups:
tibble( city = corruption$prefecture, dialectGroup = regioncode(data_input = corruption$prefecture, year_from = 2019, year_to = 1989, to_dialect = "dia_group"), dialectSubGroup = regioncode(data_input = corruption$prefecture, year_from = 2019, year_to = 1989, to_dialect = "dia_sub_group") )
Note that the linguistic distribution in China is too complex for precise gauging at the prefectural level, and it continually changes with population dynamics.
The linguistic zone output from regioncode
is thus for reference rather than rigorous linguistic research.
regioncode
offers a convenient method for converting Chinese administrative division codes, official names, and facilitating various specific conversions.
The development of the package is ongoing, with future versions aiming to add more administrative level choices and enriching data.
Collaboration is welcome, and questions, comments, or bug reports can be directed to Github Issues.
We extend our appreciation to LI Ruizhe, ZHU Meng, SHI Yuyang, XU Yujia, PAN Yuxin, TIAN Haiting, SHAO Weihang, CHEN Yuanqian, and LIU Xueyan for their contributions to data collection and function editing of this package.
::: {#refs} :::
Yue Hu
Department of Political Science,\ Tsinghua University,\ Email: yuehu\@tsinghua.edu.cn{.email}\ Website: https://www.drhuyue.site
Xinyi Ye
Department of Political Science,\ Tsinghua University,\ Email: yexy23\@mails.tsinghua.edu.cn{.email}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.