dbutils | R Documentation |
The dbutils package is a set of utility functions for processing and modifying BC Stats' population data via four main categories:
database access functions (getDBPath
, dbCheck
,
dbInfo
, dbRead
, dbWrite
)
conversion functions (conversionTables
, conversionRead
,
dbConvert
)
raking functions (dbRake
, raking helper functions)
app function (updateAppData
to update app data)
The BC Stats Population section has historically depended on APL; the 'dbutils' package is a conversion of PEOPLE's APL functions to R functions.
There are two sets of databases: estimates and projections. The naming convention in R of the database file differs from APL. First, estimates are kept separate from projections. This is represented in the naming of the databases as POPRR[E/P]YY, where an "E" is for estimates, and "P" is for projections (APL used "O" for projections). Second, APL databases used the year of the RUN for the file name (e.g., POPHAO19 was ran in 2019). This can be confusing, so the new database files use the years of the most recent estimates. In other words, POPRRE18 contains the estimates up to 2018 (July 1st 2018), and POPRRP18 includes projections data with base year 2018 (July 1st 2018). Projections files also include estimates years.
Database types include: population, deaths, births. Databases must have 7 columns: Year, Type, TypeID, Age, Male, Female, and Total. Each Year of data must have the same number of occurrences, and Years must be continuous (no year can be missing).
There are several naming conventions used with such data, where files are named DDDRRTYY:
DDD is the database type, such as POP = Population, BIR = Birth, DEA = Death.
RR is the region code (see below)
T is the data type (i.e., E = Estimates or P = Projections)
YY is the year (or two-digit PEOPLE version number, indicating which PEOPLE run) you want data for
For historical reasons, population data is saved as POPRREYY (population estimates), POPRRPYY (population projections), BIRRRYY (births) or DEARRYY (deaths), where RR is the shorthand for the region code, and YY is the last two digits of the year (this was passed over from decades of BC Stats PEOPLE development limited to 2-digit codes).
Common region codes are as follows:
RD - Regional District (which is the same data as Census Division (CD))
CS/CSD - Census Subdivision (CDCSD)
DR - Development Region (DR)
HA - Local Heath Area (LHA)
HS - Health Service Delivery Area (HSDA)
HY - Health Authority (HA)
CH - Community Health Service Area (CHSA)
CF - Ministry of Children and Family Development (MCFD)
CA - MCFD Service Delivery Area (MCFD_SDA)
CL - MCFD Local Service Area (MCFD_LSA)
MP/RCMP - Police Respondent Areas (RESP)
ED/PED - Provincial Electoral District (PED)
SD - School District (SD)
PS - College Region or Post-Secondary (CR)
SR - Special Regions (includes CMA/CA, Van Isle (VI) and Van Isle excluding CRD (VN). Does not add up to BC total)
Age vectors can include combinations of single age years, 5-year age groups (e.g., -4, -9, etc.), grouped older populations (e.g., -65, -90, -100, -110, -120, etc.), and Total (i.e., -999). Negative ages refer to age groups: those that end in 4 or 9 are five year age groupings ending on that age (e.g., -4 refers to those aged 0-4, -9 is those aged 5-9, -14 is those aged 10-14, etc.), while those that end in 0 or 5 are "and over" groups beginning with that age (e.g., -90 refers to those aged 90 and over, -100 is those aged 100 and over, etc.).
Population data is available in single age years up to 109 years old, but is usually grouped for 100 and over. (APL did not produce population for ages beyond 100+, so 100:109 and -110 have 0 population, although they do not have to be in future.) Death data is available in single age years up to 120 years old. Birth data is available only for ages 15 to 65 and over. (Age in birth data refers to the mother's age.) Population estimates and projections apps only need total and individual ages 0:89, and 90+ (i.e., exclude ages above 90 and any 5 year age groups) as the app creates age groups on the fly.
popall <- c(seq(0, 109), seq(-4, -109, -5), -90, -100, -110, -999)
pop1yr90 <- c(seq(0, 89), -90, -999)
pop1yr100 <- c(seq(0, 99), -100, -999)
pop1yrOver90 <- c(seq(90, 120))
pop5yr90 <- c(seq(-4, -89, -5), -90, -999)
pop5yr100 <- c(seq(-4, -99, -5), -100, -999)
deaall <- c(seq(0, 119), seq(-4, -119, -5), -90, -120, -999)
dea1yr120 <- c(seq(0, 119), -120, -999)
dea1yr110 <- c(seq(0, 109), -110, -999)
dea1yr90 <- c(seq(0, 89), -90, -999)
dea5yr120 <- c(seq(-4, -119, -5), -120, -999)
dea5yr110 <- c(seq(-4, -109, -5), -110, -999)
dea5yr90 <- c(seq(-4, -89, -5), -90, -999)
birall <- c(seq(15, 64), seq(-19, -64, -5), -65, -999)
bir1yr65 <- c(seq(15, 64), -65, -999)
bir5yr65 <- c(seq(-19, -64, -5), -65, -999)
Data is available at varying levels of detail, mostly dependent on some form of geography. For example, Census-related data may be accessible for every Dissemination Block, while health-related data may only be available at the Local Health Area (LHA) level. As such, there exists a need to transfer information from one geography to another. This is achieved using a conversion table that converts (demographic) data from Geography A to Geography B.
A conversion table is based on population counts at a very high level of detail. BC Stats uses Dissemination Block-level population totals from the most recent Census. Additionally, one would need geographic software (like ArcMap or QGIS) to establish the allocation of Dissemination Blocks to the conversion geographies. While the following mentions the underlying linkage between geographies being established via Dissemination Blocks, one could technically use any (small) geography to establish this relationship.
Essentially, the following is required:
(Total) population at the Dissemination Block-level
Dissemination Block allocation to Geography A (source)
Dissemination Block allocation to Geography B (destination)
Item (1) will provide the fundamental relationship between (2) and (3).
First, the Dissemination Block populations are aggregated by all possible combinations of intersections between Geography A and B. This is achievable by creating a table from four columns containing (a) the Dissemination Block number, (b) the Dissemination Block population, (c) Geography A's alignment and (d) Geography B's alignment with the Dissemination Block. Then, percentages are assigned that link Geography A to B.
This approach provides a "match" (or relationship) between Geography A and B that is based on the most recent Census population (in year X, say). Use of the conversion table therefore relies on the assumption that, if converting information from A to B in year Y != X, the conversion relationship holds for year Y. Essentially, this assumes no change in the match between A and B from X to Y. This may be valid (or sufficient) in most cases.
The setup requires a "Dissemination Block allocation to Geography A and B". This allocation assumes a 100 (or B) will have to be completely allocated to one region or the other. Another option may be to allocate its population in some proportion to each of the straddled regions. One such proportional allocation could be area. For example, assume Dissemination Block X with population 100 straddles Region Y and Z such that 40 allocate 40 persons from X to Y and 60 persons from X to Z.
Once conversion table(s) are available (from above process), the appropriate table can be read
in with conversionRead
and applied with dbConvert
as follows:
Evaluate the conversion table between Geography A (source) and B (destination) for regions that require no conversion (i.e., regions are converted 1-to-1 fromGgeography A to B). Remove these regions from the conversion process.
Sequentially share out the data relating to the region in Geography A using the conversion factor to the respective region(s) in Geography B.
Possibly aggregate Geography B components that received contributions from multiple regions in Geography A.
Depending on the application, there may be a requirement to conform to some pre-specified
"control total(s)". These control totals are usually the provincial-level totals and conformance
is achieved using raking (dbRake
). If Step (1) removed some 1-to-1 regions, the
control totals will be adjusted downward.
Any 1-to-1 regions removed during Step (1) are added back to Geography B (under the Geography B naming convention).
An example:
Consider the following conversion table which represents the demographic relationship between Geography A and B:
Geography A | Geography B | Percentage |
A-1 | B-1 | 21.96943503% |
A-1 | B-2 | 45.39105179% |
A-1 | B-3 | 32.63951317% |
A-2 | B-3 | 60.84714549% |
A-2 | B-4 | 39.15285451% |
A-3 | B-5 | 100.00000000% |
A-4 | B-6 | 49.22285940% |
A-4 | B-7 | 50.77714060% |
Also, assume we have some statistic (population, say) for Geography A for a number of years:
Geography A | Year 1 | Year 2 | Year 3 | Year 4 | Year 5 |
A-1 | 14,461 | 14,606 | 14,752 | 14,900 | 15,049 |
A-2 | 16,290 | 16,534 | 16,782 | 17,034 | 17,290 |
A-3 | 489 | 504 | 519 | 535 | 551 |
A-4 | 25,285 | 26,119 | 26,981 | 27,871 | 28,791 |
Total | 56,525 | 57,763 | 59,034 | 60,340 | 61,681 |
In the conversion table above, region A-3 is converted 100 technically no conversion necessary, so we remove A-3 from the conversion process. One could perform the following conversion for Year 1 in the above data:
Geography A | Geography B | Percentage | Population |
A-1 | **14,461** | ||
B-1 | 21.96943503% | 3,177 | |
B-2 | 45.39105179% | 6,564 | |
B-3 | 32.63951317% | 4,720 | |
A-2 | **16,290** | ||
B-3 | 60.84714549% | 9,912 | |
B-4 | 39.15285451% | 6,378 | |
A-4 | **25,285** | ||
B-6 | 49.2228594% | 12,446 | |
B-7 | 50.7771406% | 12,839 | |
Total | **56,036** | ||
Aggregating the breakout for region B-3 (that is, split across A-1 and A-2) would yield the final conversion for all years:
Geography B | Year 1 | Year 2 | Year 3 | Year 4 | Year 5 |
B-1 | 3,177 | 3,209 | 3,241 | 3,273 | 3,306 |
B-2 | 6,564 | 6,630 | 6,696 | 6,763 | 6,831 |
B-3 | 14,632 | *14,827* | 15,026 | *15,229* | 15,432 |
B-4 | 6,378 | 6,474 | 6,571 | 6,669 | 6,770 |
*B-5* | *489* | *504* | *519* | *535* | *551* |
B-6 | 12,446 | 12,857 | 13,281 | 13,719 | 14,172 |
B-7 | 12,839 | 13,262 | 13,700 | 14,152 | 14,619 |
Total | 56,525 | 57,763 | 59,034 | 60,340 | 61,681 |
In the above example, the actual value for the statistic has been truncated to zero decimals and made to sum to the "control totals" (assigning any remainder to the *largest value*; region B-3 in years 2 and 4) identified by Total. This is an elementary method for making totals conform to some controls. When dealing with higher-dimensional data (e.g., including age and gender detail with the above), raking is the suggested means for conformance. B-5 was added back as-is from A-3 after making sure that the aggregated sub-provincial remainders matched the control totals.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.