iea_df: Load IEA data from an extended energy balances .csv file

View source: R/initialize.R

iea_dfR Documentation

Load IEA data from an extended energy balances .csv file

Description

If .slurped_iea_df is supplied, arguments .iea_file or text are ignored. If .slurped_iea_df is absent, either .iea_file or text are required, and the helper function slurp_iea_to_raw_df() is called internally to load a raw data frame of data.

Usage

iea_df(
  .iea_file = NULL,
  text = NULL,
  expected_1st_line_start = ",,TIME",
  expected_2nd_line_start = "COUNTRY,FLOW,PRODUCT",
  expected_simple_start = expected_2nd_line_start,
  .slurped_iea_df = NULL,
  flow = "FLOW",
  missing_data = "..",
  not_applicable_data = "x",
  confidential_data = "c",
  estimated_year = "E"
)

Arguments

.iea_file

A string containing the path to a .csv file of extended energy balances from the IEA. Can be a vector of file paths, in which case each file is loaded sequentially and stacked together with dplyr::bind_rows(). Default is the path to a sample IEA file provided in this package.

text

A character string that can be parsed as IEA extended energy balances. (This argument is useful for testing.)

expected_1st_line_start

the expected start of the first line of iea_file. Default is ",,TIME".

expected_2nd_line_start

the expected start of the second line of iea_file. Default is "COUNTRY,FLOW,PRODUCT".

expected_simple_start

the expected starting of the first line of iea_file. Default is the value of expected_2nd_line_start. Note that expected_simple_start is sometimes encountered in data supplied by the IEA. Furthermore, expected_simple_start could be the format of the file when somebody "helpfully" fiddles with the raw data from the IEA.

.slurped_iea_df

a data frame created by slurp_iea_to_raw_df()

flow

the name of the flow column, entries of which are stripped of leading and trailing white space. Default is "FLOW".

missing_data

a string that identifies missing data. Default is "..". Entries of missing_data are coded as '0“ in output.

not_applicable_data

a string that identifies not-applicable data. Default is "x". Entries of not_applicable_data are coded as 0 in output.

confidential_data

a string that identifies confidential data. Default is "c". Entries of confidential_data are coded as 0 in output.

estimated_year

a string that identifies an estimated year. Default is "E". E.g., in "2014E", the "E" indicates that data for 2014 are estimated. Data from estimated years are removed from output.

Details

Next, this function does some cleaning of the data.

In the IEA's data, some entries in the "FLOW" column are quoted to avoid creating too many columns. For example, "Paper, pulp and printing" is quoted in the raw .csv file: " Paper, pulp and printing". Internally, this function uses data.table::fread(), which, unfortunately, does not strip leading and trailing white space from quoted entries. So the function uses base::trimws() to finish the job.

When the IEA includes estimated data for a year, the column name of the estimated year includes an "E" appended. (E.g., "2017E".) This function eliminates estimated columns.

The IEA data have indicators for not applicable values ("x") and for unavailable values (".."). (See "World Energy Balances: Database Documentation (2018 edition)" at http://wds.iea.org/wds/pdf/worldbal_documentation.pdf.) R has three concepts that could be used for "x" and "..": 0 would indicate value known to be zero. NULL would indicate an undefined value. NA would indicate a value that is unavailable. In theory, mapping from the IEA's indicators to R should proceed as follows: ".." (unavailable) in the IEA data would be converted to NA in R. "x" (not applicable) in the IEA data would be converted to 0 in R. "NULL" would not be used. However, the IEA are not consistent with their coding. In some places ".." (indicating unavailable) is used for not applicable values, e.g., World Anthracite supply in 1971. (World Anthracite supply in 1971 is actually not applicable, because Anthracite was classified under "Hard coal (if no detail)" in 1971.) On the other hand, ".." is used for data in the most recent year when those data have not yet been incorporated into the database. In the face of IEA's inconsistencies, the only rational way to proceed is to convert both "x" and ".." in the IEA files to "0" in the output data frame from this function. Furthermore, confidential data (coded by the IEA as "c") is also interpreted as 0. (What else can we do?)

The data frame returned from this function is not ready to be used in R, because rows are not unique. To further prepare the data frame for use, call augment_iea_df(), passing the output of this function to the .iea_df argument of augment_iea_df().

This function is vectorized over .iea_file.

Value

a data frame containing the IEA extended energy balances data

Examples

# Original file format
iea_df(text = paste0(",,TIME,1960,1961\n",
                     "COUNTRY,FLOW,PRODUCT\n",
                     "World,Production,Hard coal (if no detail),42,43"))
# With extra commas on the 2nd line
iea_df(text = paste0(",,TIME,1960,1961\n",
                     "COUNTRY,FLOW,PRODUCT,,,\n",
                     "World,Production,Hard coal (if no detail),42,43"))
# With a clean first line
iea_df(text = paste0("COUNTRY,FLOW,PRODUCT,1960,1961\n",
                     "World,Production,Hard coal (if no detail),42,43"))

MatthewHeun/IEATools documentation built on Dec. 14, 2024, 12:08 a.m.