Atmospheric CO2
In RCMIP5: Tools for Manipulating and Summarizing CMIP5 Data

One common use for CMIP5 data, and the RCMIP5 package, is to extract the global mean of some variable of interest for particular models, experiments, and/or ensembles. Here we give an example of this, computing the global mean atmospheric CO2 value by year.

Make sure the required data are present

As the package documentation notes (see e.g. https://github.com/JGCRI/RCMIP5) you first have to download the necessary data from an Earth System Grid Federation node; RCMIP5 won't do this for you. Assuming this step is complete, then, we check to make sure all the data we'll need are present. First load information about all the CMIP5 files available:

> library(RCMIP5)
> mypath <- "~/my/cmip5/files/"
> c5files <- getFileInfo(mypath)

By default, getFileInfo will recursively scan the entire directory tree and return a data frame containing the following information:

File path
File name
Variable name
Domain
Model
Experiment (typically historical, rcp45, rcp85, etc.)
Ensemble
Time^[Unless it's an 'fx' (fixed) domain file, with no time dimension.]
Size (kb)

We're interested in the CO2 files; more specifically, we want to know whether all the necessary files have been downloaded.

> co2files <- subset(c5files, variable=="co2")
> co2filechk <- checkTimePeriod(co2files)
> co2filechk[-1]  # don't display the file path

For the CMIP5 data we have^[Again, this depends on what you've downloaded from the ESGF! Naively running the commands in this vignette, without the appropriate CMIP5 data, will produce nothing but errors.], this produces the following output:

  experiment   model variable ensemble         yrStr allHere startDate  endDate files
1      rcp85 CanESM2      co2   r1i1p1 200601-210012    TRUE      2006 2100.917     1
2      rcp85 CanESM2      co2   r2i1p1 200601-210012    TRUE      2006 2100.917     1

Here we see that two ensembles are available, both from the 'rcp85' experiment run by the 'CanESM2' model. Crucially, both are complete: the allHere flag is TRUE (it would be FALSE if, for example, we'd downloaded 2006-2049 and 2051-2100 but forgotten year 2050).

Load the data

Time to load these data into memory. To do so we use the loadCMIP5 function, telling to it to load the 'co2' variable from the 'CanESM2' model running the 'rcp85' experiment. Note that here we're using the yearRange parameter to limit the data to five years:

> co2 <- loadCMIP5("co2", "CanESM2", "rcp85", path=mypath, verbose=T, yearRange=c(2006, 2010))
Averaging ensembles: r1i1p1 r2i1p1 
Loading ~/my/cmip5/files/co2_Amon_CanESM2_rcp85_r1i1p1_200601-210012.nc 
- co2 dimension names: lon lat plev time 
- loading only timeslices 1 - 60 
- data 128 64 22 60 
Loading ~/my/cmip5/files/co2_Amon_CanESM2_rcp85_r2i1p1_200601-210012.nc 
- co2 dimension names: lon lat plev time 
- loading only timeslices 1 - 60 
- data 128 64 22 60 
Melting to data frame

loadCMIP5 reports that there are two ensembles to average (r1i1p1 and r2i1p1), the name of each file it loads, the dimension names, the fact that it's only reading the first five years, and the overall data array size: 128 (lon) x 64 (lat) x 22 (plev, which will be renamed 'Z' when read in) x 60 (time). These data are converted to a data frame for fast processing in later steps.

Examining the data

Summarize these data:

> summary(co2)
CMIP5 data 
Variable: co2 (1e-6) from model CanESM2
Data range: 379.81-390.28  Mean: 384.85
Experiment: rcp85 - 2 ensemble(s)
Spatial dimensions: lon [128] lat [64] Z [22]
Time dimension: mon [60] days since 1850-1-1 
Size: 82.5 MB
Provenance has 6 entries

Here we see the name of the variable, model, experiment, the fact that this came from two ensembles, the data range and spatial dimensions, time dimensions and frequency ('mon' - monthly), size in memory, and a note about the provenance (see below). Everything looks good. Assuming the ggplot2 package is installed, we might want to make a quick plot of the data to check:

> worldPlot(co2, time=1:12)  # first 12 months, i.e. all of 2006

The resulting plot isn't very interesting, as CO2 is a well-mixed gas, but would be useful if we were working with a spatially variable output.

Compute global means

There are three steps in our processing chain. First, we want to filter the data: this model reports atmospheric CO2 for 22 different levels (the 'Z' dimension), and we're only interested in the level closest to the surface. Second, these are monthly data, and we're interested in an annual average. Third, we want to reduce the gridded data to a global mean value. This can be done as follows:

> co2surf <- filterDimensions(co2, Zs=co2$Z[1], verbose=T)
Filtered by Z, dim = 128 64 1 60 
> co2annual <- makeAnnualStat(co2surf, verbose=T)
Took 1.41 s
> co2summary <- makeGlobalStat(co2annual, verbose=T)
No grid areas supplied; using calculated values
Calculating grid cell areas...
Took 0.03 s

Note that because we didn't supply any grid cell area data to makeGlobalStat, it computed its own. Now we look at the resulting data:

> summary(co2summary)
CMIP5 data (annual summary of 12 months) (filtered) 
Variable: weighted.mean of co2 (1e-6) from model CanESM2
Data range: 380.72-389.22  Mean: 384.85
Experiment: rcp85 - 2 ensemble(s)
Spatial dimensions: lon [0] lat [0] Z [1] 
Time dimension: years (summarized) [5] days since 1850-1-1 
Size: 0.0 MB
Provenance has 10 entries

The summary reports that co2annual is an annual summary of filtered data, and a weighted mean (from the makeGlobalStat operation). Note the dimensions of the data: no longitude or latitude (since we summarized these to a global mean); a single level (Z) that's the one filtered to; and 5 time points. The summary also explicitly tells us that these data have been filtered and summarized. At this point, we may want to convert this object to a regular data frame for easy plotting:

> co2_df <- as.data.frame(co2summary)
> co2_df
      Z time    value
1 1e+05 2006 380.7173
2 1e+05 2007 382.6840
3 1e+05 2008 384.7202
4 1e+05 2009 386.9201
5 1e+05 2010 389.2190
> plot(co2_df$time, co2_df$value, type='b')

co2_df <- structure(list(Z = c(1e+05, 1e+05, 1e+05, 1e+05, 1e+05), time = c(2006, 
2007, 2008, 2009, 2010), value = c(380.717267354329, 382.683990478516, 
384.720212300618, 386.920120239258, 389.219032287598)), .Names = c("Z", 
"time", "value"), row.names = c(NA, -5L), class = "data.frame")
plot(co2_df$time, co2_df$value, type='b')

Provenance

With each operation we perform, RCMIP5 adds entries to the object's provenance, a data frame that records various information about the data's history. If we look at the provenance of the co2summary object, we see a record of the steps performed that produced it:

> co2summary$provenance[c(1,3)]
             timestamp                                                 message
1  2014-10-07 15:46:37           RCMIP5 0.1 under R version 3.1.0 (2014-04-10)
2  2014-10-07 15:46:37   Loaded co2_Amon_CanESM2_rcp85_r1i1p1_200601-210012.nc
3  2014-10-07 15:46:39                          Merged (*) provenance follows:
21 2014-10-07 15:46:39 * Loaded co2_Amon_CanESM2_rcp85_r2i1p1_200601-210012.nc
5  2014-10-07 15:46:39                                   Added ensemble r2i1p1
6  2014-10-07 15:46:40               Computed mean of ensembles: r1i1p1 r2i1p1
7  2014-10-07 15:46:41               Filtered for Zs in range [ 1e+05, 1e+05 ]
8  2014-10-07 15:46:47                   Calculated mean for years 2006 - 2010
9  2014-10-07 15:46:47    About to compute global stat. Grid areas calculated.
10 2014-10-07 15:46:48                    Computed global weighted.mean of co2

The provenance also records what function (with parameter values) wrote each message, the data dimensions at each step, and a checksum (MD5 hash) of the data. The provenance can be exported, saved alongside the data, etc.

Speed considerations

The CMIP5 data files can be very large, such that loading (not to mention processing) them is slow or impossible even on computers with plenty of memory. Some tips on dealing with this:

Use the yearRange parameter of loadCMIP5() to process smaller chunks of the data, re-combining only at the end. For the largest CMIP5 data sets, we've found it sometimes necessary to process 1-5 years at a time, even on a powerful machine.
If combining results as you go, don't repeatedly rbind() data frames together. Either write data out to a tempfile, or pre-allocate one results array beforehand.
Order of operations matters! Process the most expensive (largest) dimensions first; for example, makeGlobalStat() should usually come before makeAnnualStat().

Next steps and final notes

As noted above, we can simply use as.data.frame to convert our cmip5data object to a standard data frame. We could also saveNetCDF(co2summary) to save it as a Network Common Data Format (NetCDF) file, for example to send to a colleague; the provenance is saved as NetCDF global attributes, and optionally as an accompanying text file.

Given the above example, it's straightforward to process a whole collection of CMIP5 data-for example, all the co2 data across experiments and models. For example:

for(mo in co2filechk$model) {
    for(ex in co2filechk$experiment) {
        co2 <- loadCMIP5(variable="co2", model=mo, experiment=ex, path=mypath)
        #
        # ... process ...
        # 
        co2_df <- as.data.frame(co2)
        co2_df$model <- mo
        co2_df$experiment <- ex
        #
        # ... save, or combine with running results
        #
    }
}

This concludes the Atmospheric CO2 vignette.

# Here's the short script that produced the outputs in this vignette.
# (Because it uses very large CMIP5 data files, we can't run it as 
# part of the vignette itself.)
sink("sink.txt")

library(RCMIP5)

c5files<-getFileInfo()

co2files <- subset(c5files, variable =="co2")
co2filechk <- checkTimePeriod(co2files)
print(co2filechk[-1])

co2 <- loadCMIP5("co2", "CanESM2", "rcp85", verbose=T, yearRange = c(2006,2010))
print(summary(co2))
save(co2, file="co2")

print(str(co2))
print(co2$Z)

co2surf <- filterDimensions(co2, Zs=co2$Z[1], verbose=T)
print(co2surf)

co2annual <- makeAnnualStat(co2surf, verbose=T)
co2summary <- makeGlobalStat(co2annual, verbose=T)
print(summary(co2summary))
print(as.data.frame(co2summary))

write.csv(as.data.frame(co2summary), "co2_canesm2.csv")

save(co2summary, file="co2summary")

sink()