Climate models frequently exhibit bias relative to observations. Many models have a tendency to "run hot" or "run cold" with regard to near-surface air temperatures, and most models suffer from the "drizzle problem" of too much light precipitation and too little heavy precipitation.
This vignette demonstrates how to bias-correct regional climate model (RCM) output using Kernel Density Distribution Mapping (KDDM).
Distribution mapping is the general term for bias-correction techniques that adjust the individual values in the model output so that their statistical distribution matches that of observations. These techniques include quantile mapping, simple mean & variance adjustments, and adjustments based on fitting parametric distributions. Teutschbein & Seibert [-@teutschbein2012bias] show that, of the various bias correction approaches that have been developed, distribution mapping has the best overall performance across multiple different metrics. McGinnis, et al. [-@mcginnis2015kddm] show that, of the different implementations of distribution mapping, KDDM similarly has the best overall performance.
KDDM non-parametrically models the distribution of the data using kernel density estimation. The climod package uses binned kernel density estimation from the KernSmooth package [@kernsmooth] by default. It then converts the PDFs so generated into CDFs via numerical integration using the trapezoid rule and creates a transfer function by mapping the CDF for the model data onto the CDF for the observational data. When applied to model output, this transfer function accomplishes the KDDM transformation.
This vignette uses climate model output from NARCCAP, the North American Regional Climate Change Assessment Program [@narccap]. NARCCAP generate high-resolution output for use in climate impacts studies by dynamically downscaling coarser global climate model (GCM) output to the 50-km gridscale [@mearns2009narccap]. We use output from the ECP2 RCM driven by the GFDL GCM. See the NARCCAP website (http://narccap.ucar.edu) for more information about the simulations.
For observational data, we use the daily 1/8-degree gridded meteorological data from the University of Washington [@maurer2002long]. (This dataset is sometimes referred to as the Maurer observational dataset.)
We regrid both the model outputs and the observational datasets to a common half-degree lat/lon grid. We regrid the model output using the ESMF "patch recovery" interpolation method implemented in NCL. Because the observational data is already on a higher-resolution lat/lon grid, we aggregate it spatially by simple averaging.
It's important to understand the spatial resolution of the data to make sense of the output. The following code plots elevation data (from [@gebco]) for the state of Colorado, and overlays the boundaries of the six half-degree gridcells used in this example. These gridcells are roughly centered on the city of Boulder, shown in blue.
library(climod) library(fields) ## read in topographic data colo <- nc_ingest("gebco-topo/GEBCO-topo-colorado.nc") ## plot elevation cmap <- terrain.colors(100) image.plot(colo$lon, colo$lat, colo$elevation, col=cmap, asp=1, xlab="lon", ylab="lat", main="Colorado elevation") ## overlay RCM gridboxes boxlat <- c(39.5, 40.0, 40.5) boxlon <- c(-106.0, -105.5, -105.0, -104.5) vlat = c(rbind(boxlat, boxlat, NA)) vlon = rep(c(range(boxlon), NA), length(boxlat)) hlat = rep(c(range(boxlat), NA), length(boxlon)) hlon = c(rbind(boxlon, boxlon, NA)) lines(c(vlon,hlon), c(vlat, hlat)) ## Add location of Boulder, Colorado points(-105.251945, 40.027435, col="blue")
This section demonstrates how to bias-correct daily maximum temperature (Tmax) data as described above.
There are three datasets to work with: observational data (obs
),
model outputs for the current climate (cur
), and model data for the
future climate (fut
). In NARCCAP, the "current" period (used by
both the cur
and obs
datasets runs from 1970-2000, while the
future period runs 2040-2070.
There are many steps where all three datasets need to be manipulated
in the same way. To cut down on error-prone duplication of code, we
organize the datasets into lists or vectors of three elements named
obs
, cur
, and fut
and operate on them using the various apply
functions.
varname <- "tmax" norm <- "zscore" ## the type of normalization to use indir <- "boulder-data/" infiles <- c() infiles["obs"] <- paste0(indir,varname,".maurer.observed.daily.common.nc") infiles["cur"] <- paste0(indir,varname,".ECP2.gfdl-current.daily.common.nc") infiles["fut"] <- paste0(indir,varname,".ECP2.gfdl-future.daily.common.nc") nc <- lapply(infiles, nc_ingest) ## extract variables of interest from the netcdf objects time <- lapply(nc,"[[","time") indata <- lapply(nc,"[[",varname) units <- indata$cur@units ### data structures for storage of results outdata <- lapply(indata,"+",NA)
The function nc_ingest
reads in the contents of a netcdf file and
creates variables with the same names and attributes as the netcdf
file. Compare the output from the ncdump
command with the in-memory
representation of the ingested data. (Note: we pipe the output
through sed
to skip over the global attributes.)
command <- paste("ncdump -h", infiles["cur"], "| sed '/global/,$d'") cat(system(command, intern=TRUE), sep="\n")
print(str(nc$cur))
The seasonal variation of climate variables is typically much larger than the bias or the climate change signal. Therefore, we have to perform bias-correction on only a portion of the annual cycle at a time in order to prevent it from drowning out everything else.
Climatology studies typically operate on a monthly timescale. That is, they assume that the weather is approximately "the same" (statistically speaking) within a roughly 30-day period. We will use the same 30-day window length in this vignette, but rather than bias-correcting the data month-by-month, we will use a sliding window, and correct each individual day based on the 30-day period surrounding it.
The cslice
function takes a vector of times and generates sets of
indexes that divide the vector up into climatological windows, each of
which contains points from a particular time of year. The function
returns an object that can be used to extract all the values falling
within a window from any vector matching the time input.
Note that the times in the input vector do not need to be uniformly
spaced, nor do they need to have an integer number of timesteps per
year. The cslice
function divides the annual cycle up into
contiguous windows, finds all the times falling within each window,
and returns the corresponding indexes. The times do need to be stored
in NetCDF-CF representation, as time elapsed since some starting
point. The yearlength
function is used to derive the number of days
per year from the calendar
attribute of the time vector, if it is
defined; otherwise it defaults to the standard real-world value of
365.2425 days.
The following figure shows how temperature data can be sliced into 12 "monthly" climatological windows across multiple years. ("Monthly" is in scare quotes because the windows are all identically 365/12 days long, and therefore don't strictly match the calendar months.)
x <- indata$cur[2,1,] t <- time$cur ## Slicing for 12 equally-sized monthly windows monthly <- cslice(t, num=12, ratio=1, names=month.abb, split=FALSE) ## Slice vectors into climatological windows mx <- slice(x, monthly) mt <- slice(t, monthly) ## Plot data colored by window par(bg="gray") col <- topo.colors(12); names(col) <- month.abb xr <- c(0,365*2) + min(t) yr <- range(x) plot(NA, xlim=xr, ylim=yr, xlab=t@units, ylab=units, main=varname) for(i in month.abb){ points(mt[[i]], mx[[i]], col=col[i]) } legend("topright", names(col), col=col, pch=1, ncol=4, cex=0.75)
The cslice
function generates sets of indices for both inner and
outer windows. The smaller inner windows are contiguous, while the
larger outer windows are allowed to overlap. This is important for
bias correction because you need a large window to get a good estimate
of the distribution, but days near the beginning and end of a large
window are not well-represented by the overall distribution,
especially in the shoulder seasons (spring and fall). If you use a
small window, the distribution within the window is more stationary,
but not well-sampled. We can avoid this dilemma by using a large
outer window to develop the transfer function, and then applying it
only to the data in the small inner window. This is analogous to using
a moving average to calculate anomaly values instead of a monthly
mean. These relationships are shown in the following figure.
## Outer slice example, continuing from above: compare normalizing ## daily data based on 30-day moving climatological window vs ## normalizing using monthly window ## day of year doy <- t %% yearlength(t) daily <- cslice(t, inner=1, outer=30, split=FALSE) dox <- slice(x, daily, outer=TRUE) dot <- slice(t, daily, outer=TRUE) dix <- slice(x, daily, outer=FALSE) dit <- slice(t, daily, outer=FALSE) par(mfrow=c(4,1)) plot(doy, x, pch=".", xlab="day of year", ylab=units, main="30:1 windows at day 100") points(dot[[100]]%%365, dox[[100]], col="blue") points(dit[[100]]%%365, dix[[100]], col="red") legend("bottom", c("All data","Outer window","Inner window"), pch=c(46,1,1), col=c("black","blue","red")) boxplot(mx, pch='.', boxwex=1, outline=FALSE, main="Monthly windows", ylab=units) boxplot(dox, pch='.', boxwex=1, outline=FALSE, main="30-day moving windows", whisklty=0, boxlty=0, boxfill="gray", xlab="day of year", ylab=units) daynorm <- lapply(dox, normalize) dayanom <- unslice(daynorm, daily) monnorm <- lapply(mx, normalize) monanom <- unslice(monnorm, monthly) plot(doy, dayanom-monanom, pch=".", main="Difference in anomalies", ylab=units, xlab="day of year") abline(v=seq(0,365, len=13),col="gray")
Using a sliding window of a given length is also useful for harmonizing different calendars. Observational data from the real world uses the familiar Gregorian calendar, which has 365 day in a year, except for leap years, which have 366 days. Climate models, on the other hand, typically use idealized calendars. Most of them use a 365-day, or 'noleap' calendar in which every year is exactly 365 days long, with no leap years ever. However, there are also some climate models that use a 360-day calendar, composed of 12 days each exactly 30 days long.
These differences are difficult to deal with if you are attempting to establish a one-to-one correspondence between the datasets. By slicing the datasets into a certain number of windows per year, we can reconcile the different calendars by aligning the windows. The fact that the windows may have differing numbers of days is not a problem as long as each outer window has a large number of days in it. In this example, a roughly 30-day window sliced across 30 years contains around 900 days, which is more than enough to avoid problems.
The following section shows the high-level flow of the bias-correction
procedure. At the outermost level, we loop over gridcells, correcting
each one separately. After extracting data for a given gridcell, we
slice it into climatological windows. This gives us a nested list
with three outer elements (one for each dataset) each containing 360
inner elements (one for each window). We use the renest
function to
invert the nesting, so that we have 360 outer (window) elements that
each contain 3 inner (dataset) elements. We apply the biascorrect
function (described in more detail later on) to each window, re-invert
the list nesting, and then collate the inner windows of the corrected
data back into a single timeseries for each dataset using the
unslice
function. Finally, we save the results for this gridcell
into the appropriate section of the storage arrays.
## width of moving window mwinwidth = 30 ### generate climatology moving window index arrays ### 360 ~1-day moving windows w/ 30-day outer pool cwin <- lapply(time, cslice, outer=mwinwidth, num=360) ### Well-formatted NetCDF files have the time dimension varying most rapidly nxy <- dim(indata$obs)[1:2] nx <- nxy[1] ny <- nxy[2] for(x in 1:nx){ for(y in 1:ny){ ## extract data for this gridcell data <- lapply(indata, function(a){a[x,y,]}) ## window data using outer window wind <- mapply(slice, data, cwin, MoreArgs=list(outer=TRUE), SIMPLIFY=FALSE) ## invert list nesting datatobc <- renest(wind) ## bias-correct each window fixdata <- lapply(datatobc, biascorrect, norm) ## re-invert the list bc <- renest(fixdata) ## collate inner windows back into timeseries result <- mapply(unslice, bc, cwin, SIMPLIFY=FALSE) ## save results in array for(i in names(result)){ outdata[[i]][x,y,] <- result[[i]] } } }
In this section, we bias-correct data from a single slice of a single gridcell to show the procedure in detail. The sliced data has been further subdivided into segments by year. This is subdivision is important for dealing with the non-stationarity of the climate change signal in transient climate simulations. It also makes it easier to visualize what's happening in each step of the bias-correction process.
First, we create a custom plot function for the detailed data. It creates three boxplots (one for each segmented slice of the three datasets) with a set of overlaid PDF curves next to them.
library(KernSmooth) ## color palette for bias-correction ocf <- c(obs="black", cur="blue", fut="red") ## A plotting function to show how the data changes bplot <- function(ocfdata){ yr <- range(unlist(ocfdata)) mu <- sapply(ocfdata, function(x){mean(unlist(x))}) sig <- sapply(ocfdata, function(x){sd(unlist(x))}) par(mar=c(3,3,2,2)) layout(t(matrix(c(c(1,1,1,4),c(2,2,2,4),c(3,3,3,4)), 4, 3))) for(i in names(ocf)){ boxplot(ocfdata[[i]], border=ocf[i], ylim=yr, main=i, xlab="", ylab="") abline(h=mu, col=ocf) } kde <- lapply(lapply(ocfdata, unlist), bkde) mplot(kde, x="y", y="x", type="l", col=ocf, lty=1) abline(h=mu, col=ocf) legend("bottomright", text.col=ocf, c(paste("mean:",sprintf("%3.1f",mu)), paste("stdev:",sprintf("%3.2f",sig)) ) ) }
We will operate on data from the hottest part of the year (near the middle of July in this location), when the bias in extreme temperatures will be highest. Note how the current-period model output has a significantly lower mean and higher variance than the observed data for the same period.
imax <- which.max(sapply(lapply(wind$obs, unlist), mean)) bcdata <- datatobc[[imax]] bplot(bcdata)
The first step in bias correction is to normalize the data. We normalize each annual segment of the data separately in order to handle the transience of the climate change signal. For Tmax data, we use "zscore" normalization, which subtracts the mean and divides by the standard deviation. This is an appropriate normalization for data that is roughly Gaussian in character.
## normalize the three data components nbcd <- rapply(bcdata, normalize, how="replace", norm=norm) bplot(nbcd)
Now we construct a transfer function from the normalized data. The transfer function needs a lot of data to get a good estimate of the distribution, so we pool the data across years. Note that in this case, the normalized datasets look very similar. Not much reshaping of the distribution is required to get the current-period data to look like the observational data, so the adjustments that we will apply when denormalizing the data will be sufficient to handle most of the bias correction. However, looking at the lower tail of the transfer function shows that there are some excessive values on the low end that are reined in by reshaping the distribution using KDDM.
mapping <- distmap(unlist(nbcd$cur), unlist(nbcd$obs)) plot(mapping, xlab="model", ylab="observations")
Application of the transfer function adjusts the individual values so
that the PDF of cur
better matches the PDF of obs
.
## drop obs data fixme <- nbcd[-(names(nbcd)=="obs")] ## apply KDDM transfer function fixed <- rapply(fixme, function(x){predict(mapping, x)}, how="replace") fixed$obs <- nbcd$obs bplot(fixed)
The improvement is difficult to see in the visualization above, in part because most of the adjustments happen in the tails of the distribution. Calulating PDF and tail skill scores (table below) shows a clear, though small, improvement.
Although we could make the PDFs match up more exactly by using a less
smooth transfer function, that would result in overfitting the data;
the roughness of the line of points in the Q-Q plot above reflects
innate variability in the data, not some kind of very complex bias
structure. The transfer function is smooth because the distmap
function uses a low-order monotone Hermite spline (created via
splinefun
) to generate the transfer function.
library(knitr) pskill <- c(pdfskill(unlist(nbcd$obs), unlist(nbcd$cur)), pdfskill(unlist(fixed$obs), unlist(fixed$cur))) tskill <- c(tailskill(unlist(nbcd$obs), unlist(nbcd$cur)), tailskill(unlist(fixed$obs), unlist(fixed$cur))) skill <- data.frame(cbind(pskill, tskill)) rownames(skill) <- c("raw", "BC") colnames(skill) <- c("PDF skill", "tail skill") kable(skill, digits=2)
This figure shows the adjustments generated by applying the transfer function to the normalized current and future model output. As described above, the most significant adjustments happen in the tails.
par(mfrow=c(2,1)) yr <- c(-0.3, 0.3) plot(unlist(nbcd$cur), unlist(fixed$cur)-unlist(nbcd$cur), col="blue", main="current", xlab="uncorrected value", ylab="adjustment", ylim=yr) abline(h=0) plot(unlist(nbcd$fut), unlist(fixed$fut)-unlist(nbcd$fut), col="red", main="future", xlab="uncorrected value", ylab="adjustment", ylim=yr) abline(h=0)
The last step in bias-correction is to denormalize the adjusted data. We want the mean and standard deviation of the current period to match that of the observed data, so we pass in correction factors to the denormalization routine that applies appropriate adjustments to both the current and future results. For z-score normalization, the adjustment factors are the difference in the means and the ratio of the standard deviations of the observations versus the current period model output.
We apply the same transfer function and the same correction factor to the current period and to the future period. This is equivalent to assuming that the bias is stationary, and does not change significantly from the current period to the future period. This assumption is unlikely to be strictly true, but there are studies that point towards it not being badly wrong. Moreover, every bias correction technique must necessarily make an assumption about the character of bias in the future, and without strong evidence to the contrary in the current period, assuming that it doesn't change dramatically is the least unwarranted assumption.
Note that these correction factors should be based on the entire 30-year period shared between obs and cur; if we did it year-by-year, not only would the parameter estimates be less certain, we would be forcing the model output to match the interannual variation of the obs data, which would be unphysical and wrong.
mu <- lapply(fixed, function(x){mean(sapply(x, function(y){y@mean}))}) sig <- lapply(fixed, function(x){mean(sapply(x, function(y){y@sd}))}) dm <- mu$obs - mu$cur ds <- sig$obs / sig$cur result <- list() result$obs <- bcdata$obs result$cur <- rapply(fixed$cur, denormalize, how="replace", shift=dm, scale=ds) result$fut <- rapply(fixed$fut, denormalize, how="replace", shift=dm, scale=ds) bplot(result)
The following figure shows the bias-corrected results plotted against the original raw data. The results are similar to what we would get if we had applied simple shift-and-scale adjustments to the mean and standard deviation (which is equivalent to using a linear transfer function created by modeling the data as purely gaussian), but with additional nuances, such as the stronger correction in the lower tail.
tocf <- adjustcolor(ocf, 0.3) names(tocf) <- names(ocf) rg <- range(unlist(bcdata), unlist(result)) plot(unlist(bcdata$obs), unlist(result$obs), cex=0.5, col=tocf["obs"], ylim=rg, xlim=rg, xlab="raw", ylab="bias-corrected") abline(0,1) points(unlist(bcdata$cur), unlist(result$cur), col=tocf["cur"], cex=0.75) points(unlist(bcdata$fut), unlist(result$fut), col=tocf["fut"], cex=0.75) legend("bottomright", col=ocf, c("observed","current","future"), pch=1)
This section shows how we save the bias-corrected results to a NetCDF
file. Generating properly-formatted NetCDF files from R can be a
complex and laborious undertaking. You can save a considerable amount
of effort by using an extant well-formatted file as a template. In
this case, we already have the input files readily available, and they
have exactly the same structure as the output file. So instead of
constructing an entire netcdf file from scratch, we can just copy the
corresponding input file and overwrite its data variable with the
bias-corrected data. We then add a few new attributes to document the
modifications that we have made, and that's all it takes. The
nc_history
function is a convenience function in this package for
appending to the global history
attribute in a NetCDF file. See the
documentation for the ncdf4
package for more information about the
other NetCDF funcions used in this section.
library(ncdf4) outfiles <- gsub(".common.nc", ".common.bc.nc", infiles, fixed=TRUE) for(i in c("cur","fut")){ file.copy(infiles[i], outfiles[i], overwrite=TRUE, copy.mode=FALSE) fout <- nc_open(outfiles[i], write=TRUE) ncvar_put(fout, varname, outdata[[i]]) ncatt_put(fout, varname, "bias_correction", "KDDM") nc_history(fout, paste("bias-corrected using R package 'climod'\n", " see bias_correction attribute for details")) bcnote = paste( "Bias-corrected using kernel density distribution mapping\n", "calculated over a", mwinwidth, "day moving window across years\n", "observed data from",infiles["obs"],"\n", "current data from",infiles["cur"],"\n", "future data from",infiles["fut"]) ncatt_put(fout, 0, "bias_correction", bcnote) nc_close(fout) }
The following figure shows the bias-correction of the entire dataset, by plotting the corrected data values against their uncorrected counterparts. Recall that the two leftmost of the six gridpoints lie in the mountains to the west of Boulder, while the other four are primarily on the plains. We can see that in the two western gridcells (x=1), the values in the upper tail of the distribution have a positive bias (i.e., the colored points lie below the identity line), while out on the plains, nearly all the values have a negative bias, and lie above the identity line.
par(mfcol=c(2,3), oma=c(0,0,2,0), mar=c(2.1,2.1,3,1)) ttocf <- adjustcolor(ocf, 0.02) names(ttocf) <- names(ocf) for(x in 1:nx){ for(y in 1:ny){ raw <- lapply(indata, function(a){a[x,y,]}) bc <- lapply(outdata, function(a){a[x,y,]}) plot(raw$obs, bc$obs, pch='.', asp=1, xlab="", ylab="", main=paste("x",x,"y",y)) abline(0,1, col="gray") points(raw$cur, bc$cur, pch='.', col=ttocf["cur"]) points(raw$fut, bc$fut, pch='.', col=ttocf["fut"]) } } mtext("Q-Q plots of raw vs bias-corrected data", outer=TRUE)
rm(list=ls()) ocf <- c("obs","cur","fut") varname <- "prec" norm <- "power" indir <- "boulder-data" infiles <- paste0(indir, "/", varname, ".", ocf, ".nc") names(infiles) <- ocf nc <- lapply(infiles, nc_ingest) ## get netcdf data in friendly form time <- lapply(nc,"[[","time") indata <- lapply(nc,"[[",varname) ### set up storage data structures outdata <- lapply(indata,"+",NA) ## width of moving window mwinwidth = 30 ### generate climatology moving window index arrays ### 360 1-ish-day moving windows w/ 30-day outer pool cwin <- lapply(time, cslice, outer=mwinwidth, num=360) ### Assert all 3 datasets have same spatial dimensions and are ordered ### lon, lat, time... nxy <- dim(indata$obs)[1:2] nx <- nxy[1] ny <- nxy[2] for(x in 1:nx){ for(y in 1:ny){ print(paste0("x:",x,", y:",y)) ## extract data for this gridcell data <- lapply(indata, function(a){a[x,y,]}) if(varname == "prec"){ ## Remove excess drizzle and set zero values to NA ## For stability, it's best to dedrizzle all at once, before slicing data <- dedrizzle(data) data <- lapply(data, unzero) } ## If all data for one input dataset is NA, result is NA. ## No warnings or errors are needed here; all NA is ## expected over oceans, outside domain, etc. if(any(sapply(data, function(a){all(is.na(a))}))){ for(i in names(result)){ outdata[[i]][x,y,] <- NA } next } ## window data using outer window wind <- mapply(slice, data, cwin, MoreArgs=list(outer=TRUE), SIMPLIFY=FALSE) ## invert list nesting datatobc <- renest(wind) ## bias-correct each window fixdata <- lapply(datatobc, biascorrect, norm) ## re-invert the list bc <- renest(fixdata) ## collate inner windows back into timeseries result <- mapply(unslice, bc, cwin, SIMPLIFY=FALSE) ## rezero precipitation if(varname == "prec"){ result <- rapply(result, rezero, how="replace") } ## save results in array for(i in names(result)){ outdata[[i]][x,y,] <- result[[i]] } } } ################### outdir <- "tests" outfiles <- paste0(outdir, "/", varname, ".", ocf, ".bc.nc") names(outfiles) <- ocf for(i in c("cur","fut")){ file.copy(infiles[i], outfiles[i], overwrite=TRUE, copy.mode=FALSE) fout <- nc_open(outfiles[i], write=TRUE) ncvar_put(fout, varname, outdata[[i]]) ncatt_put(fout, varname, "bias_correction", "KDDM") nc_history(fout, paste("bias-corrected using R package 'climod'\n", " see bias_correction attribute for details")) bcnote = paste( "Bias-corrected using kernel density distribution mapping\n", "calculated over a", mwinwidth, "day moving window across years\n", "observed data from",infiles["obs"],"\n", "current data from",infiles["cur"],"\n", "future data from",infiles["fut"]) ncatt_put(fout, 0, "bias_correction", bcnote) nc_close(fout) } }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.