homogen: Automatic homogenization of climatological series
In climatol: Climate Tools (Series Homogenization and Derived Products)

homogen

R Documentation

Automatic homogenization of climatological series

Description

Automatic homogenization of climatological series, including missing data filling and detection and correction of outliers and shifts in the mean of the series.

Usage

homogen(varcli, anyi, anyf, test='snht', nref=NULL, std=NA, swa=NA,
ndec=1, niqd=c(4,6,1), dz.max=.01, dz.min=-dz.max, cumc=NA, wd=NULL, inht=25,
sts=5, maxdif=NA, maxite=999, force=FALSE, wz=.001, mindat=NA, onlyQC=FALSE,
annual=c('mean','sum','total'), x=NULL, ini=NA, na.strings="NA", vmin=NA,
vmax=NA, hc.method='ward.D2', nclust=300, cutlev=NA, grdcol=grey(.4),
mapcol=grey(.4), expl=FALSE, metad=FALSE, sufbrk='m', tinc=NA, tz='utc',
rlemin=NA, rlemax=NA, cex=1.1, uni=NA, raway=2, graphics=TRUE, verb=TRUE,
logf=TRUE, snht1=NA, snht2=NA, gp=NA)

Arguments

`varcli`	Short name of the studied climatic variable.
`anyi`	Initial year of the data.
`anyf`	Final year of the data.
`test`	Inhomogeneity test to apply: 'snht' (the default) or 'cuct' (Cucconi test, experimental).
`nref`	Maximum number of references for data estimation [defaults to 10 in the detection stages, and to 4 in the final series adjustments].
`std`	Type of normalization: 1: deviations from the mean, 2: rates to the mean (only for means greater than 1), 3: standardization (subtract the mean and divide by the sample standard deviation).
`swa`	Size of the step forward to be applied to the overlapping window application of the detection test [365 terms (one year) for daily data, and 60 otherwise].
`ndec`	Number of decimal digits to round the homogenized data [1].
`niqd`	Number of interquartilic distances to delete big outliers [4], spikes [6] and too long runs of identical values [1]. [Defaults to `c(4,6,1)`]
`dz.max`	Threshold of outlier tolerance, in standard deviations if greater than one, or as a percentage of data to reject otherwise [0.01].
`dz.min`	Lower threshold of outlier tolerance if different from `dz.max`.
`cumc`	Code of accumulated missing daily precipitation.
`wd`	Distance (in km) at which reference data will weight half of those located at zero distance [`c(0,0,100)`].
`inht`	Thresholds for the change in the mean detection tests [25].
`sts`	Series tail size (no. of terms) not tested for inhomogeneities [5].
`maxdif`	Maximum data difference from previous iteration [`ndec/2`].
`maxite`	Maximum number of iterations to compute means of the series [999].
`force`	Force direct homogenization of daily or sub-daily series [`FALSE`].
`wz`	Scale parameter of the vertical coordinate `Z` [0.001].
`mindat`	Minimum number of data for a split fragment to become a new series [`swa/2` for daily series or 12 terms otherwise].
`onlyQC`	Set to `TRUE` if only initial Quality Controls are requested [`FALSE`]
`annual`	Running annual value to graph in the PDF output. One of 'mean' (the default), 'sum' or 'total' (equivalent to 'sum').
`x`	Time vector. Only needed if data are taken at irregular intervals.
`ini`	Initial date, with format `'YYYY-MM-DD'`, if series does not begin on January first (as recommended).
`na.strings`	Character strings to be treated as missing data [`'NA'`].
`vmin`	Minimum possible value (lower limit) of the studied variable.
`vmax`	Maximum possible value (upper limit) of the studied variable.
`hc.method`	Hierarchical clustering method ['ward.D2'].
`nclust`	Maximum number of series for the cluster analysis [300].
`cutlev`	Level to cut the dendrogram to define clusters [`NA`].
`grdcol`	Color of the graphic background grids [`grey(0.04)`].
`mapcol`	Color of coastlines and borders in the stations map [`grey(0.04)`].
`expl`	Perform an exploratory analysis? [`FALSE`].
`metad`	Use the breakpoints file as metadata? [`FALSE`].
`sufbrk`	Suffix to add to `varcli` to form the name of the provided metadata file [`'m'`].
`tinc`	Time increment between data [`NA`].
`tz`	Time zone [`'utc'`]. Only relevant for subdaily data.
`rlemin`	Data run lengths will exclude values `<= rlemin` in quality control [`NA`].
`rlemax`	Data run lengths will exclude values `>= rlemax` in quality control [`NA`].
`cex`	Character expansion factor for graphic labels and titles [1.1].
`uni`	Units to use in some axis labels [”].
`raway`	Factor to increase internal distance to reanalysis series to make them weight less than the observed series [`2`].
`graphics`	Output graphics to a PDF file [`TRUE`].
`verb`	Verbosity [`TRUE`].
`logf`	Save console messages to a log file? [`TRUE`].
`snht1`, `snht2`	Obsolete (use `inht` instead), but kept for backwards compatibility.
`gp`	Obsolete (use `graphics=FALSE` for `gp=0`, `onlyQC=TRUE` for `gp=1` or `annual="total"` for `gp=4`), but kept for backwards compatibility.

Details

Input data must be provided in two text files, one with the data (with extension dat) and another with the station coordinates (with extension est). Both have as base name, ‘VRB_YEAR-YEAR’, composed by the short name of the climatological variable, and the initial and final years of the data, as set in the first three parameters of the call, varcli, anyi and anyf.

Data are stored in a free blank separated format (any number of data items per line is allowed), in chronological order, station by station (all data from station 1 go first, then all data from station 2, and so on). As dates are not stored in this file, all data must be present in the file, using a code for any missing data in the records (NA by default, but any other code can be used, provided that they are specified in the parameter na.strings).

The stations file, with extension est, is also a blank separated text file where each line identifies a single station, with structure 'X Y Z CODE NAME'. Coordinates X and Y are expected in geographical degrees (longitude and latitude, in this order and in decimal form). Otherwise they will be assumed to be in km, or in m if the mean of either X and Y is greater than 10000; elevation Z must be supplied in m; and the identification CODE and the full NAME of the station must be quoted if they contains blanks). Fully reliable series may be marked by putting an asterisk (*) at the beginning of their CODE to skip their outlier and break-point analysis. This is not recommended with observed series, but can be useful when using reanalysis series as references in data sparse regions.

This function will stop with an error condition if any time step becomes void of data in all stations at the same time. One or more series with data in the void time steps must be added to successfully run homogen again. If no other series are available in the area, reanalysis series of the closer grid-points can be used, adding their coordinates to the *.est file and prepending an asterisk (*) to the codes assigned to the series as mentioned above.

dz.max (and dz.min if different from dz.max) can be a vector of two values, one for suspect data and the other for probable errors. Only the latter will be deleted, but all will be listed in the ‘*_out.csv’ output file. By default, the more extreme 0.01% in each tail of the distribution will be considered errors, and values exceeding 0.1% will be suspect data. Inspection of the anomalies histogram near the end of the PDF output file will help in tuning these parameters by setting number of standard deviations to be used as rejection thresholds.

inht has a default value of 25, which is a suitable conservative value for monthly values of temperature, but not so much for precipitation or for daily series. Therefore it is advisable to adjust it empirically with the help of the histograms available by the end of the graphic output. Anyway, inhomogeneities in daily or subdaily series should be detected on their monthly aggregates, which can be easily obtained by means of the function dd2m. Two values can be given to this parameter (one for each of the two detection stages), as in e.g. inht=c(30,25). When only one value is provided, it will be used for both stages. If any or both values are zeros, the corresponding homogenization stage will be skipped.

The default value wz=0.001 gives to the vertical coordinate (in m) the same weight as the horizontal coordinates (internally managed in km). Other values can be set to overweight elevation differences (wz>0.001) or to calculate only horizontal distances (wz=0).

vmin and vmax are unset by default, but if the variable is found to have a skewed probability distribution with a minimum value of zero, vmin will be set to zero. The same will happen if the user sets std=2. In these cases, fragments with zero mean will be deleted because they will be considered as periods of absence of observation confused with zero precipitation.

Parameter raway was introduced after realizing that observational series are better references than series derived from reanalysis products. When greater than 1, distances between series involving one lableled as trusted will be added the median value of distances between observational series and then multiplied by this parameter. (Note that if you have labeled any observed series as trusted, this parameter should be set to 1 or any lower value to disable this feature.)

Sometimes daily precipitation series contain a code to indicate days where observations could not be made and precipitation accumulated in the rain gauge. In these cases you can assign that code to the cumc parameter, and then homogen will distribute the accumulated precipitation proportionally to the estimated data before continuing the process.

sufbrk is only relevant when metad=TRUE. Its default value 'm' is meant to read the file of break-points detected at the monthly scale, but if the data were originally monthly, sufbrk='' should be set.

tinc, unset by default, can be defined for subdaily data, as in e.g.: tinc='3 hours', especially if first and/or last years are incomplete. Units can be 'hours', 'mins' or 'secs'.

The default cex=1.1 increase by a 10% the size of labels in the graphic output. Note that if station names are long, they will not fit in titles when increasing this parameter too much.

The graphic output file (in PDF format) begins with a first quality control of the series, providing box-plots for every series showing (1) the range of their values, (2) their second differences and (3) the length of segments with constant data. Too big outliers are deleted at this stage because they would compromise the quality of the homogenization and missing data filling. During the rest of the process outlier detection and deletion is based on spatial differences between neighboring normalized data. (Deleted data which were not errors but due to local phenomena can be restored to the homogenized series with the help of the datrestore function.)

The following pages offer: (a) a summary of the data availability and frequency distribution; (b) a correlogram of the first differences of the series, (c) a dendrogram based on these correlations and a map with the station locations (marked with numbers if less than 100, and with symbols otherwise; (d) graphics of normalized spatial anomalies showing the detected breaks, the minimum distance to a reference data and the number of references used; (e) a histogram of maximum inht values found in overlapping window analysis; (d) and (e) are repeated for the analysis on the whole series; (f) histograms of number of splits per station and per year; (g) graphics of final anomalies of the series; (h) graphics of the reconstructed series and applied corrections; (i) a histogram of the normalized anomalies of all data (useful to set rejection thresholds for the outliers); (j) final histograms of inht values; and (k) a plot of quality/singularity of the stations (a bad score may be due to a bad quality of the series, but also to a singular siting with a peculiar micro-climate).

Note that every time that a significant shift in the mean of the series is detected, it will be split into two (potentially) homogeneous sub-periods, and hence the final number of homogenized series will be increased, as complete homogeneous series will be reconstructed from all of them. When several homogeneous series have been yielded for the same location, the user can choose to use that reconstructed from the last sub-period (the usual behavior of other homogenization packages), which is perfect for climate monitoring of newly incoming data. However, statistics derived from all of them can be useful for climatic mapping, when no a priori knowledge can indicate which of the sub-periods will be more representative at the spatial scale of the map).

The processing time can range from seconds (a few monthly series) to many hours (hundreds of daily series) or days (hundreds of subdaily series). If you must process a huge amount of data, you should consider splitting your study region into smaller areas and homogenize their series independently.

Value

This function does not return any value, its results being saved to files with the same base name as the input files, and extensions:

*.txt:

A text file that logs all the processing output,

*_out.csv:

List of corrected outliers,

*_brk.csv:

List of corrected breaks,

*.pdf:

PDF file with a collection of diagnostic graphics,

*.rda:

Homogenization results in R binary format, used by the dahstat and other post-processing functions, but can be loaded by the user for further data manipulation with the function load. This file contains the following objects:

dat

matrix of the original series,

dah

matrix of the homogenized series,

nd

number of data (time steps) in every series,

ndec

number of decimals in the data,

uni

data units,

est.c

data frame with columns:

X: longitude,
Y: latitude,
Z: elevation,
Code: station code,
Name: station name,
pod: percentage of original data,
snht: (or cuct when test='cuct'): Remaining inhomogeneity test values in the homogenized series. Can be greater than the set inht threshold because of a lower number of reference stations,
rmse: estimated root mean squared errors of the homogenized series

ct

Cluster Analysis series groups,

nei

number of input series,

ne

number of series after the homogenization,

nm

number of "months" (data items) in a year (0=daily data),

std

type of normalization applied to the data,

x

vector of the time dimension,

ini

initial date of the period under study.

References

Guijarro JA (2014): Quality Control and Homogenization of Climatological Series. In Eslamian S (Ed.), Handbook of Engineering Hydrology, Vol. 1: Fundamentals and Applications. Francis and Taylor, CRC Group, USA, ISBN 9781466552357, 636 pp.

Azorin-Molina C, Guijarro JA, McVicar TR, Trewin BC, Frost AJ, Chen D (2019): An approach to homogenize daily peak wind gusts: An application to the Australian series. Int. J. Climatol., 39:2260-2277. doi: 10.1002/joc.5949

Dumitrescu A, Cheval S, Guijarro JA (2019): Homogenization of a combined hourly air temperature dataset over Romania. Int. J. Climatol., 40:2599-2608, DOI: 10.1002/joc.6353

Visit <https://climatol.eu/> for updates of code and documentation (user's guide, links to videos, etc).

Examples

## Set a temporal working directory and write input files:
wd <- tempdir()
wd0 <- setwd(wd)
data(climatol_data)
Temp.dat[320:340,4] <- NA #remove some wrong data 
write(Temp.dat,'Temp_1961-2005.dat')
write.table(Temp.est,'Temp_1961-2005.est',row.names=FALSE,col.names=FALSE)

## Now run the example:
homogen('Temp',1961,2005)

## Return to user's working directory:
setwd(wd0)

## Input and output files can be found in directory:
print(wd)

climatol documentation built on June 8, 2025, 10:54 a.m.