homogen: Automatic homogenization of climatological series

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/depurdat.R


Automatic homogenization of climatological series, including missing data filling and detection and correction of outliers and shifts in the mean of the series.


homogen(varcli, anyi, anyf, suf=NA, nm=NA,  nref=c(10,10,4), std=3, swa=NA,
ndec=1, dz.max=5, dz.min=-dz.max, wd=c(0,0,100), snht1=25, snht2=snht1,
tol=.02, maxdif=NA, mxdif=maxdif, maxite=999, force=FALSE, wz=.001, trf=0,
mndat=NA, gp=3, ini=NA, na.strings="NA", vmin=NA, vmax=NA, nclust=100,
cutlev=NA, grdcol=grey(.4), mapcol=grey(.4), hires=TRUE, expl=FALSE,
metad=FALSE, sufbrk='m', tinc=NA, tz='UTC', cex=1.2, verb=TRUE)



Acronym of the name of the studied climatic variable, as in the input data files.


Initial year of the data present in the file.


Final year of the data present in the file.


Optional suffix appended with a '-' to the name of the variable in the input files.


Number of data per year in each station. (Defaults to NA, and then it will be computed from the total number of data).


Maximum number of references for data estimation. (Defaults to 10 in the detection stages, and to 4 in the final series adjustments).


Type of normalization:


deviations from the mean,


rates to the mean (only for means greater than 1),


standardization (subtract the mean and divide by the sample standard deviation).


Size of the step forward to be applied to the staggered window application of SNHT. If not set (the default), 365 terms (one year) will be used for daily data, and 60 otherwise.


Number of decimal digits to which the homogenized data must be rounded.


Threshold of outlier tolerance, in standard deviations. (5 by default).


Lower threshold of outlier tolerance if different from the higher one. (By default, they will be the same, with opposite signs.) If positive, its sign will be changed.


Distance (in km) at which reference data will weigh half of those of another located at zero distance. (Defaults to c(0,0,100), meaning that no weighting will be applied in the first two stages, and 100 km in the third).


Threshold value for the stepped SNHT window test applied in stage 1. (25 by default. No SNHT analysis will be performed if snht1=0).


Threshold value for the SNHT test when applied to the complete series in stage 2 (same value as snht1 by default).


Tolerance factor to split several series at a time. The default is 0.02, meaning that a 2% will be allowed for every reference data. (E.g.: if the maximum SNHT test value in a series is 30 and 10 references were used to compute the anomalies, the series will be split if the maximum test of the reference series is lower than 30*(1+0.02*10)=36. Set tol=0 to disable further splits when any reference series has already been split at the same iteration).


Maximum difference of any data item in consecutive iterations. If not set, defaults to half of the data precision (defined by the number of decimals).


Old maxdif parameter (maintained for compatilibility).


Maximum number of iterations when computing the means of the series. (999 by default).


Force break even when only one reference is available. (FALSE by default).


Scale parameter of the vertical coordinate Z. 0.001 by default, which gives the vertical coordinate (in m) the same weight as the horizontal coordinates (internally managed in km).


By default, data are not transformed (trf=0), but if the data frequency distribution is very skewed, the user can choose to apply a log(x+1) transformation (trf=1) or any root of index trf>1 (2 for square root, 3 for cubic root, etc. Fractional numbers are allowed).


Minimum number of data for a split fragment to become a new series. It defaults to half of the swa value for daily data, or to nm otherwise, with a minimum of 5 terms.


Graphic parameter:


no graphic output,


only descriptive graphics of the input data,


as with 1, plus diagnostic graphics of anomalies,


as with 2, plus graphics of running annual means and applied corrections,


as with 3, but running annual totals (instead of means) will be plotted in the last set of graphics. (Better when working with precipitation data).


Initial date, with format 'YYYY-MM-DD'. If not set, it will be assumed that the series begin the first of January of the initial year anyi.


Character string to be treated as a missing value. (It can be a vector of strings, if more than one is needed). Defaults to 'NA', the standard R missing data code.


Minimum possible value (lower limit) of the studied variable. Unset by default, but note that vmin=0 will be applied if std is set to 2.


Maximum possible value (upper limit) of the studied variable. (E.g., for relative humidity or relative sunshine hours it is advisable to set vmax=100).


Maximum number of stations for the cluster analysis. (If much greater than 100, the default value, the process may be too long and the graphic too dense).


Level to cut the dendrogram to define clusters (automatic by default).


Color of the graphic background grids. (Gray by default.)


Color of the background map. (Gray by default.)


By default, the background map will be drawn in high resolution. Set this parameter to FALSE if you are studying a big geographical area (>1000 km).


Set this to TRUE to perform an exploratory analysis.


Set this to TRUE if a metadata file is provided (see the details).


Suffix to add to varcli to form the name of the provided metadata file. This parameter is only relevant when metad=TRUE. Its default value 'm' is meant to read the file of break-points detected at the monthly scale. (Do not set a sufbrk longer than 3 characters to avoid interferences when homogen is called from the homogsplit function.)


Time increment between data. Not set by default, but can be defined for subdaily data, as in e.g.: tinc='3 hour'.


Time zone. Only relevant for subdaily data. ('UTC' by default.)


Character expansion factor for graphic labels and titles. (Defaults to 1.2. Note that if station names are long, they will not fit in titles when increasing this parameter too much.)


Verbosity. Set to FALSE to avoid messages being output to the console. (They will be in the output log file anyway).


Input data must be provided in two text files, one with the data (with extension dat) and another with the station coordinates (with extension est). Both have as base name, ‘VAR_YEAR-YEAR’, composed by the acronym of the climatological variable, and the initial and final years of the data, as set in the first three parameters of the call, varcli, anyi and anyf.

Data are stored in a free blank separated format (any number of data items per line is allowed), in chronological order, station by station (all data from station 1 go first, then all data from station 2, and so on). As dates are not stored in this file, all data must be present in the file, using a code for any missing data in the records (NA by default, but any other code can be used, provided that they are specified in the parameter na.strings).

The stations file, with extension est, is also a blank separated text file where each line identifies a single station, with structure 'X Y Z CODE NAME': Coordinates X and Y may be in geographical degrees (longitude and latitude, in decimal form), or they will be assumed to be in km, or in m if the mean of either X and Y is greater than 10000; elevation Z must be supplied in m; and the identification CODE and the full NAME of the station must be quoted if they contains blanks). Fully reliable series may be marked by putting an asterisk (*) at the beginning of their CODE to skip their outlier and break analysis. This is not recommended with observed series, but can be useful when using reanalysis series as references in data sparse regions.

The transformation of the input data may be very useful to normalize highly skewed (L-shaped) distributed variables (as is often the case with precipitation, wind speed, ...), but use it with caution. Alternatively, you can use the rate normalization (std=2) on the raw data if the variable has a natural zero lower limit. (This alternative has yielded better results than transformation in some applications, provided that no homogeneous sub-period has means much lower than 1. If this is the case, a workaround may be to multiply all data values by a constant prior to their homogenization).

The default values of dz.max, snht1 and snht2 can be appropriate for monthly values of temperature, but not so much for precipitation or for daily series. Therefore it is advisable to adjust them empirically with the help of the histograms in the graphic output of a first exploratory application, using expl=TRUE in the call to the function.

This graphic output includes: a) a summary of the data availability and frequency distribution; b) a correlogram of the first differences of the series; c) a dendrogram based on these correlations and a map with the station locations (marked with numbers if less than 100, and with symbols otherwise; d) graphics of anomalies showing the detected breaks, the minimum distance to a reference data and the number of references used; e) a histogram of maximum SNHT values found in overlapping window analysis; d) and e) are repeated for the analysis on the whole series; f) histograms of number of splits per station and per year; g) graphics of final anomalies of the series h) graphics of the reconstructed series and applied corrections; i) a histogram of the normalized anomalies of all data (may help to set rejection thresholds for the outliers); final histograms of SNHT values; and j) a plot of quality/singularity of the stations (a bad score may be due to a bad quality of the series, but also to a singular siting with a peculiar micro-climate).

Note that every time that a significant shift in the mean of the series is detected, it will be split into two (potentially) homogeneous sub-periods, and hence the final number of homogenized series will be increased, as complete homogeneous series will be reconstructed from them. When several homogeneous series have been yielded for the same location, the user can choose to use that reconstructed from the last sub-period (for climate monitoring), from the period with the highest percentage of original data (the more robust reconstruction), the lowest final SNHT (the more homogeneous sub-period, although short sub-periods tend to appear more homogeneous and this can be misleading), or all of them (e.g., for climatic mapping, when no a priori knowledge can indicate which of the sub-periods will be more representative at the studied spatial scale). Aditional columns pod, ios, ope, snht and rmse in the stations data frame can help the user to choose the most appropriate series for each location when using the post-processing functions dahstat and dahgrid (see below).

This function will stop with an error condition if any time step becomes void of data in all stations at the same time. Ideally the user would repeat the process after adding one or more series with data in the void time steps. Alternatively, setting fillall=FALSE will let the process to continue, but the final homogenized series will contain missing data, and the post-processing function dahstat will give errors or results with missing data.


This function does not return any value, its results being saved to files with the same base name as the input files, and extensions:


A text file that logs all the processing output,


List of corrected outliers,


List of corrected breaks,


PDF file with a collection of diagnostic graphics,


Homogenization results in R binary format, used by dahstat and dahgrid post-processing functions, but can be loaded by the user for further with the function load). This file contains the following objects:


matrix or array of original data,


matrix or array of homogenized data,


data frame with columns: X (X coordinate), Y (Y coordinate), Z (elevation), Code (code of the station), Name (name of the station), pod (percentage of original data), ios (index of original station), ope (operating at the end of the period? 0=no, 1=yes), snht (relative SNTH of the homogenized series) rmse (estimated root mean squared errors of the homogenized series)


number of time steps in every series,


number of input series,


number of series after the homogenization,


number of "months" in a year (0=daily data),


vector of the time dimension,


number of decimals in the data,


type of standardization used (as explained in the details),


initial date of the period under study


Jose A. Guijarro


Guijarro JA (2014): Quality Control and Homogenization of Climatological Series. In Eslamian S (Ed.), Handbook of Engineering Hydrology, Vol. 1: Fundamentals and Applications. Francis and Taylor, CRC Group, USA, ISBN 9781466552357, 636 pp.

Guijarro JA (2015): Homogenization of Spanish mean wind speed monthly series. In 8th Seminar for Homogenization and Quality Control in Climatological Databases and Third Conference on Spatial Interpolation Techniques in Climatology and Meteorology, Budapest, 12-16 May 2014 (Lakatos et al., Eds.), WMO Climate Data and Monitoring WCDMP-No. 84, pp. 98-106.

Azorin-Molina C, Guijarro JA, McVicar TR, Vicente-Serrano SM, Chen D, Jerez S, Esp<ed>rito-Santo F (2016): Trends of daily peak wind gusts in Spain and Portugal, 1961-2014. J. Geophys. Res. Atmos., 121, doi:10.1002/2015JD024485, 20 pp.

See Also

dahstat, dahgrid, outrename, dd2m


#Set a temporal working directory and write input files:
wd <- tempdir()
wd0 <- setwd(wd)
dim(dat) <- c(720,20)
dat[601:720,5] <- dat[601:720,5]*1.8
#Now run the example:
homogen('pcp',1991,2010,std=2) #homogenization
#Return to user's working directory:
#Input and output files can be found in directory:

Example output

Loading required package: maps
Loading required package: mapdata

HOMOGEN() APPLICATION OUTPUT  (From R's contributed package 'climatol' 3.1.1)

=========== Homogenization of pcp, 1991-2010. (Tue Aug  6 19:45:11 2019)

Parameters: varcli=pcp anyi=1991 anyf=2010 suf=NA nm=NA nref=10,10,4 std=2 swa=NA ndec=1 dz.max=5 dz.min=-5 wd=0,0,100 snht1=25 snht2=25 tol=0.02 maxdif=0.05 mxdif=0.05 maxite=999 force=FALSE wz=0.001 trf=0 mndat=NA gp=3 ini=NA na.strings=NA vmin=NA vmax=NA nclust=100 cutlev=NA grdcol=#666666 mapcol=#666666 hires=TRUE expl=FALSE metad=FALSE sufbrk=m tinc=NA tz=UTC cex=1.2 verb=TRUE

Read 1200 items
Data matrix: 240 data x 5 stations

Stations in the 2 clusters:

    Z Code        Name
1 183 S031 Station_031
4 129 S051 Station_051

    Z Code        Name
2 125 S047 Station_047
3 100 S098 Station_098
5  79 S081 Station_081

Computing inter-station distances:  1  2  3  4

========== STAGE 1 (SNHT on overlapping temporal windows) ===========

Computation of missing data with outlier removal
(Suggested data replacements are provisional)
  Station(rank) Date: Observed -> Suggested (Anomaly, in std. devs.)
S098(3) 1991-10-01: 298.4 -> 92.9 (5.56)
S081(5) 2002-10-01: 724.68 -> 390.7 (6.02)

Performing shift analysis on the 5 series...

========== STAGE 2 (SNHT on the whole series) =======================

Computation of missing data with outlier removal
(Suggested data replacements are provisional)
  Station(rank) Date: Observed -> Suggested (Anomaly, in std. devs.)
S098(3) 2008-09-01: 111.3 -> 290.8 (-5.09)

Performing shift analysis on the 5 series...

S081(5) breaks at 2000-12-01 (26.3)

Update number of series:  5 + 1 = 6 

Computation of missing data with outlier removal
(Suggested data replacements are provisional)
  Station(rank) Date: Observed -> Suggested (Anomaly, in std. devs.)
S081-2(6) 1993-10-01: 309 -> 111.2 (5.58)

Performing shift analysis on the 6 series...

========== STAGE 3 (Final computation of all missing data) ==========

Computing inter-station weights... (done)

Computation of missing data with outlier removal
(Suggested data replacements are provisional)

The following lines will have one of these formats:
  Station(rank) Date: Observed -> Suggested (Anomaly, in std. devs.)
  Iteration Max.data.difference (Station_code)
2 -5.569 (S081-2)
3 -3.255 (S081-2)
4 -2.049 (S081-2)
5 -1.308 (S081-2)
6 -0.843 (S081-2)
7 -0.547 (S081-2)
8 -0.356 (S081-2)
9 -0.232 (S081-2)
10 -0.152 (S081-2)
11 -0.099 (S081-2)
12 -0.065 (S081-2)
13 -0.043 (S081-2)

Last series readjustment (please, be patient...)

======== End of the homogenization process, after 2.84 secs 

----------- Final computations:

ACmx: Station maximum absolute autocorrelations of anomalies
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2000  0.2275  0.2900  0.2733  0.3000  0.3500 

SNHT: Standard normal homogeneity test (on anomaly series)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.200   2.300   2.400   2.583   2.575   4.600 

RMSE: Root mean squared error of the estimated data
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  30.49   31.56   32.73   36.98   38.67   54.11 

POD: Percentage of original data
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  49.00   57.50   89.50   78.67   96.75   98.00 

  ACmx SNHT RMSE POD Code   Name         
1 0.30 2.6  40.5 98  S031   Station_031  
2 0.20 2.3  31.3 96  S047   Station_047  
3 0.21 2.3  32.2 97  S098   Station_098  
4 0.28 2.5  33.3 83  S051   Station_051  
5 0.35 1.2  54.1 49  S081   Station_081  
6 0.30 4.6  30.5 49  S081-2 Station_081-2

----------- Generated output files: -------------------------

pcp_1991-2010.txt :  This text output 
pcp_1991-2010_out.csv :  List of corrected outliers 
pcp_1991-2010_brk.csv :  List of corrected breaks 
pcp_1991-2010.pdf :  Diagnostic graphics 
pcp_1991-2010.rda :  Homogenization results. Postprocess with (examples):
   dahstat('pcp',1991,2010) #get averages in file pcp_1991-2010-me.csv 
   dahstat('pcp',1991,2010,stat='tnd') #get OLS trends and their p-values 
   dahgrid('pcp',1991,2010,grid=YOURGRID) #get homogenized grids 
   ... (See other options in the package documentation)

[1] "/work/tmp/tmp/RtmpZQ5Fbd"

climatol documentation built on Aug. 6, 2019, 1:02 a.m.

Related to homogen in climatol...