Estimate a model

knitr::opts_chunk$set(
  fig.width = 10, # Set default plot width (adjust as needed)
  fig.height = 8, # Set default plot height (adjust as needed)
  fig.align = "center" # Center align all plots
)

# knitr::opts_chunk$set(eval = FALSE)

Model and estimator

The gmwmx2 R package allows to estimate linear model with correlated residuals in presence of missing data.

More precisely, we assume the following model: \begin{equation} \boldsymbol{Y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{F}\left{\boldsymbol{\Sigma}\left(\boldsymbol{\gamma}\right)\right}, \end{equation}

where $\boldsymbol{X} \in \mathbb{R}^{n \times p}$ is a design matrix of observed predictors, $\boldsymbol{\beta} \in \mathbb{R}^p$ is the regression parameter vector and $\boldsymbol{\varepsilon} = {\varepsilon_{i}}_{i=1,\ldots,n}$ is a zero-mean process following an unspecified joint distribution $\mathcal{F}$ with positive-definite covariance function $\boldsymbol{\Sigma}(\boldsymbol{\gamma}) \in \mathbb{R}^{n\times n}$ characterizing the second-order dependence structure of the process and parameterized by the vector $\boldsymbol{\gamma} \in \boldsymbol{\Gamma} \subset \mathbb{R}^q$.

We then define the a random variable $\boldsymbol{Z} ={Z_{i}}_{i=1,\ldots,n}$ which describes the missing observation mechanism. More specifically, the vector $\boldsymbol{Z} \in {0, 1}^n$ is a binary-valued stationary process independent of $\boldsymbol{Y}$ with expectation $\mu(\boldsymbol{\vartheta}) = \mathbb{E}[Z_i] \in [0, \, 1)$, $\forall \, i$, and covariance matrix $\boldsymbol{\Lambda}(\boldsymbol{\vartheta}) = \mathbb{V}[\boldsymbol{Z}] \in \mathbb{R}^{n\times n}$ whose structure is assumed known up to the parameter vector $\boldsymbol{\vartheta} \in \boldsymbol{\Upsilon} \subset \mathbb{R}^k$

We then assume that we only observe the stochastic process $\tilde{\boldsymbol{Y}} = \boldsymbol{Z} \odot \boldsymbol{Y}$, where $\odot$ denotes the Hadamard product. Hence, $\tilde{\boldsymbol{Y}} \in \mathbb{R}^n = \boldsymbol{Y} \odot \boldsymbol{Z}$ represents the observed process vector with null elements in the positions where observations are missing.

Using $\otimes$ to denote the Kronecker product, we then define $\tilde{\boldsymbol{X}} = \boldsymbol{Z} \otimes \boldsymbol{1}^T \odot \boldsymbol{X} \in \mathbb{R}^{n \times p}$ as the design matrix $\boldsymbol{X}$ with zero-valued vectors for the rows where observations are missing in $\boldsymbol{Y}$ (where $\boldsymbol{1}$ represents a vector of ones of dimension $p$).

We estimate the parameters $\boldsymbol{\beta}$ with the least square estimator:

\begin{equation} \hat{\boldsymbol{\beta}} = \left(\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}}\right)^{-1} \tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{Y}}. \end{equation}

We compute the estimated residuals as $\hat{\boldsymbol{\varepsilon}} = {\tilde{\boldsymbol{Y}}} - \tilde{{\boldsymbol{X}}} \hat{\boldsymbol{\beta}}$.

We then estimate with the Maximum Likelihood Estimator the parameters $\boldsymbol{\vartheta}$ of the missingness process $\boldsymbol{Z}$ assuming that $\boldsymbol{Z}$ is generated from a Markov model with the following transition probabilities:

\begin{equation} \label{eq:markov_model_def} \begin{aligned} & P\left( Z_2=1 \mid Z_1=1\right)=1 - p_1 \ & P\left(Z_2=1 \mid Z_1=0\right) = p_2 \ & P\left(Z_2=0 \mid Z_1=1\right)=p_1 \ & P\left(Z_2=0 \mid Z_1=0\right)=1-p_2. \end{aligned} \end{equation}

We then estimate the parameters $\boldsymbol{\gamma}$ using a Generalized method of Wavelet Moments approach [@guerrier2013wavelet] and using the fact that the variance-covariance matrix of $\hat{\boldsymbol{\varepsilon}}$ is given by:

$$\boldsymbol{\Sigma}_{\hat{\boldsymbol{\varepsilon}}}(\boldsymbol{\gamma}) = [\boldsymbol{\Lambda}(\hat{\boldsymbol{\vartheta}}) + \mu(\hat{\boldsymbol{\vartheta}})^2 \mathbf{1} \mathbf{1}^T ] \odot ( \boldsymbol{I} - \boldsymbol{P})\boldsymbol{\Sigma}(\boldsymbol{\gamma}) (\boldsymbol{I} - \boldsymbol{P})$$

where $\boldsymbol{P} = \boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T$ and $\boldsymbol{I}$ is the identity matrix of dimension $n\times n$.

More precisely, we rely on the result of @xu2017study that provide a computationally efficient form of the theoretical Allan variance (equivalent to the Haar wavelet variance up to a constant) for zero-mean stochastic processes such as $\boldsymbol{\varepsilon}$ to avoid computing these large matrices multiplication in the objective function. Indeed in @xu2017study they generalize the results in @zhang2008allan to zero-mean non-stationary processes by using averages of the diagonals and super-diagonals of the covariance matrix of $\boldsymbol{\varepsilon}$. What this implies is that the GMWM, which uses this form, does not require the storage of the $n \times n$ covariance matrix of $\boldsymbol{\varepsilon}$, but only of a vector of dimension $n$ which is then plugged into an explicit formula consisting in a linear combination of the elements of this vector (these elements being averages of the diagonal and super-diagonals of the covariance matrix).

Estimating tectonic velocities and crustal uplift

While the GMWMX as described above and in more details in @voirol2024inference, is a general method for estimating large linear model with complex dependence structure in presence of missing data, the gmwmx2 R package is currently developed specifically to estimate tectonic velocities from position times series in graticule distance coordinates system (GD) provided by the Nevada geodetic Laboratory [@blewitt2024improved; @blewitt2018harnessing].

Trajectory model

To estimate the trajectory model (see e.g., @bevis2014 for more details), we construct the design matrix $\boldsymbol{X}$ such that $i$-th component of the vector $\mathbf{X} {{\boldsymbol{\beta}}}$ can be described as follows with $t_i$ representing the $i^{th}$ ordered time point (epoch) indicated in Modified Julian Date and $t_0$ representing the reference epoch located exactly in the middle of start and end point of the time series:

\begin{split} \mathbb{E}[\mathbf{Y}i] &= \mathbf{X}_i^T {{\boldsymbol{\beta}}} \ &= a + b \left(t{i} - t_0\right) + \sum_{h=1}^{m}\left[c_{h} \sin \left(2 \pi f_{h} t_{i}\right) + d_h \cos \left(2 \pi f_h t_i\right)\right] + \& \sum_{j=1}^{r}e_j H\left(t_i - t_j\right) + \sum_{k = 1 }^{s} l_k \left[1- \exp\left(\frac{-(t_i-t_k)}{\tau_k}\right)\right]H\left(t_{i}-\tau_k\right) \end{split}

where $a$ is the initial position at the reference epoch $t_0$, $b$ is the velocity parameter, $c_h$ and $d_h$ are the periodic motion parameters ($h = 1$ and $h = 2$ represent the annual and semi-annual seasonal terms, respectively with $f_1 = \frac{1}{365.25}$ and $f_2 = \frac{2}{365.25}$). The offset terms models earthquakes, equipment changes or human intervention in which $e_j$ is the magnitude of the step at epochs $t_j$, $r$ is the total number of offsets, $H$ is the Heaviside step function defined as $H(x):= \begin{cases}1, & x \geq 0 \ 0, & x<0\end{cases}$ and the last term allow to model post-seismic deformation (see e.g., @sobrero2020logarithmic) where $s$ is the number of post seismic relaxation time specified, $t_k$ is the time when the relaxation $k$ starts in Modified Julian Date (MJD), $\tau_k$ is the relaxation period in days for the post-seismic relaxation $k$ and $l_k$ is the amplitude of the transient. Note that by default the estimates of the functional parameters are provided in unit/day.

When loading data from a specific station using the function gmwmx2::download_station_ngl(), we extract from the Nevada Geodetic Laboratory the position time series in GD coordinates, the time steps associated with a equipment or software change and the time steps associated with an earthquake near the station. All these objects are stored in a object of class gnss_ts_ngl.

When applying the function gmwmx2::gmwmx2() to an object of class gnss_ts_ngl, we construct the design matrix $\boldsymbol{X}$ by considering an offset term for all equipment or software changes steps and all earthquakes indicated by the NGL. We also specify a post-seismic relaxation term for all earthquakes indicated by the NGL. If no relaxation time is specified in the argument vec_earthquakes_relaxation_time, we consider a default relaxation time of $365.25$ days.

Stochastic model

It is generally recognized that noise in GNSS time series is best described by a combination of colored noise plus white noise [@he2017review; @langbein2008noise; @williams2004error; @bos2013fast] where the white noise generally model noise at high frequencies and the colored noise model the lower frequencies of the spectrum. In a large study on the noise properties of GNSS signals, \cite{he2019investigation} concluded that the optimal noise models for 80–90\% of GNSS time series signals are the power law and white noise model or white noise and flicker/pink noise with model. The \texttt{gmwmx} \texttt{R} package currently support both stochastic model specification.

More precisely, the power spectrum of a power-law noise has the following form: $$ P(f)=P_0\left(f / f_s\right)^\kappa $$ where $f$ is the frequency, $P_0$ is a constant, $f_s$ the sampling frequency and the exponent $\kappa$ is called the spectral index.

Many stochastic noise can be expressed as such, for example, a spectral index $\kappa=0$ produces a white noise, a spectral index $\kappa=-2$ produces a red noise or random walk and a spectral index $\kappa=-1$ produce a flicker noise, also called pink noise.

@granger1980long and @hosking1981fractional showed that power-law noise with a spectral index between $-1$ and $1$ can be obtained by using fractional differencing of Gaussian noise:

$$ (1-B)^{-\kappa / 2} \mathbf{v} $$

where $B$ is the backward-shift operator $\left(B v_i=v_{i-1}\right)$ and $\mathbf{v}$ a vector with independent and identically distributed (IID) Gaussian noise.

Following from Hosking's definition of the fractional differencing, we obtain

$$ \begin{aligned} (1-B)^{-\kappa / 2} & =\sum_{i=0}^{\infty}\binom{-\kappa / 2}{i}(-B)^i \ & =1-\frac{\kappa}{2} B-\frac{1}{2} \frac{\kappa}{2}\left(1-\frac{\kappa}{2}\right) B^2+\ldots \ & =\sum_{i=0}^{\infty} h_i \end{aligned} $$ with the coefficient $h_i$ that can be computed using the following recurrence relation [@kasdin1992discrete]:

$$ \begin{aligned} h_0 & =1 \ h_i & =\left(i-\frac{\kappa}{2}-1\right) \frac{h_{i-1}}{i} \quad \text { for } i>0 \end{aligned} $$

Assuming an infinite sequence of zero-mean white noise $\mathbf{v}$, with variance $\sigma_{P L}^2$, and a spectral index $kappa > -1$ then the autocovariance $\gamma(\tau)=\operatorname{Cov}\left(X_t, X_{t+\tau}\right)=\mathbb{E}\left[\left(X_t-\mu\right)\left(X_{t+\tau}-\mu\right)\right]$ is [@bos2008fast]:

$$ \begin{aligned} & \gamma(0)=\sigma_{p l}^2 \frac{\Gamma(1-\alpha)}{\left(\Gamma\left(1-\frac{\alpha}{2}\right)\right)^2} \ & \gamma(\tau)=\frac{\frac{\alpha}{2}+\tau-1}{-\frac{\alpha}{\alpha}+\tau} \gamma(\tau-1) \text { for } \tau>0 \end{aligned} $$ When the argument stochastic_model is set to "wn + pl", the stochastic model considered includes both white noise and power-law with the specified above stationary autocovariance structure. The parameters estimated are: $\sigma^2_{W N}$, $\kappa$ (constrained to be greater than $-1$) and $\sigma^2_{P L}$.

When the argument stochastic_model is set to "wn + fl", the stochastic model considered includes both white noise and flicker noise (not stationary power-law noise with a spectral index $\kappa=-1$) where the variance covariance of the flicker noise $\omega$ is obtained as follows (see e.g., [@bos2008fast]):

$$ \operatorname{Cov}(\omega) = \sigma^2_{F L}\mathbf{U}^T \mathbf{U} $$

where $$ \mathbf{U}=\left(\begin{array}{cccc} h_0 & h_1 & \ldots & h_n \ 0 & h_0 & & h_{n-1} \ \vdots & & \ddots & \vdots \ 0 & 0 & \ldots & h_0 \end{array}\right) $$ with the coefficients $h_i$ computed considering a spectral index $\kappa=-1$.

The stochastic parameters estimated are: $\sigma^2_{W N}$, $\sigma^2_{F L}$ and $\kappa$ is fixed to $-1$.

Example of estimation

Let us showcase how to estimate the tectonic velocity in for one specific component (North, East or Vertical) of one station.

Let us first load the gmwmx2 package.

library(gmwmx2)

Download a station from Nevada Geodetic Laboratory

station_data <- download_station_ngl("CHML")

Plot Station

plot(station_data)

Plot Northing component

plot(station_data, component = "N")

Estimate model on station data

fit1 <- gmwmx2(station_data, n_seasonal = 2, component = "N", stochastic_model = "wn + pl")

Extract estimated parameters

summary(fit1)

By default, the estimated parameters are provided in m/day, we can optionally scale the estimated functional parameters so that they are returned in m/year with the argument scale_parameters.

summary(fit1, scale_parameters = TRUE)

Plot estimated model

plot(fit1)
fit2 <- gmwmx2(station_data, n_seasonal = 2, component = "N", stochastic_model = "wn + fl")
summary(fit2)
plot(fit2)

References



Try the gmwmx2 package in your browser

Any scripts or data that you put into this service are public.

gmwmx2 documentation built on Aug. 21, 2025, 5:56 p.m.