knitr::opts_chunk$set( fig.width = 10, # Set default plot width (adjust as needed) fig.height = 8, # Set default plot height (adjust as needed) fig.align = "center" # Center align all plots ) # knitr::opts_chunk$set(eval = FALSE)
The gmwmx2
R
package allows to estimate linear model with correlated residuals in presence of missing data.
More precisely, we assume the following model: \begin{equation} \boldsymbol{Y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{F}\left{\boldsymbol{\Sigma}\left(\boldsymbol{\gamma}\right)\right}, \end{equation}
where $\boldsymbol{X} \in \mathbb{R}^{n \times p}$ is a design matrix of observed predictors, $\boldsymbol{\beta} \in \mathbb{R}^p$ is the regression parameter vector and $\boldsymbol{\varepsilon} = {\varepsilon_{i}}_{i=1,\ldots,n}$ is a zero-mean process following an unspecified joint distribution $\mathcal{F}$ with positive-definite covariance function $\boldsymbol{\Sigma}(\boldsymbol{\gamma}) \in \mathbb{R}^{n\times n}$ characterizing the second-order dependence structure of the process and parameterized by the vector $\boldsymbol{\gamma} \in \boldsymbol{\Gamma} \subset \mathbb{R}^q$.
We then define the a random variable $\boldsymbol{Z} ={Z_{i}}_{i=1,\ldots,n}$ which describes the missing observation mechanism. More specifically, the vector $\boldsymbol{Z} \in {0, 1}^n$ is a binary-valued stationary process independent of $\boldsymbol{Y}$ with expectation $\mu(\boldsymbol{\vartheta}) = \mathbb{E}[Z_i] \in [0, \, 1)$, $\forall \, i$, and covariance matrix $\boldsymbol{\Lambda}(\boldsymbol{\vartheta}) = \mathbb{V}[\boldsymbol{Z}] \in \mathbb{R}^{n\times n}$ whose structure is assumed known up to the parameter vector $\boldsymbol{\vartheta} \in \boldsymbol{\Upsilon} \subset \mathbb{R}^k$
We then assume that we only observe the stochastic process $\tilde{\boldsymbol{Y}} = \boldsymbol{Z} \odot \boldsymbol{Y}$, where $\odot$ denotes the Hadamard product. Hence, $\tilde{\boldsymbol{Y}} \in \mathbb{R}^n = \boldsymbol{Y} \odot \boldsymbol{Z}$ represents the observed process vector with null elements in the positions where observations are missing.
Using $\otimes$ to denote the Kronecker product, we then define $\tilde{\boldsymbol{X}} = \boldsymbol{Z} \otimes \boldsymbol{1}^T \odot \boldsymbol{X} \in \mathbb{R}^{n \times p}$ as the design matrix $\boldsymbol{X}$ with zero-valued vectors for the rows where observations are missing in $\boldsymbol{Y}$ (where $\boldsymbol{1}$ represents a vector of ones of dimension $p$).
We estimate the parameters $\boldsymbol{\beta}$ with the least square estimator:
\begin{equation} \hat{\boldsymbol{\beta}} = \left(\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}}\right)^{-1} \tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{Y}}. \end{equation}
We compute the estimated residuals as $\hat{\boldsymbol{\varepsilon}} = {\tilde{\boldsymbol{Y}}} - \tilde{{\boldsymbol{X}}} \hat{\boldsymbol{\beta}}$.
We then estimate with the Maximum Likelihood Estimator the parameters $\boldsymbol{\vartheta}$ of the missingness process $\boldsymbol{Z}$ assuming that $\boldsymbol{Z}$ is generated from a Markov model with the following transition probabilities:
\begin{equation} \label{eq:markov_model_def} \begin{aligned} & P\left( Z_2=1 \mid Z_1=1\right)=1 - p_1 \ & P\left(Z_2=1 \mid Z_1=0\right) = p_2 \ & P\left(Z_2=0 \mid Z_1=1\right)=p_1 \ & P\left(Z_2=0 \mid Z_1=0\right)=1-p_2. \end{aligned} \end{equation}
We then estimate the parameters $\boldsymbol{\gamma}$ using a Generalized method of Wavelet Moments approach [@guerrier2013wavelet] and using the fact that the variance-covariance matrix of $\hat{\boldsymbol{\varepsilon}}$ is given by:
$$\boldsymbol{\Sigma}_{\hat{\boldsymbol{\varepsilon}}}(\boldsymbol{\gamma}) = [\boldsymbol{\Lambda}(\hat{\boldsymbol{\vartheta}}) + \mu(\hat{\boldsymbol{\vartheta}})^2 \mathbf{1} \mathbf{1}^T ] \odot ( \boldsymbol{I} - \boldsymbol{P})\boldsymbol{\Sigma}(\boldsymbol{\gamma}) (\boldsymbol{I} - \boldsymbol{P})$$
where $\boldsymbol{P} = \boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T$ and $\boldsymbol{I}$ is the identity matrix of dimension $n\times n$.
More precisely, we rely on the result of @xu2017study that provide a computationally efficient form of the theoretical Allan variance (equivalent to the Haar wavelet variance up to a constant) for zero-mean stochastic processes such as $\boldsymbol{\varepsilon}$ to avoid computing these large matrices multiplication in the objective function. Indeed in @xu2017study they generalize the results in @zhang2008allan to zero-mean non-stationary processes by using averages of the diagonals and super-diagonals of the covariance matrix of $\boldsymbol{\varepsilon}$. What this implies is that the GMWM, which uses this form, does not require the storage of the $n \times n$ covariance matrix of $\boldsymbol{\varepsilon}$, but only of a vector of dimension $n$ which is then plugged into an explicit formula consisting in a linear combination of the elements of this vector (these elements being averages of the diagonal and super-diagonals of the covariance matrix).
While the GMWMX as described above and in more details in @voirol2024inference, is a general method for estimating large linear model with complex dependence structure in presence of missing data, the gmwmx2
R
package is currently developed specifically to estimate tectonic velocities from position times series in graticule distance coordinates system (GD) provided by the Nevada geodetic Laboratory [@blewitt2024improved; @blewitt2018harnessing].
To estimate the trajectory model (see e.g., @bevis2014 for more details), we construct the design matrix $\boldsymbol{X}$ such that $i$-th component of the vector $\mathbf{X} {{\boldsymbol{\beta}}}$ can be described as follows with $t_i$ representing the $i^{th}$ ordered time point (epoch) indicated in Modified Julian Date and $t_0$ representing the reference epoch located exactly in the middle of start and end point of the time series:
\begin{split} \mathbb{E}[\mathbf{Y}i] &= \mathbf{X}_i^T {{\boldsymbol{\beta}}} \ &= a + b \left(t{i} - t_0\right) + \sum_{h=1}^{m}\left[c_{h} \sin \left(2 \pi f_{h} t_{i}\right) + d_h \cos \left(2 \pi f_h t_i\right)\right] + \& \sum_{j=1}^{r}e_j H\left(t_i - t_j\right) + \sum_{k = 1 }^{s} l_k \left[1- \exp\left(\frac{-(t_i-t_k)}{\tau_k}\right)\right]H\left(t_{i}-\tau_k\right) \end{split}
where $a$ is the initial position at the reference epoch $t_0$, $b$ is the velocity parameter, $c_h$ and $d_h$ are the periodic motion parameters ($h = 1$ and $h = 2$ represent the annual and semi-annual seasonal terms, respectively with $f_1 = \frac{1}{365.25}$ and $f_2 = \frac{2}{365.25}$). The offset terms models earthquakes, equipment changes or human intervention in which $e_j$ is the magnitude of the step at epochs $t_j$, $r$ is the total number of offsets, $H$ is the Heaviside step function defined as $H(x):= \begin{cases}1, & x \geq 0 \ 0, & x<0\end{cases}$ and the last term allow to model post-seismic deformation (see e.g., @sobrero2020logarithmic) where $s$ is the number of post seismic relaxation time specified, $t_k$ is the time when the relaxation $k$ starts in Modified Julian Date (MJD), $\tau_k$ is the relaxation period in days for the post-seismic relaxation $k$ and $l_k$ is the amplitude of the transient. Note that by default the estimates of the functional parameters are provided in unit/day.
When loading data from a specific station using the function gmwmx2::download_station_ngl()
, we extract from the Nevada Geodetic Laboratory the position time series in GD coordinates, the time steps associated with a equipment or software change and the time steps associated with an earthquake near the station. All these objects are stored in a object of class gnss_ts_ngl
.
When applying the function gmwmx2::gmwmx2()
to an object of class gnss_ts_ngl
, we construct the design matrix $\boldsymbol{X}$ by considering an offset term for all equipment or software changes steps and all earthquakes indicated by the NGL. We also specify a post-seismic relaxation term for all earthquakes indicated by the NGL. If no relaxation time is specified in the argument vec_earthquakes_relaxation_time
, we consider a default relaxation time of $365.25$ days.
It is generally recognized that noise in GNSS time series is best described by a combination of colored noise plus white noise [@he2017review; @langbein2008noise; @williams2004error; @bos2013fast] where the white noise generally model noise at high frequencies and the colored noise model the lower frequencies of the spectrum. In a large study on the noise properties of GNSS signals, \cite{he2019investigation} concluded that the optimal noise models for 80–90\% of GNSS time series signals are the power law and white noise model or white noise and flicker/pink noise with model. The \texttt{gmwmx} \texttt{R} package currently support both stochastic model specification.
More precisely, the power spectrum of a power-law noise has the following form: $$ P(f)=P_0\left(f / f_s\right)^\kappa $$ where $f$ is the frequency, $P_0$ is a constant, $f_s$ the sampling frequency and the exponent $\kappa$ is called the spectral index.
Many stochastic noise can be expressed as such, for example, a spectral index $\kappa=0$ produces a white noise, a spectral index $\kappa=-2$ produces a red noise or random walk and a spectral index $\kappa=-1$ produce a flicker noise, also called pink noise.
@granger1980long and @hosking1981fractional showed that power-law noise with a spectral index between $-1$ and $1$ can be obtained by using fractional differencing of Gaussian noise:
$$ (1-B)^{-\kappa / 2} \mathbf{v} $$
where $B$ is the backward-shift operator $\left(B v_i=v_{i-1}\right)$ and $\mathbf{v}$ a vector with independent and identically distributed (IID) Gaussian noise.
Following from Hosking's definition of the fractional differencing, we obtain
$$ \begin{aligned} (1-B)^{-\kappa / 2} & =\sum_{i=0}^{\infty}\binom{-\kappa / 2}{i}(-B)^i \ & =1-\frac{\kappa}{2} B-\frac{1}{2} \frac{\kappa}{2}\left(1-\frac{\kappa}{2}\right) B^2+\ldots \ & =\sum_{i=0}^{\infty} h_i \end{aligned} $$ with the coefficient $h_i$ that can be computed using the following recurrence relation [@kasdin1992discrete]:
$$ \begin{aligned} h_0 & =1 \ h_i & =\left(i-\frac{\kappa}{2}-1\right) \frac{h_{i-1}}{i} \quad \text { for } i>0 \end{aligned} $$
Assuming an infinite sequence of zero-mean white noise $\mathbf{v}$, with variance $\sigma_{P L}^2$, and a spectral index $kappa > -1$ then the autocovariance $\gamma(\tau)=\operatorname{Cov}\left(X_t, X_{t+\tau}\right)=\mathbb{E}\left[\left(X_t-\mu\right)\left(X_{t+\tau}-\mu\right)\right]$ is [@bos2008fast]:
$$
\begin{aligned}
& \gamma(0)=\sigma_{p l}^2 \frac{\Gamma(1-\alpha)}{\left(\Gamma\left(1-\frac{\alpha}{2}\right)\right)^2} \
& \gamma(\tau)=\frac{\frac{\alpha}{2}+\tau-1}{-\frac{\alpha}{\alpha}+\tau} \gamma(\tau-1) \text { for } \tau>0
\end{aligned}
$$
When the argument stochastic_model
is set to "wn + pl"
, the stochastic model considered includes both white noise and power-law with the specified above stationary autocovariance structure. The parameters estimated are: $\sigma^2_{W N}$, $\kappa$ (constrained to be greater than $-1$) and $\sigma^2_{P L}$.
When the argument stochastic_model
is set to "wn + fl"
, the stochastic model considered includes both white noise and flicker noise (not stationary power-law noise with a spectral index $\kappa=-1$) where the variance covariance of the flicker noise $\omega$ is obtained as follows (see e.g., [@bos2008fast]):
$$ \operatorname{Cov}(\omega) = \sigma^2_{F L}\mathbf{U}^T \mathbf{U} $$
where $$ \mathbf{U}=\left(\begin{array}{cccc} h_0 & h_1 & \ldots & h_n \ 0 & h_0 & & h_{n-1} \ \vdots & & \ddots & \vdots \ 0 & 0 & \ldots & h_0 \end{array}\right) $$ with the coefficients $h_i$ computed considering a spectral index $\kappa=-1$.
The stochastic parameters estimated are: $\sigma^2_{W N}$, $\sigma^2_{F L}$ and $\kappa$ is fixed to $-1$.
Let us showcase how to estimate the tectonic velocity in for one specific component (North, East or Vertical) of one station.
Let us first load the gmwmx2
package.
library(gmwmx2)
station_data <- download_station_ngl("CHML")
plot(station_data)
plot(station_data, component = "N")
fit1 <- gmwmx2(station_data, n_seasonal = 2, component = "N", stochastic_model = "wn + pl")
summary(fit1)
By default, the estimated parameters are provided in m/day, we can optionally scale the estimated functional parameters so that they are returned in m/year with the argument scale_parameters
.
summary(fit1, scale_parameters = TRUE)
plot(fit1)
fit2 <- gmwmx2(station_data, n_seasonal = 2, component = "N", stochastic_model = "wn + fl")
summary(fit2)
plot(fit2)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.