knitr::opts_chunk$set(echo = TRUE)

The current version of hawkes has a number of limitations that are outlined below along with questions comments about how then can be addresed. After that a stepwise plan laid out for version 0.2.0

Current limitations

It's slow

On each iteration of the EM algorithm, a 5-deep nesting of for-loops appears. Doing this 5-layer deep nested for loop quickly is important.

The outer loop is over $N$, the number of realizations of process that are to be analysed (e.g., persons). The next two loops are over the dimension the process, $M$. Similar to a covariance matrix, computations are required for each pairwise combination of processes. The inner two loops are over the observed number of time points $T$. For each time point $t_i$, it is required to loop over all $t_j < t_i$. This puts the order of computations at $O(NM^2T^2)$ -- it grows linearly with the sample size, $N$, quadratically with the of dimensionality of the process,$M$, and quadratically with duration of the time series $T$.

$N$ and $M$ aren't going away, but one $N$ is linear and the $M$ will by relatively small in applications I have in mind. This leaves $T$. The double-loop over $T$ is currently implemented by compute_I (compute the response intensity), which is the workhorse of the Q step and the E-step -- I would say about 80% of the rest of the EM is just bookeeping for compute_I.

Let's take a look:

# devtools::install_github("peterhalpin/hawkes")
library("hawkes")
compute_I

Now that I know more about how R searches for variables and enviromental scoping, I would consider just passing out, inp, and Log to compute_I and having other args sit in the next enviornment up (the EM function). Or at least passing the parameterized response kernel rather than the kernel function and its parms separately. Sorting out how to allow for user-chosen kernel with possibly different numbers of parameters was a problem (ask Zack about scoping and efficiency.)

To avoid the double loop it calls lapply and mapply to find the possible parents for each output event (parent_ind -- this is rounding error trick), to compute the lag bewteen each parent and each output (parent_lag), and to compute the response kernel for each the lags (temp_I). It seems this should be fast? But it isn't...

Random-effects

The current verions of hawkes goes one process at a time. That has a number of problems that result can be addressed via a random effects approach, which can be run on multiple processes at a time.

To include a random effect for the intensity parameter \alpha (the multiplier on the response kernel), there are at least two approaches.

So far I have worked out the EM for first approach, using the two-parameter gamma density for $G(\alpha)$. This is OK for the univariate case, but for M-dimensional case $A = {\alpha_{jk}}$ is an $M \times M$ non-symmetrical matrix. There are different considerations here...

The main case I am interested in is where $N$ groups of responsdents are observed, each with $M_n = M$ (for simplicity) members. In this case, we might better think of the situation as an $M \times N$ dimensional process with two types of dependence, each with its own random effect. The following matrix depicts this situation for $M = 2$ (dyads).

[A_{MN \times MN} = \left[\begin{matrix}
\alpha_{11} & \beta_{21} & 0 & 0 & 0 & 0 & \dots \ \beta_{21} & \alpha_{22} & 0 & 0 & 0 & 0 & \dots \ 0 & 0 & \alpha_{33} & \beta_{34} & 0 & 0 & \dots \ 0 & 0 & \beta_{43} & \alpha_{44} & 0 & 0 & \dots \ 0 & 0 & 0 & 0 & \alpha_{55} & \beta_{56} & \dots \ 0 & 0 & 0 & 0 & \beta_{65} & \alpha_{66} & \dots \ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \ \end{matrix} \right] \quad \text{with} \quad \begin{matrix} \alpha_{ii} \sim G_\alpha \
\beta_{ij} \sim G_\beta \ \alpha_{ii} \perp \beta_{ij} \end{matrix} ]

There are some issues not addressed here, but I am willing to treat this as a first approximation. For example, should the components of the vector $\gamma_i = [\alpha_{ii}, \beta_{ij}]$ be treated as related to one other? In this case, we would be taking $i$ samples from an $M$-dimensional distribution. We could treat the $M$ parameters as independent in a heirarhical model...but, this is making me quezy. Anyway, if we want to do all this we should just go full Bayes.

The second case occurs where there are $M$ input streams each of $N$ individuals, and the input streams are the same across individuals (e.g., text, click, voice). We might actually treat this as a univariate stream of event times with different covariates describing the input source. The marks could affect the probability of another event occuring as well as the mark of the next event. Not sure how to get the cross-dependence tho.... Perhaps we need a properly multivariate distribution and cannot get away with simplifying sturctures. Or perhaps we just care about the sequence...

Follow-up analyses

Let's say we fit a viable random effects Hawkes process to some data in which $N$ individuals are nested within $M$ groups and we have predicted values for each element of the matrix given above, and, let's even say, prediction errors for each element. What is required next are a number of summary quantities computed on the team's responsiveness matrices. It seems that we can accomplish a lot of this via post processing with node- and graph-level statistics for bi-directional weighted networks. But I don't know much about this...

Let's say that we want to test structural constraints on the responsiveness matrices. For example, symmetry constraints would allow us to test for mirroring, row constraints would all us to test whether someone is equally responsive to all team members (only useful when M > 2).

Let's say we want explain differences in groups performance in terms of group or node level covariates. Hmmm

Plans

EM |-- logL |-- compute_mu |-- compute_alpha |-- probs |-- Q_fun |-- compute_Q |-- compute_alpha |-- compute_I |-- compensator |-- set_parms



peterhalpin/hawkes documentation built on July 2, 2023, 1:04 p.m.