\newcommand{\bomega}{\mbox{\boldmath $\omega$}} \newcommand{\bbeta}{\mbox{\boldmath $\beta$}} \newcommand{\bTheta}{\mbox{\boldmath $\Theta$}} \newcommand{\bOmega}{\mbox{\boldmath $\Omega$}} \newcommand{\bLambda}{\mbox{\boldmath $\Lambda$}} \def\extLambda{\bLambda_{\rm ext}} \def\intLambda{\bLambda_{\rm int}} \def\sumi{\hbox{$\sum_{i=1}^{n}$}} \def\sumj{\hbox{$\sum_{j=1}^J$}} \def\suml{\hbox{$\sum_{\ell=1}^L$}} \def\sumk{\hbox{$\sum_{k=1}^{K_{\ell}}$}} \def\sumkl{\hbox{$\sum_{k,\ell}$}} \def\sumb{\hbox{$\sum_{b=1}^{B}$}} \def\exPhi{\Phi_{\rm true}} \def\catPhi{\Phi_{\rm cat}} \def\wh{\widehat} \def\trans{^{\rm T}} \def\pr{\hbox{pr}}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library("xtable"); options(xtable.comment = FALSE)

Abstract

This document provides further details and a concrete illustration for the R programs used in the paper Categorizing a Continuous Predictor Subject to Measurement Error [@twbb2018]. This package is mainly focused on logistic regression and linear regression, though the proposed method has much weaker assumptions and can be applied in many scenarios. This document provides a brief overview of the methodology, especially for linear regression and logistic regression. Further, we use simulation studies and a real data example, the EATS data [@EATS2001], to show 4 ways to use ccp, the main function in the CCP package.

Introduction

In epidemiology, it is common to fit a categorical risk model to a continuous risk predictor, because the categorical one is thought to be more robust and interpretable. When the risk predictor is observed with measurement error, epidemiologists typically ignore the underlying measurement error and perform a naive approach, e.g., logistic regression, as what they would have done if they observe the true predictor. Here we introduce some notation to help describe the problem background.

Using the notation stated above, ideally, epidemiologists categorize $X$ and then use $X_{\rm C}$ to fit the model. However, if they observe $W$ instead of $X$, they would use $W_{\rm C}$ in the original categorical model without correcting the measurement error.

@white1982maximum shows that when $X$ is observed, though the categorical model is a misspecified model, the estimates are unbiased with respect to the true value of what epidemiologists are interested in - the parameters with respect to $X_C$ in the categorical risk model. When $X$ is not observed, however, substituting $W$ for $X$ leads to a biased estimate, as well as a poor inference quality. To address the problem based on $W$, the relationship between $W$ and $X$ needs to be specified.

We address this problem and provide a general method to get unbiased estimates and correct inference even with measurement error in the data. The key of our method is adding another layer of conditional expectation given observed predictor $W$. Thus, the original estimating equation is now relying on $W$ but not on $X$. We then need to estimate the expectations of functions of $X$ given $W$. For example, suppose the original estimating equation is formed based on $E{f(X)} = 0$. Adding a layer of conditional expectation leads to $E[E{f(X)|W}]=0$. Hence, the goal turns out to be estimating $E{f(X)|W}$, depending on the conditional density $f_{X|W}$. Although @twbb2018 focuses on the general case, this document aims to provide more details for logistic regression and linear regression.

Due to the complexity of the problem itself, in this package, we do not consider other covariates measured without error. Readers can find more general formulas in the original paper.

The rest of this document is organized as follows: we first provide a brief methodology review for readers to gain more background without looking at the original paper; then, we present estimating equations in logistic regression and linear regression. Finally, we show different ways to use the main function ccp through simulation studies, as well as the analysis for EATS data [@EATS2001].

Methodology review

General overview

Here we present two cases: linear regression and logistic regression, corresponding to continuous or binary response. For the more general model and its assumptions, we refer readers to Categorizing a Continuous Predictor for more details.

This package allows users to use two types of data:

In the following part, we explain the external-internal and internal-only cases in linear regression and logistic regression, respectively.

In the R package CCP, we assume that $$ W= X + U; \hskip 5mm X \sim N(\mu_x,\sigma_x^2); \hskip 5mm U \sim N(0,\sigma^2_u). $$ Also, $X$ and $U$ are independent. For convenience, we define nuisance parameter $\bLambda = (\mu_x, \sigma^2_x, \sigma^2_u)$.

For the continuous risk predictor $X$, we denote $m(X, \bbeta) = \alpha + X \beta$, where $\bbeta = ( \alpha, \beta)$. To categorize $X$ into $j = 1,..., J$ categories $(C_1, ..., C_J)$, we define $M(X) = { I(X\in C_1),..., I(X\in C_J)}\trans$. Thus, the corresponding parameters in the categorical model are $\bTheta = (\theta_1,..., \theta_J)$.

The parameter we are mainly interested in is $\theta_J - \theta_1$, which is the log relative risk in logistic regression.

Now we introduce three assumptions required for our approach:

(a) When $X$ is observed, the true risk model in the continuous scale has unbiased estimating functions known up to parameters $\bbeta$.

(b) When $X$ is not observed, we can find a function $g(X, \bbeta)$ that $E[E{g(X, \bbeta)|W}]=0$, with its conditional expectation $E{g(X, \bbeta)|W}$ depends on $\bLambda$ and can be estimated. A special case is knowing the distribution of $X$ given $W$ up to parameters $\bLambda$.

(c) If the external data are necessary for model identification, the parameter estimated from external data, i.e. $\sigma^2_u$, should be transportable. See Chapter 2.2.4-2.2.5 of @Carroll2006.

For linear regression and logistic regression considered in this package, all three assumptions are satisfied. Further, we would like to point out that neither normally distributed $X$ and $U$, nor logistic or linear regression model is specifically required for the proposed method itself.

To estimate nuisance parameters $\bLambda$, we now introduce the estimating equations based on using external-internal or internal-only data. Then the estimating equations for $\bbeta$ and $\bTheta$ are introduced, depending on using linear regression or logistic regression.

External-internal data

If there are no replicates in the internal data, we use the external data only to estimate $\sigma_u^2$. Suppose we observe $W_{ik} = X_i + U_{ik}$ for $k=1,...,K$ and $i=n+1,...,n+N$. We use internal data to estimate $\mu_x,\sigma^2_x$ without replicates.

In the external data, let $\overline{W}{i\cdot} = K^{-1} \sum{k=1}^{K}W_{ik}$. Define $\wh{\sigma}^2_{u,i} = (K-1)^{-1} \sum_{k=1}^K (W_{ik} - \overline{W}{i\cdot})^2$ to be the sample variance of the $W{ik}$ for a given $i$.

Because $E{ (W_{i} - \mu_x)^2} = \sigma_x^2 + \sigma_u^2$, unbiased estimating equations for $\bLambda=(\mu_x,\sigma^2_x,\sigma^2_u)$ are

Internal-only data

Suppose there are no external data, and we have replicates $W_{ir}$ for $r= 1,...,R$ in the internal data. Now we use the internal data to estimate $\Lambda =(\mu_x,\sigma_x^2,\sigma_{uR}^2)$, and we observe $W_{ir} = X_i + U_{ir}$ for $r=1,...,R$ and $i=1,...,n$. Define $\overline{W}{i\cdot} = R^{-1}\sum{r=1}^R W_{ir}$. Define $\wh{\sigma}^2_{u,i}$ to be the sample variance of the $W_{ir}$ within subject $i$, and define $\sigma_u^2/R = \sigma^2_{uR}$.The estimating equations are

Using the external-internal or internal-only data influences how to estimate nuisance parameters $\bLambda$, while fitting linear or logistic regression affects the estimating equations of $\bbeta, \bTheta$ as described below.

Linear regression

We assume the true model in the continuous scale is$$ Y = \alpha + X \beta+\epsilon = m(X, \bbeta) + \epsilon,$$ where $m(X, \bbeta) = \alpha + X\beta.$ For external-internal and internal-only cases, the estimation equations for $\bbeta$ and $\bTheta$ are the same.

The estimating function for $\bbeta=(\alpha,\beta)$ is $$ \Phi ( \bbeta,\wh\bLambda) = n^{-1}\sumi E[{Y_i - m(X_i, \bbeta) }\partial m(X_i, \bbeta) / \partial\bbeta\trans \vert W_i ]. $$ The estimating function for $\bTheta$ is $$ Q(W_i,\bTheta, \wh\bbeta,\wh\bLambda) = E \left[\begin{array}{c} m(X_i,\wh\bbeta) I(X_i \in C_1) - \bTheta_1 I(X_i \in C_1) \ \vdots \ m(X_i,\wh\bbeta) I(X_i \in C_J) - \bTheta_J I(X_i \in C_J) \end{array} \vline \hbox{ } W_i \right]. $$ The integration above is calculated using the integrate function in the R package stats.

Logistic regression

Let $H(\cdot)$ denote the logistic distribution function. Here we consider the special case of linear logistic regression with the classical measurement error model in both the external and internal datasets: $$ \pr(Y = 1 \vert X,Z) = H(\alpha + X\beta) = H{(1, X)\bbeta}$$

Let $p_i = \pr(Y=1 \vert W_i) = \int H{(1, x)\bbeta} f_{x\vert W_i}(x,W_i,\bLambda) dx$, we use the integrate function in the R package stats to compute this quantity and calculate the loglikelihood $\propto n^{-1}\sumi Y_i\log(p_i)+(1-Y_i)\log(1-p_i)$. We then use the optim function in the R package stats to minimize negative loglikelihood to estimate $\bbeta$.

Given the logistic regression model, the categorical estimating function is $$ \catPhi{Y,M\trans(X)\bTheta} = M(X) [Y - H{M\trans(X)\bTheta}], $$ Where $M(X) = { I(X\in C_1),..., I(X\in C_J)}\trans$ for categories $(C_1, ..., C_J)$. Hence, with $\bOmega = (\bTheta,\bbeta,\bLambda)$, $$ Q(W,\bOmega) = E\left( M(X) \left[H{m(X, \bbeta)} - H{M\trans (X)\bTheta}\right]\bigg\vert W\right). $$ In the R program, \begin{equation} Q(W_i,\bTheta, \wh\bbeta,\wh\bLambda) = E \left[\begin{array}{c} H{m(X_i, \wh\bbeta)} I(X_i \in C_1) - H(\bTheta_1) I(X_i \in C_1) \ \vdots \ H{m(X_i, \wh\bbeta)} I(X_i \in C_J) - H(\bTheta_J) I(X_i \in C_J) \end{array} \vline \hbox{ } W_i \right]. \end{equation}

Again, we use the integrate function in the R package stats to compute the integrals.

Function overview

knitr::include_graphics("CCP.pdf")

As shown in Figure 1, the package CCP contains one main function named 'ccp'. Based on the types of response, we can specify logistic regression for binary $Y$, or linear regression for continuous $Y$. Further, in each of the two cases we mentioned before, ccp provides the choice of external-internal and internal-only cases as introduced in the methodology review.

Get started

First, let us install the R package CCP.

install.packages("~/Desktop/CCP_1.1.tar.gz", repos = NULL, type = "source")
library(CCP)

Once the package has been loaded, one can call the main function as follows.

ccp(y, W_int, W_ext = NULL, C = NULL, Type, print.summary = TRUE, standardize = TRUE)

To check the package CCP or the usage of a specific function, you can either use help or ??.

help( package = "CCP" )
??ccp

The former command gives a brief summary of all functions in the package, while the later one offers more detailed information for function ccp.

Simulation study

Here we show the external-internal and internal-only cases for logistic regression. We also compare the proposed method with the naive approach: substituting $W$ for $X$ in the categorical model with no adjustment for measurement error.

(1) Define a function to calculate the naive estimates.

thetaw <- function(y, w, mux_hat, s2x_hat){

  # Define cut points

  J = 5 # categorize W into quintiles
  C = rep(0, 4)
  C[1] = qnorm(0.2, mean = mux_hat, sd = sqrt(s2x_hat))
  C[2] = qnorm(0.4, mean = mux_hat, sd = sqrt(s2x_hat))
  C[3] = qnorm(0.6, mean = mux_hat, sd = sqrt(s2x_hat))
  C[4] = qnorm(0.8, mean = mux_hat, sd = sqrt(s2x_hat))

  # Define a function to categorize W
  fMx <- function(x){
    Mx = vector()
    Mx[1] = ifelse(x < C[1], 1, 0)
    Mx[2] = ifelse((C[1] <= x) & (x < C[2]), 1, 0)
    Mx[3] = ifelse((C[2] <= x) & (x < C[3]), 1, 0)
    Mx[4] = ifelse((C[3] <= x) & (x < C[4]), 1, 0)
    Mx[5] = ifelse(x >= C[4], 1, 0)
    return (Mx)
  }  

# Categorize W
cw = matrix(0, ncol = J,nrow = n)
for(i in 1:n){cw[i, ] = fMx(w[i])}

# Run standard logistic regression using glm (no intercept)
thetaw_out = glm(y ~ cw - 1 , family = binomial(link = "logit"))

# Get estimates theta1, .., that_5 and standard errors
thetaw_w = summary(thetaw_out)$coef[1:J] 
s.e.thw = summary(thetaw_out)$coef[1:J, 2]

# Calculate the standard error for theta_5 - theta_1
s.e_thetaw_J1 = sqrt(s.e.thw[J]^2+s.e.thw[1]^2 -2*vcov(thetaw_out)[1,J])
s.e_thetaw_w = c(s.e.thw, s.e_thetaw_J1) # SE of theta1, .., theta_5 and (theta_5-theta_1)

# Report results 
theta.par = c(thetaw_w, thetaw_w[5] - thetaw_w[1])
names(theta.par) = names(s.e_thetaw_w) = c("theta1", "theta2", "theta3", 
                                           "theta4", "theta5", "theta5-theta1")
out1 = list(theta.par, s.e_thetaw_w)
names(out1) = c( "theta", "stderr.theta")

return(out1)  }

(2) Set parameters values for data generation.

# Parameter values
mux = 0 #true mean of X
su2 = 1 #true variance of U
sx2 = 1 #true variance of X

b = log(1.5) #beta_1
a = -0.42 #beta_0

# Sample size
n = 500 # internal data
m = 300 # external data
r = 2 # replicates 

(3) Generate the external and internal datasets. Note that $X$ in the external data has no replicates. The replicates are generated due to the error term $U$.

# Set seed
set.seed(107852)

# Generate external dataset
X_ext = rnorm(m, mux, sqrt(sx2)) # X is a vector, not a matrix
U_ext = matrix(rnorm(m * r, 0, sqrt(su2)), m, r)
W_ext = matrix(rep(X_ext, r), m, r, byrow = FALSE) + U_ext 

# Generate internal dataset
X_int = rnorm(n, mux, sqrt(sx2))
U_int = rnorm(n, 0, sqrt(su2))
W_int = X_int + U_int  # internal data has no replicates

## Generate response y for internal dataset
fHm <- function(x, a, b){1 / (1 + exp( - (a + b * x)))}
pr = fHm(X_int, a, b)
y = vector()
for(i in 1:n){y[i] = rbinom(1, 1, pr[i])}

(4) Perform the proposed method using function ccp. Type = "logistic" needs to be specified for logistic regression.

outcome1 = ccp( y = y, W_int = W_int, W_ext = W_ext, Type = "logistic")
outcome1

(5) Compare results from the proposed method to the naive approach.

# Estimate mean of X and variance of X (used for categorization)
mux_hat = mean(W_int)
su2e = mean(apply(W_ext, 1, var))
s2x_hat=max((var(W_int)-su2e), 0.2*var(W_int)) 
# 0.2*var(W_int) is the common bound to control the variance of X

# Run naive approach
thetaw(y, W_int, mux_hat, s2x_hat)

Given in the paper [@twbb2018], the true $\bTheta = (-0.98, -0.64, -0.42, -0.21, 0.14)\trans$. Thus, the true $\theta_5 - \theta_1 = 1.12$. The proposed method has the estimate 1.121, while the naive approach provides an estimate 0.470. The results show that ignoring measurement error and applying standard logistic regression directly with respect to $W$ lead to poor inference quality. On the contrary, the proposed method gives consistent estimate as expected.

Now we present the internal-only case for logistic regression.

# set seed
set.seed(1029356)

# Generate dataset
X_int = rnorm(n, mux, sqrt(sx2)) # X has no replicates
U_int = matrix(rnorm(n * r, 0, sqrt(su2)), n, r)
W_int = matrix(rep(X_int, r), n, r, byrow = FALSE) + U_int  

# Generate response y
fHm <- function(x, a, b){1 / (1 + exp(-(a + b * x)))}
pr = fHm(X_int, a, b)
y = vector()
for(i in 1:n){y[i] = rbinom(1, 1, pr[i])}

Run the proposed method:

outcome2 = ccp( y = y, W_int = W_int, Type = "logistic")
outcome2

Run standard logistic regression:

row_mean_w  = apply(W_int, 1, mean)
mux_hat = mean(row_mean_w)
s2w = apply(W_int, 1, var)
su2e =  mean(s2w)/r   
s2x_hat = max(mean((row_mean_w - mux_hat) ^ 2) - su2e, 
              0.2 * (mean((row_mean_w - mux_hat) ^ 2)))
thetaw(y, row_mean_w, mux_hat, s2x_hat)

We observe the similar pattern as shown in the external-internal case. For $\theta_5- \theta_1 = 1.12$, the naive estimate is 0.772, while the proposed method estimates it as 1.126.

Real data example: EATS data

Data

Here we use the Eating at America's Table (EATS) Study [@EATS2001] data as an example to illustrate the usage of the package. The dataset contains 964 participants with multiple 24-hour recalls of diet per each person. Define Fat Density as the percentage of calories coming from fat. We want to use this data to analyze the relative risk of being obese, comparing the group of people with the highest level of Fat Density versus people with the lowest level of Fat Density.

First, we load the data. However, we are not allowed to share this data. Thus, we show several lines of the data so you can get a sense of what the data looks like.

data_external = read.csv('~/Desktop/data_external_clean.csv')
data_internal = read.csv('~/Desktop/data_internal_clean.csv')
EATSdata_all = rbind(data_internal, data_external)
head(EATSdata_all)

In this study, we mainly focus on the following variables:

For numerical stability, we first preprocess the data:

(1) delete outliers;

(2) centered and standardized $W$ using $(15*W - 5)/\sqrt{0.5}$.

Then we obtain a dataset with 929 observations. We then randomly selected 200 observations as the external dataset, the remaining 729 observations are the internal dataset. More details are provided within the examples.

Before formally applied the our approach, we need to check the assumptions. In the paper, we showed that it is reasonable to take (a) $X$ to be normally distributed, (b) $U$ to be normally distributed, and (c) $X$ and $U$ to be independent. Hence, here we do not repeat the detailed measurements we have done previously.

We first show the external-internal case, then the internal-only case. Each case contains logistic regression with binary $Y$ and linear regression with continuous $Y$.

External-internal case

First, we choose variables from the two datasets. We choose the first 2 records from the external dataset, and the 3rd record from the internal dataset.

Logistic regression

The response variable BMI has been transferred to a binary variable with threshold 30. In other words, $Y_{\rm original} > 30\Longrightarrow Y_{\rm new} = 1$, which indicates obesity.

# select the first 2 records from external data
W_ext = data_external[,2:3]

# select the 3rd record from internal data
W_int = as.matrix(data_internal[,4])

# transfer continuous Y into binary
y = 1*((data_internal[,1])>30)

The following table shows the size of the external and internal datasets.

CN = c("size", "recalls")
RN = c("internal", "external")
datastructure = matrix(c(dim(W_int)[1], dim(W_int)[2], dim(W_ext)[1], dim(W_ext)[2]), byrow = TRUE, ncol = 2, nrow = 2)
rownames(datastructure) = RN
colnames(datastructure) = CN
print(xtable(datastructure, caption = "summary for the external-internal case", digits = 0))

Now we apply ccp with specified Type = "logistic".

results = ccp( y = y, W_int = W_int, W_ext = W_ext, Type = "logistic")

The log relative risk - the term theta 5 - theta 1- is estimated as 0.984 with p-value = 0.036 and is significant at the 0.05 level.

Linear regression

To fit linear regression, we use the scaled BMI as the continuous response. The internal and external data are the same as before.

# select the first 2 records from external data
W_ext = data_external[,2:3]

# select the 3rd record from internal data
W_int = data_internal[,4]

# scale continuous Y
y = data_internal[,1]
y=as.numeric(scale(y))

Specifying Type = "linear" fits a linear regression. Because we already standardized the data, we can simply choose standardize = FALSE. The default is TRUE.

results = ccp(y = y,W_int = W_int, W_ext = W_ext, Type = "linear", standardize = FALSE)

The estimate for $\theta_5 - \theta_1$ is 0.59336, which is highly significant with a small $p$-value 0.00098.

Internal-only case

For the internal-only case, the syntax is similar to what we showed above. However, only the internal data needs to be provided. If the user also provides external data, as long as the internal data has replicates, the external data are ignored.

Logistic regression

# use first two replicates 
W_int = EATSdata_all[, 2:3]
# transfer continuous Y into binary
y = 1*((EATSdata_all[, 1])>30) 
CN = c("size", "recalls")
RN = c("internal", "external")
datastructure = matrix(c(dim(W_int)[1], dim(W_int)[2],0, 0), byrow = TRUE, ncol = 2, nrow = 2)
rownames(datastructure) = RN
colnames(datastructure) = CN
print(xtable(datastructure, caption = "summary for the internal-only case", digits = 0))
results = ccp(y = y, W_int = W_int, Type = "logistic")

Linear regression

W_int = EATSdata_all[, 2:3]
y = EATSdata_all[, 1]
y = as.numeric(scale(y))
results =ccp(y = y, W_int = W_int, Type = "linear", standardize = FALSE)

For the external-internal and internal-only cases, our approach provides similar estimates. However, the naive logistic regression, which is often used by epidemiology, has different results in the two cases and not similar. Since method comparison is not in the scope of this document, we refer readers to check the paper Categorizing a Continuous Predictor Subject to Measurement Error for more details.

References



tianyingw/CCP documentation built on Aug. 20, 2022, 1:30 a.m.