In Garylikai/PMLi-1.0-R-Package: AMS 597 Project Partially Matched Samples R Package

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"

)

Package Info

The user manual is designed for users of the PMLi package. This R package was created as a project for AMS 597: Statistical Computing, Spring 2021, in Stony Brook University. PMLi contains statistical procedures to mainly analyze partially matched samples - an experimental design based on independent samples and matched pairs designs. For example, in clinical studies, a medical researcher may be interested in comparing two methods, call them methods A and B, for measuring cardiac output. If no missing value is present in the samples, the experimental design is called matched pairs design. Regular hypothesis testing procedures for matched pairs design, such as paired t-test method, can test the null hypothesis of the mean difference between two sets of paired observations. However, in reality, data has missing values for some reason. Appropriate statistical approaches are required for partially matched pairs. The project also investigates the statistical methods for numerous exceptional cases related to partially matched samples. In particular, data inputs may not be partially matched samples but uncomplicated cases that can be analyzed using standard analysis procedures. Therefore, the package handles partially matched samples and corner cases where statistical approaches for partially matched samples may not be suitable.

The statistical procedures for partially matched samples discussed here are largely based on Kuan and Huang's work @kuan. Prof. Pei Fen Kuan is the course instructor for AMS 597, Spring 2021. More information about Kuan and Huang's work on partially matched samples can be found here. More information about the course AMS 597 can be found here.

Package Installation

The PMLi package can be installed in R directly.

install.packages("PMLi_1.0.tar.gz", type = "source")
library(PMLi)

More Background

Mathematically speaking, suppose that $2n$ samples of cardiac output, i.e., $n$ pairs of matched samples of cardiac output with missing values, are drawn from a subject population. Suppose that $n_1$ pairs of the samples are entirely matched. That is, the $n_1$ subject pairs do not have any missing cardiac output values. Further, suppose that $n_2$ and $n_3$ subject pairs have only missing cardiac output values for methods A and B, respectively. Then, the data for measuring cardiac output is an example of partially matched data.

It is also possible that $n_4$ pairs of the cardiac outputs are all missing. This scenario will not change the inherent characteristics of partially matched samples because the $n_4$ pairs of missing values can be omitted directly. A table illustration with general ordered partially matched samples is shown below. Note that in reality, the pairs are typically not ordered like the following.

Table: Demonstration of Partially Matched Samples

Pair $1$ Pair $2$ ... Pair $n_1$ Pair $n_1+1$ ... Pair $n_1+n_2$ Pair $n_1+n_2+1$ ... Pair $n_1+n_2+n_3$ Pair $n_1+n_2+n_3+1$ ... Pair $n$

$x_1$ $x_2$ ... $x_{n_1}$ $x_{n_1+1}$ ... $x_{n_1+n_2}$ NA ... NA NA ... NA $y_1$ $y_2$ ... $y_{n_1}$ NA ... NA $y_{n_1+n_2+1}$ ... $y_{n_1+n_2+n_3}$ NA ... NA

Pair $n$ is, in fact, Pair $n_1+n_2+n_3+n_4$. Denote the second and the third rows of the table above Sample 1 and Sample 2, correspondingly. Then, Method A and Method B are Sample 1 and Sample 2, respectively, in the cardiac output example. Furthermore, denote the subset of data data1 with $n_1$ fully matched pairs, the subset of data data2 with $n_2$ Sample 1 observations paired with missing values, the subset of data data3 with $n_3$ Sample 2 observations paired with missing values, and the subset of data data4 with $n_4$ matched NA observations. In particular, partially matched samples should be either a combined data of data1, data2, and data3 or a combined data of data1, data2, data3, and data4. Again, the partially matched pairs between and within datasets can be in any order.

Partially matched samples is an experimental design that can be considered as a combination of the following two experimental designs @kuan:

$n_1$ matched pairs or repeated measures (data1)
independent groups with $n_2$ and $n_3$ per group (Sample 1 from data2 and Sample 2 from data3), where both group's experimental designs intend to estimate the same parameter (e.g., difference of means of the two groups is 0)

Again, data4 with $n_4$ pairs of missing values is discarded in hypothesis testing since it is only meaningful for the completeness of justifying the data structure. Now, it is ready to move on to some details of implementation.

Implementation Details for Special Cases Related to Partially Matched Samples^[Note that some of the code presented in this user manual has been simplified for illustration purposes. For the actual implementation, please refer to the source code files.]

All functions in the PMLi package will verify whether the experimental design of the input dataset is partially matched samples. If the dataset is not partially matched, appropriate alternative testing procedures will be used, depending on the dataset's structure. The sample dataset pm in the PMLi package is borrowed to clarify the scenarios below.

0. Data Input

pm is the sample dataset that is lazily loaded already. The dataset can be retrieved directly.

data <- pm

1. $n_1=0,\, n_2=0,\, n_3=0,\, n_4=0\, (n=0)$

The first case concerns an empty dataset. In this package, inputting an empty dataset will return an error message without performing any hypothesis tests.

if(length(data) == 0){
  stop("The data is empty. Please input a nonempty dataset to begin.\n")
}

If the input argument is a vector, an error message will also be returned because the data does not follow a matched sample design. The data argument of all the functions in PMLi accepts either a $2\times n$ or an $n\times 2$ data frame or matrix. Here, an $n\times 2$ data frame or matrix input will be converted into a $2\times n$ data frame to align with the data structure demonstrated in the table. Inputting other dimensions, such as an $n\times 5$ matrix, will produce an error message.

if(is.vector(data)){
    stop("The data is a vector. Please input a nonempty dataset to begin.\n")
}else if(length(data[1, ]) == 2){
  data <- as.data.frame(t(data))
}else if(length(data[, 1]) == 2){
  data <- as.data.frame(data)
}else{
  stop("The data inputted is not partially matched.\n")
}

After verifying the input data dimension, separate the complete input data into subsets of data named data1, data2, and data3 with the same notation illustrated before. Define n1, n2, and n3 to be the corresponding number of paired samples in the implementation so that the exceptional cases can be described easily.

data1 <- data[which(is.na(data[1, ]) == FALSE & is.na(data[2, ]) == FALSE)]
data2 <- data[which(is.na(data[1, ]) == FALSE & is.na(data[2, ]))]
data3 <- data[which(is.na(data[1, ]) & is.na(data[2, ]) == FALSE)]

n1 <- length(data1)
n2 <- length(data2)
n3 <- length(data3)

2. $n_1=0,\, n_2=0,\, n_3=0,\, n_4>0\, (n>0)$

The second case is an extension of the first case: only missing value pairs are inputted. Note that the first case is examined before the second case. Therefore, the only possibility that corresponds to the second case is a dataset with only missing data pairs, which will return an error message. In future implementations, because data4 is not of any interest, whether data4 is empty or not will not be considered individually.

if(n1 == 0 & n2 == 0 & n3 == 0){
    stop("The data only contains missing values. Please input a valid dataset
          to begin.\n")
}

3. $n_1=0,\, n_2>0,\, n_3=0$

The third case considers that $n_2$ pairs of samples are inputted, with all missing values in Sample 2. This scenario matches a data input of data2. In this case, the one-sample Wilcoxon signed rank test method will be used if the Shapiro-Wilk test for the first sample of data2 is significant. Otherwise, the one-sample t-test method will be used.

if(n1 == 0 & n2 > 0 & n3 == 0){
  if(shapiro.test(unlist(data2[1, ]))$p.value <= 0.05){
    test <- wilcox.test(unlist(data2[1, ]), alternative = alternative, mu = mu,
                        paired = FALSE, conf.int = TRUE, conf.level = conf.level)
  }else{
    test <- t.test(data2[1, ], alternative = alternative, mu = mu, 
                   conf.level = conf.level, paired = FALSE, var.equal = FALSE)
  }
}

4. $n_1=0,\, n_2=0,\, n_3>0$

This case is identical to the third case except that all Sample 1 values of data3 are missing.

if(n1 == 0 & n2 == 0 & n3 > 0){
  if(shapiro.test(unlist(data3[2, ]))$p.value <= 0.05){
    test <- wilcox.test(unlist(data3[2, ]), alternative = alternative, mu = mu,
                        paired = FALSE, conf.int = TRUE, conf.level = conf.level)
  }else{
    test <- t.test(data3[2, ], alternative = alternative, mu = mu, 
                   conf.level = conf.level, paired = FALSE, var.equal = FALSE)
  }
}

5. $n_1=0,\, n_2>0,\, n_3>0$

Now suppose that no matched pairs are found in the data input but only pairs of (observation, missing) and (missing, observation). This case corresponds to an input of a combined dataset of data2 and data3. Here, treating the first and the second samples of data2 and data3, respectively, as independent groups is more appropriate. Hence, the two-sample Wilcoxon rank sum test will be used if the Shapiro-Wilk test is significant for at least one of the samples. Otherwise, the two-sample t-test method will be used. Specifically, if the F Test to compare two variances is significant, the two-sample t-test method with unequal variance assumption will be used. If not, the two-sample t-test method with equal variance assumption will be used.

if(n1 == 0 & n2 > 0 & n3 > 0){
  if(shapiro.test(unlist(data2[1, ]))$p.value <= 0.05 | 
       shapiro.test(unlist(data3[2, ]))$p.value <= 0.05){
      test <- wilcox.test(unlist(data2[1, ]), unlist(data3[2, ]),
                          alternative = alternative, mu = mu, paired = FALSE, 
                          conf.int = TRUE, conf.level = conf.level)
  }else{
    if(var.test(unlist(data2[1, ]), unlist(data3[2, ]))$p.value <= 0.05){
      test <- t.test(data2[1, ], data3[2, ], alternative = alternative, mu = mu,
                     conf.level = conf.level, paired = FALSE, var.equal = FALSE)
    }else{
      test <- t.test(data2[1, ], data3[2, ], alternative = alternative, mu = mu,
                     conf.level = conf.level, paired = FALSE, var.equal = TRUE)
    }
  }
}

6. $n_1=n,\, n_2=0,\, n_3=0$

Consider the case that the input samples are complete (data1). Here, treating the first and the second samples of data1 as matched pairs is more appropriate. Hence, the Wilcoxon signed rank test method for matched pairs will be used if the Shapiro-Wilk test is significant for at least one of the samples. Otherwise, the matched t-test method will be used.

if(length(data) == n1){
  if(shapiro.test(unlist(data1[1, ]))$p.value <= 0.05 |
     shapiro.test(unlist(data1[2, ]))$p.value <= 0.05){
    test <- wilcox.test(unlist(data1[1, ]), unlist(data1[2, ]),
                        alternative = alternative, mu = mu, paired = TRUE,
                        conf.int = TRUE, conf.level = conf.level)
  }else{
    test <- t.test(unlist(data1[1, ]), unlist(data1[2, ]), 
                   alternative = alternative, mu = mu, conf.level = conf.level,
                   paired = TRUE, var.equal = FALSE)
  }
}

7. $n_1>0,\, n_2>0,\, n_3=0$

In this special case, the dataset input is a combination of data1 and data2. There are two choices to analyze the data if only standard complete analysis procedures are given:

Only analyze the $n_1$ samples data1 using paired-sample methods.
Treat $n_1+n_2$ Sample 1 observations (the first sample of the combined data of data1 and data2) and $n_1$ Sample 2 observations (the second sample of data1) as two independent samples.

In this package, both approaches will be run and outputted for the user's information. The first approach can be handle using the procedures discussed in Case 6. The second approach can be dealt with using the methods discussed in Case 5 except for considering Sample 1 of the combined data of data1 and data2 and Sample 2 of data1 as independent groups.

if(n1 > 0 & n2 > 0 & n3 == 0){
  if(shapiro.test(unlist(data1[1, ]))$p.value <= 0.05 |
     shapiro.test(unlist(data1[2, ]))$p.value <= 0.05){
    test1 <- wilcox.test(unlist(data1[1, ]), unlist(data1[2, ]),
                         alternative = alternative, mu = mu, paired = TRUE,
                         conf.int = TRUE, conf.level = conf.level)
    if(shapiro.test(unlist(cbind(data1, data2)[1, ]))$p.value <= 0.05 |
       shapiro.test(unlist(data1[2, ]))$p.value <= 0.05){
      test2 <- wilcox.test(unlist(cbind(data1, data2)[1, ]), unlist(data1[2, ]),
                           alternative = alternative, mu = mu, paired = FALSE,
                           conf.int = TRUE, conf.level = conf.level)  
    }else{
      if(var.test(unlist(cbind(data1, data2)[1, ]), unlist(data1[2, ]))$
         p.value <= 0.05){
        test2 <- t.test(cbind(data1, data2)[1, ], data1[2, ],
                        alternative = alternative, mu = mu, 
                        conf.level = conf.level, paired = FALSE,
                        var.equal = FALSE)
      }else{
        test2 <- t.test(cbind(data1, data2)[1, ], data1[2, ],
                        alternative = alternative, mu = mu, 
                        conf.level = conf.level, paired = FALSE, 
                        var.equal = TRUE)
      }
    }
  }else{
    test1 <- t.test(unlist(data1[1, ]), unlist(data1[2, ]), 
                    alternative = alternative, mu = mu, conf.level = conf.level,
                    paired = TRUE, var.equal = FALSE)
    if(shapiro.test(unlist(cbind(data1, data2)[1, ]))$p.value <= 0.05 |
       shapiro.test(unlist(data1[2, ]))$p.value <= 0.05){
      test2 <- wilcox.test(unlist(cbind(data1, data2)[1, ]), unlist(data1[2, ]),
                           alternative = alternative, mu = mu, paired = FALSE,
                           conf.int = TRUE, conf.level = conf.level)
    }else{
      if(var.test(unlist(cbind(data1, data2)[1, ]), unlist(data1[2, ]))$
         p.value <= 0.05){
        test2 <- t.test(cbind(data1, data2)[1, ], data1[2, ],
                        alternative = alternative, mu = mu, 
                        conf.level = conf.level, paired = FALSE, 
                        var.equal = FALSE)
      }else{
        test2 <- t.test(cbind(data1, data2)[1, ], data1[2, ],
                            alternative = alternative, mu = mu, 
                            conf.level = conf.level, paired = FALSE, 
                            var.equal = TRUE)
      }
    }
  }
}

8. $n_1>0,\, n_2=0,\, n_3>0$

The last special case is identical to the previous one except that the input dataset is a combined data of data1 and data3.

if(n1 > 0 & n2 == 0 & n3 > 0){
  if(shapiro.test(unlist(data1[1, ]))$p.value <= 0.05 |
     shapiro.test(unlist(data1[2, ]))$p.value <= 0.05){
    test1 <- wilcox.test(unlist(data1[1, ]), unlist(data1[2, ]),
                         alternative = alternative, mu = mu, paired = TRUE,
                         conf.int = TRUE, conf.level = conf.level)
    if(shapiro.test(unlist(cbind(data1, data3)[2, ]))$p.value <= 0.05 |
       shapiro.test(unlist(data1[1, ]))$p.value <= 0.05){
      test2 <- wilcox.test(unlist(cbind(data1, data3)[2, ]), unlist(data1[1, ]),
                           alternative = alternative, mu = mu, paired = FALSE,
                           conf.int = TRUE, conf.level = conf.level)  
    }else{
      if(var.test(unlist(cbind(data1, data3)[2, ]), unlist(data1[1, ]))$
         p.value <= 0.05){
        test2 <- t.test(cbind(data1, data3)[2, ], data1[1, ],
                        alternative = alternative, mu = mu, 
                        conf.level = conf.level, paired = FALSE,
                        var.equal = FALSE)
      }else{
        test2 <- t.test(cbind(data1, data3)[2, ], data1[1, ],
                        alternative = alternative, mu = mu, 
                        conf.level = conf.level, paired = FALSE, 
                        var.equal = TRUE)
      }
    }
  }else{
    test1 <- t.test(unlist(data1[1, ]), unlist(data1[2, ]), 
                    alternative = alternative, mu = mu, conf.level = conf.level,
                    paired = TRUE, var.equal = FALSE)
    if(shapiro.test(unlist(cbind(data1, data3)[2, ]))$p.value <= 0.05 |
       shapiro.test(unlist(data1[1, ]))$p.value <= 0.05){
      test2 <- wilcox.test(unlist(cbind(data1, data3)[2, ]), unlist(data1[1, ]),
                           alternative = alternative, mu = mu, paired = FALSE,
                           conf.int = TRUE, conf.level = conf.level)
    }else{
      if(var.test(unlist(cbind(data1, data3)[2, ]), unlist(data1[1, ]))$
         p.value <= 0.05){
        test2 <- t.test(cbind(data1, data3)[2, ], data1[1, ],
                        alternative = alternative, mu = mu, 
                        conf.level = conf.level, paired = FALSE, 
                        var.equal = FALSE)
      }else{
        test2 <- t.test(cbind(data1, data3)[2, ], data1[1, ],
                            alternative = alternative, mu = mu, 
                            conf.level = conf.level, paired = FALSE, 
                            var.equal = TRUE)
      }
    }
  }
}

Statistical Analysis Methods for Partially Matched Samples^[For a complete statistical analysis strategies for partially matched samples, see Kuan and Huang's @kuan article here].

After inspecting and providing solutions for the special cases that users may encounter, it is ready to discuss implementations for partially matched samples. Detailed implementation for partially matched samples will not be shown in this manual because the implementation is straightforward. More information can be found in the source code and the help files of the functions. The more critical information - the methods to handle partially matched samples - will be discussed in detail.

Partially matched samples correspond to the case $n_1>0,\, n_2>0,\, n_3>0$. There are two options for users to select if only standard complete analysis procedures are provided for analyzing partially matched samples @kuan:

Only analyze the $n_1$ matched samples data1 using paired-sample methods
Treat $n_1+n_2$ samples (the first samples of the combined data of data1 and data2) and $n_1+n_3$ samples (the second samples of the combined data of data1 and data3 ) as two independent samples.

Kuan and Huang claim @kuan that the above approaches are not the best choices because paired-sample methods for a subset of data do not utilize all the information given in the original data, whereas considering the partially matched samples as two independent samples may lose the inherent pair correlation structure between the matched samples. Thus, five types of statistical analysis strategies are developed for analyzing partially matched samples.

Liptak's Weighted Z-Test

Given a dataset of partially matched samples, let $p_{1i}$ be the p-value of $n_1$ matched samples data1 using paired-sample methods and $p_{2i}$ be the p-value of two independent samples methods of $n_2$ Sample 1 (the first samples of data2) and $n_3$ Sample 2 (the second samples of data3). Liptak's weighted Z-test statistics are $Z_{1i}=\Phi^{-1}(1-p_{1i})$ and $Z_{2i}=\Phi^{-1}(1-p_{2i})$, where $\Phi^{-1}(x)$ is the inverse standard normal cumulative distribution function of $x$. The combined p-value of Liptak's weighted Z-test @liptak is given by \begin{equation} p_{ci}=1-\Phi{\left(\frac{w_1Z_{1i}+w_2Z_{2i}}{\sqrt{w_1^2+w_2^2}}\right)}, \end{equation} where $\Phi(x)$ is the standard normal cumulative distribution function of $x$. $w_1$ and $w_2$ are the weights for the first and the second p-values, respectively. There are several choices of weights. Liptak @liptak proposed the default weights as $w_1=\sqrt{2n_1}$ and $w_2=\sqrt{n_2+n_3}$.

It is important to remark that the combined p-value above is meaningful only when $p_{1i}$ and $p_{2i}$ are one-sided p-values @kuan. If a two-sided combined p-value is preferred, Zaykin @zaykin suggests obtaining one-sided (either "greater" or "less") p-values $p_{1i}$ and $p_{2i}$ first. Then the one-sided combined p-value can be obtained using the above formula. Finally, adjust the one-sided combined p-value $p_{ci}$ to obtain a two-sided combined p-value $p_{ci}^*$ using

\begin{equation} p_{ci}^*= \begin{cases} 2p_{ci}, & \mbox{if }p_{ci}<1/2 \ 2(1-p_{ci}), & \mbox{otherwise} \end{cases}. \end{equation}

Kim et al.'s Modified t-Statistic

Let $\bar{D}$ and $S_D$ be the mean and the standard deviation of the difference of Sample 1 and Sample 2 in data1. Also, let $\bar{T}$ and $S_T$ be the mean and the standard deviation of Sample 1 in data2. Similarly define $\bar{N}$ and $S_N$ for Sample 2 in data3. Moreover, let $n_H=\frac{2}{1/n_2+1/n_3}$. Then, Kim et al.'s modified t-statistic @kim is given by

\begin{equation} t_3=\frac{n_1\bar{D}+n_H(\bar{T}-\bar{N})}{\sqrt{n_1S_D^2+n_H^2(S_T^2/n_2+S_N^2/n_3)}}. \end{equation}

Note that the distribution of $t_3$ approximately follows a standard normal distribution under the null hypothesis @kuan. Therefore, p-values and confidence intervals can be obtained from $t_3$.

Looney and Jones's Corrected Z-Test

Let $\bar{T}^$ and $S_T^$ be the mean and the standard deviation of Sample 1 in the combined data of data1 and data2. Similarly define $\bar{N}^$ and $S_N^$ for Sample 2 in the combined data of data1 and data3. Moreover, let $S_{TN_1}$ be the sample covariance of Sample 1 and Sample 2 in data1. Then, Looney and Jones's corrected Z-test @looney is given by

\begin{equation} Z_{corr}=\frac{\bar{T}^-\bar{N}^}{\sqrt{{S_T^}^2/(n_1+n_2)+{S_N^}^2/(n_1+n_3)-2n_1S_{TN_1}/((n_1+n_2)(n_1+n_3))}}. \end{equation}

Note that the distribution of $Z_{corr}$ follows a standard normal distribution. Therefore, p-values and confidence intervals can be obtained from $Z_{corr}$.

Lin and Stivers's MLE-Based Test under Heteroscedasticity

Let $\bar{T_1}$ and $S_{T_1}$ be the mean and the standard deviation of Sample 1 in data1. Similarly define $\bar{N_1}$ and $S_{N_1}$ for Sample 2 in data1. Let $\bar{T}$ be the mean of Sample 1 in data2, as defined in Kim et al.'s modified t-statistic @kim. Similarly define $\bar{N}$ for Sample 2 in data3. Let $S_{TN_1}$ be the sample covariance of Sample 1 and Sample 2 in data1, as defined in Looney and Jones's corrected Z-test @looney. Then, Lin and Stivers's MLE-based test under heteroscedasticity @lin is given by

\begin{equation} Z_{LS}=\frac{f(\bar{T}_1-\bar{T})-g(\bar{N}_1-\bar{N})+\bar{T}-\bar{N}}{\sqrt{V_1}}, \end{equation}

where

\begin{equation} V_1=\frac{[f^2/n_1+(1-f)^2/n_2]S_{T_1}^2(n_1-1)+[g^2/n_1+(1-g)^2/n_3]S_{N_1}^2(n_1-1)-2fgS_{TN_1}(n_1-1)/n_1}{n_1-1}, \end{equation}

\begin{equation} f=n_1(n_1+n_3+n_2S_{TN_1}/S_{T_1}^2)[(n_1+n_2)(n_1+n_3)-n_2n_3r^2]^{-1}, \end{equation}

\begin{equation} g=n_1(n_1+n_2+n_3S_{TN_1}/S_{N_1}^2)[(n_1+n_2)(n_1+n_3)-n_2n_3r^2]^{-1}, \end{equation}

\begin{equation} r=S_{TN_1}/(S_{T_1}S_{N_1}). \end{equation}

Note that the distribution of $Z_{LS}$ approximately follows a $t$ distribution with $n_1$ degrees of freedom under the null hypothesis @kuan. Therefore, p-values and confidence intervals can be obtained from $Z_{LS}$.

Ekbohm's MLE-Based Test under Homoscedasticity

Let $\bar{T_1}$ and $S_{T_1}$ be the mean and the standard deviation of Sample 1 in data1, and $r=S_{TN_1}/(S_{T_1}S_{N_1})$, as defined in Lin and Stivers's MLE-based test under heteroscedasticity @lin. Similarly define $\bar{N_1}$ and $S_{N_1}$ for Sample 2 in data1. Let $\bar{T}$ and $S_T$ be the mean and the standard deviation of Sample 1 in data2, as defined in Kim et al.'s modified t-statistic @kim. Similarly define $\bar{N}$ and $S_N$ for Sample 2 in data3. Then, Ekbohm's MLE-based test under homoscedasticity @ekbohm is given by \begin{equation} Z_E=\frac{f^(\bar{T}_1-\bar{T})-g^(\bar{N_1}-\bar{N})+\bar{T}-\bar{N}}{\sqrt{V^*_1}}, \end{equation}

where

\begin{equation} V_1^*=\hat{\sigma}^2\frac{2n_1(1-r)+(n_2+n_3)(1-r^2)}{(n_1+n_2)(n_1+n_3)-n_2n_3r^2}, \end{equation}

\begin{equation} \hat{\sigma}^2=\frac{S_{T_1}^2(n_1-1)+S_{N_1}^2(n_1-1)+(1+r^2)[S_T^2(n_2-1)+S_N^2(n_3-1)]}{2(n_1-1)+(1+r^2)(n_2+n_3-2)}, \end{equation}

\begin{equation} f^*=n_1(n_1+n_3+n_2r)[(n_1+n_2)(n_1+n_3)-n_2n_3r^2]^{-1}, \end{equation}

\begin{equation} g^*=n_1(n_1+n_2+n_3r)[(n_1+n_2)(n_1+n_3)-n_2n_3r^2]^{-1}. \end{equation}

Note that the distribution of $Z_E$ approximately follows a $t$ distribution with $n_1$ degrees of freedom under the null hypothesis @kuan. Therefore, p-values and confidence intervals can be obtained from $Z_E$.

Discussion of Statistical Approaches for Partially Matched Samples

It is important to realize that Liptak's weighted Z-test @liptak is a p-values pooling strategy, and the rest of the methods illustrated in this manual @kim@looney@lin@ekbohm are based on modified test statistics.

P-Values Pooling Strategies vs. Modified Test Statistics Approaches for Partially Matched Samples

Note that the distribution of the modified test statistics methods discussed here follow either a standard normal distribution or a Student's t-distribution. Therefore, Kuan and Huang @kuan state that the above four modified test statistics approaches @kim@looney@lin@ekbohm may not be robust for nonnormal datasets that are small or moderate in size. Comparatively, because Liptak's weighted Z-test @liptak considers p-values directly, which enables a more robust choice of hypothesis testing procedures. For example, if at least one of the two independent samples, $n_2$ Sample 1 (the first samples of data2) and $n_3$ Sample 2 (the second samples of data3), is not normal, the two independent samples using the two-sample Wilcoxon rank sum test method is a more robust nonparametric approach, rather than the two-sample t-test method. To summarize, p-values pooling strategies can be used directly to any statistical tests for both paired and independent samples @kuan.

Additional Comments on Modified Test Statistics Methods

As the name of Kim et al.'s modified t-statistic @kim suggests, Kim et al.'s test statistic $t_3$ @kim is based on a modification of the standard t-test statistic. Looney and Jones's corrected Z-test statistic $Z_{corr}$ @looney is based on a modified variance estimation of the standard Z-test @kuan.
The standard matched t-test method and the the two-sample t-test method are special cases of Looney and Jones's corrected Z-test @looney when $n_2=n_3=0$ and $n_1=0$, respectively @kuan.
Lin and Stivers's @lin and Ekbohm's @ekbohm tests are based on a modified maximum likelihood estimator (MMLE). More precisely, suppose that Sample 1 and Sample 2 are bivariate normal. Both procedures consider the MMLE of the two samples' correlation mean difference @kuan. In particular, Lin and Stivers's @lin work extends to the nonconstant variance case for Sample 1 and Sample 2. Ekbohm's @ekbohm work concerns about the constant variance case.

Examples

Partially matched samples analysis examples using the pm dataset in the PMLi package are illustrated here. The common function arguments in the PMLi package are data, alternative, mu, and conf.level. Liptak's weighted Z-test @liptak has additional two parameters, w1 and w2, for the weights of the p-values. Various components will be outputted. More details can be found in the help files of the functions.

# Run the following command if you have questions about weighted.z() function.
# ?weighted.z()

# p-value is returned.
weighted.z(pm)$p.value

# Formula interface is outputted.
modified.t(pm)

# Test statistic is returned.
corrected.z(pm, "greater")$statistic

# Confidence interval is returned with specified arguments.
mle.hetero(pm, alternative = "less", conf.level = 0.90)$conf.int

# Formula interface is returned with specified arguments.
mle.homo(pm, alternative = "greater", mu = 0.01, conf.level = 0.99)

References

Garylikai/PMLi-1.0-R-Package documentation built on Dec. 28, 2021, 12:12 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Garylikai/PMLi-1.0-R-Package
AMS 597 Project Partially Matched Samples R Package

In Garylikai/PMLi-1.0-R-Package: AMS 597 Project Partially Matched Samples R Package

Package Info

Package Installation

More Background

Implementation Details for Special Cases Related to Partially Matched Samples^[Note that some of the code presented in this user manual has been simplified for illustration purposes. For the actual implementation, please refer to the source code files.]

0. Data Input

1. $n_1=0,\, n_2=0,\, n_3=0,\, n_4=0\, (n=0)$

2. $n_1=0,\, n_2=0,\, n_3=0,\, n_4>0\, (n>0)$

3. $n_1=0,\, n_2>0,\, n_3=0$

4. $n_1=0,\, n_2=0,\, n_3>0$

5. $n_1=0,\, n_2>0,\, n_3>0$

6. $n_1=n,\, n_2=0,\, n_3=0$

7. $n_1>0,\, n_2>0,\, n_3=0$

8. $n_1>0,\, n_2=0,\, n_3>0$

Statistical Analysis Methods for Partially Matched Samples^[For a complete statistical analysis strategies for partially matched samples, see Kuan and Huang's @kuan article here].

Liptak's Weighted Z-Test

Kim et al.'s Modified t-Statistic

Looney and Jones's Corrected Z-Test

Lin and Stivers's MLE-Based Test under Heteroscedasticity

Ekbohm's MLE-Based Test under Homoscedasticity

Discussion of Statistical Approaches for Partially Matched Samples

P-Values Pooling Strategies vs. Modified Test Statistics Approaches for Partially Matched Samples

Additional Comments on Modified Test Statistics Methods

Examples

References

R Package Documentation

Browse R Packages

We want your feedback!

Garylikai/PMLi-1.0-R-Package AMS 597 Project Partially Matched Samples R Package

In Garylikai/PMLi-1.0-R-Package: AMS 597 Project Partially Matched Samples R Package

Package Info

Package Installation

More Background

Implementation Details for Special Cases Related to Partially Matched Samples^[Note that some of the code presented in this user manual has been simplified for illustration purposes. For the actual implementation, please refer to the source code files.]

0. Data Input

1. $n_1=0,\, n_2=0,\, n_3=0,\, n_4=0\, (n=0)$

2. $n_1=0,\, n_2=0,\, n_3=0,\, n_4>0\, (n>0)$

3. $n_1=0,\, n_2>0,\, n_3=0$

4. $n_1=0,\, n_2=0,\, n_3>0$

5. $n_1=0,\, n_2>0,\, n_3>0$

6. $n_1=n,\, n_2=0,\, n_3=0$

7. $n_1>0,\, n_2>0,\, n_3=0$

8. $n_1>0,\, n_2=0,\, n_3>0$

Statistical Analysis Methods for Partially Matched Samples^[For a complete statistical analysis strategies for partially matched samples, see Kuan and Huang's @kuan article here].

Liptak's Weighted Z-Test

Kim et al.'s Modified t-Statistic

Looney and Jones's Corrected Z-Test

Lin and Stivers's MLE-Based Test under Heteroscedasticity

Ekbohm's MLE-Based Test under Homoscedasticity

Discussion of Statistical Approaches for Partially Matched Samples

P-Values Pooling Strategies vs. Modified Test Statistics Approaches for Partially Matched Samples

Additional Comments on Modified Test Statistics Methods

Examples

References

R Package Documentation

Browse R Packages

We want your feedback!

Garylikai/PMLi-1.0-R-Package
AMS 597 Project Partially Matched Samples R Package