knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Before you use this package you need to answer the following 2 questions:

  1. Do I want a randomized list of individuals or do I want the Top N observations for a list of individuals that I already have?
  2. If randomized, then use a match_* function.
  3. if top n, then use a top_n* function.
  4. Do you have data with only numeric variables, or does your data have both numeric and categorical?
  5. If numeric, then use a *_numeric function.
  6. If mixed, then use a *_mixed function.

So your choices for how you want to proceed are listed below:

Randomized Sample, match_* functions

R contains a crime data set for the all 50 states. This data set contains data on murder rates, assaults, urban population and the occurrences of rape. The TestContR can be used to match states that have similar crime rates.

library(dplyr)
library(TestContR)

match_numeric(): Random selection of test and control groups/individuals for numeric metrics/variables.

Numeric only dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(USArrests)) %>%
                               dplyr::select(state, everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))

Build Test and Control list:

# defaults to 10 obs for the test group with matching controls. Change the size of the test group w/ param "n".

set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_numeric(df) 

Results of random selection option:

knitr::kable(TEST_CONTROL_LIST)



Providing a list of Test Groups/Individuals (No randomization of the test group)

TEST_GRP <- tribble(~'TEST','Colorado','Minnesota','Florida','South Carolina')

Example of data frame for the "test_list" input parameter:

knitr::kable(TEST_GRP)
set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_numeric(df, test_list = TEST_GRP)

Results for the "test_list" input parameter:

knitr::kable(TEST_CONTROL_LIST)

match_mixed(): Random selection of test and control groups/individuals with mixed metrics/variables, meaning both numeric and categorical.

Numeric and categorical dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(datasets::USArrests)) %>%
  base::cbind(datasets::state.division) %>%
  dplyr::select(state, dplyr::everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))

Build Test and Control list from mixed metrics:

# defaults to 10 obs for the test group with matching controls. Change the size of the test group w/ param "n".

set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_mixed(df)

Results of random selection option:

knitr::kable(TEST_CONTROL_LIST)

Providing a list of Test Groups/Individuals (No randomization of the test group)

TEST_GRP <- tribble(~'TEST','Colorado','Minnesota','Florida','South Carolina')

Example of data frame for the "test_list" input parameter:

knitr::kable(TEST_GRP)
set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_mixed(df, test_list = TEST_GRP)

Results for the "test_list" input parameter:

knitr::kable(TEST_CONTROL_LIST)


Top N matches for individuals or groups, the topn_* functions

topn_numeric(): Select Top N Controls for a set of groups or individuals

Build/provide a list of the obs of interest in the test_list:

test_list <- tribble(~"TEST","Colorado")

Numeric only dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(USArrests)) %>%
                               dplyr::select(state, everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))

Build the list of Top N matches: Provide the test_list dataframe to the test_list parameter in the function as below.

TOPN_CONTROL_LIST <- TestContR::topn_numeric(df, topN = 10, test_list = test_list)

Results of Top N selection option:

knitr::kable(head(TOPN_CONTROL_LIST,20))

Top N without a Test List: Don't be concerned about the warning; I just wanted to let users know that it would use all the labels in the dataframe.

TOPN_CONTROL_LIST <- TestContR::topn_numeric(df, topN = 10)

Results of Top N selection without Test List:

knitr::kable(head(TOPN_CONTROL_LIST,20))



topN_mixed(): Random selection of test and control groups/individuals with mixed metrics/

Numeric and categorical dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(datasets::USArrests)) %>%
  base::cbind(datasets::state.division) %>%
  dplyr::select(state, dplyr::everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))

Build Test and Control list from mixed metrics:

set.seed(99)
TOPN_CONTROL_LIST <- TestContR::topn_mixed(df, topN = 10, test_list = test_list)

Results of Top N selection without Test List:

knitr::kable(head(TOPN_CONTROL_LIST,20))

Top N Mixed without a Test List Don't be concerned about the warning; I just wanted to let users know that it would use all the labels in the dataframe.

TOPN_CONTROL_LIST <- TestContR::topn_mixed(df, topN = 10)

Results of Top N selection without Test List:

knitr::kable(head(TOPN_CONTROL_LIST,20))


Conclusion

Depending on your experiment, it may be prudent to add categorical metrics/variables that will help align your data better. In the above examples, when only using the numerical data Alabama's nearest match is Louisiana, but once region is taken into consideration, Alabama's nearest match is Tennessee. Now you have the tools to create a list of nearest matches for your data whether it is numeric or mixed.





Fredo-XVII/TestContR documentation built on Nov. 5, 2022, 5:54 p.m.