knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(CPTtools)

Conditional Probabilities over Discrete variables.

In a discrete Bayesian network, each node, $Y$, has an assoicated conditional probability table (CPT). Let $X_1, \ldots, X_K$ be the parent nodes, each of which has $|X_k|$ states. The conditional probablity distribution $\Pr(Y|X_1=x_1,\ldots,X_k=X_K)$ os a distrete distribution over the $|Y|=M$ states of $Y$. Note that there are $|X_1|\times\cdots\times|Y_K|=S$ possible configurations of the parent variables, so that the conditional proability distribution is actually a set of $S$ probability distributions. If they are stacked into a matrix, $S\times M$ matrix, this is the CPT. If there are no parents, the unconditional probability table consists of a single row ($S=1$).

The CPTtools package offers two ways of representing conditional probability distributions (as well as a number of tools for manipulating them.)

Note that the contents of the CPF and CPA are not constrained to be probability distributions (i.e., each row need not sum to one). In particular, contigency tables, tables of counts of cases occuring in various configurations, are natuaral the natural conjugate the the CPT, and are often used as data. These can also be stored in the CPF and CPA classes.

CPF

The class "CPF" is a subclass of data.frame; usually, CPF objects have class c("CPF","data.frame"). This means that operations which operate on data frames should do something sensible with CPFs. Note that the columns representing the parent variable states must be of class factor(), which might require an explicit call to factor() or as.factor(), if the values are character. The function as.CPF() coerces an object to be a CPF and is.CPF() tests whether or not it is a CPF.

# Note:  in R 4.0, the factor() call is required.
arf <- data.frame(A=factor(rep(c("a1","a2"),each=3)),
                  B=factor(rep(c("b1","b2","b3"),2)),
                  C.c1=1:6, C.c2=7:12, C.c3=13:18, C.c4=19:24)
arf <- as.CPF(arf)
arf

Note that by convention, the names of the columns for the parent variables are the names of the parent variables, and the names of the numeric columns have the format "childname.statename".

Graphics

The CPTtools package supplies a method for the lattice::barchart() generic function for CPFs (barchart.CPF()). Each conditional probability distribution is represented by a separate bar using color intensity to indicate the states. Here are some examples.

First, set up some information about the variables, in particular, the lists of states for each varaible.

## Set up variables
skill1l <- c("High","Medium","Low") 
skill2l <- c("High","Medium","Low","LowerYet") 
correctL <- c("Correct","Incorrect") 
pcreditL <- c("Full","Partial","None")
gradeL <- c("A","B","C","D","E") 

Next, generate some bar charts illustrating the method. We are using the function CPTtools::calcDPCFrame() to build the CPF objects. (This function is more fully described in the vignette DPCModels.Rmd).

cpfTheta <- calcDPCFrame(list(),skill1l,numeric(),0,rule="Compensatory",
                         link="normalLink",linkScale=.5)

barchart.CPF(cpfTheta)
cptComp <- calcDPCFrame(list(S2=skill2l,S1=skill1l),correctL,
                        lnAlphas=log(c(1.2,.8)), betas=0,
                           rule="Compensatory")
barchart.CPF(cptComp,layout=c(3,1))
cptPC1 <- calcDPCFrame(list(S1=skill1l,S2=skill2l),pcreditL,
                       lnAlphas=log(1),
                       betas=list(full=c(S1=0,S2=999),partial=c(S2=999,S2=0)),
                       rule="OffsetDisjunctive")
barchart.CPF(cptPC1,baseCol="slateblue")

CPA

The class "CPA" is a subclass of array; usually, CPA objects have class c("CPA","data.frame"). This means that operations which operate on arrays should do something sensible with CPAs. All of the entries in the CPA are numeric, the names of the parent variables and the state labels are given in the dimnames() of the array. The function as.CPA() coerces an object to be a CPA and is.CPA() tests whether or not it is a CPA.

arr <- array(1:24,c(2,3,4),
             dimnames=list(A=c("a1","a2"),B=c("b1","b2","b3"),
                           C=c("c1","c2","c3","c4")))
arr <- as.CPA(arr)
arr

Note that as.CPF() and as.CPA() can be used to freely convert between the two formats:

cat("The dimensions of this CPA are ",
    paste(dim(as.CPA(arf)),collapse=" x "),
    ".\n")
print(as.CPF(arr))

Accessing data and metadata

A properly labeled CPF contains metadata about the names of the parents and child variables. The functions getTableParents() and getTableStates() get some of that metadata. The function numericPart() strips out the metadata and just leaves the remaining numeric values; factorPart() strips out the numeric values and leaves the parent state configurations.

getTableStates(arf)
getTableParents(arf)
numericPart(arf)
factorPart(arf)

Hyperdirichlet distribution

CPFs and CPAs can be used to store three kinds of mathematical objects:

A conditional probability table corresponds to a conditional multinomial distribution, or contigency table. Each row of the CPT gives the probability for the categories in the corresponding row of the contigency table. The sum of each row will depend on how often that configuration of parent states occurs in the data.

The Dirichlet distribution is the natural conjugate of the multinomial distribution. The hyperdirichlet distrubtion is a series of independent Dirichlet distributions, one for each row of the contigency table. The name comes from Speigelhalter and Lauritzen (2000), where they use it to refer to an entire Bayesian network in which every CPT is parameterized in this way. In CPTtools, the term is used for any CPT parameterized in this way, and the package deliberately allows some CPTs to have the hyperdirichlet distributions, and others to use parametric models (see DPCmodels.Rmd).

Because of the conjugacy, there are two important relationships with hyperdirichlet models. First, the expected conditional probability table can be found from the hyperdirichlet parameters by dividing each row by its sum (normalizing the table). Second, if the prior distribution parameters are given in a CPF and the data (contigency table) is given in a CPF as well, then the posterior parameters will be the CPF produced by adding the numeric parts of the CPFs.

Scaling and Normalization

If the rows of a CPF represent a probability simplex, they should all be non-negative and sum to 1. Often it is convenient to force a set of numbers into a probability simplex by simply dividing by the sum. The normalize() generic function attempts to do this. It operates on CPFs and CPAs in a way consistent with their conditional probability distribution. It will also operate on a generic array or matrix (normalizing the last dimension) or data.frame (normalizing rows but ignoring non-numeric columns).

normalize(arf)

Dividing each row by its sum is one way we can rescale a table. However, there are other reasons we might want to multiply each row by a constant as well. One way to store contingency table data is to store a probablity vector in each row in the CPF and a separate vector of weights to represent the sample size of each row. The function rescaleTable() rescales the table by the specified factor. normalizeTable() rescales the table by the row sums, and hence is equivalent to normalize.CPF().

arf1 <- data.frame(A=factor(rep(c("a1","a2"),each=3)),
                  B=factor(rep(c("b1","b2","b3"),2)),
                  C.c1=rep(1,6), C.c2=rep(1,6), C.c3=rep(1,6),
                  C.c4=rep(1,6))
arf1

rescaleTable(arf1,1:6)

normalizeTable(arf1)

Generating Contingency Tables

As mentioned previously, a contingency table is the natural conjugate of the hyperdirichlet distribution. The function dataTable() can be used to construct contigency tables from data.

## State names
skill1l <- c("High","Medium","Low") 
skill3l <- c("High","Better","Medium","Worse","Low") 
correctL <- c("Correct","Incorrect") 

## Read data from file
x <- read.csv(system.file("testFiles", "randomPinned100.csv",
                          package="CPTtools"),
            header=FALSE, as.is=TRUE,
            col.names = c("Skill1", "Skill2", "Skill3",
                          "Comp.Correct", "Comp.Grade",
                          "Conj.Correct", "Conj.Grade",
                          "Cor.Correct", "Cor.Grade",
                          "Dis.Correct", "Dis.Grade",
                          "Inhib.Correct", "Inhib.Grade"
                          ))
## Force variables to be ordered categories
x[,"Skill1"] <- ordered(x[,"Skill1"],skill1l)
x[,"Skill3"] <- ordered(x[,"Skill3"],skill3l)
x[,"Comp.Correct"] <- ordered(x[,"Comp.Correct"],correctL)


tab <- dataTable(x, c("Skill1","Skill3"),"Comp.Correct",correctL)

## Tab is just the numeric part, so use expand.grid to generate
## labels.
data.frame(expand.grid(list(Skill1=skill1l,Skill3=skill3l)),tab)

Acknowledgements

Work on RNetica, CPTtools and Peanut has been sponsored in part by the following grants:



ralmond/CPTtools documentation built on Dec. 27, 2024, 7:15 a.m.