Educational Data Synthesizer

Installation

library(devtools)
install_github('thtrieu/edmsyn', 'thtrieu')

Intro

For an informal introduction, read Project debut - R - Educational Data Synthesizer

Full vignettes

Please refer to README.pdf for fully compiled codes. This vignette includes two parts, you can read them separately in /vignettes.

Introduction

In general, generating data from a given set of parameters under a specified model's assumptions is straightforward, as long as the following two conditions are satisfied:

However, checking for these conditions is not always a straightforward task. Namely, difficulties may arise from the Educational Data Mining literature, where parameters are usually organised in a complex manner. Some of them are shared between different models (e.g. average success rate, student variance, number of concepts, etc), the collection of all parameters can also be arranged into a hierarchical structure in the sense that some parameters can be used to generate some others, as oppose to a set of parameters that directly generates data.

For example, with Non-negative Matrix Factorization models, Q-matrix and Skill matrix are parameters that directly generate data, while these two parameters can be generated using parameters at lower level such as number of students, number of items, or number of concepts. A sufficient set of parameters to generate data in this case can be {Q-matrix, Skill matrix} or {Q-matrix, number of students}. For the former set to be consistent, number of columns of Q-matrix must equate the number of rows of Skill matrix because both represent number of concepts.

One can conclude from the above observation that there are many different ways to generate data depending on which model is being acquired and which parameters are given (assuming that they all satisfy condition 1 and condition 2). At this point it is important to stress that for different sufficient and consistent sets of parameters, even with respect to a single model M, we must expect different amounts of variance in the generated data. The reason is, with a set of parameters at lower level, it requires more intermediate generating steps before reaching the final one (generating data), thus produces more variability in the corresponding generated data.

R package edmsyn provides users a simple framework that conveniently handles all the situations above while generating data, including checking for condition 1 and condition 2. It also automates the process of learning parameters from raw data, modifying and using this information to create synthetic data from 10 different models. This document is intended to give a quick and thorough tutorial on using edmsyn.

Loading the package

library(edmsyn)

Vector edmconst$ALL.MODELS loaded at the start of this package contains the names of all available models.

ALL.MODELS <- edmconst$ALL.MODELS
ALL.MODELS

Class context and function pars()

Since some of the parameters are shared between different models, edmsyn did not regard the collection of all parameters from 10 models separately, but jointly as different aspects describing a single instance of reality. For example, the vector of discrimination factors of all items is a parameter belongs to the IRT-2pl model, which is not the case for the parameter Q-matrix, however, edmsyn introduces the class context where these two parameters can co-exist in a single object and can be utilized according to user purpose.

Function pars() produce an object of class context.

p <- pars(students = 15, items = 20)
class(p)

By the assignment p <- pars(students = 20, items = 20) there is currently no other information in the context p except students, items, and default.vals.

print(p)

p - a context object, is essentially a list object with different components. default.vals is the component that is always available (activated) in any context object, it is an environment containing the default values for some of the parameters. Whenever one of these parameters is needed without user idicated input, the corresponding value will be fetched from default.vals and then loaded into the context.

class(p$default.vals)
names(p$default.vals)
p$default.vals$avg.success
p$avg.success

At some point in the future, if data need to be generated from p by a model that requires avg.success, in case this parameter had not been inputted by the user, the value of 0.5 would be used.

We can change these default values by using the default function

p_ <- pars(default.vals = default(avg.success = 0.6))
p_$default.vals$avg.success

Function pars can also be used to update an available context, for example, we can change the number of students in p from 15 to 20, add the number of concepts information into p, and delete the number of items information.

p <- pars(p, concepts = 5, students = 20, items = NULL)
p
# Recover items = 20 for later use
p <- pars(p, items = 20)
p

pars also automatically figures out obtainable parameters at lower level whenever a parameter is activated. For example, dimensions of M imply the number of concepts and the number of students. Thus, if a context is supplied with M, parameters concepts and students (and some other obtainable parameters) are activated.

p_ <- pars(M = matrix(1, 7, 30))
p_
p_$concepts

While assembling input parameters from users, pars will check for their consistency (condition 2) ^[Currently edmsyn is able to detect whether a parameter is receiving different values, or if it violates some bounds indicated by users]. Since p is a context where there are 20 question items, a Q-matrix with 30 rows is incompatible.

p <- pars(p, Q = matrix(1,30,5))

In short, the introduction of class context and function pars is mostly for convenience in the following cases:

Get the value of a parameter from a context by get.par

We can access to components of a context object just like accessing to components of a list. edmsyn provides an alternative approach to this by the function get.par

p$students
get.par("students", p)

The advantage of get.par is that, it gives result even when the parameter is not available, as long as the parameter is possible to be generated. For example, if we want to produce the value of the skill mastery matrix (M) from p, the conventional operator $ will return NULL, while get.par finds a way to generate M from concepts and students. Set the argument progress = TRUE to see how get.par does it.

p$M
M_ <- get.par("M", p, progress = TRUE)
M_

As can be seen, the progress involves generating an intermediate parameter called concept.exp before generating M. concept.exp is a vector of the expected mastery rates for 5 concepts in p. By using get.par, we can examine the values of these intermediate generations.

M_$context$concept.exp

For example, looking at the above concept.exp, it is expected that there is about r round(M_$context$concept.exp[1]*100) percent of the students who mastered the first concept in p. In fact, due to the abundant availability of parameters co-existing in a context, generating concept.exp is just one amongst many other ways by which get.par can reach M, the reason why it turns out to be this way will be discussed in later parts.

If we run get.par to obtain M again, in general the result will be different. This is because of the fact that each time doing so, another probabilistic generation will be carried out, thus we receive a different concept.exp, and after that a different M.

identical(M_$value, get.par("M", p)$value)

get.par can detect whether there is enough information to generate the required parameters. For example, Q-matrix needs the number of concepts to be defined, without the number of concepts, there is no way Q-matrix can be derived.

p_ <- pars(students = 20, items = 15)
get.par("Q", p_)

One way to fix this is supplying M to the context, which essentially contains the number of concepts information. By this, the process becomes possible

p_ <- pars(p_, M = matrix(1, 4, 20))
get.par("Q", p_)

Generate data from a context using gen

Parameters activated in a single context are ensured non-conflict while being put together by pars (condition 2). If a context satisfies condition 1, there will be a way to generate data from it. edmsyn provides function gen that can check for condition 2 and generates data with the specified model. For example, the following code generates data from p by POKS model (Partial Order Knowledge Structure model) twice:

poks.data <- gen("poks", p, n = 2, progress = TRUE)
poks.data

As can be seen, gen returns a list of two components, correspond to two generations required by option n = 2. Each of the two is a context. edmsyn views the generated data as a part of the context. In fact, we can consider data as a parameter at highest level. In each of the generated context, there is a component whose name is the same as the specified model poks, this component contains the generated data.

poks.data[[1]]$poks

Looking at the result, the poks component is a list with component R being the response matrix, and other components representing other information that is needed for the learning process. This point will be explained in later sections.

Since data is considered to be just another parameter in a context, we can actually use get.par to generate data. In fact, gen is simply a wrapper of get.par with an additional feature n, where users can specify how many times the process should be repeated.

get.par("poks", p)

Following is two more examples on using gen. In the first one, gen will generate data by POKS model again, but from a different context where user have more control on the Partial Order structure of items. In the second one, gen generates data by DINA model (Deterministic Input Noisy And model).

Example 1

Suppose this time we want the Partial Order structure of items to have two connected components, each with height 3 and no transitive link between items, this can be done by updating p with some more parameters. To visualize this structure, whose dependency matrix will be component po in the context returned by gen, we can use function viz.

p <- pars(p, min.depth = 3, max.depth = 3, min.ntree = 2, max.ntree = 2, trans = FALSE)
poks.data <- gen("poks", p)
v <- viz(poks.data$po)

v contains analysed information about the structure. Its first component is identical to po, the second one is a list with equal number of components to the number of connected components of the structure po represents. Each of these components is in turn a list with first component being the corresponding dependency matrix, and sencond component represents how many items are there on each level of depth.

print(v)

Example 2

dina.data <- gen("dina", p, progress = TRUE)
dina.data

As reported by gen, unlike the previous case, M is now generated from three parameters students, skill.space, and skill.dist instead of concept.exp and students. The reason for this particular situation is that, skill.space and skill.dist are formal parameters of DINA model, so they should be used in the process of generating data. Again, we can access this data by referring to component dina of the generated context. Besides the response matrix, dina.data$dina also have another component Q, this is because of the fact that under DINA's view, a response matrix without its corresponding Q-matrix is incomplete.

dina.data$dina

gen only allow one model and one context at a time, we can save time generating data across different models and contexts using gen.apply. Setting the argument multiply to TRUE or FALSE decides what kind of matching will be made between models and contexts.

dat.1 <- gen.apply(ALL.MODELS, list(p1 = p,p2 = p_), multiply = FALSE, n = 5)
dat.1
dat.1["dino.p1", 3]
dat.2 <- gen.apply(ALL.MODELS, list(p1 = p,p2 = p_), multiply = TRUE, n = 5)
dat.2
dat.2["nmf.com", "p2"]

Learning the most probable context from data using learn

Let's say we want to get the third data generation from the matching between context p1 and model poks, and use POKS model to learn from this data.

poks.data <- dat.2["poks", "p1"][[1]][[3]]
poks.data
poks.data$poks
learn.poks <- learn("poks", data = poks.data$poks)
learn.poks
learn.poks$po

If we want to learn from this same data using DINA model, poks.data$poks cannot be used because components p.min, alpha.p, alpha.c are meaningless to DINA. In order to do the task, one will need to hand-design this data. DINA requires one additional component besides the response matrix: Q-matrix. Normally, Q-matrix is expected to be expert-defined, however this illustration will just simply generate it randomly.

Q <- get.par("Q", p)$value
R <- poks.data$poks$R
dina.data <- list(R=R,Q=Q)
learn.dina <- learn("dina", data = dina.data)
learn.dina

Here we have a look at two parameters skill.space and skill.dist

learn.dina$skill.space
learn.dina$skill.dist

Generate synthetic data using syn

Generating synthetic data includes three steps:

  1. Learn the most probable context from a given data by a specified model.

  2. Modify the learned context by keeping some of the parameters, changing some and discarding the rest.

  3. Generate data from the new context.

edmsyn provides function syn that automates the above process, specifically, syn consists of three parts:

  1. Learn the most probable context by using learn.

  2. Keep some parameters (the default choice is stored in edmconst$KEEP^[edmconst$KEEP is designed in a way such that, with any new value the user updates for students, the new context is still consistent and sufficient, in this sense function syn generates synthetic data by creating simulated students.]) and discard the rest, also allow the user to change parameter students.

edmconst$KEEP
  1. Generate synthetic data from this new context by gen

Now we synthesize dina.data with a new number of student

dina.syn <- syn("dina", data = dina.data, students = 12, n = 10)
dina.syn$synthetic[[5]]$dina

In case the default option is not favoured, syn also allows users to manually specify which parameters to keep through argument keep.pars.

dina.syn <- syn("dina", data = dina.data, keep.pars = c("Q", "concept.exp"), students = 12)

However, in this case users take their own risk if the kept parameters (together with the new number of students if students is redefined), form an inconsistent or insufficient set with respect to the specified model. For example, for synthesizing data by DINA model case, if we choose to keep M (which essentially retain the number of students) and define a new number of students, there will be conflict.

dina.syn <- syn("dina", data = dina.data, keep.pars = c("Q", "M"), students = 12)

Modifying edmsyn

edmsyn comes with a pre-defined set of parameters and relationships amongst themselves. These relationships are rules that help edmsyn derive values for one or more parameters from some others. Specifically, these rules are represented as functions in the package. For example, a function that takes two integers and randomly produce a binary matrix with its dimensions being the two inputted integers can be used as the rule to derive M (skill mastery matrix) from students and concepts. Rules that derive value for "data parameter" such as poks, dina, or dino encode POKS, DINA, DINO models respectively. Similarly, rules that derive value in the opposite direction encode the corresponding learning algorithms.

The choice of built-in parameters, models, and learning algorithms is made independently at the time of development for edmsyn and thus, it may or may not satisfy users' need. That is why edmsyn also comes with a set of tools that allow its users to re-define all these components to the extent of building a whole new set of parameters and models, while still retaining all the original benefits that it offers. All possible changes on edmsyn will be made on a single graphical structure that edmsyn creates at the loading time, with vertices representing parameters and edges encoding the respective relationships.

Functions that allow these modifications have names starting with edmtree. There are just a few of such functions and they provide everything you need to work with edmsyn at its internal level. This tutorial will also walk you through the building blocks of them so that an in-depth understanding of edmsyn mechanism is provided. By the end of this section, you are expected to handle edmtree. functions properly and efficiently.

If you do not feel going that deep, feel free to skip to A toy model at your own risk. Reading A toy model before going through the whole tutorial is also a good suggestion.

Fetching a node using edmtree.fetch

Firstly, let's start with the skill mastery matrix M

M.node <- edmtree.fetch('M')
class(M.node)
names(M.node)

The representation of M in edmsyn is essentially a list with four components tell, f.tell, gen, and f.gen. The first one, tell, is a set of names of parameters that receive information if the value of M is known.

M.node$tell

In this case, the value of M tells edmsyn the values of concepts, students and concept.exp (expected mastery rate for each concept). This is quite straightforward since concepts, students, and concept.exp are respectively the row dimension, column dimension, and row means of M. At this point, it is reasonable to look at the third component f.tell

M.node$f.tell

The third component, f.tell, is the function to derive values for each parameter listed in tell from value of M. As can be seen, this function does exactly what we expected: taking the row dimension, column dimension and row means value of its input M, assemble these results into a list as its return value.

The second component, gen, is a list of generating methods for M.

M.node$gen

In this example, edmsyn knows that there are three different methods to reach M: either using (S), (students,skill.space,skill.dist), or (students,concept.exp). If the generating process of M is somehow determined to use S (the skill matrix, or the probabilistic version of M), then one way to proceed can be rounding S element-wise to obtain M. In fact, this is the chosen method in edmsyn, let's have a look at the last component

M.node$f.gen

As can be seen, f.gen is a list of three functions correspond to three generating methods in gen, and the first one is somehow rounding its input for the return value. The details of these functions will be discussed in later part, at this point it is sufficient to just realise what is the representation of a parameter in edmsyn and which tasks it is trying to achieve.

From this point on, all f.tell functions will be referred to as type-1 connections. Similarly, the collections of functions in f.gen will be referred to as type-2 connections. These are the two types of connections along which all data flow. The main benefit edmsyn offers is a convenient interface that allow easy access and control to various data processes.

Replacing using edmtree.replace

Replacing type-2 connections

Suppose you are not satisfied with the current generation method for M from S (rounding), and expect a fully probabilistic implementation of it. To do this, the first function in f.gen should be replaced. Firstly, we must design a function that takes one input being the matrix S, sample binary result from S's entries and return the value of M as follows:

new.gen.S.to.M <- function(S){
  M <- matrix(0, nrow(S), ncol(S))
  for (row in 1:nrow(S))
    for (col in 1:ncol(S)){
      p <- S[row,col]
      M[row, col] <- sample(x = 0:1, 1, prob = c(1 - p, p))
    }
  return(M)
}

Next, this function is replaced into the internal structure of M

edmtree.replace.gen('M','S',new.gen.S.to.M)
M.node <- edmtree.fetch('M')
M.node$f.gen

Now that f.gen[1] is changed. But apparently it is not the original new.gen.S.to.M that has been designed by us, the reason is that edmsyn has wrapped this function inside one or more layers to make sure it fit in perfectly with the internal working environment. We can check if the replacement is carried out successfully by simply generating M from S many times, if the result is different each time, the probabilistic version has been installed successfully.

S <- matrix(runif(15),3,5)
p <- pars(S = S)
M1 <- get.par('M',p)$value
M2 <- get.par('M',p)$value
# This is very unlikely to be TRUE
identical(M1, M2)
# Let's try something else
M = matrix(0, 3, 5)
big.number = 10000
for (i in 1:big.number)
  M = M + get.par('M',p)$value
# This is likely to be TRUE
identical(round(M/big.number), round(S))

Similarly, we can change the functions of other generating methods by simply create a new function for each and replace them into the structure.

# Examples of changing generating methods
new.f.gen.2 <- function(students, skill.space, skill.dist){
  # Evaluate M here
  return(value.of.M)
}
edmtree.replace.gen('M', M.node$gen[[2]], new.f.gen.2)

As long as the number and the order of arguments match that of what is defined in the corresponding gen, the replacement will be done successfully. Otherwise, error will be detected either at replacement time or run-time depending on the error.

# Some more playing around with replacement
M.node$gen[[3]]
new.gen.3 <- rev(M.node$gen[[3]])
new.gen.3
# Notice that the order of arguments of new.f.gen.3 below
# matches the order of new.gen.3
new.f.gen.3 <- function(concept.expectation, num.of.students){
  # evaluate M here
  return(value.of.M)
}
# Successful
edmtree.replace.gen('M', new.gen.3, new.f.gen.3)

Below are some examples where edmtree.replace fails to execute.

# Unsucessful examples:
# new f.gen is supposed to have 3 arguments as M.node$gen[[2]] has length 3
edmtree.replace.gen('M', M.node$gen[[2]], function(arg1, arg2) {
  return(NULL) # does not matter
})
# (items, poks) is not an existing method
edmtree.replace.gen('M', c('items', 'poks'), function(arg1, arg2){
  return(NULL) # does not matter
})

Replacing type-1 connections

Besides edmtree.replace.gen, edmsyn also allow edmtree.replace.tell where users are able to modify tell and f.tell in many flexible ways. Knowing the value of S, it is straightforward to infer values of students and concepts, being the number of columns and rows respectively. As a toy example, let's swap the inference of students and concepts (and thus produce an incorrect result)

S.node <- edmtree.fetch('S')
S.node$tell
p <- pars(S = S)
print(dim(S))
# this would give correct result
c(p$concepts, p$students)
# Now let's alter f.tell so that concepts
# is the number of columns of S and students
# is the number of rows of S.
edmtree.replace.tell('S',c('concepts','students'), function(S){
  return(list( ncol(S), nrow(S) ))
})
p <- pars(S=S)

The reason why pars fails to execute is that, not only concepts and students are being inferred from S, S.con.exp, a vector of length concepts, is also being inferred correctly. This caused a conflict that edmsyn was able to detect. One way to force the incorrect result is as follows:

edmtree.replace.tell('S',c('S.con.exp','concepts','students'), function(S){
  return(list( colMeans(S), ncol(S), nrow(S) ))
})
p <- pars(S=S)
print(dim(S))
# Now we successfully produce a wrong result.
c(p$concepts, p$students)

There is two important points to make here: firstly, it is dangerous to replace such incorrect calculations into edmsyn structure: from what we have seen above, these false implementations may take long (sometimes forever) to be detected after the replacement and thus, your application may have operated incorrectly for many steps without being noticed. Take careful examination before you decide to change edmsyn's built-in functions.

Secondly, the replacement we did above is called partial replacement because not all parameters in tell is modified. Unlike many other classes of objects in R, f.tell, which is essentially an R function, is inherently not modular, which means it can not be modified partially. So how did edmsyn finish the task above? It simply executes the old function and then the new one, after that new results will replace the old ones at appropriate positions in the returned list. This, apparently, may involve a great deal of redundant calculations and thus, the use of edmtree.replace.tell to modify a subset of tell is strongly discourage. In other words, whenever it is possible, try to use edmtree.replace.tell with argument tell being an identical set to S$tell.

To proceed normally, let's recover the original correct version of f.tell inside S

edmtree.replace.tell('S', S.node$tell, S.node$f.tell)
p <- pars(S=S)
dim(S)
c(p$concepts, p$students)

Naive replacing with edmtree.replace

The previous replacing functions are smart because they are flexible in terms of partial replacement. Namely, they retain the parts that is not declare in their function call and properly handle the integration of new parts with old parts.

If you wish to simply throw away some parts of the node and subtitude new ones, (in other words, you do not care about retaining any part of the existing functions, either they are useful to be reused or not), simply use edmtree.replace. See some examples below

# replacing tell
edmtree.replace('S', tell = new.tell)
# replacing tell and f.gen
edmtree.replace('S', tell = new.tell, f.gen = new.f.gen)
# replacing gen, f.gen and f.tell
edmtree.replace('S', f.gen = new.f.gen, f.tell = new.f.tell, gen = new.gen)

Removing using edmtree.remove

Removing type-2 connections

Let's start with a context where students and concepts are defined

p <- pars(students = 20, concepts = 15)
M <- get.par('M', p, progress = TRUE)

Assume that we desire not to produce M from the process that requires generating concept.exp. In fact, we want to remove this method of generating M from edmsyn completely. edmtree.remove.gen is here to help:

edmtree.remove.gen('M',c('concept.exp','students'))
M <- get.par('M', p, progress = TRUE)
edmtree.remove.gen('M',c('skill.dist','students','skill.space'))
M <- get.par('M', p, progress = TRUE)
edmtree.remove.gen('M', 'S')
M <- get.par('M', p, progress = TRUE)

As we consecutively went through the deletion of all three generating methods for M, a reasonable question is raised: why does it have to be in the order of (concept.exp, students), then (skill.dist, students, skill.space), and lastly (S)? In other words, base on which criteria did edmsyn prioritise the method (concept.exp, students) over (skill.dist, students, skill.space); and (skill.dist, students, skill.space) over (S)? Note that when students and concepts are provided, all three options are possible to implement.

The generating critera

One complication of the edmsyn perspective is that, there will be times when there are more than one way to generate data. One example is the case above. There is critera that edmsyn follows in terms of prioritizing one option over another. Note that this criteria is chosen by the author at the time of developing edmsyn and there is no certain theoretical foundation for it. It is, however, beneficial for the users to be informed anyway.

If there are more than one available method to generate a node's value:

  1. Choose the one that most utilizes inputted data. In the previous example, inputted data includes students and concepts, thus both (concept.exp, students) and (skill.dist, students, skill.space), utilizing a half of the input, are certainly better than (S).

  2. If there is a tie, as pointed out in the M$gen example above, choose the one is that most covered by the input. In this case, one half of (concept.exp, students) is covered (by students), while only one third of (skill.dist, students, skill.space) is covered (also by students). This is why (concept.exp,students) is most favoured amongst the three.

  3. If there is still a tie, then choose the one that most cover tell of the target node. For example in this case, M$tell consists of concepts, students, and concept.exp; two of them is covered by (concept.exp, students), one is covered by (skill.dist, students, skill.space) and none is covered by (S). For ties beyond this point, the choice is completely random.

Removing type-1 connections

Now we move on to removing type-1 connections. Let's say we want to discard any information about concepts inferred from M, edmtree.remove.tell should be used as follows:

edmtree.remove.tell('M', 'concepts')
p <- pars(M = M1)
print(p)

Why does concepts is still there in p? Actually in this case concepts is not inferred from M but from concept.exp, to completely eliminate concepts, one would need to remove concept.exp

edmtree.remove.tell('M','concept.exp')
p <- pars(M = M1)
# now concept.exp is gone, along with concepts
print(p)
# recover
edmtree.replace('M', tell = M.node$tell, f.tell = M.node$f.tell)
edmtree.fetch('M')

Again, since f.tell cannot be partially modified, the users should be aware of the fact that there are redundant calculations going under the implementation of the removed concepts. In fact, concepts is still being calculated, but then discarded from the returned list each time information is inferred from M. So it is highly recommended that the f.tell argument of edmtree.remove.tell is used whenever it is possible. Namely, the recommended way to completely remove concepts is as follows:

edmtree.remove.tell('M', c('concept.exp', 'concepts'), f.tell = function(M){
  return(ncol(M)) # since the only one left in tell is students
})
p <- pars(M = M1)

The warning here tells us that, the returning line inside our new f.tell should be return(list(ncol(M)) instead of return(ncol(M)). However, edmsyn is smart enough to prevent this error by a function wrapper that detects and fixes many errors like these whenever it is possible. Note that since this is an error at returning of our function (in other words, it requires actual execution of the function to be presented), this error is a run-time error and cannot be detected immediately in the edmtree.remove step, but much later when pars calls for data to flow from M to lower level parameters in the execution of p <- pars(M = M1).

To avoid displaying this warning each time data flow through M, let's fix f.tell by a proper one.

# recover
edmtree.replace('M', tell = M.node$tell, f.tell = M.node$f.tell)
# remove
edmtree.remove.tell('M', c('concept.exp', 'concepts'), f.tell = function(M){
  return(list(ncol(M)))
})
p <- pars(M = M1)
print(p)

Removing a node from the structure

Removing a whole node is syntactically simple but requires a high level of awareness. Removing root parameters such as students or concepts can cause the whole system to crash since everything is built up from these root parameters. In short, removing parameters at higher level creates less an impact on the whole structure than removing lower ones.

In this first example, let's remove a "data parameter", namely bkt:

bkt.node <- edmtree.remove("bkt")
edmtree.fetch("bkt")

Later on, plugging bkt.node back into the structure is simple, but only because this is a special case where bkt is at data level. The situation is more complicated when deleting a lower level node such as concept.exp.

# For the purpose of illustration, let's first recover everything for M
edmtree.replace('M', tell = M.node$tell, f.tell = M.node$f.tell,
                gen = M.node$gen, f.gen = M.node$f.gen)
concept.exp.node <- edmtree.remove("concept.exp")
get.par('concept.exp', pars(concepts = 3))

As can be seen, removing concept.exp will consequently remove several other parts of the whole structure. Let's say you want to reverse edmsyn back to the state before this removal, you will have to plug the concept.exp.node back and manually recover everything else that have gone along.

To do this, for example, adding method (students, concept.exp) back into M, edmtree.add is needed.

Adding with edmtree.add

Adding a node

This task is simple so long as you have already done the hard part: defining all four components tell, f.tell, gen and f.gen of the node. With node concept.exp in the removal section above, luckily we have saved these components into concept.exp.node. Let's reuse them for a quick illustration

edmtree.add('concept.exp', tell = concept.exp.node$tell,
            f.tell = concept.exp.node$f.tell,
            gen = concept.exp.node$gen,
            f.gen = concept.exp.node$f.gen)
concept.exp <- get.par('concept.exp', pars(concepts=3))$value
print(concept.exp)

Adding type-2 connection

Now move on to adding method (concept.exp, students) into M. Again, we will reuse M.node for a quick and clean illustration

# First let's see what method is being used to generate M
# When students and concept.exp is given
p <- pars(students = 10, concept.exp = concept.exp)
M <- get.par('M', p, progress = TRUE)

# See what we have in M.node
M.node$gen

# Add the new method
edmtree.add.gen('M', gen.method = M.node$gen[[3]], 
                f.gen.method = M.node$f.gen[[3]])

# Generate M again to see the change
M <- get.par('M', p, progress = TRUE)$value

Now as method (students, concept.exp) is available, and clearly has a better input utilisation than (students, skill.space, skill.dist) (see The generating criteria section above), edmsyn opts for this method to generate M.

Adding type-1 connection

Adding concept.exp into M$tell is also a necessary step in recovering the deleted concept.exp node. In this task, edmtree.add.tell is used.

# For the purpose of illustration, let's say knowing M only tells us about the number of concepts
edmtree.replace('M', tell = 'concepts', f.tell = function(M){
  return(list(nrow(M)))
})

# Now see that currently knowing M tells nothing about the expected mastery rate for each concept or the number of students
p <- pars(M = M)
print(p)

# It is time to add the full set of tell to M
edmtree.add.tell('M', c('concept.exp', 'students'), function(M){
  return(list( rowMeans(M), ncol(M) ))
})
p <- pars(M = M)
print(p)
# Check if all calulations are correct
identical( dim(M), c(p$concepts, p$students) )
identical( p$concept.exp, rowMeans(M) )

A toy model

Let's wrap up this section by going through the process of adding a whole new model into edmsyn structure. During this process, a few useful techniques that edmsyn provides will also be introduced, so it is beneficial to pay attention to the details in this section. Note that:

  1. None of the details in this toy model makes sense, all of them is specifically designed for illustration purpose

  2. This part covers only edmtree.add. edmtree.replace and edmtree.remove will not be used since we are not modifying any of the existing models and parameters.

Below is an outline of what is added to edmsyn by this new model (named toy):

# foo is a root node, with no default values
edmtree.add('foo', integer = TRUE)
edmtree.add('lower.foo', integer = TRUE,
            tell = 'foo', f.tell = less.strict,
            gen = 'default.vals', f.gen = 1)

less.strict is a special function provided by edmsyn for cases when you want to tell the structure that lower.foo should be strictly less than foo. Alternatively, edmtree.add.tell('foo', tell = 'lower.foo', f.tell = greater.strict) gives the same effect. There are four such special functions recognised by edmsyn: less.equal, less.strict, greater.equal, and greater.strict.

The presence of these four functions highlighted the fact that inferring information in edmsyn is not solely inferring values, but can also be inferring different aspects of this value, namely the bound of them in this case.

# Now add upper.foo
edmtree.add('upper.foo', integer = TRUE,
            gen = c('default.vals', 'concepts'),
            f.gen = function(default.vals, concepts){
              return(default.vals$min.it.per.tree + concepts)
            })
# Another use of special bound function
edmtree.add.tell('upper.foo', 'concepts', function(upper.foo){
  list(greater.equal(upper.foo - default()$min.it.per.tree))
})
# Instead of upper.foo telling the bound of foo,
# we will do it in the opposite direction,
# just for the purpose of illustration
edmtree.add.tell('foo', tell = 'upper.foo', f.tell = less.equal)
# Now since lower.foo and upper.foo are both presented in the structure
# it's time to add a generating method for foo
edmtree.add.gen('foo', gen.method = c('lower.foo', 'upper.foo'),
                f.gen.method = function(lower.foo, upper.foo){
                  sample((lower.foo+1) : upper.foo, 1)
                })
# add bar
edmtree.add('bar', gen = c('foo', 'concepts'),
            f.gen = function(foo, concepts){
              matrix(runif(foo * concepts), foo, concepts)
            })
# dimensions of bar are foo and concepts
edmtree.add.tell('bar', tell = c('foo', 'concepts'),
                 f.tell = function(bar){
                   list(nrow(bar), ncol(bar))
                 })

Note that it is okay not to add the tell component for bar, (in fact, it is okay to skip defining tell and f.tell in every parameter, your application will still run just fine as long as the rest is properly designed). However doing so will limit the capability of edmsyn to recognise conflicts (condition 2). For example, pars(bar = matrix(0, 3, 5), concepts = 4) will not raise the conflict between 4 and 5 if bar$tell does not include the inference for concepts. Adding tell and f.tell is a good practice if you want to add more debugging power to a big and complicated application.

# to make the model a little more sophisticated,
# we add another generating method for bar
edmtree.add.gen('bar', gen = c('M', 'foo'),
                f.gen = function(M, foo){
                  concepts = nrow(M)
                  matrix(runif(foo * concepts), foo, concepts)
                })
# finally, add the data node "toy"
edmtree.add('toy', data = TRUE,
            gen = c('bar','M'), f.gen = function(bar, M){
              list(R = round(bar %*% M), concepts = nrow(M))
            },
            tell = c('bar', 'M'), f.tell = function(toy){
              # Note that the following learning algorithm makes no sense
              # it is just for the purpose of illustration
              concepts = toy$concepts
              R = toy$R
              foo = nrow(R)
              students = ncol(R)
              bar = matrix(runif(foo * concepts), foo, concepts)
              M = matrix(sample(0:1, concepts * students, TRUE),
                         concepts, students)
              list(bar, M)
            })

# Check if ALL.MODELS includes "toy" (yes it does)
edmconst$ALL.MODELS

So, that is all there is to plug a new model into edmsyn. The process is simple and it forces users to think about various aspects while doing so. Another benefit is that before adding toy, a whole system of 62 parameters with carefully-built and well-tested connections is already there. This makes the work even lighter, we saved a lot of time before moving on testing our model.

Test the toy model

# 1. Test the bounds
p <- pars(lower.foo = 3, foo = 3)
p <- pars(foo = 4, upper.foo = 3)
# This one requires reasoning to detect
# Thus error is not raised immediately
p <- pars(lower.foo = 3, upper.foo = 2)
# But nevertheless, when p go into use, it immediately fails
get.par('foo', p)
p <- pars(upper.foo = 5, concepts = 5)

# 2. Test foo$gen
get.par('foo', pars())
p <- pars(p, upper.foo = 15)
get.par('foo', p)
p <- pars(concepts = 5)
get.par('foo', p, progress = TRUE)
p <- pars(M = M)
p <- get.par('foo', p, progress = TRUE)
print(p)

# 3. Test bar
get.par('bar', pars())
get.par('bar', pars(upper.foo = 15))
get.par('bar', pars(lower.foo = 3, concepts = 5), progress = TRUE)
get.par('bar', pars(M = M), progress = TRUE)

# 4. Test data
toys <- gen('toy', pars(M = M, bar = matrix(0, 3, 5)))
toys <- gen('toy', pars(M = M, bar = matrix(1, 3, 3)),
            n = 2, progress = TRUE)
toys <- gen('toy', pars(students = 20, concepts = 4), 
            n = 3, progress = TRUE)
toys <- gen('toy', pars(M = M), n = 3, progress = TRUE)
toys.syn <- syn('toy', toys[[2]]$toy,
                keep.pars = c("foo","concept.exp"),
                students = 12, n = 3, progress = TRUE)

Working with the whole structure

Now that you have everything needed to fully manipulate the internal structure of edmsyn, it is time to move on working with the whole structure (as oppose to working with nodes and edges like before). The first thing to know is that, each time library(edmsyn) is executed, the original edmsyn graphical structure (without foo, bar, toy, etc) will be restored and everything we have built so far is lost. Thus, occasionally saving the modified structure is important.

toy.save <- edmtree.dump()

Assume a different situation where no part of the current structure is needed, you will manually build a whole new structure from scratch. This case, the first thing to do is to clear out all nodes

edmtree.clear()
edmtree.fetch('toy')
edmtree.fetch('M')

Lastly, you can always restore a saved structure

edmtree.load(toy.save)
gen('toy', pars(M = M))
gen('bkt', pars(students = 15, items = 20))
# Leave the argument part empty
# if you want the original edmsyn structure
edmtree.load()
gen('toy', pars(M = M))
gen('bkt', pars(students = 15, items = 20, 
                concepts = 4, time = 10))


thtrieu/edmsyn documentation built on May 31, 2019, 11:18 a.m.