library(devtools) install_github('thtrieu/edmsyn', 'thtrieu')
For an informal introduction, read Project debut - R - Educational Data Synthesizer
Please refer to README.pdf for fully compiled codes. This vignette includes two parts, you can read them separately in /vignettes
.
In general, generating data from a given set of parameters under a specified model's assumptions is straightforward, as long as the following two conditions are satisfied:
However, checking for these conditions is not always a straightforward task. Namely, difficulties may arise from the Educational Data Mining literature, where parameters are usually organised in a complex manner. Some of them are shared between different models (e.g. average success rate, student variance, number of concepts, etc), the collection of all parameters can also be arranged into a hierarchical structure in the sense that some parameters can be used to generate some others, as oppose to a set of parameters that directly generates data.
For example, with Non-negative Matrix Factorization models, Q-matrix and Skill matrix are parameters that directly generate data, while these two parameters can be generated using parameters at lower level such as number of students, number of items, or number of concepts. A sufficient set of parameters to generate data in this case can be {Q-matrix, Skill matrix} or {Q-matrix, number of students}. For the former set to be consistent, number of columns of Q-matrix must equate the number of rows of Skill matrix because both represent number of concepts.
One can conclude from the above observation that there are many different ways to generate data depending on which model is being acquired and which parameters are given (assuming that they all satisfy condition 1 and condition 2). At this point it is important to stress that for different sufficient and consistent sets of parameters, even with respect to a single model M, we must expect different amounts of variance in the generated data. The reason is, with a set of parameters at lower level, it requires more intermediate generating steps before reaching the final one (generating data), thus produces more variability in the corresponding generated data.
R package edmsyn
provides users a simple framework that conveniently handles all the situations above while generating data, including checking for condition 1 and condition 2. It also automates the process of learning parameters from raw data, modifying and using this information to create synthetic data from 10 different models. This document is intended to give a quick and thorough tutorial on using edmsyn
.
library(edmsyn)
Vector edmconst$ALL.MODELS
loaded at the start of this package contains the names of all available models.
ALL.MODELS <- edmconst$ALL.MODELS ALL.MODELS
context
and function pars()
Since some of the parameters are shared between different models, edmsyn
did not regard the collection of all parameters from 10 models separately, but jointly as different aspects describing a single instance of reality. For example, the vector of discrimination factors of all items is a parameter belongs to the IRT-2pl model, which is not the case for the parameter Q-matrix, however, edmsyn
introduces the class context
where these two parameters can co-exist in a single object and can be utilized according to user purpose.
Function pars()
produce an object of class context
.
p <- pars(students = 15, items = 20) class(p)
By the assignment p <- pars(students = 20, items = 20)
there is currently no other information in the context p
except students
, items
, and default.vals
.
print(p)
p
- a context
object, is essentially a list
object with different components. default.vals
is the component that is always available (activated) in any context
object, it is an environment containing the default values for some of the parameters. Whenever one of these parameters is needed without user idicated input, the corresponding value will be fetched from default.vals
and then loaded into the context
.
class(p$default.vals) names(p$default.vals) p$default.vals$avg.success p$avg.success
At some point in the future, if data need to be generated from p
by a model that requires avg.success
, in case this parameter had not been inputted by the user, the value of 0.5
would be used.
We can change these default values by using the default
function
p_ <- pars(default.vals = default(avg.success = 0.6)) p_$default.vals$avg.success
Function pars
can also be used to update an available context
, for example, we can change the number of students in p
from 15 to 20, add the number of concepts information into p
, and delete the number of items information.
p <- pars(p, concepts = 5, students = 20, items = NULL) p # Recover items = 20 for later use p <- pars(p, items = 20) p
pars
also automatically figures out obtainable parameters at lower level whenever a parameter is activated. For example, dimensions of M
imply the number of concepts and the number of students. Thus, if a context is supplied with M
, parameters concepts
and students
(and some other obtainable parameters) are activated.
p_ <- pars(M = matrix(1, 7, 30)) p_ p_$concepts
While assembling input parameters from users, pars
will check for their consistency (condition 2) ^[Currently edmsyn
is able to detect whether a parameter is receiving different values, or if it violates some bounds indicated by users]. Since p
is a context where there are 20 question items, a Q-matrix with 30 rows is incompatible.
p <- pars(p, Q = matrix(1,30,5))
In short, the introduction of class context
and function pars
is mostly for convenience in the following cases:
context
.get.par
We can access to components of a context
object just like accessing to components of a list. edmsyn
provides an alternative approach to this by the function get.par
p$students get.par("students", p)
The advantage of get.par
is that, it gives result even when the parameter is not available, as long as the parameter is possible to be generated. For example, if we want to produce the value of the skill mastery matrix (M
) from p
, the conventional operator $
will return NULL
, while get.par
finds a way to generate M
from concepts
and students
. Set the argument progress = TRUE
to see how get.par
does it.
p$M M_ <- get.par("M", p, progress = TRUE) M_
As can be seen, the progress involves generating an intermediate parameter called concept.exp
before generating M
. concept.exp
is a vector of the expected mastery rates for 5 concepts in p
. By using get.par
, we can examine the values of these intermediate generations.
M_$context$concept.exp
For example, looking at the above concept.exp
, it is expected that there is about r round(M_$context$concept.exp[1]*100)
percent of the students who mastered the first concept in p
. In fact, due to the abundant availability of parameters co-existing in a context, generating concept.exp
is just one amongst many other ways by which get.par
can reach M
, the reason why it turns out to be this way will be discussed in later parts.
If we run get.par
to obtain M
again, in general the result will be different. This is because of the fact that each time doing so, another probabilistic generation will be carried out, thus we receive a different concept.exp
, and after that a different M
.
identical(M_$value, get.par("M", p)$value)
get.par
can detect whether there is enough information to generate the required parameters. For example, Q-matrix needs the number of concepts to be defined, without the number of concepts, there is no way Q-matrix can be derived.
p_ <- pars(students = 20, items = 15) get.par("Q", p_)
One way to fix this is supplying M
to the context, which essentially contains the number of concepts information. By this, the process becomes possible
p_ <- pars(p_, M = matrix(1, 4, 20)) get.par("Q", p_)
gen
Parameters activated in a single context are ensured non-conflict while being put together by pars
(condition 2). If a context satisfies condition 1, there will be a way to generate data from it. edmsyn
provides function gen
that can check for condition 2 and generates data with the specified model. For example, the following code generates data from p
by POKS model (Partial Order Knowledge Structure model) twice:
poks.data <- gen("poks", p, n = 2, progress = TRUE) poks.data
As can be seen, gen
returns a list of two components, correspond to two generations required by option n = 2
. Each of the two is a context. edmsyn
views the generated data as a part of the context. In fact, we can consider data as a parameter at highest level. In each of the generated context, there is a component whose name is the same as the specified model poks
, this component contains the generated data.
poks.data[[1]]$poks
Looking at the result, the poks
component is a list with component R
being the response matrix, and other components representing other information that is needed for the learning process. This point will be explained in later sections.
Since data is considered to be just another parameter in a context, we can actually use get.par
to generate data. In fact, gen
is simply a wrapper of get.par
with an additional feature n
, where users can specify how many times the process should be repeated.
get.par("poks", p)
Following is two more examples on using gen
. In the first one, gen
will generate data by POKS model again, but from a different context where user have more control on the Partial Order structure of items. In the second one, gen
generates data by DINA model (Deterministic Input Noisy And model).
Example 1
Suppose this time we want the Partial Order structure of items to have two connected components, each with height 3 and no transitive link between items, this can be done by updating p
with some more parameters. To visualize this structure, whose dependency matrix will be component po
in the context returned by gen
, we can use function viz
.
p <- pars(p, min.depth = 3, max.depth = 3, min.ntree = 2, max.ntree = 2, trans = FALSE) poks.data <- gen("poks", p) v <- viz(poks.data$po)
v
contains analysed information about the structure. Its first component is identical to po
, the second one is a list with equal number of components to the number of connected components of the structure po
represents. Each of these components is in turn a list with first component being the corresponding dependency matrix, and sencond component represents how many items are there on each level of depth.
print(v)
Example 2
dina.data <- gen("dina", p, progress = TRUE) dina.data
As reported by gen
, unlike the previous case, M
is now generated from three parameters students
, skill.space
, and skill.dist
instead of concept.exp
and students
. The reason for this particular situation is that, skill.space
and skill.dist
are formal parameters of DINA model, so they should be used in the process of generating data. Again, we can access this data by referring to component dina
of the generated context. Besides the response matrix, dina.data$dina
also have another component Q
, this is because of the fact that under DINA's view, a response matrix without its corresponding Q-matrix is incomplete.
dina.data$dina
gen
only allow one model and one context at a time, we can save time generating data across different models and contexts using gen.apply
. Setting the argument multiply
to TRUE
or FALSE
decides what kind of matching will be made between models and contexts.
dat.1 <- gen.apply(ALL.MODELS, list(p1 = p,p2 = p_), multiply = FALSE, n = 5) dat.1 dat.1["dino.p1", 3] dat.2 <- gen.apply(ALL.MODELS, list(p1 = p,p2 = p_), multiply = TRUE, n = 5) dat.2 dat.2["nmf.com", "p2"]
learn
Let's say we want to get the third data generation from the matching between context p1
and model poks
, and use POKS model to learn from this data.
poks.data <- dat.2["poks", "p1"][[1]][[3]] poks.data poks.data$poks learn.poks <- learn("poks", data = poks.data$poks) learn.poks learn.poks$po
If we want to learn from this same data using DINA model, poks.data$poks
cannot be used because components p.min
, alpha.p
, alpha.c
are meaningless to DINA. In order to do the task, one will need to hand-design this data. DINA requires one additional component besides the response matrix: Q-matrix. Normally, Q-matrix is expected to be expert-defined, however this illustration will just simply generate it randomly.
Q <- get.par("Q", p)$value R <- poks.data$poks$R dina.data <- list(R=R,Q=Q) learn.dina <- learn("dina", data = dina.data) learn.dina
Here we have a look at two parameters skill.space
and skill.dist
learn.dina$skill.space learn.dina$skill.dist
syn
Generating synthetic data includes three steps:
Learn the most probable context from a given data by a specified model.
Modify the learned context by keeping some of the parameters, changing some and discarding the rest.
Generate data from the new context.
edmsyn
provides function syn
that automates the above process, specifically, syn
consists of three parts:
Learn the most probable context by using learn
.
Keep some parameters (the default choice is stored in edmconst$KEEP
^[edmconst$KEEP
is designed in a way such that, with any new value the user updates for students
, the new context is still consistent and sufficient, in this sense function syn
generates synthetic data by creating simulated students.]) and discard the rest, also allow the user to change parameter students
.
edmconst$KEEP
gen
Now we synthesize dina.data
with a new number of student
dina.syn <- syn("dina", data = dina.data, students = 12, n = 10) dina.syn$synthetic[[5]]$dina
In case the default option is not favoured, syn
also allows users to manually specify which parameters to keep through argument keep.pars
.
dina.syn <- syn("dina", data = dina.data, keep.pars = c("Q", "concept.exp"), students = 12)
However, in this case users take their own risk if the kept parameters (together with the new number of students if students
is redefined), form an inconsistent or insufficient set with respect to the specified model. For example, for synthesizing data by DINA model case, if we choose to keep M
(which essentially retain the number of students) and define a new number of students, there will be conflict.
dina.syn <- syn("dina", data = dina.data, keep.pars = c("Q", "M"), students = 12)
edmsyn
edmsyn
comes with a pre-defined set of parameters and relationships amongst themselves. These relationships are rules that help edmsyn
derive values for one or more parameters from some others. Specifically, these rules are represented as functions in the package. For example, a function that takes two integers and randomly produce a binary matrix with its dimensions being the two inputted integers can be used as the rule to derive M
(skill mastery matrix) from students
and concepts
. Rules that derive value for "data parameter" such as poks
, dina
, or dino
encode POKS, DINA, DINO models respectively. Similarly, rules that derive value in the opposite direction encode the corresponding learning algorithms.
The choice of built-in parameters, models, and learning algorithms is made independently at the time of development for edmsyn
and thus, it may or may not satisfy users' need. That is why edmsyn
also comes with a set of tools that allow its users to re-define all these components to the extent of building a whole new set of parameters and models, while still retaining all the original benefits that it offers. All possible changes on edmsyn
will be made on a single graphical structure that edmsyn
creates at the loading time, with vertices representing parameters and edges encoding the respective relationships.
Functions that allow these modifications have names starting with edmtree.
There are just a few of such functions and they provide everything you need to work with edmsyn
at its internal level. This tutorial will also walk you through the building blocks of them so that an in-depth understanding of edmsyn
mechanism is provided. By the end of this section, you are expected to handle edmtree.
functions properly and efficiently.
If you do not feel going that deep, feel free to skip to A toy model at your own risk. Reading A toy model before going through the whole tutorial is also a good suggestion.
edmtree.fetch
Firstly, let's start with the skill mastery matrix M
M.node <- edmtree.fetch('M') class(M.node) names(M.node)
The representation of M
in edmsyn
is essentially a list with four components tell
, f.tell
, gen
, and f.gen
.
The first one, tell
, is a set of names of parameters that receive information if the value of M
is known.
M.node$tell
In this case, the value of M
tells edmsyn
the values of concepts
, students
and concept.exp
(expected mastery rate for each concept). This is quite straightforward since concepts
, students
, and concept.exp
are respectively the row dimension, column dimension, and row means of M
. At this point, it is reasonable to look at the third component f.tell
M.node$f.tell
The third component, f.tell
, is the function to derive values for each parameter listed in tell
from value of M
. As can be seen, this function does exactly what we expected: taking the row dimension, column dimension and row means value of its input M
, assemble these results into a list as its return value.
The second component, gen
, is a list of generating methods for M
.
M.node$gen
In this example, edmsyn
knows that there are three different methods to reach M
: either using (S
), (students
,skill.space
,skill.dist
), or (students
,concept.exp
). If the generating process of M
is somehow determined to use S
(the skill matrix, or the probabilistic version of M
), then one way to proceed can be rounding S
element-wise to obtain M
. In fact, this is the chosen method in edmsyn
, let's have a look at the last component
M.node$f.gen
As can be seen, f.gen
is a list of three functions correspond to three generating methods in gen
, and the first one is somehow rounding its input for the return value. The details of these functions will be discussed in later part, at this point it is sufficient to just realise what is the representation of a parameter in edmsyn
and which tasks it is trying to achieve.
From this point on, all f.tell
functions will be referred to as type-1 connections. Similarly, the collections of functions in f.gen
will be referred to as type-2 connections. These are the two types of connections along which all data flow. The main benefit edmsyn
offers is a convenient interface that allow easy access and control to various data processes.
edmtree.replace
Suppose you are not satisfied with the current generation method for M
from S
(rounding), and expect a fully probabilistic implementation of it. To do this, the first function in f.gen
should be replaced. Firstly, we must design a function that takes one input being the matrix S
, sample binary result from S
's entries and return the value of M
as follows:
new.gen.S.to.M <- function(S){ M <- matrix(0, nrow(S), ncol(S)) for (row in 1:nrow(S)) for (col in 1:ncol(S)){ p <- S[row,col] M[row, col] <- sample(x = 0:1, 1, prob = c(1 - p, p)) } return(M) }
Next, this function is replaced into the internal structure of M
edmtree.replace.gen('M','S',new.gen.S.to.M) M.node <- edmtree.fetch('M') M.node$f.gen
Now that f.gen[1]
is changed. But apparently it is not the original new.gen.S.to.M
that has been designed by us, the reason is that edmsyn
has wrapped this function inside one or more layers to make sure it fit in perfectly with the internal working environment. We can check if the replacement is carried out successfully by simply generating M
from S
many times, if the result is different each time, the probabilistic version has been installed successfully.
S <- matrix(runif(15),3,5) p <- pars(S = S) M1 <- get.par('M',p)$value M2 <- get.par('M',p)$value # This is very unlikely to be TRUE identical(M1, M2)
# Let's try something else M = matrix(0, 3, 5) big.number = 10000 for (i in 1:big.number) M = M + get.par('M',p)$value # This is likely to be TRUE identical(round(M/big.number), round(S))
Similarly, we can change the functions of other generating methods by simply create a new function for each and replace them into the structure.
# Examples of changing generating methods new.f.gen.2 <- function(students, skill.space, skill.dist){ # Evaluate M here return(value.of.M) } edmtree.replace.gen('M', M.node$gen[[2]], new.f.gen.2)
As long as the number and the order of arguments match that of what is defined in the corresponding gen
, the replacement will be done successfully. Otherwise, error will be detected either at replacement time or run-time depending on the error.
# Some more playing around with replacement M.node$gen[[3]] new.gen.3 <- rev(M.node$gen[[3]]) new.gen.3
# Notice that the order of arguments of new.f.gen.3 below # matches the order of new.gen.3 new.f.gen.3 <- function(concept.expectation, num.of.students){ # evaluate M here return(value.of.M) } # Successful edmtree.replace.gen('M', new.gen.3, new.f.gen.3)
Below are some examples where edmtree.replace
fails to execute.
# Unsucessful examples: # new f.gen is supposed to have 3 arguments as M.node$gen[[2]] has length 3 edmtree.replace.gen('M', M.node$gen[[2]], function(arg1, arg2) { return(NULL) # does not matter }) # (items, poks) is not an existing method edmtree.replace.gen('M', c('items', 'poks'), function(arg1, arg2){ return(NULL) # does not matter })
Besides edmtree.replace.gen
, edmsyn
also allow edmtree.replace.tell
where users are able to modify tell
and f.tell
in many flexible ways. Knowing the value of S
, it is straightforward to infer values of students
and concepts
, being the number of columns and rows respectively. As a toy example, let's swap the inference of students
and concepts
(and thus produce an incorrect result)
S.node <- edmtree.fetch('S') S.node$tell p <- pars(S = S) print(dim(S)) # this would give correct result c(p$concepts, p$students)
# Now let's alter f.tell so that concepts # is the number of columns of S and students # is the number of rows of S. edmtree.replace.tell('S',c('concepts','students'), function(S){ return(list( ncol(S), nrow(S) )) }) p <- pars(S=S)
The reason why pars
fails to execute is that, not only concepts
and students
are being inferred from S
, S.con.exp
, a vector of length concepts
, is also being inferred correctly. This caused a conflict that edmsyn
was able to detect. One way to force the incorrect result is as follows:
edmtree.replace.tell('S',c('S.con.exp','concepts','students'), function(S){ return(list( colMeans(S), ncol(S), nrow(S) )) }) p <- pars(S=S) print(dim(S)) # Now we successfully produce a wrong result. c(p$concepts, p$students)
There is two important points to make here: firstly, it is dangerous to replace such incorrect calculations into edmsyn
structure: from what we have seen above, these false implementations may take long (sometimes forever) to be detected after the replacement and thus, your application may have operated incorrectly for many steps without being noticed. Take careful examination before you decide to change edmsyn
's built-in functions.
Secondly, the replacement we did above is called partial replacement because not all parameters in tell
is modified. Unlike many other classes of objects in R, f.tell
, which is essentially an R function
, is inherently not modular, which means it can not be modified partially. So how did edmsyn
finish the task above? It simply executes the old function and then the new one, after that new results will replace the old ones at appropriate positions in the returned list. This, apparently, may involve a great deal of redundant calculations and thus, the use of edmtree.replace.tell
to modify a subset of tell
is strongly discourage. In other words, whenever it is possible, try to use edmtree.replace.tell
with argument tell
being an identical set to S$tell
.
To proceed normally, let's recover the original correct version of f.tell
inside S
edmtree.replace.tell('S', S.node$tell, S.node$f.tell) p <- pars(S=S) dim(S) c(p$concepts, p$students)
edmtree.replace
The previous replacing functions are smart because they are flexible in terms of partial replacement. Namely, they retain the parts that is not declare in their function call and properly handle the integration of new parts with old parts.
If you wish to simply throw away some parts of the node and subtitude new ones, (in other words, you do not care about retaining any part of the existing functions, either they are useful to be reused or not), simply use edmtree.replace
. See some examples below
# replacing tell edmtree.replace('S', tell = new.tell) # replacing tell and f.gen edmtree.replace('S', tell = new.tell, f.gen = new.f.gen) # replacing gen, f.gen and f.tell edmtree.replace('S', f.gen = new.f.gen, f.tell = new.f.tell, gen = new.gen)
edmtree.remove
Let's start with a context where students and concepts are defined
p <- pars(students = 20, concepts = 15) M <- get.par('M', p, progress = TRUE)
Assume that we desire not to produce M
from the process that requires generating concept.exp
. In fact, we want to remove this method of generating M
from edmsyn
completely. edmtree.remove.gen
is here to help:
edmtree.remove.gen('M',c('concept.exp','students')) M <- get.par('M', p, progress = TRUE) edmtree.remove.gen('M',c('skill.dist','students','skill.space')) M <- get.par('M', p, progress = TRUE) edmtree.remove.gen('M', 'S') M <- get.par('M', p, progress = TRUE)
As we consecutively went through the deletion of all three generating methods for M
, a reasonable question is raised: why does it have to be in the order of (concept.exp
, students
), then (skill.dist
, students
, skill.space
), and lastly (S
)? In other words, base on which criteria did edmsyn
prioritise the method (concept.exp
, students
) over (skill.dist
, students
, skill.space
); and (skill.dist
, students
, skill.space
) over (S
)? Note that when students
and concepts
are provided, all three options are possible to implement.
One complication of the edmsyn
perspective is that, there will be times when there are more than one way to generate data. One example is the case above. There is critera that edmsyn
follows in terms of prioritizing one option over another. Note that this criteria is chosen by the author at the time of developing edmsyn
and there is no certain theoretical foundation for it. It is, however, beneficial for the users to be informed anyway.
If there are more than one available method to generate a node's value:
Choose the one that most utilizes inputted data. In the previous example, inputted data includes students
and concepts
, thus both (concept.exp
, students
) and (skill.dist
, students
, skill.space
), utilizing a half of the input, are certainly better than (S
).
If there is a tie, as pointed out in the M$gen
example above, choose the one is that most covered by the input. In this case, one half of (concept.exp
, students
) is covered (by students
), while only one third of (skill.dist
, students
, skill.space
) is covered (also by students
). This is why (concept.exp
,students
) is most favoured amongst the three.
If there is still a tie, then choose the one that most cover tell
of the target node. For example in this case, M$tell
consists of concepts
, students
, and concept.exp
; two of them is covered by (concept.exp
, students
), one is covered by (skill.dist
, students
, skill.space
) and none is covered by (S
). For ties beyond this point, the choice is completely random.
Now we move on to removing type-1 connections. Let's say we want to discard any information about concepts
inferred from M
, edmtree.remove.tell
should be used as follows:
edmtree.remove.tell('M', 'concepts') p <- pars(M = M1) print(p)
Why does concepts
is still there in p? Actually in this case concepts
is not inferred from M
but from concept.exp
, to completely eliminate concepts
, one would need to remove concept.exp
edmtree.remove.tell('M','concept.exp') p <- pars(M = M1) # now concept.exp is gone, along with concepts print(p) # recover edmtree.replace('M', tell = M.node$tell, f.tell = M.node$f.tell) edmtree.fetch('M')
Again, since f.tell
cannot be partially modified, the users should be aware of the fact that there are redundant calculations going under the implementation of the removed concepts
. In fact, concepts
is still being calculated, but then discarded from the returned list each time information is inferred from M
. So it is highly recommended that the f.tell
argument of edmtree.remove.tell
is used whenever it is possible. Namely, the recommended way to completely remove concepts
is as follows:
edmtree.remove.tell('M', c('concept.exp', 'concepts'), f.tell = function(M){ return(ncol(M)) # since the only one left in tell is students }) p <- pars(M = M1)
The warning here tells us that, the returning line inside our new f.tell
should be return(list(ncol(M))
instead of return(ncol(M))
. However, edmsyn
is smart enough to prevent this error by a function wrapper that detects and fixes many errors like these whenever it is possible. Note that since this is an error at returning of our function (in other words, it requires actual execution of the function to be presented), this error is a run-time error and cannot be detected immediately in the edmtree.remove
step, but much later when pars
calls for data to flow from M
to lower level parameters in the execution of p <- pars(M = M1)
.
To avoid displaying this warning each time data flow through M
, let's fix f.tell
by a proper one.
# recover edmtree.replace('M', tell = M.node$tell, f.tell = M.node$f.tell) # remove edmtree.remove.tell('M', c('concept.exp', 'concepts'), f.tell = function(M){ return(list(ncol(M))) }) p <- pars(M = M1) print(p)
Removing a whole node is syntactically simple but requires a high level of awareness. Removing root parameters such as students
or concepts
can cause the whole system to crash since everything is built up from these root parameters. In short, removing parameters at higher level creates less an impact on the whole structure than removing lower ones.
In this first example, let's remove a "data parameter", namely bkt
:
bkt.node <- edmtree.remove("bkt") edmtree.fetch("bkt")
Later on, plugging bkt.node
back into the structure is simple, but only because this is a special case where bkt
is at data level. The situation is more complicated when deleting a lower level node such as concept.exp
.
# For the purpose of illustration, let's first recover everything for M edmtree.replace('M', tell = M.node$tell, f.tell = M.node$f.tell, gen = M.node$gen, f.gen = M.node$f.gen) concept.exp.node <- edmtree.remove("concept.exp") get.par('concept.exp', pars(concepts = 3))
As can be seen, removing concept.exp
will consequently remove several other parts of the whole structure. Let's say you want to reverse edmsyn
back to the state before this removal, you will have to plug the concept.exp.node
back and manually recover everything else that have gone along.
To do this, for example, adding method (students
, concept.exp
) back into M
, edmtree.add
is needed.
edmtree.add
This task is simple so long as you have already done the hard part: defining all four components tell
, f.tell
, gen
and f.gen
of the node. With node concept.exp
in the removal section above, luckily we have saved these components into concept.exp.node
. Let's reuse them for a quick illustration
edmtree.add('concept.exp', tell = concept.exp.node$tell, f.tell = concept.exp.node$f.tell, gen = concept.exp.node$gen, f.gen = concept.exp.node$f.gen) concept.exp <- get.par('concept.exp', pars(concepts=3))$value print(concept.exp)
Now move on to adding method (concept.exp
, students
) into M
. Again, we will reuse M.node
for a quick and clean illustration
# First let's see what method is being used to generate M # When students and concept.exp is given p <- pars(students = 10, concept.exp = concept.exp) M <- get.par('M', p, progress = TRUE) # See what we have in M.node M.node$gen # Add the new method edmtree.add.gen('M', gen.method = M.node$gen[[3]], f.gen.method = M.node$f.gen[[3]]) # Generate M again to see the change M <- get.par('M', p, progress = TRUE)$value
Now as method (students
, concept.exp
) is available, and clearly has a better input utilisation than (students
, skill.space
, skill.dist
) (see The generating criteria section above), edmsyn
opts for this method to generate M
.
Adding concept.exp
into M$tell
is also a necessary step in recovering the deleted concept.exp
node. In this task, edmtree.add.tell
is used.
# For the purpose of illustration, let's say knowing M only tells us about the number of concepts edmtree.replace('M', tell = 'concepts', f.tell = function(M){ return(list(nrow(M))) }) # Now see that currently knowing M tells nothing about the expected mastery rate for each concept or the number of students p <- pars(M = M) print(p) # It is time to add the full set of tell to M edmtree.add.tell('M', c('concept.exp', 'students'), function(M){ return(list( rowMeans(M), ncol(M) )) }) p <- pars(M = M) print(p) # Check if all calulations are correct identical( dim(M), c(p$concepts, p$students) ) identical( p$concept.exp, rowMeans(M) )
Let's wrap up this section by going through the process of adding a whole new model into edmsyn
structure. During this process, a few useful techniques that edmsyn
provides will also be introduced, so it is beneficial to pay attention to the details in this section. Note that:
None of the details in this toy model makes sense, all of them is specifically designed for illustration purpose
This part covers only edmtree.add
. edmtree.replace
and edmtree.remove
will not be used since we are not modifying any of the existing models and parameters.
Below is an outline of what is added to edmsyn
by this new model (named toy
):
foo
lower.foo
being the strict lower bound of foo
(i.e. foo
must be greater than lower.foo
), lower.foo
have the default value of 1.upper.foo
being the upper bound of foo
(i.e. foo
must be less than or equal to upper.foo
), upper.foo
have the default value equal to the sum of default value for min.it.per.tree
and concepts
.bar
, a matrix with dimension (foo
,concepts
), its entries being real numbers between 0 and 1.toy
) is a list with two components: the first one is a matrix obtained from the rounded multiplication of foo
and M
, the second is the number of concepts# foo is a root node, with no default values edmtree.add('foo', integer = TRUE) edmtree.add('lower.foo', integer = TRUE, tell = 'foo', f.tell = less.strict, gen = 'default.vals', f.gen = 1)
less.strict
is a special function provided by edmsyn for cases when you want to tell the structure that lower.foo
should be strictly less than foo
. Alternatively, edmtree.add.tell('foo', tell = 'lower.foo', f.tell = greater.strict)
gives the same effect. There are four such special functions recognised by edmsyn
: less.equal
, less.strict
, greater.equal
, and greater.strict
.
The presence of these four functions highlighted the fact that inferring information in edmsyn
is not solely inferring values, but can also be inferring different aspects of this value, namely the bound of them in this case.
# Now add upper.foo edmtree.add('upper.foo', integer = TRUE, gen = c('default.vals', 'concepts'), f.gen = function(default.vals, concepts){ return(default.vals$min.it.per.tree + concepts) }) # Another use of special bound function edmtree.add.tell('upper.foo', 'concepts', function(upper.foo){ list(greater.equal(upper.foo - default()$min.it.per.tree)) }) # Instead of upper.foo telling the bound of foo, # we will do it in the opposite direction, # just for the purpose of illustration edmtree.add.tell('foo', tell = 'upper.foo', f.tell = less.equal) # Now since lower.foo and upper.foo are both presented in the structure # it's time to add a generating method for foo edmtree.add.gen('foo', gen.method = c('lower.foo', 'upper.foo'), f.gen.method = function(lower.foo, upper.foo){ sample((lower.foo+1) : upper.foo, 1) }) # add bar edmtree.add('bar', gen = c('foo', 'concepts'), f.gen = function(foo, concepts){ matrix(runif(foo * concepts), foo, concepts) }) # dimensions of bar are foo and concepts edmtree.add.tell('bar', tell = c('foo', 'concepts'), f.tell = function(bar){ list(nrow(bar), ncol(bar)) })
Note that it is okay not to add the tell
component for bar
, (in fact, it is okay to skip defining tell
and f.tell
in every parameter, your application will still run just fine as long as the rest is properly designed). However doing so will limit the capability of edmsyn
to recognise conflicts (condition 2). For example, pars(bar = matrix(0, 3, 5), concepts = 4)
will not raise the conflict between 4 and 5 if bar$tell
does not include the inference for concepts
. Adding tell
and f.tell
is a good practice if you want to add more debugging power to a big and complicated application.
# to make the model a little more sophisticated, # we add another generating method for bar edmtree.add.gen('bar', gen = c('M', 'foo'), f.gen = function(M, foo){ concepts = nrow(M) matrix(runif(foo * concepts), foo, concepts) }) # finally, add the data node "toy" edmtree.add('toy', data = TRUE, gen = c('bar','M'), f.gen = function(bar, M){ list(R = round(bar %*% M), concepts = nrow(M)) }, tell = c('bar', 'M'), f.tell = function(toy){ # Note that the following learning algorithm makes no sense # it is just for the purpose of illustration concepts = toy$concepts R = toy$R foo = nrow(R) students = ncol(R) bar = matrix(runif(foo * concepts), foo, concepts) M = matrix(sample(0:1, concepts * students, TRUE), concepts, students) list(bar, M) }) # Check if ALL.MODELS includes "toy" (yes it does) edmconst$ALL.MODELS
So, that is all there is to plug a new model into edmsyn
. The process is simple and it forces users to think about various aspects while doing so. Another benefit is that before adding toy
, a whole system of 62 parameters with carefully-built and well-tested connections is already there. This makes the work even lighter, we saved a lot of time before moving on testing our model.
# 1. Test the bounds p <- pars(lower.foo = 3, foo = 3) p <- pars(foo = 4, upper.foo = 3) # This one requires reasoning to detect # Thus error is not raised immediately p <- pars(lower.foo = 3, upper.foo = 2) # But nevertheless, when p go into use, it immediately fails get.par('foo', p) p <- pars(upper.foo = 5, concepts = 5) # 2. Test foo$gen get.par('foo', pars()) p <- pars(p, upper.foo = 15) get.par('foo', p) p <- pars(concepts = 5) get.par('foo', p, progress = TRUE) p <- pars(M = M) p <- get.par('foo', p, progress = TRUE) print(p) # 3. Test bar get.par('bar', pars()) get.par('bar', pars(upper.foo = 15)) get.par('bar', pars(lower.foo = 3, concepts = 5), progress = TRUE) get.par('bar', pars(M = M), progress = TRUE) # 4. Test data toys <- gen('toy', pars(M = M, bar = matrix(0, 3, 5))) toys <- gen('toy', pars(M = M, bar = matrix(1, 3, 3)), n = 2, progress = TRUE) toys <- gen('toy', pars(students = 20, concepts = 4), n = 3, progress = TRUE) toys <- gen('toy', pars(M = M), n = 3, progress = TRUE) toys.syn <- syn('toy', toys[[2]]$toy, keep.pars = c("foo","concept.exp"), students = 12, n = 3, progress = TRUE)
Now that you have everything needed to fully manipulate the internal structure of edmsyn
, it is time to move on working with the whole structure (as oppose to working with nodes and edges like before). The first thing to know is that, each time library(edmsyn)
is executed, the original edmsyn
graphical structure (without foo
, bar
, toy
, etc) will be restored and everything we have built so far is lost. Thus, occasionally saving the modified structure is important.
toy.save <- edmtree.dump()
Assume a different situation where no part of the current structure is needed, you will manually build a whole new structure from scratch. This case, the first thing to do is to clear out all nodes
edmtree.clear() edmtree.fetch('toy') edmtree.fetch('M')
Lastly, you can always restore a saved structure
edmtree.load(toy.save) gen('toy', pars(M = M)) gen('bkt', pars(students = 15, items = 20)) # Leave the argument part empty # if you want the original edmsyn structure edmtree.load() gen('toy', pars(M = M)) gen('bkt', pars(students = 15, items = 20, concepts = 4, time = 10))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.