Review of Byron's paper
Note: I have marked up the paper on my iPad, but I've written corresponding longer form notes here.
<< BYRON's COMMENTS >>
I feel that the main idea of the paper starts in the third paragraph, I wonder if it might be possible to instead start there? The first two paragraphs provide great introduction to cross validation, but it at first sets me up to think the paper is about cross validation, rather than missing data? I guess in some ways it is about cross validation as well, so perhaps that is warranted.
<< this is a good point. I have trouble going straight to the punch line b/c I feel it is important to define CV and missing data imputation before talking about doing imputation before CV. I would like to proceed with the original paragraph order for now but we can talk more about it if you'd like. >>
It seems to me that the large benefit of this new approach is that you essentially get the same results, but for a fraction of the computational time. Do you think the paragraph before the aims are outlined could outline this potential benefit, to fortify the motivation for the paper?
<>
I am not clear on what the overall goal/aim/objectives are of the paper. Can I suggest something like the following:
The overall aim of this paper is to understand whether unsupervised imputation before, or after CV, impacts model pipeline error. The two conditions are defined as follows: unsupervised imputation to the training data before CV as "I → CV"; unsupervised imputation being applied during each replicate of CV as "CV ↻ I". To achieve this aim, we conduct XX number of experiments using simulated data, and XX experiments using real data.
<< Wonderful. I incorporated this in the second to last paragraph of the intro. >>
I'm not sure if your definition of crowding is MCAR, I would consider this MAR, since this is something observed that could be related to missing data - for example here it could be more missing data for covid tests when there is more community transmission (since there is high community transmission there are more people getting tests). Perhaps instead suggest random clerical data entry error?
<< Good catch - I updated the MCAR example >>
I've suggested defining Type I error, but that probably isn't needed, just a thought, since we might get machine learning folks reading this paper and my understanding is they don't know what that means, and type I error is only described once in the paper, so perhaps we say "false positive" or "falsely rejecting a true null hypothesis".
<< Not hard to include this at all, it's been added in parentheses >>
I've also suggested to cite Max Kuhn's book - https://bookdown.org/max/FES/handling-missing-data.html. I thought this book might also come in handy when explaining the idea of the "analysis" and "assessment" set from Max Kuhn...but I can't seem to find his explanation anywhere? Hmmmm.
<< A great book - it is now cited to back up the statement that most applied ML analyses use single imputation methods >>
I don't think "analysis and assessment set" have been defined yet.
<< Ah we wouldn't want that - here is what I have now: We refer to the $k-1$ folds and $k$ remaining fold used to internally train and test a modeling algorithm as analysis and assessment sets, respectively, to avoid abuse of notation. >>
<< It was half-heartedly so. Here is what I have now: The 'oracle' performance is the performance of an 'oracle' model whose specification coincides with the specification that generated the data that the model is fitted to. >>
<< I like this idea. Can we make this part of the ReadMe on github? >>
<< Yes, good idea. Here is what I have now: In practice, $\textbf{X}$ often has some 'junk' variables that are not related to the outcome. To make our simulations similar to applied settings, we generated normally distributed variables that had no relation to the simulated outcome .>>
<< It is a rare hiccup related to the ampute() function in mice. If the ampute() function can't determine a set of weights for MAR amputation, it fails and the simulation cannot proceed. The error only comes up with small sample sizes and when MAR amputation is needed. I appreciate that you appreciate this! =] >>
<< Absolutely. I added an in-line summary of time used to create imputed data and also added a figure to the ames analysis (see Figure 5). >>
<< Yes, that would be a more interesting hypothesis to test. I am not sure how to set up the equivalence test based on the 5,000 simulation replicates. Is it just as simple as making an empirical confidence interval and showing that both upper and lower bounds fall within a pre-defined band of equivalence? Let's discuss. >>
<< I agree with this, it needs to be highlighted that this can save lots of time. I added a sentence to the discussion: 'Throughout our analysis, \icv\space required less computation time than \cvi\space by a factor of roughly $v$, the number of folds employed by CV.' >>
<< I really like the three stage approach to figure interpretation. For Figure 1, the 'overview' should be 'this is the standard workflow to develop and validate a pipeline modeling algorithm'. The description is hopefully clear from the caption and annotations in the graph. As far as 'what you learn', I think the reader may already be familiar with the workflow but they will understand that this workflow is what we will be studying in the current analysis. Would you like to discuss this more when we talk? We may be able to tweak figure 1 so that it is more aligned with what I wrote here. >>
Table 1
Figure 4
<< I love the figure too and have included grid lines. Some journals allow you to specify a 'central' illustration, and we will definitely go with Figure 4 if that is something SIM wants.
mice
, and randomForest
in the paper, I've marked these.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.