R/data.R
In reproducer: Reproduce Statistical Analyses and Meta-Analyses

#' Madeyski15SQJ.NDC data
#'
#' If you use this data set please cite: Lech Madeyski and Marian Jureczko, 'Which Process Metrics Can Significantly Improve Defect Prediction Models? An Empirical Study,' Software Quality Journal, vol. 23, no. 3, pp.393-422, 2015. DOI: 10.1007/s11219-014-9241-7
#'
#' 'This paper presents an empirical evaluation in which several process metrics were investigated in order to identify the ones which significantly improve the defect prediction models based on product metrics. Data from a wide range of software projects (both, industrial and open source) were collected. The predictions of the models that use only product metrics (simple models) were compared with the predictions of the models which used product metrics, as well as one of the process metrics under scrutiny (advanced models). To decide whether the improvements were significant or not, statistical tests were performed and effect sizes were calculated. The advanced defect prediction models trained on a data set containing product metrics and additionally Number of Distinct Committers (NDC) were significantly better than the simple models without NDC, while the effect size was medium and the probability of superiority (PS) of the advanced models over simple ones was high (p=.016, r=-.29, PS=.76), which is a substantial finding useful in defect prediction. A similar result with slightly smaller PS was achieved by the advanced models trained on a data set containing product metrics and additionally all of the investigated process metrics (p=.038, r=-.29, PS=.68). The advanced models trained on a data set containing product metrics and additionally Number of Modified Lines (NML) were significantly better than the simple models without NML, but the effect size was small (p=.038, r=.06). Hence, it is reasonable to recommend the NDC process metric in building the defect prediction models.' [https://dx.doi.org/10.1007/s11219-014-9241-7]
#'
#'
#' @format A data frame with variables:
#' \describe{
#' \item{Project}{In case of open source projects this field includes the name of the project as well as its version. In case of industrial projects this field includes the string 'proprietary' (we were not allowed to disclose the names of the analyzed industrial software projects developed by Capgemini Polska).}
#' \item{simple}{The percentage of classes that must be tested in order to find 80\% of defects in case of simple defect prediction models, i.e., using only software product metrics as predictors.}
#' \item{advanced}{The percentage of classes that must be tested in order to find 80\% of defects in case of advanced defect prediction models, using not only software product metrics but also the NDC (Number of distinct committers) process metric.}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' Madeyski15SQJ.NDC
#'
"Madeyski15SQJ.NDC"

#' Madeyski15EISEJ.OpenProjects data
#'
#' If you use this data set please cite: Marian Jureczko and Lech Madeyski, 'Cross-project defect prediction with respect to code ownership model: An empirical study', e-Informatica Software Engineering Journal, vol. 9, no. 1, pp. 21-35, 2015. DOI: 10.5277/e-Inf150102 (https://dx.doi.org/10.5277/e-Inf150102) URL: https://madeyski.e-informatyka.pl/download/JureczkoMadeyski15.pdf)
#'
#' This paper presents an analysis of 84 versions of industrial, open-source and academic projects. We have empirically evaluated whether those project types constitute separate classes of projects with regard to defect prediction. The predictions obtained from the models trained on the data from the open source projects were compared with the predictions from the other models (built on proprietary, i.e. industrial, student, open source, and not open source projects).
#'
#'
#' @format A data frame with variables:
#' \describe{
#' \item{PROP}{The percentage of classes of proprietary (i.e., industrial) projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on open source projects.}
#' \item{NOTOPEN}{The percentage of classes of projects which are not open source projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on open source projects.}
#' \item{STUD}{The percentage of classes of student (i.e., academic) projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on open source projects.}
#' \item{OPEN}{The percentage of classes of open source projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on open source projects.}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' Madeyski15EISEJ.OpenProjects
#'
"Madeyski15EISEJ.OpenProjects"

#' Madeyski15EISEJ.PropProjects data
#'
#' If you use this data set please cite: Marian Jureczko and Lech Madeyski, 'Cross-project defect prediction with respect to code ownership model: An empirical study', e-Informatica Software Engineering Journal, vol. 9, no. 1, pp. 21-35, 2015. DOI: 10.5277/e-Inf150102 (https://dx.doi.org/10.5277/e-Inf150102) URL: https://madeyski.e-informatyka.pl/download/JureczkoMadeyski15.pdf)
#'
#' @format A data frame with variables:
#' \describe{
#' \item{NOTPROP}{The percentage of classes of non-proprietary (i.e., non-industrial) projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on proprietary (i.e., industrial) projects.}
#' \item{OPEN}{The percentage of classes of open source projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on proprietary (i.e., industrial) projects.}
#' \item{STUD}{The percentage of classes of student (i.e., academic) projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on proprietary (i.e., industrial) projects.}
#' \item{PROP}{The percentage of classes of proprietary (i.e., industrial) projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on proprietary (i.e., industrial) projects.}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' Madeyski15EISEJ.PropProjects
#'
"Madeyski15EISEJ.PropProjects"


#' Madeyski15EISEJ.StudProjects data
#'
#' If you use this data set please cite: Marian Jureczko and Lech Madeyski, 'Cross-project defect prediction with respect to code ownership model: An empirical study', e-Informatica Software Engineering Journal, vol. 9, no. 1, pp. 21-35, 2015. DOI: 10.5277/e-Inf150102 (https://dx.doi.org/10.5277/e-Inf150102) URL: https://madeyski.e-informatyka.pl/download/JureczkoMadeyski15.pdf)
#'
#' @format A data frame with variables:
#' \describe{
#' \item{PROP}{The percentage of classes of proprietary (i.e., industrial) projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on student (i.e., academic) projects.}
#' \item{NOTSTUD}{The percentage of classes of projects which are not student projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on student (i.e., academic) projects.}
#' \item{STUD}{The percentage of classes of student (i.e., academic) projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on student (i.e., academic) projects.}
#' \item{OPEN}{The percentage of classes of open source projects that must be tested in order to find 80\% of defects in case of software defect prediction models built on student (i.e., academic) projects.}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' Madeyski15EISEJ.StudProjects
#'
"Madeyski15EISEJ.StudProjects"

#' Ciolkowski09ESEM.MetaAnalysis.PBRvsCBRorAR data
#'
#' Data form a set of primary studies on reading methods for software inspections. They were reported and analysed by M. Ciolkowski ('What do we know about perspective-based reading? an approach for quantitative aggregation in software engineering', in Proceedings of the 3rd International Symposium on Empirical Software Engineering and Measurement, ESEM'09, pp. 133-144, IEEE Computer Society, 2009), corrected and re-analysed by Madeyski and Kitchenham ('How variations in experimental designs impact the construction of comparable effect sizes for meta-analysis' (to be submitted)).
#'
#' If you use this data set please cite: Lech Madeyski and Barbara Kitchenham, 'How variations in experimental designs impact the construction of comparable effect sizes for meta-analysis', 2015.
#'
#' @format A data frame with 21 rows and 7 variables:
#' \describe{
#' \item{Study}{Name of empirical study}
#' \item{Ref.}{Reference to the paper reporting primary study or experimental run where data were originally reported}
#' \item{Control}{Control treatment: Check-Based Reading (CBR) or Ad-hoc Reading (AR)}
#' \item{Within-subjects}{Yes - if the primary study used the within-subjects experimental design, No - if the primary study did not use the within-subjects experimental design}
#' \item{Cross-over}{Yes - if the primary study used the cross-over experimental design, No - if the primary study did not use the cross-over experimental design}
#' \item{d_ByCiolkowski}{d effect size calculated by Ciolkowski}
#' \item{d_ByOriginalAuthors}{d effect size as reported by the original authors}
#' }
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' Ciolkowski09ESEM.MetaAnalysis.PBRvsCBRorAR
#'
"Ciolkowski09ESEM.MetaAnalysis.PBRvsCBRorAR"


#' MadeyskiKitchenham.MetaAnalysis.PBRvsCBRorAR data
#'
#' Data form a set of primary studies on reading methods for software inspections. They were analysed by Lech Madeyski and Barbara Kitchenham, 'How variations in experimental designs impact the construction of comparable effect sizes for meta-analysis', 2015.
#'
#' If you use this data set please cite: Lech Madeyski and Barbara Kitchenham, 'How variations in experimental designs impact the construction of comparable effect sizes for meta-analysis', 2015.
#'
#' @format A data frame with 17 rows and 26 variables:
#' \describe{
#' \item{Study}{Name of empirical study}
#' \item{Ref.}{Reference to the paper reporting primary study or experimental run where data were originally reported}
#' \item{Teams}{The number of teams including both, PBR and Control teams}
#' \item{DesignDesc}{Experimental design description: Before-after, Between-groups, Cross-over}
#' \item{ExpDesign}{Experimental design: between-groups (BG), within-subjects cross-over (WSCO), within-subjects before-after (WSBA)}
#' \item{M_PBR}{The average proportion of defects found by teams using PBR}
#' \item{M_C}{The average proportion of defects found by teams using Control treatment: Check-Based Reading (CBR) or Ad-Hoc Reading (AR)}
#' \item{Diff}{The difference between M_PBR and M_C, i.e. Diff = M_PBR - M_C}
#' \item{Inc}{The percentage increase in defect rate detection, i.e. Inc=100*[(M_PBR-M_C)/M_C]}
#' \item{SD_C_ByAuthors}{The standard deviation of the control group values reported by the original Authors, i.e., obtained from the papers/raw data}
#' \item{SD_C}{The standard deviation of the control group values equals SD_C_ByAuthors for studies for which the data was available OR the weighted average of SD_C_ByAuthors (i.e., 0.169) for studies where SD_C_ByAuthors is missing.}
#' \item{V_C}{The variance of the Control group observations, i.e., the variance obtained from the teams using the Control method V_C=SD_C^2}
#' \item{V_D}{The variance of the unstandardized mean difference D (between the mean value for the treatment group and the mean value for the Control group)}
#' \item{SD_C_Alt}{This is the equivalent of SD_C (the standard deviation of the control group) based on a different variance for the student studies or the practitioner studies depending on the subject type of the study with the missing value.}
#' \item{V_Alt}{The variance of the mean difference in the meta-analysis based on SD_C_Alt}
#' \item{SS_C}{The sum of squares of the Control group values. For within subjects studies SS=V_C*(n-1). For between subjects studies SS=V_C*(n_C-1)}
#' \item{n_PBR}{The number of PBR teams}
#' \item{n_C}{The number of Control (CBR or AR) teams}
#' \item{ControlType}{Type of Control treatment: CRB or AR}
#' \item{ParticipantsType}{Type of participants: Engineers or Students}
#' \item{TeamType}{Type of team: Nominal or Real}
#' \item{TwoPersonTeamVsLargerTeam}{Reflects size of the teams: 2-PersonTeam or LargerTeam}
#' \item{ArtefactType}{The type of artefact: Requirements or Other}
#' \item{AssociatedWithBasili}{Whether study is associated with Basili (the forerunner): Yes or No}
#' \item{ControlType_Basili}{Combined ControlType and AssociatedWithBasili: AH_AssociatedWithBasili, CBR_AssociatedWithBasili, CBR_NotAssociatedWithBasili}
#' }
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' MadeyskiKitchenham.MetaAnalysis.PBRvsCBRorAR
#'
"MadeyskiKitchenham.MetaAnalysis.PBRvsCBRorAR"


#' KitchenhamMadeyskiBudgen16.FINNISH data
#'
#' If you use this data set please cite this R package and the following paper when accepted: Barbara Kitchenham, Lech Madeyski, David Budgen, Jacky Keung, Pearl Brereton, Stuart Charters, Shirley Gibbs, and Amnart Pohthong, 'Robust Statistical Methods for Empirical Software Engineering', Empirical Software Engineering, vol. 22, no.2, p. 579-630, 2017. DOI: 10.1007/s10664-016-9437-5 (https://dx.doi.org/10.1007/s10664-016-9437-5), URL: https://madeyski.e-informatyka.pl/download/KitchenhamMadeyskiESE.pdf
#'
#' Data set collected from 9 Finish companies by Mr Hanna M\'aki from the TIEKE organisation see Barbara Kitchenham and Kari Kansala, Inter-item correlations among function points, Proceedings ICSE 15, 1983, pp 477-480
#'
#' @format A data frame with variables:
#' \describe{
#' \item{Project}{Project ID}
#' \item{DevEffort}{Development Effort measured in hours}
#' \item{UserEffort}{Effort provided by the customer/user organisation measured in hours}
#' \item{Duration}{Project duration measured in months}
#' \item{HWType}{A categorical variable defining the hardware type}
#' \item{AppType}{A categorical variable defining the application type}
#' \item{FP}{Function Points measured using the TIEKE organisation method}
#' \item{Co}{A categorical variable defining the company}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBudgen16.FINNISH
#'
"KitchenhamMadeyskiBudgen16.FINNISH"

#' KitchenhamMadeyskiBudgen16.PolishSubjects data
#'
#' If you use this data set please cite this R package and the following paper when accepted: Barbara Kitchenham, Lech Madeyski, David Budgen, Jacky Keung, Pearl Brereton, Stuart Charters, Shirley Gibbs, and Amnart Pohthong, 'Robust Statistical Methods for Empirical Software Engineering', Empirical Software Engineering, vol. 22, no.2, p. 579-630, 2017. DOI: 10.1007/s10664-016-9437-5 (https://dx.doi.org/10.1007/s10664-016-9437-5), URL: https://madeyski.e-informatyka.pl/download/KitchenhamMadeyskiESE.pdf
#'
#' Data set collected at Wroclaw University of Technology (POLAND) by Lech Madeyski includes separate entries for each abstract assessed by a judge, that is 4 entries for each judge. Data collected from 16 subjects recruited from Wroclaw  University of Technology who were each asked to assess 4 abstracts.
#'
#' Note Only completeness question 2 was expected to be context dependent and have a NA (not applicable)  answer, if other completeness answers were left blank, BAK coded the answer as NA
#'
#' polishsubjects.txt
#' @format A data frame with variables:
#' \describe{
#' \item{Judge}{The identifier for each subject}
#' \item{Abstract}{The identifier for each abstract - the code starts with a three alphanumeric string that defines the source of the abstract}
#' \item{OrderViewed}{Each judge assessed 4 abstracts in sequence, this data item identifies the order in which the subject viewed the specified abstract}
#' \item{Completness1}{Assessment by judge of question 1:Is the reason for the project clear? Can take values:  Yes/No/Partly}
#' \item{Completness2}{Assessment by judge of question 2: Is the specific aim/purpose of the study clear? Can take values:  Yes/No/Partly}
#' \item{Completness3}{Assessment by judge of question 3: If the aim is to describe a new or enhanced software technology (e.g. method, tool, procedure or process) is the method used to develop this technology defined? Can take values:  Yes/No/Partly/NA}
#' \item{Completness4}{Assessment by judge of question 4: Is the form (e.g. experiment, general empirical study, data mining, case study, survey, simulation etc.) that was used to evaluate the technology made clear? Can take values:  Yes/No/Partly}
#' \item{Completness5}{Assessment by judge of question 5: Is there a description of how the evaluation process was organised? Can take values:  Yes/No/Partly}
#' \item{Completness6}{Assessment by judge of question 6: Are the results of the evaluation clearly described? Can take values:  Yes/No/Partly}
#' \item{Completness7}{Assessment by judge of question 7: Are any limitations of the study reported?:  Yes/No/Partly}
#' \item{Completness8}{Assessment by judge of question 8: Are any ideas for future research presented?:  Yes/No/Partly}
#' \item{Clarity}{Assessment by judge of question regarding the overall understandability of the abstract: Please give an assessment of the clarity of this abstract by circling a number on the scale of 1-10 below, where a value of 1 represents Very Obscure and 10 represents Extremely Clearly Written.}
#' \item{Completness1NumValue}{A numerical value for completeness question 1 where 0=No, Partly=0.5, yes =1}
#' \item{Completness2NumValue}{A numerical value for completeness question 2 where 0=No, Partly=0.5, yes =1, NA means not applicable}
#' \item{Completness3NumValue}{A numerical value for completeness question 3 where 0=No, Partly=0.5, yes =1, NA means not applicable or not answered}
#' \item{Completness4NumValue}{A numerical value for completeness question 4 where 0=No, Partly=0.5, yes =1, NA means not applicable}
#' \item{Completness5NumValue}{A numerical value for completeness question 5 where 0=No, Partly=0.5, yes =1, NA means not applicable}
#' \item{Completness6NumValue}{A numerical value for completeness question 6 where 0=No, Partly=0.5, yes =1, NA means not applicable}
#' \item{Completness7NumValue}{A numerical value for completeness question 7 where 0=No, Partly=0.5, yes =1, NA means not applicable}
#' \item{Completness8NumValue}{A numerical value for completeness question 8 where 0=No, Partly=0.5, yes =1, NA means not applicable}
#' \item{Sum}{The sum of the numerical completeness questions excluding those labelled NA}
#' \item{TotalQuestions}{The count of the number of question related to completeness excluding questions considered not applicable }
#' \item{Completeness}{Sum/TotalQuestions}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBudgen16.PolishSubjects
#'
"KitchenhamMadeyskiBudgen16.PolishSubjects"


#' KitchenhamMadeyskiBudgen16.SubjectData
#'
#' If you use this data set please cite this R package and the following paper when accepted: Barbara Kitchenham, Lech Madeyski, David Budgen, Jacky Keung, Pearl Brereton, Stuart Charters, Shirley Gibbs, and Amnart Pohthong, 'Robust Statistical Methods for Empirical Software Engineering', Empirical Software Engineering, vol. 22, no. 2, pp. 579–630, 2017. DOI: 10.1007/s10664-016-9437-5 (https://dx.doi.org/10.1007/s10664-016-9437-5), URL: https://madeyski.e-informatyka.pl/download/KitchenhamMadeyskiESE.pdf
#'
#' Data set collected from 16 judges assessing 4 abstracts at 6 sites: Lincoln University NZ=1, Hong Kong Polytechnic University=2, PSu Thailand=3, Durham=4, Keele=5, Hong Kong City University=6
#'
#' subjectdata.txt: Judge Institution JudgeID age eng1st years.study abs.read Absid Treat TreatID Order Com.1 Com.2 Com.3 Com.4 Com.5 Com.6
#' Com.7 Com.8 Clarity num.questions total.score av.score Site
#' @format A data frame with variables:
#' \describe{
#' \item{Judge}{Alphanumeric identifier for each judge}
#' \item{Institution}{Numerical value identifying each site from which data was collected}
#' \item{JudgeID}{Numerical value identifying each judge}
#' \item{Age}{Age of the judge in years}
#' \item{Eng1st}{Whether the judge's first language was Enlish: Yes/No}
#' \item{YearsStudy}{The number of years have student been studying computing at University: 1, 2, 3, 4}
#' \item{AbstractsRead}{Number of abstracts the judge had read prior to the study' 0, 1 to 10, 10+}
#' \item{AbstractsWritten}{Whether the judge had ever written an abstract for a scientific report/article}
#' \item{AbstractID}{Alphanumeric identifier for an abstract. The first character identifies the journal, I=IST, J=JSS, the third digit identifies the time period as 1 or 2, the remaining digits identify the abstract number within the set of abstracts found for the specified journal and time period}
#' \item{Treat}{The initial 3 characters of AbstractID}
#' \item{TreatID}{A numeric identifier for the journal and time period, 1=IB1, 2=IB2, 3=JB1, 4=JB2}
#' \item{Order}{The order in which the judge should have viewed the specified abstract}
#' \item{Completness1NumValue}{The numeric answer to completeness question 1}
#' \item{Completness2NumValue}{The numeric answer to completeness question 2}
#' \item{Completness3NumValue}{The numeric answer to completeness question 3}
#' \item{Completness4NumValue}{The numeric answer to completeness question 4}
#' \item{Completness5NumValue}{The numeric answer to completeness question 5}
#' \item{Completness6NumValue}{The numeric answer to completeness question 6}
#' \item{Completness7NumValue}{The numeric answer to completeness question 7}
#' \item{Completness8NumValue}{The numeric answer to completeness question 8}
#' \item{Clarity}{The response to the clarity question or NA if not answered}
#' \item{NumberOfAnsweredCompletnessQuestions}{The number of completeness questions excluding those with NA}
#' \item{TotalScore}{Sum of the numeric values of the 8 completeness questions}
#' \item{MeanScore}{Sum of the completeness questions 1 to 8 divided by TotalScore}
#' \item{Site}{The name of the site which provided the data. HongKong refers to the Polytechnic University, HongKong.2 refers to the City University}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBudgen16.SubjectData
#'
"KitchenhamMadeyskiBudgen16.SubjectData"


#' KitchenhamMadeyskiBudgen16.PolishData data
#'
#' If you use this data set please cite this R package and the following paper when accepted: Barbara Kitchenham, Lech Madeyski, David Budgen, Jacky Keung, Pearl Brereton, Stuart Charters, Shirley Gibbs, and Amnart Pohthong, 'Robust Statistical Methods for Empirical Software Engineering', Empirical Software Engineering, vol. 22, no.2, p. 579-630, 2017. DOI: 10.1007/s10664-016-9437-5 (https://dx.doi.org/10.1007/s10664-016-9437-5), URL: https://madeyski.e-informatyka.pl/download/KitchenhamMadeyskiESE.pdf
#'
#' Data set derived from PolishSubjects data set collected at Wroclaw University. It summarizes the completeness and clarity data collected from 4 judges about the same abstract.
#'
#' PolishData.txt
#' @format A data frame with variables:
#' \describe{
#' \item{Abstract}{The abstract identifier}
#' \item{Site}{Numeric identifier for the site}
#' \item{Treatment}{The first three characters of the Abstract field which identifies the journal and time period of the abstract}
#' \item{Journal}{An acronym for the journal from which the abstract was obtained: IST or JSS}
#' \item{Timeperiod}{The Time period in which the abstract was found: 1 or 2}
#' \item{J1}{The identifier for the judge who made the next 2 assessments}
#' \item{J1Completeness}{The average completeness made by judge J1 based on the 8 completeness questions}
#' \item{J1Clarity}{The clarity assessment made by judge J1}
#' \item{J2}{The identifier for the judge who made the next 2 assessments}
#' \item{J2Completeness}{The average completeness made by judge J2 based on the 8 completeness questions}
#' \item{J2Clarity}{The clarity assessment made by judge J2}
#' \item{J3}{The identifier for the judge who made the next 2 assessments}
#' \item{J3Completeness}{The average completeness made by judge J3 based on the 8 completeness questions}
#' \item{J3Clarity}{The clarity assessment made by judge J3}
#' \item{J4}{The identifier for the judge who made the next 2 assessments}
#' \item{J4Completeness}{The average completeness made by judge J4 based on the 8 completeness questions}
#' \item{J4Clarity}{The clarity assessment made by judge J4}
#' \item{MedianCompleteness}{The median of J1Completeness, J2Completeness, J3Completeness, J4Completeness}
#' \item{MedianClarity}{The median of J1Clarity, J2Clarity, J3Clarity, J4Clarity}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBudgen16.PolishData
#'
"KitchenhamMadeyskiBudgen16.PolishData"


#' KitchenhamMadeyskiBudgen16.DiffInDiffData data
#'
#' If you use this data set please cite this R package and the following paper when accepted: Barbara Kitchenham, Lech Madeyski, David Budgen, Jacky Keung, Pearl Brereton, Stuart Charters, Shirley Gibbs, and Amnart Pohthong, 'Robust Statistical Methods for Empirical Software Engineering', Empirical Software Engineering, vol. 22, no.2, p. 579-630, 2017. DOI: 10.1007/s10664-016-9437-5 (https://dx.doi.org/10.1007/s10664-016-9437-5), URL: https://madeyski.e-informatyka.pl/download/KitchenhamMadeyskiESE.pdf
#'
#' Data set was derived from the data reported in the SubjectData data set (subjectdata.txt). It contains the summary completeness and clarity data from 4 judges who assessed the same abstract. Only the initial 5 sites are included.
#'
#' dinddata.txt
#' @format A data frame with variables:
#' \describe{
#' \item{Abstract}{The abstract identifier}
#' \item{Site}{A numeric identifier of the site}
#' \item{Treatment}{A three character alphanumeric identifying the journal and time period of the abstract}
#' \item{Journal}{The journal in which the abstract was published: IST or JSS}
#' \item{Timeperiod}{The time period in which the abstract: 1 or 2}
#' \item{J1}{The identifier for the judge who made the next 2 assessments}
#' \item{J1Completeness}{The average completeness made by judge J1 based on the 8 completeness questions}
#' \item{J1Clarity}{The clarity assessment made by judge J1}
#' \item{J2}{The identifier for the judge who made the next 2 assessments}
#' \item{J2Completeness}{The average completeness made by judge J2 based on the 8 completeness questions}
#' \item{J2Clarity}{The clarity assessment made by judge J2}
#' \item{J3}{The identifier for the judge who made the next 2 assessments}
#' \item{J3Completeness}{The average completeness made by judge J3 based on the 8 completeness questions}
#' \item{J3Clarity}{The clarity assessment made by judge J3}
#' \item{J4}{The identifier for the judge who made the next 2 assessments}
#' \item{J4Completeness}{The average completeness made by judge J4 based on the 8 completeness questions}
#' \item{J4Clarity}{The clarity assessment made by judge J4}
#' \item{MeanCompleteness}{The mean of J1Completeness, J2Completeness, J3Completeness, J4Completeness}
#' \item{MedianCompleteness}{The median of J1Completeness, J2Completeness, J3Completeness, J4Completeness}
#' \item{MedianClarity}{The median clarity of J1Clarity, J2Clarity, J3Clarity, J4Clarity}
#' \item{MeanClarity}{The mean clarity of J1Clarity, J2Clarity, J3Clarity, J4Clarity}
#' \item{VarCompleteness}{The variance of J1Completeness, J2Completeness, J3Completeness, J4Completeness}
#' \item{VarClarity}{The variance clarity of J1Clarity, J2Clarity, J3Clarity, J4Clarity}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBudgen16.DiffInDiffData
#'
"KitchenhamMadeyskiBudgen16.DiffInDiffData"


#' KitchenhamMadeyskiBudgen16.COCOMO data
#'
#' If you use this data set please cite this R package and the following paper when accepted: Barbara Kitchenham, Lech Madeyski, David Budgen, Jacky Keung, Pearl Brereton, Stuart Charters, Shirley Gibbs, and Amnart Pohthong, 'Robust Statistical Methods for Empirical Software Engineering', Empirical Software Engineering, vol. 22, no.2, p. 579-630, 2017. DOI: 10.1007/s10664-016-9437-5 (https://dx.doi.org/10.1007/s10664-016-9437-5), URL: https://madeyski.e-informatyka.pl/download/KitchenhamMadeyskiESE.pdf
#'
#' Data set collected at TRW by Barry Boehm see: B.W. Boehm. 1981.  Software Engineering Economics. Prentice-Hall.
#'
#' Explanations by Barbara Kitchenham / https://terapromise.csc.ncsu.edu:8443/!/#repo/view/head/effort/cocomo/cocomo1/nasa93/nasa93.arff
#'
#' COCOMO.txt: pro type year Lang Rely Data CPLX aaf time store virt turn type2 acap aexp pcap vexp lexp cont modp TOOL TOOLcat SCED RVOL Select rvolcat Modecat Mode1 Mode2 Mode3 KDSI AKDSI Effort Dur Productivity
#' @format A data frame with variables:
#' \describe{
#' \item{Project}{Project ID}
#' \item{Type}{A categorical variable describing the type of the project}
#' \item{Year}{The year the project was completed}
#' \item{Lang}{A categorical variable describing the development language used}
#' \item{Rely}{Ordinal value defining the required software reliability}
#' \item{Data}{Ordinal value defining the data complexity / Data base size}
#' \item{Cplx}{Ordinal value defining the complexity of the software / Process complexity}
#' \item{Aaf}{??}
#' \item{Time}{Ordinal value defining the stringency of timing constraints / Time constraint for cpu}
#' \item{Stor}{Ordinal value defining the stringency of the data storage requirements / Main memory constraint}
#' \item{Virt}{Virtual Machine volatility}
#' \item{Turn}{Turnaround time}
#' \item{Type2}{A categorical variable defining the hardware type: mini, max=mainframe, midi}
#' \item{Acap}{Ordinal value defining the analyst capability}
#' \item{Aexp}{Ordinal value defining the analyst experience / application experience}
#' \item{Pcap}{Ordinal value defining the programming capability of the team / Programmers capability}
#' \item{Vexp}{Ordinal value defining the virtual machine experience of the team}
#' \item{Lexp}{Ordinal value defining the programming language experience of the team}
#' \item{Cont}{??}
#' \item{Modp}{ / Modern programming practices}
#' \item{Tool}{Ordinal value defining the extent of tool use / Use of software tools}
#' \item{ToolCat}{Recoding of Tool to labelled ordinal scale}
#' \item{Sced}{Ordinal value defining the stringency of the schedule requirements / Schedule constraint}
#' \item{Rvol}{Ordinal value defining the requirements volatility of the project}
#' \item{Select}{Categorical value calculated by BAK for an analysis example}
#' \item{Rvolcat}{Recoding of Rvol to a labelled ordinal scale}
#' \item{Modecat}{Mode of the projects: O=Organic, E=Embedded, SD-Semi-Detached}
#' \item{Mode1}{Dummy variable calculated by BAK: 1 if the project is Organic, 0 otherwise}
#' \item{Mode2}{Dummy variable calculated by BAK: 1 if the project is Semi-detached, 0 otherwise}
#' \item{Mode3}{Dummy variable calculated by BAK: 1 if the project is Embedded, 0 otherwise}
#' \item{KDSI}{Product Size Thousand of Source Instructions}
#' \item{AKDSI}{Adjusted Product Size for Project in Thousand Source Instructions - differs from KDSI for enhancement projects}
#' \item{Effort}{Project Effort in Man months}
#' \item{Duration}{Duration in months}
#' \item{Productivity}{Productivity of project calculated by BAK as AKDSI/Effort, so the the larger the value the better the productivity}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBudgen16.COCOMO
#'
"KitchenhamMadeyskiBudgen16.COCOMO"


#' KitchenhamMadeyski.SimulatedCrossoverDataSets data
#'
#' If you use this data set please cite this R package and the following paper: Lech Madeyski and Barbara Kitchenham, 'Effect Sizes and their Variance for AB/BA Crossover Design Studies', Empirical Software Engineering, vol. 24, no.4, p. 1982-2017, 2018. DOI: 10.1007/s10664-017-9574-5
#'
#' This is simulated normally distributed data from 30 subjects, with technique A being 10 units more effective than technique B, and there is a period effect equaling 5 units. Subject 1 to 15 used technique B first while subjects 16 to 30 used technique A first.
#'
#' @format A data frame with variables:
#' \describe{
#' \item{actualSampleSize}{Sample size}
#' \item{SSFull}{Sample Size}
#' \item{CFull}{Correlation}
#' \item{ESFull}{Effect Size}
#' \item{Accuracy}{Accuracy}
#' \item{PropSig}{...}
#' \item{WrongTSig}{...}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyski.SimulatedCrossoverDataSets
#'
"KitchenhamMadeyski.SimulatedCrossoverDataSets"


#' MadeyskiKitchenham.EUBASdata data
#'
#' If you use this data set please cite this R package and the paper where we analyze the data set: Lech Madeyski and Barbara Kitchenham, 'Effect Sizes and their Variance for AB/BA Crossover Design Studies', Empirical Software Engineering, vol. 24, no.4, p. 1982-2017, 2018. DOI: 10.1007/s10664-017-9574-5
#'
#' Data set comes from an experiment conducted in Italy at the University of Basilicata (with 24 first-year students from the Master's Program in Computer Science) to answer the question 'Do the software models produced in the requirements analysis process aid in the comprehensibility and modifiability of source code?', see G. Scanniello, C. Gravino, M. Genero, J. A. Cruz-Lemus, and G. Tortora, 'On the Impact of UML Analysis Models on Source-code Comprehensibility and Modifiability,' ACM Transactions on Software Engineering and Methodology, vol. 23, pp. 13:1-13:26, Apr. 2014. However, the inconsistent subject data for subject 2 was removed, see the aforementioned paper by Madeyski and Kitchenham.
#'
#' @format A data frame with variables:
#' \describe{
#' \item{ID}{Project ID}
#' \item{TimePeriod}{Period of time (run): R1, R2}
#' \item{SequenceGroup}{Sequence group: G1, G2, G3, G4}
#' \item{System}{Software system identifier indicates the system (i.e., S1 or S2) used as the experimental object: S1. A software system to sell and manage CDs/DVDs in a music shop, S2. A software system to book and buy theater tickets}
#' \item{Technique}{The independent variable. It is a nominal variable that can assume the following two values: AM (analysis models plus source code) and SC (source code alone)}
#' \item{Comp_Level}{This denotes the comprehension level of the source code achieved by a software engineer}
#' \item{Modi_Level}{This denotes the capability of a maintainer to modify source code}
#' }
#'
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' MadeyskiKitchenham.EUBASdata
#'
"MadeyskiKitchenham.EUBASdata"



#' KitchenhamMadeyskiBrereton.MetaAnalysisReportedResults data
#'
#' This data is used in the paper: Barbara Kitchenham, Lech Madeyski and Pearl Brereton. Meta-analysis for Families of Experiments: A Systematic Review and Reproducibility Assessment (to be submitted).
#' This data set reports the meta-analysis results reported by the authors of the 13 primary studies included in the systematic review.
#' @format A text file file with variables:
#' \describe{
#' \item{Study}{This field includes the study identifier of each of the 13 primary studies which were included in the systematic review.}
#' \item{Type}{This identifies the type of effect size used by the study authors. d  or g refer to d_IG and g_IG, P is the aggregated p values, if the repeated measures estimate was obtained it is appropriately specified, r refers to the point bi-serial correlation.}
#' \item{Source}{Always set to Rep. This identifies that the data was as reported by the primary study authors.}
#' \item{mean}{The overall mean effect size reported by the study authors}
#' \item{pvalue}{The one-sided p-value associated with the overall mean reported by the study authors. NA means the authors did not report this statistic.}
#' \item{UB}{The upper bound of the confidence interval of the overall mean as reported by the primary study authors. NA means the authors did not report this statistic.}
#' \item{LB}{The lower bound of the confidence interval of the overall mean as reported by the primary study authors. NA means the authors did not report this statistic.}
#' \item{QE}{The heterogeneity statistic associated with the meta-analysis as reported by the study authors. NA means the authors did not report this statistic.}
#' \item{Qep}{The p-value of the heterogeneity statistic associated with the meta-analysis as reported by the study authors. NA means the authors did not report this statistic.}
#' }
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBrereton.MetaAnalysisReportedResults
#'
"KitchenhamMadeyskiBrereton.MetaAnalysisReportedResults"

#' KitchenhamMadeyskiBrereton.ABBAMetaAnalysisReportedResults data
#'
#' This data is used in the paper: Barbara Kitchenham, Lech Madeyski and Pearl Brereton. Meta-analysis for Families of Experiments: A Systematic Review and Reproducibility Assessment, Empirical Software Engineering (2019) doi:10.1007/s10664-019-09747-0.
#' This data set reports the meta-analysis results reported by the authors of the primary studies included in the systematic review that reported results on a per document basis which for S7 and S11 was equivalent to reporting the results for each time period.
#' @format A text file with variables:
#' \describe{
#' \item{Study}{This field includes the study identifier of each of the the 3 primary studies which reported results per document.}
#' \item{Type}{This identifies the type of effect size used by the study authors. d or g refer to d_IG and g_IG, P is the aggregated p values, if the repeated measures (RM) estimate was obtained it is appropriately specified.}
#' \item{Source}{Always set to Rep. This identifies that the data was as reported by the primary study authors.}
#' \item{mean}{The overall mean effect size reported by the study authors}
#' \item{pvalue}{The one-sided p-value associated with the overall mean reported by the study authors. NA means the authors did not report this statistic.}
#' \item{UB}{The upper bound of the confidence interval of the overall mean as reported by the primary study authors. NA means the authors did not report this statistic.}
#' \item{LB}{The lower bound of the confidence interval of the overall mean as reported by the primary study authors. NA means the authors did not report this statistic.}
#' \item{QE}{The heterogeneity statistic associated with the meta-analysis as reported by the study authors. NA means the authors did not report this statistic.}
#' \item{Qep}{The p-value of the heterogeneity statistic associated with the meta-analysis as reported by the study authors. NA means the authors did not report this statistic.}
#' }
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBrereton.ABBAMetaAnalysisReportedResults
#'
"KitchenhamMadeyskiBrereton.ABBAMetaAnalysisReportedResults"


#' KitchenhamMadeyskiBrereton.ReportedEffectSizes data
#'
#' This data is used in the paper: Barbara Kitchenham, Lech Madeyski and Pearl Brereton. Meta-analysis for Families of Experiments: A Systematic Review and Reproducibility Assessment, Empirical Software Engineering (2019) doi:10.1007/s10664-019-09747-0.
#' This file holds the individual effect sizes for each experiment, as reported by 13 primary studies in the systematic review.
#' @format A text file with variables:
#' \describe{
#' \item{Study}{This field includes the study identifier of each of the 13 primary studies which were included in the systematic review.}
#' \item{Type}{This identifies the type of effect size used by the study authors. d  or g refer to dIG and gIG, p is the p-value used for aggregation, if the repeated measures estimate was obtained it is appropriately specified as gRM, r refers to the point bi-serial correlation.}
#' \item{Source}{Always set to Rep. This identifies that the data was as reported by the primary study authors.}
#' \item{Design}{The refers to the design method used by the study author. 4GroupCO is a 4-group crossover design. Mixed means different experiments in a particular family used different methods (only S3 used mixed methods and 4 experiments used the 4 group crossover and one used an independent groups design). ABBACO is the standard 2-group crossover design. IndGroups is the independent groups design also called between groups design or a randomised design. PrePost is pretest and posttest design with a post test control.}
#' \item{Exp1}{This is the reported standardized effect size for the first experiment in the family.}
#' \item{Exp2}{This is the reported standardized effect size for the second experiment in the family.}
#' \item{Exp3}{This is the reported standardized effect size for the third experiment in the family.}
#' \item{Exp4}{This is the reported standardized effect size for the fourth experiment in the family. NA means there was no fourth experiment in the family.}
#' \item{Exp5}{This is the reported standardized effect size for the fifth experiment in the family. NA means there was no fifth experiment in the family.}
#' }
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBrereton.ReportedEffectSizes
#'
"KitchenhamMadeyskiBrereton.ReportedEffectSizes"


#' KitchenhamMadeyskiBrereton.ABBAReportedEffectSizes data
#'
#' This data is used in the paper: Barbara Kitchenham, Lech Madeyski and Pearl Brereton. Meta-analysis for Families of Experiments: A Systematic Review and Reproducibility Assessment, Empirical Software Engineering (2019) doi:10.1007/s10664-019-09747-0.
#' This file holds the individual effect sizes for the first time period (or equivalently the first document), as reported by the 3 primary studies in the systematic review that reported results for each document/time period separately.
#' @format A text file with variables:
#' \describe{
#' \item{Study}{This field includes the study identifier of each of the 3 primary studies which were included in the systematic review. The studies are S3, S7 and S11.}
#' \item{Type}{This identifies the type of effect size used by the study authors. d  or g refer to dIG and gIG.}
#' \item{Source}{Always set to Rep. This identifies that the data was as reported by the primary study authors.}
#' \item{Design}{Mixed means different experiments in a particular family used different methods (onlyS3 used mixed methods and 4 experiments used the 4 group crossover and one used an independent groups design). ABBACO is the standard 2-group crossover design. }
#' \item{Exp1}{This is the reported standardised effect size for the first time period and the first experiment in the family.}
#' \item{Exp2}{This is the reported standardised effect size for the first time period and second experiment in the family.}
#' \item{Exp3}{This is the reported standardised effect size for the first time period and the third experiment in the family.}
#' \item{Exp4}{This is the reported standardised effect size for the first time period and the fourth experiment in the family. NA means there was no fourth experiment in the family.}
#' \item{Exp5}{This is the reported standardised effect size for the first time period and the fifth experiment in the family. NA means there was no fifth experiment in the family.}
#' }
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBrereton.ABBAReportedEffectSizes
#'
"KitchenhamMadeyskiBrereton.ABBAReportedEffectSizes"


#' KitchenhamMadeyskiBrereton.ExpData data
#'
#' This data is used in the paper: Barbara Kitchenham, Lech Madeyski and Pearl Brereton. Meta-analysis for Families of Experiments: A Systematic Review and Reproducibility Assessment, Empirical Software Engineering (2019) doi:10.1007/s10664-019-09747-0.
#' This file holds the descriptive data for each experiment which include the mean, standard deviation and sample size for the control and treatment techniques. Note in the case of studies 3, 7 and 11, which reported descriptive data for each time period (or equivalently each document) separately, the values for of the descriptive data were obtained by analysing the data reported in the DocData file.
#' @format A text file with variables:
#' \describe{
#' \item{Study}{This field includes the study identifier of each of the 13 primary studies which were included in the systematic review. }
#' \item{Exp}{This identifies the experiment to which the descriptive data belongs.}
#' \item{Source}{Always set to Rep. This identifies that the data was as reported by the primary study authors.}
#' \item{Mc}{The mean value of the observations obtained using the control technique.}
#' \item{SDc}{The standard deviation of the observations obtained using the control technique.}
#' \item{Nc}{The number of participants using the control technique in the first time period.}
#' \item{Mt}{The mean value of the observations obtained using the treatment technique.}
#' \item{SDt}{The standard deviation of the observations obtained using the treatment technique.}
#' \item{Nt}{The number of participants using the treatment technique in the first time period.}
#' \item{r}{The correlation between repeated measures. NA if not reported. Note only study 13 reported this correlation.}
#' }
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBrereton.ExpData
#'
"KitchenhamMadeyskiBrereton.ExpData"


#' KitchenhamMadeyskiBrereton.DocData data
#'
#' This data is used in the paper: Barbara Kitchenham, Lech Madeyski and Pearl Brereton. Meta-analysis for Families of Experiments: A Systematic Review and Reproducibility Assessment, Empirical Software Engineering (2019) doi:10.1007/s10664-019-09747-0.
#' This file holds the descriptive data for each document and each experiment for studies 3, 7 and 11 which include the mean, standard deviation and sample size for the control and treatment techniques. These studies performed ABBA crossover experiments and reported data for each document separately. Note Study 3 also undertook an independent groups study but data from that experiment is held in the ExpData file.
#' @format A text file with variables:
#' \describe{
#' \item{Study}{This field includes the study identifier of each of the 3 primary studies which reported their basic statistics on a time period & document basis. }
#' \item{Exp}{This identifies the experiment to which the descriptive data belongs.}
#' \item{Doc}{This identifies whether the data arose from the document used in the first or second time period. The value 'Doc1' identifies the data as coming from the first document or first time period. The value 'Doc2' identifies the data as coming from the second time period or document. Note for Study 3 we used the analysis of a specific document that was used in all 4 ABBA experiments. For studies 7 and 11, the authors identified which we used in r=each time period and Doc1 refers to data from the first time period.}
#' \item{Mc}{The mean value of the observations obtained using the control technique for the identified document.}
#' \item{SDc}{The standard deviation of the observations obtained using the control technique for the identified document.}
#' \item{Nc}{The number of participants using the control technique in the first time period for the identified document.}
#' \item{Mt}{The mean value of the observations obtained using the treatment technique for the identified document.}
#' \item{SDt}{The standard deviation of the observations obtained using the treatment technique for the identified document.}
#' \item{Nt}{The number of participants using the treatment technique in the first time period for the identified document.}
#' }
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' KitchenhamMadeyskiBrereton.DocData
#'
"KitchenhamMadeyskiBrereton.DocData"



#' MadeyskiLewowski.IndustryRelevantGitHubJavaProjects20190324 data
#'
#' This data is used in the paper:
#' Tomasz Lewowski and Lech Madeyski, 'Creating Evolving Project Data Sets in Software Engineering', vol. 851 of Studies in Computational Intelligence, pp. 1–14. Cham: Springer, 2020. DOI: 10.1007/978-3-030-26574-8_1
#' @format A text file with variables:
#' \describe{
#' \item{rowID}{unique id assigned to projects before filtering (source: API)}
#' \item{id}{GitHub repository ID (source: API)}
#' \item{repository owner}{the organization or user owning the repository (source: API)}
#' \item{project name}{name of the project (source: API)}
#' \item{manual}{link to best found project documentation - wiki, webpage, documentation directory or readme. Projects with limited documentation were marked with (limited) and ones that had documentation in Chinese - (Chinese) (source: manual)}
#' \item{installation}{the recommended installation medium(s) for the project. Some mediums may be missing for projects with multiple recommendations. (source: manual)}
#' \item{support}{channel(s) that can be used to get support and/or report bugs. Some channels may be missing for projects with multiple ones. Abbreviations used (source: manual):
#' GH GitHub Issues
#' SO Stack Overflow
#' GG Google Groups
#' ML Mailing list
#' FB Facebook
#' MM Mattermost
#' LI LinkedIn
#' ?  not found}
#' \item{is not sample/playground/docs/...}{1 if the project is an actual application or library, 0 if it is a set of samples, only documentation or some experimental area (source: manual)}
#' \item{is industrial}{whether the project can be treated as industrial quality one. Values and their meanings:
#' 1   the repository can be classified as industrial grade;
#' 0,5 the repository can sometimes be classified as industrial grade, but it is either a minor project or its documentation or support may be lacking the depth;
#' 0   the repository cannot be classified as industrial-grade;
#' -1  the repository is no longer actively maintained as of the date of data acquisition;
#' -2  the repository is no longer in Java as of the date of data acquisition. (source: manual)}
#' \item{createdAt}{the date at which the repository was created (source: API)}
#' \item{updatedAt}{the date of last repository update - including changes in projects, watchers, issues etc. (source: API)}
#' \item{pushedAt}{the date of last push to the repository - NOT the date of last pushed commit (source: API)}
#' \item{diskUsage}{total number of bytes on disk that are needed to store the repository (source: API)}
#' \item{forkCount}{number of existing repository forks (independent copies managed by other entities) (source: API)}
#' \item{isArchived}{true if the repository is archived (no longer maintained), false otherwise (source: API)}
#' \item{isFork}{true if the repository is a fork (not the main repository), false otherwise (source: API)}
#' \item{isMirror}{true if the repository is a mirror, false otherwise (source: API)}
#' \item{sshUrlOfRepository}{URL that can be used to immediately clone the repository (source: API)}
#' \item{licenseInfo.name}{name of license under which the project is distributed. Names are the same as in https://choosealicense.com/appendix/ (source: API)}
#' \item{commitSHA}{unique Git identifier of commit that was top of the main branch at the time of data acquisition (source: API)}
#' \item{defaultBranchRef.target.history.totalCount}{number of commits on the default branch in the repository (usually master) at the time of data acquisition (source: API)}
#' \item{stargazers.totalCount}{number of stargazers for the repository at the time of data acquisition (source: API)}
#' \item{watchers.totalCount}{number of watchers for the repository at the time of data acquisition (source: API)}
#' \item{languages.totalSize}{total size of all source code files (source: API)}
#' \item{Java.byte.count}{total size of Java files (source: API)}
#' \item{Language}{main programming language used in the repository, i.e. one that the most code is written in (source: API)}
#' \item{searchQuery}{query used during search that obtained this project (source: API)}
#' }
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' MadeyskiLewowski.IndustryRelevantGitHubJavaProjects20190324
#'
"MadeyskiLewowski.IndustryRelevantGitHubJavaProjects20190324"


#' MadeyskiLewowski.IndustryRelevantGitHubJavaProjects20191022 data
#'
#' This data is used in the paper: Tomasz Lewowski and Lech Madeyski, 'How do software engineering data sets evolve? A reproduction study', 2020 (submitted).
#' Generated by:
#' token <- '...'
#' MadeyskiLewowski.IndustryRelevantGitHubJavaProjects20191022<-searchForIndustryRelevantGitHubProjects(token, '2019-03-01', '2018-08-01')
#' usethis::use_data(MadeyskiLewowski.IndustryRelevantGitHubJavaProjects20191022)
#' @format A text file with variables:
#' \describe{
#' \item{rowID}{unique id assigned to projects before filtering (source: API)}
#' \item{id}{GitHub repository ID (source: API)}
#' \item{repository owner}{the organization or user owning the repository (source: API)}
#' \item{project name}{name of the project (source: API)}
#' \item{manual}{link to best found project documentation - wiki, webpage, documentation directory or readme. Projects with limited documentation were marked with (limited) and ones that had documentation in Chinese - (Chinese) (source: manual)}
#' \item{installation}{the recommended installation medium(s) for the project. Some mediums may be missing for projects with multiple recommendations. (source: manual)}
#' \item{support}{channel(s) that can be used to get support and/or report bugs. Some channels may be missing for projects with multiple ones. Abbreviations used (source: manual):
#' GH GitHub Issues
#' SO Stack Overflow
#' GG Google Groups
#' ML Mailing list
#' FB Facebook
#' MM Mattermost
#' LI LinkedIn
#' ?  not found}
#' \item{is not sample/playground/docs/...}{1 if the project is an actual application or library, 0 if it is a set of samples, only documentation or some experimental area (source: manual)}
#' \item{is industrial}{whether the project can be treated as industrial quality one. Values and their meanings:
#' 1   the repository can be classified as industrial grade;
#' 0,5 the repository can sometimes be classified as industrial grade, but it is either a minor project or its documentation or support may be lacking the depth;
#' 0   the repository cannot be classified as industrial-grade;
#' -1  the repository is no longer actively maintained as of the date of data acquisition;
#' -2  the repository is no longer in Java as of the date of data acquisition. (source: manual)}
#' \item{createdAt}{the date at which the repository was created (source: API)}
#' \item{updatedAt}{the date of last repository update - including changes in projects, watchers, issues etc. (source: API)}
#' \item{pushedAt}{the date of last push to the repository - NOT the date of last pushed commit (source: API)}
#' \item{diskUsage}{total number of bytes on disk that are needed to store the repository (source: API)}
#' \item{forkCount}{number of existing repository forks (independent copies managed by other entities) (source: API)}
#' \item{isArchived}{true if the repository is archived (no longer maintained), false otherwise (source: API)}
#' \item{isFork}{true if the repository is a fork (not the main repository), false otherwise (source: API)}
#' \item{isMirror}{true if the repository is a mirror, false otherwise (source: API)}
#' \item{sshUrlOfRepository}{URL that can be used to immediately clone the repository (source: API)}
#' \item{licenseInfo.name}{name of license under which the project is distributed. Names are the same as in https://choosealicense.com/appendix/ (source: API)}
#' \item{commitSHA}{unique Git identifier of commit that was top of the main branch at the time of data acquisition (source: API)}
#' \item{defaultBranchRef.target.history.totalCount}{number of commits on the default branch in the repository (usually master) at the time of data acquisition (source: API)}
#' \item{stargazers.totalCount}{number of stargazers for the repository at the time of data acquisition (source: API)}
#' \item{watchers.totalCount}{number of watchers for the repository at the time of data acquisition (source: API)}
#' \item{languages.totalSize}{total size of all source code files (source: API)}
#' \item{Java.byte.count}{total size of Java files (source: API)}
#' \item{Language}{main programming language used in the repository, i.e. one that the most code is written in (source: API)}
#' \item{searchQuery}{query used during search that obtained this project (source: API)}
#' }
#' @source \url{https://madeyski.e-informatyka.pl/reproducible-research/}
#' @examples
#' MadeyskiLewowski.IndustryRelevantGitHubJavaProjects20191022
#'
"MadeyskiLewowski.IndustryRelevantGitHubJavaProjects20191022"