knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
suppressPackageStartupMessages(library("mmstat4"))

Introduction

Aim

This package was designed with the aim of distributing educational resources for statistics courses targeted at students.

Once the teaching materials have been downloaded, the primary functions of this package include:

With this feature, students can access various educational materials such as interactive apps, R code, data files, and other resources that can be helpful in learning statistical concepts. By providing easy access to these materials, the package aims to facilitate the learning process for students and make it more interactive and engaging.

GitHub allows you to download the repository as a ZIP file. You can find the option to download under the Code button (Download ZIP) in the repository. mmstat4 works with this ZIP file, but you can also use one of your own ZIP files.

In my courses, I assume that all R programs run in a freshly started R environment, meaning there are no path dependencies, and all necessary libraries are loaded within the R program. My repositories contain not only the example programs for the students but also the programs I use to create images and tables, as well as the Shiny Apps I demonstrate.

Installation

You can install mmstat4 from CRAN using:

install.packages("mmstat4")

Alternatively, you can install the development version from GitHub using devtools:

devtools::install_github("sigbertklinke/mmstat")

Getting started

A component of the package includes a small ZIP file containing educational materials. Initially, we need to instruct mmstat4 to utilize this ZIP file instead of the larger ZIP file for my Data Analysis I and II courses.

ghget('local') 
ghopen("example_mcnemar.R")  # open a R example file

ghget returns the key (local) associated with the currently active ZIP file. ghopen launches the example file in RStudio. To access the equivalent Python script, use:

ghopen("example_mcnemar.py")  # open a Python example file

Note: To run Python scripts, ensure local Python installation. Scripts execute within mmstat4.xxxx virtual environment, created upon script run or open. User approval is crucial. Upon setup, script checks for init_py.R in ZIP file. If found, it's executed, often installing Python modules with reticulate::py_install('module name').

Shiny apps can also be launched in RStudio and run locally.

ghopen("pca_best_line/app.R")  # open a Shiny app

Data files can be loaded with:

x <- ghload("TelefonDaten.csv")  # load a data set
head(x)
x <- ghget("local")
x <- ghload("TelefonDaten.csv")  # load a data set
head(x)

HTML and PDF files will open in the default application:

ghopen("Formelsammlung.pdf")  # open a PDF file

Using a ZIP file or repository

ghget

A ZIP file or repository can be stored locally or in the internet. A key-value approach can be used to determine the location of the source ZIP file. If no key is defined then ghget uses the base name of the source ZIP file as the key.

ghget(dummy="https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip")

Three repository keys are predefined: hu.data, hu.stat and dummy. You can retrieve them via

ghget('dummy')
ghget('hu.stat')
ghget('hu.data')

If you do not use a key, the programme will create one and return it as result.

ghget(system.file("zip/mmstat4.dummy.zip", package = "mmstat4"))
ghget("https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip")
# tries https://github.com/my/github_repo/archive/refs/heads/[main|master].zip 
ghget("my/github_repo")  # will fail
#
ghget()                  # uses 'hu.data'

ghget downloads the ZIP file, saves it to a temporary location and unpacks it. For non-temporary locations, see the FAQ.

Full and short names for files

In addition, unique short names, related to the ZIP file content, are generated from the path components.

After unpacking the ZIP file, unique short names are generated for these files.

ghget('dummy')
gd <- ghdecompose(ghlist(full.names=TRUE))
head(gd)

The file name is split into four parts. The last two parts, minpath and filename, are used to create short names:

  1. the short name for /tmp/RtmpXXXXXX/mmstat4.dummy-main/LICENSE is LICENSE. There was no other file named LICENSE in the ZIP file. Therefore, it is sufficient to address this file in the ZIP file.
  2. the short name for /tmp/RtmpXXXXXX/mmstat4.dummy-main/data/BANK2.sav is data/BANK2.sav. There is another file called BANK2.sav in the ZIP file, but to address it uniquely, data/BANK2.sav is sufficient for this file in the ZIP file (the other is dbscan/BANK2.sav). Currently, no check is made whether two files with identical basenames are also identical in content.
ghlist("BANK2", full.names=TRUE) # full names
ghlist("BANK2")                  # short names

ghopen, ghload, ghsource

The short names (or the full names) can be used to work with the files

x <- ghload("data/BANK2.sav")          # load data via rio::import
ghopen("univariate/example_ecdf.R")    # open file in RStudio editor
ghsource("univariate/example_ecdf.R")  # execute file via source
ghlist("example_ecdf")                 # "univariate/" was unnecessary

ghlist, ghquery

With ghlist you can get a list of unique (short) names for all files or a subset based on a regular expression pattern in the repository

str(ghlist())     # get all short names
ghlist("\\.pdf$") # get all short names of PDF files

With ghquery you can query the list of unique (short) names for all files based on the overlap distance.

ghlist("bnk")  # pattern = "bnk
ghquery("bnk") # nearest string matching to "bnk"

ghfile, ghpath, ghdecompose

ghfile tries to find a unique match for a given file and returns the full path. If there is no unique match, an error is returned with some possible matches.

ghdecompose builds a data frame and decomposes the full names of the files into

The short names for the files are built from the components minpath and filename.

ghpath builds up the short name with various path components from a ghdecompose object.

ghfile('data/BANK2.sav')
ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
fnf <- ghlist(full.names=TRUE)
dfn <- ghdecompose(fnf)
head(dfn)
head(ghpath(dfn))

RStudio addins

The package comes with two RStudio addins (see under Addins -> MMSTAT4):

Creating an own ZIP file

Preparation 1: Libraries/Modules used

Currently there are the following routines to support R/Python code snippets:

ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
files <- ghlist(pattern="*.R$", full.names = TRUE)
cat(head(pkglist(files, repos="https://cloud.r-project.org"), 12))

Note that the line for CHAID is commented out. The package cannot be found in CRAN, but you can install it from R-Forge.

cat(head(pkglist(files, repos=c("https://cloud.r-project.org", "http://R-Forge.R-project.org")), 12))

You can add a file init_R.R or init_py.R to your ZIP file, which installs the necessary R packages or Python modules.

Preparation 2: Scripts run independently

checkFiles checks whether each R code snippet runs smoothly in a freshly started R.

# just check the last files from the list 
# Note that the R console will show more output (warnings etc.)
checkFile(files, start=435)  # alternatively: Rsolo

Three modes are available for checking a file:

  1. exist: Does the source file exist?
  2. parse: Is parse(file) or python -m "file" successful? (default)
  3. run: Is Rscript "file" or python3 "file" successful?

Preparation 3: Searching for (and removing) duplicate files

dupFiles uses checksums to check whether files exist twice.

files <- ghlist(full.names = TRUE)
head(dupFiles(files))  # alternatively: Rdups

Note: there is also an error message if the necessary libraries are not installed!

ZIP file and access names

Once you created your ZIP file you need to know under which names a specific file can be accessed. In the example we use a ZIP file which comes with the package mmstat4:

ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
ghnames <- ghdecompose(ghlist(full.names=TRUE))
ghnames[58,]

The shortest possible name is determined by minpath and filename. But all other paths determined by uniquepath, minpath and filename should also work.

For file number 58, the following access names are possible:

x1 <- ghload("BANK2.sav")
x2 <- ghload("dbscan/BANK2.sav")
x3 <- ghload("cluster/dbscan/BANK2.sav")
x4 <- ghload("data/cluster/dbscan/BANK2.sav")
x5 <- ghload("examples/data/cluster/dbscan/BANK2.sav")

Frequently asked questions

Something is not working properly. Where can I get help?

Please email me at sigbert@hu-berlin.de. You can also try the current development version of the package from GitHub:

# install.packages("devtools")
devtools::install_github("sigbertklinke/mmstat4")

Can I use a password protected ZIP file?

No, this is not supported.

How can I force a reload of a zip file?

ghget("dummy", .force=TRUE)

How can I store a zip file permanently?

ghget("dummy", .tempdir=FALSE)        # install non-temporarily
ghget("dummy", .tempdir="~/mmstat4")  # install non-temporarily to ~/mmstat4
ghget("dummy", .tempdir=TRUE)         # install again temporarily

Note: If a repository was installed permanently and you switch back to temporarily storage then the downloaded files will not be deleted.

How can I find all directories with Shiny apps?

ghget("dummy", .tempdir=TRUE)
ghlist(pattern="/(app|server)\\.R$")
ghopen("dbscan") # open the app

How can I find all csv data files?

ghget("dummy", .tempdir=TRUE)
ghlist(pattern="\\.csv$", ignore.case=TRUE, full.names=TRUE)
# use mmstat4::ghload for importing
ghlist(pattern="\\.csv$")
pechstein <- ghload("pechstein.csv")
str(pechstein)

What should I install to use Python scripts?

For Ubuntu (Linux) install:

sudo apt-get install python3 python3-dev python3-pip python3-venv libbz2-dev

Note: mmstat4 installs these Python modules numpy, scipy, statsmodels, pandas, scikit-learn, matplotlib, and seaborn by default.

init_py.R is only called if the virtual environment is created. Can I force a new call?

Yes, delete the virtual environment and recreate it

reticulate::virtualenv_remove('mmstat4')
ghinstall('py', force=TRUE)

Default repositories

The package recognises three standard repositories: dummy, hu.stat, and hu.data.

tabl <- "
| Repository       | Size          | ZIP file location |
| :--------------- | :-------------| :--------|
| `dummy`          | 3 MB          | `https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip`  |
| `hu.data`        | 29 MB         | `https://github.com/sigbertklinke/mmstat4.data/archive/refs/heads/main.zip`   |
| `hu.stat`        | 31 MB         | `https://github.com/sigbertklinke/mmstat4.stat/archive/refs/heads/main.zip`   |
"
cat(tabl) 

dummy is small subsample of hu.stat and hu.data which is intended for examples and test purposes.

Lecture Notes Sigbert Klinke, HU Berlin

Basic statistics I+II (in german)

Mathematische Grundlagen - Einführung - Grundbegriffe - Univariate Verteilungen - Parameter univariater Verteilungen - Bivariate Verteilungen - Parameter bivariater Verteilungen - Regressionanalyse - Zeitreihenanalyse - Indexzahlen - Wahrscheinlichkeitsrechnung - Zufallsvariablen - So lügt man mit Statistik - Wichtige Verteilungsmodelle - Stichprobentheorie - Statistische Schätzverfahren - Regressionsmodell - Konfidenzintervalle - Statistische Testverfahren - Parameterische Tests - Nichtparametrische Tests

ghget("hu.stat")
ghopen("Statistik.pdf")
ghopen("Aufgaben.pdf")
ghopen("Loesungen.pdf")
ghopen("Formelsammlung.pdf")

Data analysis

General - R - Basics and data generation - Test and estimation theory - Parameter of distributions - Distribution - Transformations - Robust statistics - Missing values - Subgroup analysis - Correlation and association - Multivariate graphics - Principal component analysis - Exploratory factor analysis - Reliability - Cluster analysis - Regression analysis - Linear regression - Nonparametric regression - Classification and regression trees - Neural networks

ghget("hu.data")
ghopen("dataanalysis.pdf")

Lecture Notes Bernd Rönz, HU Berlin (in german)

Computergestützte Statistik I mit SPSS 10 (2001)

Einführung - Entdeckung und Identifikation von Ausreißern - Prüfung der Verteilungsform von Variablen - Parametervergleiche bei unbhängigen Stichproben - Anhänge A-D, Literaturverzeichnis, Index

ghget("hu.data")
ghopen("cs1_roenz.pdf")

Computergestützte Statistik II mit SPSS 10 (2000)

Vorwort - Überprüfung von Zusammenhängen - Regressionsanalyse - Reliabilitäts- und Homogenitätsanalyse von Konstrukten - Anhänge A-H, Literaturverzeichnis, Stichwortverzeichnis

ghget("hu.data")
ghopen("cs2_roenz.pdf")

Generalisierte lineare Modelle mit SPSS 10 (2001)

Einführung - Verallgemeinerte lineare Modelle (generalized linear models, GLM) - Modellierung binärer Daten - Das multinomiale Logit Modell - Modellierung multinomialer Daten (log-lineare Modelle) - Literaturverzeichnis, Index

ghget("hu.data")
ghopen("glm_roenz.pdf")


sigbertklinke/mmstat4 documentation built on Sept. 13, 2024, 4:46 p.m.