Home

/

GitHub

/

In matthiasgomolka/sdcLog: Tools for Statistical Disclosure Control in Research Data Centers

library(xaringanthemer)
library(sdcLog)
library(DiagrammeR)
style_mono_accent(
  base_color = "#1d3557",
  header_font_google = google_font("Fira Sans"),
  text_font_google   = google_font("Fira Sans", "300", "300i"),
  code_font_google   = google_font("Fira Mono")
)

options(htmltools.dir.version = FALSE)
options(datatable.print.keys = FALSE)
options(datatable.print.class = FALSE)

knitr::opts_chunk$set(
  # prompt = TRUE,
  # comment = "#>",
  collapse = TRUE,
  background = "#FAFAFA"
)

Wer bin ich?

Ich arbeite im Forschungsdaten- und Servicezentrum (FDSZ) der Deutschen Bundesbank.

--

Inhaltliche Schwerpunkte:

--

Wertpapiertransaktionsdaten
Data Production Pipelines
R-Tools, die das FDSZ-Leben einfacher machen

Motivation

Problem

Forschende stehen in der Pflicht, zu zeigen, dass ihr Output den Regeln des FDSZ entspricht.

--

Das kann schnell sehr komplex werden.

--

Außerdem ist der Aufwand für das FDSZ sehr hoch, wenn zusätzlich geprüft werden muss, wie die Forschenden die Nachweise für Ihren Output erbringen.

--

Lösung

FDSZ stellt Forschenden Werkzeuge zur Verfügung, um die Konformität mit den Outputregeln nachzuweisen: sdcLog

Theorie

Zwei einfache Regeln:

--

Jedes Ergebnis muss auf mindestens 5 unterschiedlichen Entitäten basieren (distinct ID's).

--

Die beiden größten Entitäten dürfen nicht mehr als 85% eines Ergebnisses ausmachen (dominance).

Ein Beispiel

Forschende möchten das arithmetische Mittel einer Variable berechnen und das Ergebnis in ihrer Publikation zeigen. Dazu müssen sie vorab mit sdc_descriptives() zeigen, dass das Ergebnis den Output-Regeln entspricht.

--

.pull-left[

data("sdc_descriptives_DT")
DT <- sdc_descriptives_DT
head(DT)

]

--

.pull-right[

# gesuchtes Ergebnis
DT[, .(mean = mean(val_1, na.rm = TRUE)),
   by = "sector"]

]

--

# Nachweis, dass das Ergebnis den Regeln entspricht
sdc_descriptives(DT, id_var = "id", val_var = "val_1", by = "sector")

Noch ein Beispiel

Diesmal berechnen die Forschenden das arithmetische Mittel gruppiert nach sector und year.

--

sdc_descriptives(DT, id_var = "id", val_var = "val_1", by = c("sector", "year"))

Minimum und Maximum

Jetzt möchten Forschende neben dem arithmetischen Mittel auch noch das Minimum und Maximum einer Variablen zeigen.

--

Problem

Minimum und Maximum sind vertrauliche Einzeldaten.

--

Lösung

"Minumum" und "Maximum" als arithmetisches Mittel der kleinsten bzw. größten Werte mit sdc_min_max():

--

sdc_min_max(DT, id_var = "id", val_var = "val_1")

Outputkontrolle bei statistischen Modellen

Jetzt möchten Forschende die Ergebnisse einer linearen Regression veröffentlichen.

--

options(sdc.n_ids = 3)

# Modell berechnen
mod <- lm(val_1 ~ sector + year + val_2, data = DT)

# Prüfen, ob Ergebnise freigegeben werden können
sdc_model(DT, model = mod, id_var = "id")

Warum heißt es sdcLog?

grViz("
digraph boxes_and_circles {

  # a 'graph' statement
  graph [overlap = true, fontsize = 30]

  # several 'node' statements
  node [shape = box, fontname = 'Fira Sans', width = 5.5]
  script [label = 'analysis.R\nenthält sdc_descriptives() / sdc_min_max() / sdc_model()'];

  node [width = 1] 
  log [label = 'Log Datei']

  node [shape = box, width = 3]
  checks [label = 'Prüfung durch FDSZ']

  node [shape = oval, width = 1] 
  sdc_log [label = '@@1']

  # several 'edge' statements
  script -> sdc_log
  sdc_log -> log
  log -> checks
}

[1]: 'sdc_log(analysis.R)'
")

Installation und Kontakt

CRAN

install.packages("sdcLog")

GitHub

https://github.com/matthiasgomolka/sdcLog/issues

E-Mail

matthias.gomolka@bundesbank.de

Telefon

069 9556-4991

matthiasgomolka/sdcLog documentation built on July 17, 2025, 3:21 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

matthiasgomolka/sdcLog
Tools for Statistical Disclosure Control in Research Data Centers

In matthiasgomolka/sdcLog: Tools for Statistical Disclosure Control in Research Data Centers

Wer bin ich?

Motivation

Problem

Lösung

Theorie

Ein Beispiel

Noch ein Beispiel

Minimum und Maximum

Problem

Lösung

Outputkontrolle bei statistischen Modellen

Warum heißt es sdcLog?

Installation und Kontakt

CRAN

GitHub

E-Mail

Telefon

R Package Documentation

Browse R Packages

We want your feedback!

matthiasgomolka/sdcLog Tools for Statistical Disclosure Control in Research Data Centers

In matthiasgomolka/sdcLog: Tools for Statistical Disclosure Control in Research Data Centers

Wer bin ich?

Motivation

Problem

Lösung

Theorie

Ein Beispiel

Noch ein Beispiel

Minimum und Maximum

Problem

Lösung

Outputkontrolle bei statistischen Modellen

Warum heißt es sdcLog?

Installation und Kontakt

CRAN

GitHub

E-Mail

Telefon

R Package Documentation

Browse R Packages

We want your feedback!

matthiasgomolka/sdcLog
Tools for Statistical Disclosure Control in Research Data Centers