The goal of gpt2samples is to help users explore the various sample texts as generated by Open AI’s new GPT2 transformer based language model.
An original implementation of a smaller version of GPT-2 can be found here, and the original sample text files can be found here.
This package contains the following data, stored as tibbles:
| tibble | description |
| :------------------- | :------------------------------------------------------------------------------------------------------------------------------------ |
| conditional-t07 | Conditionally generated samples, with context prompts from WebText
test corpus, default settings (temperature 1 and no truncation). |
| conditional-topk40 | Conditionally generated samples, with context prompts from WebText
test corpus, with temperature 0.7 |
| conditional | Conditionally generated samples, with context prompts from WebText
test corpus, with truncation and top_k 40. |
| unconditional | Unconditionally generated samples, default settings. |
| unconditional-t07 | Unconditionally generated samples, with temperature 0.7 |
| unconditional-topk40 | Unconditionally generated samples, with truncation and top_k 40. |
Additionally, all the generated samples (conditional and unconditional)
can be explored by calling
all_samples()
.
You can install the released version of gpt2samples from GitHub with:
# install.packages("gpt2samples")
# install.packages("devtools")
devtools::install_github("kanishkamisra/gpt2samples")
This is a basic example to explore the data using dplyr verbs
library(dplyr)
library(gpt2samples)
conditional %>%
filter(id == 100)
#> # A tibble: 2 x 4
#> file id type text
#> <chr> <int> <chr> <chr>
#> 1 conditio… 100 sample the waterbody that you are managing, getting pr…
#> 2 conditio… 100 complet… Permit, WDFW ensures that nonconventional child…
unconditional_t07 %>%
filter(id == 250)
#> # A tibble: 213 x 3
#> file id text
#> <chr> <int> <chr>
#> 1 unconditional… 250 This question already has an answer here: How do I…
#> 2 unconditional… 250 ""
#> 3 unconditional… 250 This is a basic question regarding text editing. T…
#> 4 unconditional… 250 ""
#> 5 unconditional… 250 (A)
#> 6 unconditional… 250 ""
#> 7 unconditional… 250 (B)
#> 8 unconditional… 250 ""
#> 9 unconditional… 250 (A)
#> 10 unconditional… 250 ""
#> # … with 203 more rows
all_samples() %>%
filter(file == "conditional") %>%
tail()
#> # A tibble: 6 x 4
#> file id type text
#> <chr> <int> <chr> <chr>
#> 1 conditio… 500 complet… "BOP will be remembered for it's technically in…
#> 2 conditio… 500 complet… ""
#> 3 conditio… 500 complet… There were literal lap times in running the wat…
#> 4 conditio… 500 complet… ""
#> 5 conditio… 500 complet… ""
#> 6 conditio… 500 complet… I was voiced by legendary actor turns down play…
all_samples() %>%
group_by(file) %>%
summarise(total_lines = n())
#> # A tibble: 6 x 2
#> file total_lines
#> <chr> <int>
#> 1 conditional 18067
#> 2 conditional-t07 24081
#> 3 conditional-topk40 20405
#> 4 unconditional 19469
#> 5 unconditional-t07 28841
#> 6 unconditional-topk40 21188
Additional exploration can use Julia Silge and David Robinson’s
tidytext
package,
among others to analyze the generated text as produced by GPT-2.
Please note that the ‘gpt2samples’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.