README.md

ldccr

ldccr status
badge

Overview

ldccr is utilities for various Japanese corpora.

The goal of ldccr package is to make easy to use Japanese language resources.

This package provides:

  1. parsers for several Japanese corpora that are free or open licensed (non proprietary).
  2. a downloader of zipped text files published on Aozora Bunko.

Installation

install.packages("ldccr", repos = c("https://paithiov909.r-universe.dev", "https://cloud.r-project.org"))

Supported Corpora

Monolingual

| … | Name | License | Link | | -------------------- | -------------------------------------------- | --------------------------------------------------------------------- | ------------------------------------------------ | | :heavy_check_mark: | Live Door News Corpus | CC BY-ND 2.1 JP | # | | :heavy_check_mark: | Japanese Realistic Textual Entailment Corpus | CC BY-NC-SA 4.0 | # | | :heavy_check_mark: | ja.text8 corpus | CC BY-SA | # |

Multilingual

Currently not supported.

Download text file from Aozora Bunko

if (!dir.exists("cache")) dir.create("cache")

text <- ldccr::AozoraBunkoSnapshot |>
  dplyr::sample_n(1L) |>
  dplyr::pull("テキストファイルURL") |>
  ldccr::read_aozora(directory = "cache") |>
  readr::read_lines()

dplyr::glimpse(text)
#>  chr [1:16] "雪子さんの泥棒よけ" "夢野久作" ...

License

MIT license.



paithiov909/ldccr documentation built on Oct. 14, 2024, 3:44 a.m.