read_ja_text8: Read the ja.text8 corpus

View source: R/ja-text8-reader.R

read_ja_text8R Documentation

Read the ja.text8 corpus

Description

Downloads and reads the ja.text8 corpus as a tibble.

Usage

read_ja_text8(
  url =
    "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip",
  size = NULL
)

Arguments

url

String.

size

Integer. If supplied, samples rows by this argument.

Details

By default, this function reads the ja.text8 corpus as a tibble by splitting it into sentences. The ja.text8 as whole corpus consists of over 582,000 sentences, 16,900,026 tokens, and 290,811 vocabularies.

Value

A tibble.


paithiov909/ldccr documentation built on Feb. 3, 2025, 12:16 a.m.