read_ja_text8: Read the ja.text8 corpus

View source: R/ja-text8-reader.R

read_ja_text8R Documentation

Read the ja.text8 corpus

Description

Download and read the ja.text8 corpus as a tibble.

Usage

read_ja_text8(
  url =
    "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip",
  size = NULL
)

Arguments

url

String.

size

Integer. If supplied, samples rows by this argument.

Details

By default, this function reads the ja.text8 corpus as a tibble by splitting it into sentences. The ja.text8 as whole corpus consists of over 582,000 sentences, 16,900,026 tokens, and 290,811 vocabularies.

Value

A tibble.


paithiov909/ldccr documentation built on Oct. 14, 2024, 3:44 a.m.