Introduction

Aozora Bunko in a nutshell http://www.aozora.gr.jp/

Basic workflow

  1. Works are transcribed or scanned from books into a custom text format by volunteers.
  2. A separate set of volunteers then checks the transcription for errors.
  3. If the work is not under copyright anymore, it is released publicly on the Aozora Bunko website.

Basic information

Situating Aozora Bunko in relation to other corpora

References: Meiroku Zasshi: @近藤明日子2012明六雑誌コーパス; Kokumin no Tomo: @近藤明日子2016明六雑誌コーパス; The Sun: @田中牧郎2005言語資料としての雑誌; Kindai Josei Zasshi: @田中牧郎2006近代女性雑誌コーパス; Balanced Corpus of Contemporary Written Japanese: @Maekawa2013

Examples of Aozora Bunko as corpus for linguistic analysis

Preprocessing

Viewers

There are many online viewers and smartphone apps that you can use to read Aozora Bunko material that I won’t cover here.

Aozora Bunko as corpus resource

Unofficial search interfaces

Aozora Bunko metadata schema

Official database schema for Aozora Bunko’s author-work CSV file copied from https://www.slideshare.net/takahashim/osc201703-aozorahack.

Projects that worked to expand metadata

The basic LOD metadata structure has been done before:

Issues with and solutions for metadata relevant for us today:

Aozora Bunko

Interim solution: Aozora Bunko Corpus Generator

(*) This is a slight issue.

Morphological processing issues

References: @UniDic; @OkaUnidicCWJ220en

Dictionary comparisons

Japanese novels corpus

Japanese novels corpus: Data formats

Uses Aozora Bunko Corpus Generator with the latest Aozora Bunko GitHub repository, as well as a non-free dataset.

Plain version

Tokenized version

Japanese novels corpus: Issues

Japanese novels corpus overview

High group (6 authors, 25 works)

Low group (6 authors, 22 works) (cf. Proletariat literature)

Japanese novels corpus: 夏目漱石 Natsume Soseki (1867–1916) HIGH

NDC: 小説、物語 Fiction

Japanese novels corpus: 太宰治 Dazai Osamu (1909–1948) HIGH

NDC: 小説、物語 Fiction

Comments:

Japanese novels corpus: 坂口安吾 Sakaguchi Ango (1906–1955) LOW

NDC: 小説、物語 Fiction

Japanese novels corpus: 夢野久作 Yumeno Kyusaku (1889–1936) LOW

NDC: 小説、物語 Detective

Japanese novels corpus: 江戸川乱歩 Edogawa Ranpo (1894–1965) LOW

NDC: 小説、物語 (児童書) Detective

Japanese novels corpus: 海野十三 Unno Juza (1897–1949) LOW

NDC: 小説、物語 (児童書あり) SF/Young SF

Japanese novels corpus: 岡本綺堂 Okamoto Kido (1872–1939) LOW

NDC: 小説、物語 Horror (怪奇)/Detective

Japanese novels corpus: 森鴎外 Mori Ogai (1862–1922) HIGH

NDC: 小説、物語 Fiction/Non-Fiction

Japanese novels corpus: 小栗虫太郎 Oguri Mushitaro (1901–1946) LOW

NDC: 小説、物語 Horror (怪奇)

Comments:

Japanese novels corpus: 小林多喜二 Kobayashi Takiji (1903–1933) HIGH

NDC: 小説、物語 Fiction

Japanese novels corpus: 谷崎潤一郎 Tanizaki Jun’ichiro (1886–1965) HIGH

NDC: 小説、物語 Fiction

DBpedia entry

Japanese novels corpus: 川端 康成 Yasunari Kawabata (1899–1972) HIGH

NDC: 小説、物語 Fiction

Comments:

Stylistics

General

Text complexity measures

Lexical complexity measures

ttr        = v/l
guiraud_r  = v/sqrt(l)
herdan_c   = log(v)/log(l)
dugast_k   = log(v)/log(log(l))
dugast_u   = log(l)^2/(log(l) - log(v))
maas_a2    = (log(l) - log(v))/log(l)^2
tuldava_ln = (1 - v^2)/(v^2 * log(l))
brunet_w   = l^(v^-0.172)
cttr       = v/sqrt(2 * l)
summer_s   = log(log(v))/log(log(l))

Frequency spectrum and vocabulary size-based measures

sichel_s = dislegomena/v
michea_m = v/dislegomena
honore_h = 100.0 * log(l)/(1 - hapaxlegomena/v)
ent = sum(freq_size * -log(freq / l) * (freq / l) for (freq, freq_size) in fs)
yule_k = 10000 * (sum(freq_size * (freq / l)^2 for (freq, freq_size) in fs) - (1 / l))
simpson_d = sum((freq_size * (freq / l) * ((freq - 1) / (l - 1)) for (freq, freq_size) in fs))
herdan_vm =  sqrt(sum(freq_size * (freq / l)^2 for (freq, freq_size) in fs) - (1 / v))
hdd(l, fs, sample_size=42) = sum((1 - dhyper(0, freq, l - freq, sample_size)) / sample_size for (word, freq) in fs)

Authorship attribution

東京<名詞>
の<助詞>
まん中<名詞>
に<助詞>
ある<動詞>
有名<形状詞>
[...]

Authorship attribution (Euclidean distance)

©Uesaka Ayaka

Authorship attribution (Cosine distance)

©Uesaka Ayaka

Authorship attribution (KL distance)

©Uesaka Ayaka

Topic model on Dec. 2017 version of Aozora Bunko

Topic model on Jan. 2018 version of Aozora Bunko

Discussion points

References