Aozora Bunko in a nutshell http://www.aozora.gr.jp/

Basic workflow

Works are transcribed or scanned from books into a custom text format by volunteers.
A separate set of volunteers then checks the transcription for errors.
If the work is not under copyright anymore, it is released publicly on the Aozora Bunko website.

Basic information

The closest English equivalent is Project Gutenberg, but there are many differences, in particular regarding what kind of information is encoded about the works.
It is relatively large, currently at around 76M tokens over 4,000 works for just the most modern Japanese writing style (新仮名新漢字). The total is even bigger (>10k works).
It is continuously updated from its inception in 1997 (2018 saw the upload of works covering 28 authors whose copyright expired), the GitHub repository is currently at ~13GB, and there is a supporting charity organization Hon no Mirai Kikin and associated Aozorahack collective.

Situating Aozora Bunko in relation to other corpora

References: Meiroku Zasshi: @近藤明日子2012明六雑誌コーパス; Kokumin no Tomo: @近藤明日子2016明六雑誌コーパス; The Sun: @田中牧郎2005言語資料としての雑誌; Kindai Josei Zasshi: @田中牧郎2006近代女性雑誌コーパス; Balanced Corpus of Contemporary Written Japanese: @Maekawa2013

Examples of Aozora Bunko as corpus for linguistic analysis

Preprocessing

Christopher Morse’s “Digital Japanese Literature: Aozora Bunko” article. Most introductions like this, while clear and easy to implement, only get you to ~90%.
松村優哉らのRではじめよう！［モダン］なデータ分析：第5回　青空文庫のテキストマイニングをRMeCabパッケージでやってみた
Aozora Bunko text format Emacs Elisp parser is the ‘real deal’—and is over 1,000 lines of Lisp. There is also a PEG grammar of the peculiar format used to encode metadata in text!
Aozorahack’s new text to html conversion program.

Viewers

There are many online viewers and smartphone apps that you can use to read Aozora Bunko material that I won’t cover here.

Aozora Bunko as corpus resource

Morphologically parsed (MeCab + NAIST-jdic) Aozora Bunko corpus as CSV dataset from 2012: http://aozora-word.hahasoha.net
@chiba2006aozora’s 「青空文庫」を言語コーパスとして使おう

Unofficial search interfaces

University of Chicago’s ARTFL Project: Aozora Search can do concordances, KWIC, collocation and time series search courtesy of their PhiloLogic4 system.
You can browse by NDC category on this site.

Aozora Bunko metadata schema

Official database schema for Aozora Bunko’s author-work CSV file copied from https://www.slideshare.net/takahashim/osc201703-aozorahack.

Projects that worked to expand metadata

The basic LOD metadata structure has been done before:

青空文庫Linked Open Data project converted the metadata found in the author-work CSV file into RDF using vocabulary linked to Meta Bridge.

Issues with and solutions for metadata relevant for us today:

Aozora Bunko

Research using Aozora Bunko mentions issues with the date of publication fields (many are after the author’s death) and preprocessing: @金川絵利子2017en; @HatanoBN0376330X; @ishida40020684966; @weko_32961_1
@maruyama2012’s “The Role of Metadata in the Analysis of Large-scale Corpora”
Preliminary investigations suggest that by using DBpedia’s extraction framework we can supplant/correct some of these faults (DBpedia extracts LOD from Wikipedia database dumps: @lehmann2015dbpedia).

Interim solution: Aozora Bunko Corpus Generator

Aozora Bunko Corpus Generator is a Python preprocessing script that generates plain and tokenized text files from the Aozora Bunko for use in corpus-based studies.
Works on the official GitHub repository of Aozora Bunko, and specifically on the HTML versions of works (*).
Can specify specific authors and works to extract, or can extract all data.
Basic punctuation/symbol removal, morphological feature extraction (surface form or lemma, or a combination of any MeCab dictionary field), filtering of works smaller than specified token threshold.
Outputs a CSV of metadata.

(*) This is a slight issue.

Morphological processing issues

At this point, morphological processing issues come down to choice of dictionary:
- IPAdic 2.7.0
- UniDic 2.1.0
- UniDic CWJ 2.2.0
- UniDic CSJ 2.2.0
- Kindaigo (Modern Japanese) UniDic 1603
At first glance, the Kindaigo Modern Japanese dictionary might seem like the right choice…

References: @UniDic; @OkaUnidicCWJ220en

Dictionary comparisons

Developed a new tool (ja-morph-diff) to make comparisons easy.
Two major differences between results due to dictionary choice:
- Word chunking differences (i.e. is ‘二十歳’ one token or two?)
- Morphological feature differences (i.e. is . a conjunctive or adverb?)

Results (Warning: links to large HTML reports)

Japanese novels corpus

All novels are extracted from the HTML files of Aozora Bunko using aozora-corpus-generator.
All metadata is included in the “groups.csv” file. Parts (brow and genre) are sourced from the manually specified ‘author-title.csv’ in the aozora-corpus-generator repo, while the rest are extracted from the master list in the Aozora Bunko repository (list_person_all_extended_utf8.zip).
An additional non-free part (5 works of Yasunari Kawabata) is then added to supplement the high-brow part.

Japanese novels corpus: Data formats

Uses Aozora Bunko Corpus Generator with the latest Aozora Bunko GitHub repository, as well as a non-free dataset.

Plain version

One sentence per line with a linebreak in between paragraphs.

Tokenized version

End of sentence marker: <EOS>
End of paragraph marker: <PGB>
One line per morpheme segmented using the morphological analyzer MeCab and the UniDic CWJ (Contemporary Written Japanese) dictionary version 2.2.0.

Japanese novels corpus: Issues

“Gaiji” characters with provided JIS X 0213 codepoints are converted to their equivalent Unicode codepoint. Aozora Bunko is conservative in encoding rare Kanji, and, therefore, uses images (html version) or textual descriptions (plaintext version).
Words are sometimes emphasized in Japanese text with dots above characters, while Aozora Bunko uses bold text in their place. Emphasis tags are currently stripped.
Not all footer information is marked with metadata. Currently several heuristics are in place to detect these occurrences (Example regular expression: '^[　【]?(底本：|訳者あとがき|この翻訳は|この作品.*翻訳|この翻訳.*全訳)').

Japanese novels corpus overview

High group (6 authors, 25 works)

Natsume Soseki (4)
Dazai Osamu (2)
Mori Ogai (3)
Kobayashi Takiji (4)
Tanizaki Jun’ichiro (6)
Yasunari Kawabata (6) (not Aozora Bunko)

Low group (6 authors, 22 works) (cf. Proletariat literature)

Yumeno Kyusaku (3)
Edogawa Ranpo (4)
Unno Juza (4)
Okamoto Kido (5)
Oguri Mushitaro (3)
Sakaguchi Ango (3)

Japanese novels corpus: 夏目漱石 Natsume Soseki (1867–1916) HIGH

NDC: 小説、物語 Fiction

草枕 Kusamakura
こころ Kokoro
三四郎 Sanshirou
- 「ハイドリオタフヒア」
坊っちゃん Bocchan

Japanese novels corpus: 太宰治 Dazai Osamu (1909–1948) HIGH

NDC: 小説、物語 Fiction

人間失格 Ningenshikkaku
斜陽 Shayou

Comments:

Sakaguchi Ango: possibly friends; dark subject matter

Japanese novels corpus: 坂口安吾 Sakaguchi Ango (1906–1955) LOW

NDC: 小説、物語 Fiction

ジロリの女 Jirorinoonna
復員殺人事件 Fukuinsatsujinjiken
不連続殺人事件 Furenzokusatsujinjiken

Japanese novels corpus: 夢野久作 Yumeno Kyusaku (1889–1936) LOW

NDC: 小説、物語 Detective

ドグラ・マグラ Dogura Magura
- 九州弁：「……イヤラッサナア……マアホンニ……タマガッタガ……トッケムナカア……ゾウタンノゴト……イヒヒヒヒヒ……」
- 「ウーム。ナルホド。ウーム」
少女地獄 Shoujojigoku
押絵の奇蹟 Oshienokiseki

Japanese novels corpus: 江戸川乱歩 Edogawa Ranpo (1894–1965) LOW

NDC: 小説、物語 (児童書) SF/Fantasy

魔法博士 Mahouhakase
少年探偵団 Shounentanteidan
宇宙怪人 Uchuukaijin
- 『キミ、ニゲル、ハイニナル。』
灰色の巨人 Haiironokyojin
- 「……汽車アオーラアーイ」

Japanese novels corpus: 海野十三 Unno Juza (1897–1949) LOW

NDC: 小説、物語 (児童書あり) SF/Young SF

浮かぶ飛行島 Ukabuhikoutou
- アレモ人ノ子。生キテイル。
怪塔王 Kaitouou
火星兵団 Kaseiheidan
地球要塞 Chikyuuyousai

Japanese novels corpus: 岡本綺堂 Okamoto Kido (1872–1939) LOW

NDC: 小説、物語 Horror (怪奇)

半七捕物帳: 69 白蝶怪 Hanshichitorimonochou: 69 Hakuchoukai
青蛙堂鬼談 Seiadoukidan
飛騨の怪談 Hidanokaidan
両国の秋 Ryougokunoaki
玉藻の前 Tamamonomae

Japanese novels corpus: 森鴎外 Mori Ogai (1862–1922) HIGH

NDC: 小説、物語 Fiction

ヰタ・セクスアリス Uita Sekusuarisu
青年 Seinen
渋江抽斎 Shibuechuusai

Japanese novels corpus: 小栗虫太郎 Oguri Mushitaro (1901–1946) LOW

NDC: 小説、物語 Horror (怪奇)

黒死館殺人事件 Kokushikansatsujinjiken
白蟻 Shiroari
潜航艇「鷹の城」 Senkoutei “Habihitsuburugu”

Comments:

Oguri and Unno were friends.

Japanese novels corpus: 小林多喜二 Kobayashi Takiji (1903–1933) HIGH

NDC: 小説、物語 Fiction

蟹工船 Kanikousen
- 漢字カタカナ交じり文：“ヿ” koto → ………僕ハ今年カラ、今日マデ日記ニ記スコトヲ躊躇シテイタヨウナ事柄ヲモアエテ書キ留メルヿニシタ。
工場細胞 Koujousainou
- ――ダイジョウブカ。
- ――レッコ、レッコ！
党生活者 Touseikatsusha
不在地主 Fuzaijinushi

Japanese novels corpus: 谷崎潤一郎 Tanizaki Jun’ichiro (1886–1965) HIGH

NDC: 小説、物語 Fiction

DBpedia entry

鍵 Kagi
- 分ケテモスカートノ下カラ踝ノ辺ノ歪曲美ニ見惚レナガラ、今夜ノヿヲ考エテイタ。……
春琴抄 Shunkinshou
少将滋幹の母 Shoushoushigemotonohaha
痴人の愛 Chijinnoai
卍 Manji
盲目物語 Moumokumonogatari

Japanese novels corpus: 川端康成 Yasunari Kawabata (1899–1972) HIGH

NDC: 小説、物語 Fiction

伊豆の踊子 Izu no Odoriko
古都 Koto
千羽鶴 Senbazuru
山の音 Yama no Oto
- 「千二ヨ、タズネルモノハ、コノサキニアル。ワレワレハ、ナカデマツ」
- 「バカヤロー。キチガイ！」
- 「ナムアムダブツ。」
- 三、姉ト弟トハ徳光光子ノ愛情ガ第三者ニ移ルコトナ…
眠れる美女 Nemureru Bijo
雪国 Yukiguni

Comments:

Influenced by Tanizaki Jun’ichiro (DBpedia)

Lexical complexity measures

l is text length (tokens)
v is vocabulary size (types)

ttr        = v/l
guiraud_r  = v/sqrt(l)
herdan_c   = log(v)/log(l)
dugast_k   = log(v)/log(log(l))
dugast_u   = log(l)^2/(log(l) - log(v))
maas_a2    = (log(l) - log(v))/log(l)^2
tuldava_ln = (1 - v^2)/(v^2 * log(l))
brunet_w   = l^(v^-0.172)
cttr       = v/sqrt(2 * l)
summer_s   = log(log(v))/log(log(l))

Frequency spectrum and vocabulary size-based measures

fs is frequency spectrum (histogram of frequencies)
hapaxlegomena is number of tokens with frequency 1
dislegomena is number of tokens with frequency 2

sichel_s = dislegomena/v
michea_m = v/dislegomena
honore_h = 100.0 * log(l)/(1 - hapaxlegomena/v)
ent = sum(freq_size * -log(freq / l) * (freq / l) for (freq, freq_size) in fs)
yule_k = 10000 * (sum(freq_size * (freq / l)^2 for (freq, freq_size) in fs) - (1 / l))
simpson_d = sum((freq_size * (freq / l) * ((freq - 1) / (l - 1)) for (freq, freq_size) in fs))
herdan_vm =  sqrt(sum(freq_size * (freq / l)^2 for (freq, freq_size) in fs) - (1 / v))

STTR: look if it separates genres well
Stylistics: MVR, tf-idf,
Check if Kawabata ebook is all full-width or not

Authorship attribution

Extracted a new dataset from the Japanese novels corpus. Used the features flag to append the part-of-speech to each token. Example:

東京＜名詞＞
の＜助詞＞
まん中＜名詞＞
に＜助詞＞
ある＜動詞＞
有名＜形状詞＞
[...]

Used three distance measures (euclidean, cosine, KL) to cluster the works according to their length-normalized use of particles.
All work here by Uesaka Ayaka.

Authorship attribution (Euclidean distance)

Authorship attribution (Cosine distance)

Authorship attribution (KL distance)

Topic model on Dec. 2017 version of Aozora Bunko

Gensim topic model (50 topics, α=auto, passes=15): [Interactive visualization link]

Topic model on Jan. 2018 version of Aozora Bunko

Uses newer UniDic CWJ 2.2.0; newer version of Aozora Bunko; includes better handling of kanji-katakana-majiri writing style; other small bugfixes: [Interactive visualization link]

Discussion points

Balance issues:
- high and low
- genre
- length
- number of works per author
- variation within works by author
Dictionary choice:
- kanji-katakana-majiri
- spoken (dialogue) vs written
- leaning toward the new UniDic CWJ 2.2.0 (and also perhaps a custom dictionary for issues uncovered by tokenization comparisons)

Introduction

Aozora Bunko in a nutshell http://www.aozora.gr.jp/

Basic workflow

Basic information

Situating Aozora Bunko in relation to other corpora

Examples of Aozora Bunko as corpus for linguistic analysis

Preprocessing

Viewers

Aozora Bunko as corpus resource

Unofficial search interfaces

Aozora Bunko metadata schema

Projects that worked to expand metadata

Aozora Bunko

Interim solution: Aozora Bunko Corpus Generator

Morphological processing issues

Dictionary comparisons

Results (Warning: links to large HTML reports)

Japanese novels corpus

Japanese novels corpus: Data formats

Plain version

Tokenized version

Japanese novels corpus: Issues

Japanese novels corpus overview

Japanese novels corpus: 夏目漱石 Natsume Soseki (1867–1916) HIGH

Japanese novels corpus: 太宰治 Dazai Osamu (1909–1948) HIGH

Japanese novels corpus: 坂口安吾 Sakaguchi Ango (1906–1955) LOW

Japanese novels corpus: 夢野久作 Yumeno Kyusaku (1889–1936) LOW

Japanese novels corpus: 江戸川乱歩 Edogawa Ranpo (1894–1965) LOW

Japanese novels corpus: 海野十三 Unno Juza (1897–1949) LOW

Japanese novels corpus: 岡本綺堂 Okamoto Kido (1872–1939) LOW

Japanese novels corpus: 森鴎外 Mori Ogai (1862–1922) HIGH

Japanese novels corpus: 小栗虫太郎 Oguri Mushitaro (1901–1946) LOW

Japanese novels corpus: 小林多喜二 Kobayashi Takiji (1903–1933) HIGH

Japanese novels corpus: 谷崎潤一郎 Tanizaki Jun’ichiro (1886–1965) HIGH

Japanese novels corpus: 川端 康成 Yasunari Kawabata (1899–1972) HIGH

Stylistics

General

Text complexity measures

Lexical complexity measures

Frequency spectrum and vocabulary size-based measures

Authorship attribution

Authorship attribution (Euclidean distance)

Authorship attribution (Cosine distance)

Authorship attribution (KL distance)

Topic model on Dec. 2017 version of Aozora Bunko

Topic model on Jan. 2018 version of Aozora Bunko

Discussion points

References

Japanese novels corpus: 川端康成 Yasunari Kawabata (1899–1972) HIGH