Works are transcribed or scanned from books into a custom text format by volunteers.
A separate set of volunteers then checks the transcription for errors.
If the work is not under copyright anymore, it is released publicly on the Aozora Bunko website.
Basic information
The closest English equivalent is Project Gutenberg, but there are many differences, in particular regarding what kind of information is encoded about the works.
It is relatively large, currently at around 76M tokens over 4,000 works for just the most modern Japanese writing style (新仮名新漢字). The total is even bigger (>10k works).
Situating Aozora Bunko in relation to other corpora
References: Meiroku Zasshi: @近藤明日子2012明六雑誌コーパス; Kokumin no Tomo: @近藤明日子2016明六雑誌コーパス; The Sun: @田中牧郎2005言語資料としての雑誌; Kindai Josei Zasshi: @田中牧郎2006近代女性雑誌コーパス; Balanced Corpus of Contemporary Written Japanese: @Maekawa2013
Examples of Aozora Bunko as corpus for linguistic analysis
Aozora Bunko text format Emacs Elisp parser is the ‘real deal’—and is over 1,000 lines of Lisp. There is also a PEG grammar of the peculiar format used to encode metadata in text!
University of Chicago’s ARTFL Project: Aozora Search can do concordances, KWIC, collocation and time series search courtesy of their PhiloLogic4 system.
Issues with and solutions for metadata relevant for us today:
Aozora Bunko
Research using Aozora Bunko mentions issues with the date of publication fields (many are after the author’s death) and preprocessing: @金川絵利子2017en; @HatanoBN0376330X; @ishida40020684966; @weko_32961_1
Preliminary investigations suggest that by using DBpedia’s extraction framework we can supplant/correct some of these faults (DBpedia extracts LOD from Wikipedia database dumps: @lehmann2015dbpedia).
Interim solution: Aozora Bunko Corpus Generator
Aozora Bunko Corpus Generator is a Python preprocessing script that generates plain and tokenized text files from the Aozora Bunko for use in corpus-based studies.
Works on the official GitHub repository of Aozora Bunko, and specifically on the HTML versions of works (*).
Can specify specific authors and works to extract, or can extract all data.
Basic punctuation/symbol removal, morphological feature extraction (surface form or lemma, or a combination of any MeCab dictionary field), filtering of works smaller than specified token threshold.
Outputs a CSV of metadata.
(*) This is a slight issue.
Morphological processing issues
At this point, morphological processing issues come down to choice of dictionary:
IPAdic 2.7.0
UniDic 2.1.0
UniDic CWJ 2.2.0
UniDic CSJ 2.2.0
Kindaigo (Modern Japanese) UniDic 1603
At first glance, the Kindaigo Modern Japanese dictionary might seem like the right choice…
References: @UniDic; @OkaUnidicCWJ220en
Dictionary comparisons
Developed a new tool (ja-morph-diff) to make comparisons easy.
Two major differences between results due to dictionary choice:
Word chunking differences (i.e. is ‘二十歳’ one token or two?)
Morphological feature differences (i.e. is . a conjunctive or adverb?)
All metadata is included in the “groups.csv” file. Parts (brow and genre) are sourced from the manually specified ‘author-title.csv’ in the aozora-corpus-generator repo, while the rest are extracted from the master list in the Aozora Bunko repository (list_person_all_extended_utf8.zip).
An additional non-free part (5 works of Yasunari Kawabata) is then added to supplement the high-brow part.
One sentence per line with a linebreak in between paragraphs.
Tokenized version
End of sentence marker: <EOS>
End of paragraph marker: <PGB>
One line per morpheme segmented using the morphological analyzer MeCab and the UniDic CWJ (Contemporary Written Japanese) dictionary version 2.2.0.
Japanese novels corpus: Issues
“Gaiji” characters with provided JIS X 0213 codepoints are converted to their equivalent Unicode codepoint. Aozora Bunko is conservative in encoding rare Kanji, and, therefore, uses images (html version) or textual descriptions (plaintext version).
Words are sometimes emphasized in Japanese text with dots above characters, while Aozora Bunko uses bold text in their place. Emphasis tags are currently stripped.
Not all footer information is marked with metadata. Currently several heuristics are in place to detect these occurrences (Example regular expression: '^[ 【]?(底本:|訳者あとがき|この翻訳は|この作品.*翻訳|この翻訳.*全訳)').
Uses newer UniDic CWJ 2.2.0; newer version of Aozora Bunko; includes better handling of kanji-katakana-majiri writing style; other small bugfixes: [Interactive visualization link]
Discussion points
Balance issues:
high and low
genre
length
number of works per author
variation within works by author
Dictionary choice:
kanji-katakana-majiri
spoken (dialogue) vs written
leaning toward the new UniDic CWJ 2.2.0 (and also perhaps a custom dictionary for issues uncovered by tokenization comparisons)