References: Meiroku Zasshi: @近藤明日子2012明六雑誌コーパス; Kokumin no Tomo: @近藤明日子2016明六雑誌コーパス; The Sun: @田中牧郎2005言語資料としての雑誌; Kindai Josei Zasshi: @田中牧郎2006近代女性雑誌コーパス; Balanced Corpus of Contemporary Written Japanese: @Maekawa2013
There are many online viewers and smartphone apps that you can use to read Aozora Bunko material that I won’t cover here.
Official database schema for Aozora Bunko’s author-work CSV file copied from https://www.slideshare.net/takahashim/osc201703-aozorahack.
The basic LOD metadata structure has been done before:
Issues with and solutions for metadata relevant for us today:
(*) This is a slight issue.
References: @UniDic; @OkaUnidicCWJ220en
ja-morph-diff
) to make comparisons easy.list_person_all_extended_utf8.zip
).Uses Aozora Bunko Corpus Generator with the latest Aozora Bunko GitHub repository, as well as a non-free dataset.
<EOS>
<PGB>
'^[ 【]?(底本:|訳者あとがき|この翻訳は|この作品.*翻訳|この翻訳.*全訳)'
).High group (6 authors, 25 works)
Low group (6 authors, 22 works) (cf. Proletariat literature)
NDC: 小説、物語 Fiction
NDC: 小説、物語 Fiction
Comments:
NDC: 小説、物語 Fiction
NDC: 小説、物語 Detective
NDC: 小説、物語 (児童書) Detective
NDC: 小説、物語 (児童書あり) SF/Young SF
NDC: 小説、物語 Horror (怪奇)/Detective
NDC: 小説、物語 Fiction/Non-Fiction
NDC: 小説、物語 Horror (怪奇)
Comments:
NDC: 小説、物語 Fiction
NDC: 小説、物語 Fiction
NDC: 小説、物語 Fiction
Comments:
l
is text length (tokens)v
is vocabulary size (types)fs
is frequency spectrum (histogram of frequencies)hapaxlegomena
is number of tokens with frequency 1dislegomena
is number of tokens with frequency 2sichel_s = dislegomena/v
michea_m = v/dislegomena
honore_h = 100.0 * log(l)/(1 - hapaxlegomena/v)
ent = sum(freq_size * -log(freq / l) * (freq / l) for (freq, freq_size) in fs)
yule_k = 10000 * (sum(freq_size * (freq / l)^2 for (freq, freq_size) in fs) - (1 / l))
simpson_d = sum((freq_size * (freq / l) * ((freq - 1) / (l - 1)) for (freq, freq_size) in fs))
herdan_vm = sqrt(sum(freq_size * (freq / l)^2 for (freq, freq_size) in fs) - (1 / v))
hdd(l, fs, sample_size=42) = sum((1 - dhyper(0, freq, l - freq, sample_size)) / sample_size for (word, freq) in fs)
features
flag to append the part-of-speech to each token. Example:東京<名詞>
の<助詞>
まん中<名詞>
に<助詞>
ある<動詞>
有名<形状詞>
[...]