Agenda

Aozora Bunko

Aozora Bunko abridged history

Significance

GitHub and infrastructure

Aozora Bunko on GitHub

Issues

Research aims

  1. The present work aims to develop the infrastructure to be able to provide continuously updated plaintext and TEI versions of works in the Aozora Bunko.
  2. Automatic generation of Linked Open Data from Aozora Bunko metadata to enable rich faceted search of this large dataset with applications to dataset construction facilitating diachronic/genre/authorship attribution/stylistics, and similar studies.
  3. Expose the resource to the wider non-Japanese speaking research community.

Text Processing

Japanese language text processing

Annotations included and missing in Aozora Bunko

Natural Language Toolset

Text processing with the Aozora Corpus Generator (1)

Aozora Bunko Corpus Generator is a Python script developed for the purpose of generating plaintext and tokenized corpora from lists of authors or works within Aozora Bunko.

Text processing with the Aozora Corpus Generator (2)

LOD

Linked Open Data: Previous efforts

<http://www.aozora.gr.jp/cards/001257/card46658.html> a
    aozora:BibResource .
<http://www.aozora.gr.jp/index_pages/person1257.html> a
    aozora:Person ,
    foaf:Person ;
    aozora:authorID "1257"^^xsd:int ;
    foaf:familyName "アーヴィング" ;
    foaf:givenName "ワシントン" ;
    aozora:familyNameTranscription "アーヴィング" ;
    aozora:givenNameTranscription "ワシントン" ;
    aozora:familyNameForSort "ああういんく" ;
    aozora:givenNameForSort "わしんとん" ;
    aozora:familyNameInRomaji "Irving" ;
    aozora:givenNameInRomaji "Washington" ;
    rdag2:dateOfBirth "1783-04-03"^^xsd:date ;
    rdag2:dateOfDeath "1859-11-28"^^xsd:date ;
    foaf:name "アーヴィング ワシントン" ;
    rdfs:seeAlso <http://id.ndl.go.jp/auth/ndlna/00444295> , ...

LOD: Problems

_:node17fc51317x3 dcterms:title "スケッチ・ブック" ;
    aozora:firstIssued "1957(昭和32)年5月20日" ;
    aozora:inputVersion "2000(平成12)年2月20日33刷改版" ;
    aozora:modifiedVersion "2000(平成12)年 2月20日33刷改版" ;
    dcterms:publisher "新潮文庫、新潮社" .

LOD: Current solutions (1)

Initial release of the Aozora Bunko corpus Converter library.

LOD: Current solutions (2)

<http://abc.com/0.1/w050367>
    abc:aozoraLastModifiedDate
        "2014-09-21"^^xsd:date ;
    abc:aozoraPublishingDate
        "2010-06-21"^^xsd:date ;
    abc:author    "p001263" ;
    abc:bibResource
        <https://www.aozora.gr.jp/cards/001263/card50367.html> ;
    abc:copyrightExpired
        false ;
    abc:firstPublished
        "1933-09-01"^^xsd:date ;
    abc:ndc       [ abc:ndc/category "中国" ] ;
    abc:ndc       [ abc:ndc/category "朝鮮" ] ;

*abc.com is a placeholder

LOD: Current solutions (3)

    abc:orthographicStyle
        "新字新仮名" ;
    abc:references
        <http://abc.com/0.1/黒船前後・志士と経済他十六篇> ;
    abc:revisionSource
        "1981(昭和56)年7月16日第1刷" ;
    abc:revisor   "小林繁雄" ;
    abc:sources   <https://www.aozora.gr.jp/cards/001263/files/50367_39396.html> , <https://www.aozora.gr.jp/cards/001263/files/50367_ruby_38317.zip> ;
    abc:subtitle  "はつりょうえんせいたい" ;
    abc:title     "撥陵遠征隊" ;
    abc:transcriber
        "ゆうき" ;
    abc:transcription
        "はつりょうえんせいたい" ;
    abc:transcriptionSource
        "1981(昭和56)年7月16日第1刷" .

TEI Encoding

TEI Header Mapping (1)

fileDesc

TEI Header Mapping (2)

profileDesc

revisionDesc

TEI Body Mapping

Discussion

Discussion

Future Work

Future Work

References