Bor Hodošček: Academic Website

自然言語処理の基本と技術

1章

自然言語対人工言語 / natural vs. artificial languages
言語の曖昧性 / ambiguity in language
自然言語処理の技術と応用例 / NLP technologies and applications
- 仮名漢字変換 / Input Method Environment (IME)
  - Atok, MS-IME, Google IME
- 機械翻訳 / machine translation
- 音声翻訳 / speech translation
- 検索エンジン / search engine
  - Linguee word sense parallel corpus search
- クエリ，文章 / query, document
- スペル訂正 / spellcheck
  - Grammarlyで自動文章校正
- 対話システム / dialogue system
- 質問応答 / Q&A system
  - IBM Watson
- ファクトイド / factoid
  - Example 1, Example 2, Example 3 (see also)
歴史
品詞タグ付与と係り受け解析例 / Example of part-of-speech tagging and dependency parsing
Universal Dependencies Part-Of-Speech tagset
知識ベース・言語資源
- WordNet, 検索インターフェース / Search interface
- FrameNet, FrameGrapher
Google AI: Semantic Experiences

2章

コーパス / corpus
均衡コーパス / balanced corpus
辞書 / dictionary
形態素解析用辞書 / morphological processing dictionary
現代日本語均衡書き言葉コーパス (BCCWJ) / Balanced Corpus of Contemporary Written Japanese
Brown Corpus
知識獲得 / knowledge acquisition
異表記 / orthographic variants
部分全体関係 / whole-part relationship
上位下位関係 / hyponymy
意味カテゴリ関係 / semantic category relation
属性関係 / attribute (/property) relation
分布仮説 / distributional hypothesis
分布類似度 / distributional similarity
単語のクラスタリング / word clustering
単語のベクトル表現 / word representation
語彙統語パターン / word sytactic patterns
言い換え / paraphrasing
シソーラス / thesaurus
データベース / database
ワードネット / WordNet
オントロジー / ontology
知識ベース / knowledge base
情報抽出 / information extraction/retrieval
固有表現抽出 / Named Entity Extraction/Resolution (NER)
関係抽出 / relation extraction
イベント情報抽出 / event extraction
スロット付きのテンプレート / slotted template
分野適応 / domain adaptation
テキストマイニング / text mining
形態素 / morpheme
単語分割 / word segmentation
未知語 / missing word

3章

Markov chains in a nutshell:

Markov Chains: A visual explanation by Victor Powell

Hidden Markov Models (HMM) and the Viterbi algorithm (the decoder step in an HMM):

Jurafsky and Martin’s chapter on Hidden Markov Models
Viterbi algorithm graphical explanation: https://www.youtube.com/watch?v=RwwfUICZLsA

4章

Traditionally, machine translation has been classified according to four levels:

word-for-word
- Take a word string from one language and translate, word by word, into another.
syntactic transfer
- Take a syntactic parse of one language and, using special syntactic transfer rules, generate a syntactic parse for another language.
semantic transfer
- Take a semantic parse (usually a syntactic parse with additional semantic parse information) of one language and, using special semantic and syntactic transfer rules, generate a semantic parse for another language.
knowledge-based translation
- Translate via a language-independent knowledge representation.

Chris Callison-Burch’s class on Machine Translation

Alignment

Phrase-based Translation

Phrase- and tree-based machine translation are implemented in the Moses system.

Neural Machine Translation

Following the trend of joint learning of a task over splitting it into several independent components (alignment and translation in the case of MT), @DBLP:journals/corr/BahdanauCB14 present a new deep learning approach that jointly learns how to align and translate between languages.
Tensorflow provides a English-French sequence-to-sequence model here.

Evaluation

Overview in Japanese
“An Awkward Disparity between BLEU/RIBES Scores and Human Judgements in Machine Translation” by @tanawkward [[PDF](http://www.aclweb.org/website/old_anthology/W/W15/W15-5009.pdf)]

BLEU

Read the description on Wikipedia.

RIBES

A description of RIBES with source code in Python is available on NTT Communication’s site.

Commercial Machine Translation systems

Open Source MT systems

Phrase/Statistics-based (SMT) open-source systems

Moses

Neural network-based open-source systems

OpenNMT: industrial-strength, open-source (MIT) neural machine translation system utilizing the Torch mathematical toolkit. [Paper] PyTorch

MT Resources

Parallel corpora

OPUS … the open parallel corpus
Aggregates many parallel corpora from different resources. Below are some examples that have wide coverage (and at least cover Japanese, English, and Chinese).
- OpenSubtitle 2016 [PDF]
  Has a wide range of translation pairs. Note that (many?) subtitles are done by volunteers and not professionals.
- Global Voices Parallel Corpus 2016Q4
日本薬局方/ICHガイドライン対訳集

Human translator workflow (translation memory)

SDL Trados Studio

5章

情報検索システム

用語：文書，情報要求，検索質問（クエリ），適合文章，構造化された情報⇔非構造された情報

情報検索の基礎

用語：全文検索，ランキング（ranking），索引（index）

索引付け

用語：索引語，索引語行列，索引付け（indexing），疎行列（そぎょうれつ; sparse matrix），転置索引（inverted index），文章処理，見出し語化（lemmatization），語幹化（stemming），整数列圧縮（integer sequence compression）

ブーリアンモデル

用語：クエリ処理，AND/OR/NOT，NEAR

ベクトル空間モデル

用語：TF（Term Frequency; 索引語頻度），IDF（Inverse Document Frequency; 逆文書頻度），TF-IDF，文章ベクトル，コサイン類似度（cosine similarity）

tf = log₁₀(n) + 1

$$idf = log_{10}(\frac{N}{df})$$

tfidf = tf × idf

Web検索

用語：クローリング（crawling），クローラー（crawler），スニペット（snippet），クエリサジェスチョン（query suggestion）

情報検索の評価

用語：適合文章，正解文書，偽陽性（False Positive），偽陰性（False Negative），精度precision，再現率（recall），曲線，評価指標（MAP, MRR, DCG）

情報検索システムの現在と課題

用語：クロスリンガル情報検索，自然文検索，言外の意味，LSI/PLSI/LDA/Word2Vec等

Text Embedding Models Contain Bias. Here’s Why That Matters.

6章: Webと自然言語処理

自然言語処理のWebサービスへの応用

用語：サービス，大規模，ノイズ

文章分類

用語：文章分類，クラス，クラス重み，素性，サポート・ベクトル・マシン（SVM），ロジスティック回帰（Logistic Regression），言語識別，著者推定，スパム，

類似文書検索

用語：類似度，TF-IDF

連想検索：Webcat Plus

クラスタリング

用語：k平均法（k-means），階層的クラスタリング（hierarchical clustering），系統樹（dendrogram）

マイニング

用語：テキストマイニング，関係抽出，評判分析（sentiment analysis）

スペル訂正

用語：スペラー，表記誤り，表記揺れ，同義語，類義語

レコメンド

用語：レコメンド（recommendation (engine)），強調フィルタリング（collaborative filtering），コンテンツに基づくレコメンド，コールドスタート問題

文書要約

用語：文書要約（document summarization），単一・複数文書要約，リード法，抽出型要約，抽象型要約，MMR（Maximal Marginal Relevance），代表的な文⇔冗長でない文

質問応答

用語：質問応答（QA/Q&A），ファクトイド型質問，ノンファクトイド型質問，文書に対する質問応答システム，構造化データベースに対する質問応答

Webサービスにおける自然言語処理の課題

用語：新しい応用技術，頑健な言語解析，深いレベルのマイニング，機械読解（machine reading）

Google AI Experiments

7章

文の意味を知る技術

用語：部分文字列，マークアップ（HTML/XMLなど），アノテーション，ガゼッター（gazetteer），固有表現の曖昧性，辞書の保守性，Infobox

固有表現抽出 (Named Entity Recognition)
述語項構造解析／格解析（）
- 用語：格，格構造，必須格，任意格，格フレーム，表層格，事象性名詞，述語⇔項，意味役割付与
語義曖昧性解消
- 用語：分類問題，正解（教師）データ，新しい語義の検出
感情推定・評判解析
- 消費者生成メディア／CGM（Consumer Generated Media），肯定的⇔否定的，極性，中性，段階評価，回帰分析（Linear Regression）

文を超えたつながりを知る技術

照応省略解析
- 用語：照応，省略，先行詞（antecedent），照応詞，中心性アルゴリズム（センタリング理論）
談話と対話
- 用語：独話，発話，談話，談話表示構造（DRS），チャットシステム，タスク指向対話システム，ボット（bot），対話管理，音声対話システム，音声認識，音声合成
含意関係認識
- 用語：含意関係認識（(recognizing) textual entailment），モダリティ

自然言語処理の限界・課題

適応分野の広がりと機械学習の発展
- 未知語（UNK），分野適応（domain adaptation），能動学習（active learning），半教師あり学習（semi-supervised learning），正解データ，正解なしデータ，教師なし学習（unsupervised learning），クラスタリング
- See Yann Le Cun’s “cake” for a recent take on some of these concepts at the NIPS2016
多様化・大規模化
- 用語：ビッグデータ，オンライン学習（online learning）⇔バッチ学習（batch learning）
「意味」の問題
- ルール（規則）ベース手法，統計ベース的手法，機械学習を基本とした手法，再帰性（recursivity），世界知識
- 句に基づく統計的機械翻訳（phrase-based machine translation）対（ディープ）ニューラルネットベース機械翻訳（(deep) neural network based machine translation）
  - どっちがどっちか当ててみよう（新MS/Bing機械翻訳の解説記事）
  - Googleのニューラルネットベース機械翻訳の論文（論文で提案されている手法のレビュー）

NLP A Notes