自然言語処理A

自然言語処理の主なタスク

今日の目的

  • 自然言語処理Aで注目するタスクの列挙と位置づけ
  • テキストの前処理として正規表現を予習
  • 来週までに各自でルールベースの言語処理の長所・短所を体験

自然言語処理Aで注目するタスク

  • テキストの前処理(→それぞれのタスクに合わせるための一連の前処理)
  • インターネットコーパスの作り方(→前処理を応用するため)
  • 形態素解析(→日本語など空白で区切られていない言語の単語分割及び品詞推定)
  • ベクター空間モデル(→単語埋め込みモデルで単語間の意味を計算)
  • 構文解析(→単語間の関係を計算;係り受け解析に焦点を)
  • 他タスク(→ニューラルネットワーク(深層学習)ベース)
    • 機械翻訳、テキスト分類などニーズに合わせて

データ

Speech and Language Processingとの対応

前処理の準備

ELIZA

テキスト正規化の例(1)

“I desire you will do no such thing. Lizzy is not a bit better than the others; and I am sure she is not half so handsome as Jane, nor half so good-humoured as Lydia. But you are always giving her the preference.”

" I desire you will do no such thing . Lizzy is not a bit better than the others ; and I am sure she is not half so handsome as Jane, nor half so good-humoured as Lydia . But you are always giving her the preference . ”

テキスト正規化の例(2)

“I desire you will do no such thing. Lizzy is not a bit better than the others; and I am sure she is not half so handsome as Jane, nor half so good-humoured as Lydia. But you are always giving her the preference.”

“ I desir you will do no such thing . lizzi is not a bit better than the other ; and I am sure she is not half so handsom as jane , nor half so good - humour as lydia . but you are alway give her the prefer . ”

(NLTKのPorter Stemmerを使用)

テキスト正規化の例(3)

“I desire you will do no such thing. Lizzy is not a bit better than the others; and I am sure she is not half so handsome as Jane, nor half so good-humoured as Lydia. But you are always giving her the preference.”

" -PRON- desire -PRON- will do no such thing . Lizzy be not a bit well than the other ; and -PRON- be sure -PRON- be not half so handsome as Jane , nor half so good - humoured as Lydia . but -PRON- be always give -PRON- the preference . "

(-PRON-はSpacyのlemmaとして代名詞を示す)

テキスト正規化

  • 以前の例のように目的によってテキストの正規化を行わないと同じ意味の単語が同じものとして処理できない
    • tokenization
    • stemming
    • lematization

他にも

  • Unicode Normalization
  • 文分割 / Sentence Segmentation

正規表現

正規表現の便利なサイト

宿題

前処理と正規表現(設定)

  • regular expressions 101

  • Test stringとして下記のテキストをペースト(Pride and Prejudiceの冒頭から)

    “My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently. “You want to tell me, and I have no objection to hearing it.” This was invitation enough.

前処理と正規表現(課題)

  • 文がどこで終わるか、自分で考え、その文境界に相当する文字列をマッチする正規表現を書いてみる(ここでは?“など複数の文字列にマッチングしてもよい)
  • それぞれの(すべての)英語単語(記号を除く)をマッチする正規表現を書いてみる(|My|dear|Mr.|Bennet|said|…)
  • 解答はメールで下さい(正規表現がうまく動作しない場合はその旨を添えて書いてください。文境界は自分の判断になるので)