Universal Dependencies and the Universal POS tagset
The Universal Dependencies project is an ongoing effort to provide a practical and unified annotation scheme for natural language processing of many languages. This annotation framework encompasses dependency relations between words, but also lower levels–token segmentation, POS tagging, etc.
UD Label Scheme
In general, all information is encoded within these three label schemes:
Noticeably, this core of UD only provides annotation within one sentence at a time, and does not concern itself with tasks such as textual entailment, question answering, translation, etc.
CoNLL-U
The CoNLL-U format is an extension to the standard formats used in CoNLL tasks (The SIGNLL Conference on Computational Natural Language Learning). It is the format in which all treebanks are distributed.
For each sentence, the format follows the following rules (see documentation):
- One or more metadata lines prefixed with the hash sign
#
(currently onlysent_id
andtext
) - Followed by a tab-delimited line providing various information for each word:
- ID: Word index, integer starting at 1 for each new sentence; may be a range for tokens with multiple words.
- FORM: Word form or punctuation symbol.
- LEMMA: Lemma or stem of word form.
- UPOSTAG: Universal part-of-speech tag drawn from our revised version of the Google universal POS tags.
- XPOSTAG: Language-specific part-of-speech tag; underscore if not available.
- FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
- HEAD: Head of the current token, which is either a value of ID or zero (0).
- DEPREL: Universal Stanford dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
- DEPS: List of secondary dependencies (head-deprel pairs).
- MISC: Any other annotation.
- Example from UD Japanese GSD (Link to position in corpus):
# sent_id = dev-s510
# text = それでいてこの対応である。
1 それ それ PRON NP _ 3 obl _ SpaceAfter=No
2 で で ADP PS _ 1 case _ SpaceAfter=No
3 い いる VERB VV _ 6 advcl _ SpaceAfter=No
4 て て SCONJ PC _ 3 mark _ SpaceAfter=No
5 この この DET JR _ 6 det _ SpaceAfter=No
6 対応 対応 NOUN NN _ 0 root _ SpaceAfter=No
7 である だ AUX AV _ 6 cop _ SpaceAfter=No
8 。 。 PUNCT SYM _ 6 punct _ SpaceAfter=No
- Example using the
ginza
command:
# text = それでいてこの対応である。x
1 それ 其れ PRON 代名詞 _ 3 nmod _ BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No|NP_B
2 で だ AUX 助動詞 _ 1 aux _ BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
3 い 居る VERB 動詞-非自立可能 _ 6 acl _ BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No
4 て て SCONJ 助詞-接続助詞 _ 3 mark _ BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
5 この 此の DET 連体詞 _ 6 det _ BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No
6 対応 対応 NOUN 名詞-普通名詞-サ変可能 _ 0 root _ BunsetuBILabel=B|BunsetuPositionType=ROOT|SpaceAfter=No|NP_B
7 で だ AUX 助動詞 _ 6 aux _ BunsetuBILabel=I|BunsetuPositionType=FUNC|SpaceAfter=No
8 ある 有る AUX 動詞-非自立可能 _ 6 aux _ BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
9 。 。 PUNCT 補助記号-句点 _ 6 punct _ BunsetuBILabel=I|BunsetuPositionType=CONT|SpaceAfter=No
If we convert this into a table:
id | form | lemma | upos | xpos | feats | head | deprel | deps | misc |
---|---|---|---|---|---|---|---|---|---|
1 | それ | 其れ | PRON | 代名詞 | _ | 3 | nmod | _ | BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No|NP_B |
2 | で | だ | AUX | 助動詞 | _ | 1 | aux | _ | BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No |
3 | い | 居る | VERB | 動詞-非自立可能 | _ | 6 | acl | _ | BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No |
4 | て | て | SCONJ | 助詞-接続助詞 | _ | 3 | mark | _ | BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No |
5 | この | 此の | DET | 連体詞 | _ | 6 | det | _ | BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No |
6 | 対応 | 対応 | NOUN | 名詞-普通名詞-サ変可能 | _ | 0 | root | _ | BunsetuBILabel=B|BunsetuPositionType=ROOT|SpaceAfter=No|NP_B |
7 | で | だ | AUX | 助動詞 | _ | 6 | aux | _ | BunsetuBILabel=I|BunsetuPositionType=FUNC|SpaceAfter=No |
8 | ある | 有る | AUX | 動詞-非自立可能 | _ | 6 | aux | _ | BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No |
9 | 。 | 。 | PUNCT | 補助記号-句点 | _ | 6 | punct | _ | BunsetuBILabel=I|BunsetuPositionType=CONT|SpaceAfter=No |
spaCy models and UD
- English model label scheme
- note that the
tagger
scheme corresponds to the Penn one below
- note that the
The GiNZA model does not have a summary page currently, but is similar to the English models. Below you can expand the metadata to see the details in JSON format. Note that dependency labels suffixed with _as_
are not displayed in the final analysis, because they are only used for disambiguation in a previous alaysis step. This is detailed in this paper on GiNZA [@田中貴秋2019日本語の文法機能タイプ付き単語依存構造解析].
GiNZA model metadata
{'accuracy': {'ents_f': 0.0,
'ents_p': 0.0,
'ents_r': 0.0,
'las': 86.8687191322,
'tags_acc': 99.9137330714,
'token_acc': 98.8075260789,
'uas': 90.4732435529},
'author': 'Megagon Labs, Tokyo.',
'description': 'Japanese multi-task CNN trained on UD-Japanese BCCWJ v2.4 + GSK2014-A. Assigns word2vec token vectors, POS tags, dependency parse and named entities.',
'email': 'ginza@megagon.ai',
'lang': 'ja',
'license': 'MIT',
'name': 'ginza',
'parent_package': 'spacy',
'pipeline': ['tagger', 'parser', 'ner', 'JapaneseCorrector'],
'sources': ['SudachiPy',
'jawiki-cirrussearch-content',
'UD-Japanese BCCWJ v2.4',
'KWDLC'],
'spacy_version': '>=2.1.0',
'speed': {'cpu': 21097.6865736948, 'gpu': None, 'nwords': 145999},
'url': 'https://github.com/megagonlabs/ginza',
'vectors': {'width': 100,
'vectors': 117951,
'keys': 100000,
'name': 'ja_nopn.vectors'},
'version': '2.1.0',
'labels': OrderedDict([('tagger',
['_SP',
'web誤脱',
'代名詞',
'副詞',
'助動詞',
'助詞-係助詞',
'助詞-副助詞',
'助詞-接続助詞',
'助詞-格助詞',
'助詞-準体助詞',
'助詞-終助詞',
'動詞-一般',
'動詞-非自立可能',
'名詞-助動詞語幹',
'名詞-固有名詞-一般',
'名詞-固有名詞-人名-一般',
'名詞-固有名詞-人名-名',
'名詞-固有名詞-人名-姓',
'名詞-固有名詞-地名-一般',
'名詞-固有名詞-地名-国',
'名詞-数詞',
'名詞-普通名詞-サ変可能',
'名詞-普通名詞-サ変形状詞可能',
'名詞-普通名詞-一般',
'名詞-普通名詞-副詞可能',
'名詞-普通名詞-助数詞可能',
'名詞-普通名詞-形状詞可能',
'形容詞-一般',
'形容詞-非自立可能',
'形状詞-タリ',
'形状詞-一般',
'形状詞-助動詞語幹',
'感動詞-フィラー',
'感動詞-一般',
'接尾辞-動詞的',
'接尾辞-名詞的-サ変可能',
'接尾辞-名詞的-一般',
'接尾辞-名詞的-副詞可能',
'接尾辞-名詞的-助数詞',
'接尾辞-形容詞的',
'接尾辞-形状詞的',
'接続詞',
'接頭辞',
'補助記号-一般',
'補助記号-句点',
'補助記号-括弧閉',
'補助記号-括弧開',
'補助記号-読点',
'補助記号-AA-一般',
'補助記号-AA-顔文字',
'記号-一般',
'記号-文字',
'連体詞']),
('parser',
['ROOT',
'acl',
'acl_as_AUX',
'acl_as_VERB',
'advcl',
'advcl_as_AUX',
'advcl_as_VERB',
'advmod',
'advmod_as_ADV',
'advmod_as_AUX',
'amod',
'amod_as_ADJ',
'amod_as_AUX',
'appos',
'appos_as_NOUN',
'appos_as_X',
'as_NOUN',
'as_NUM',
'as_PROPN',
'as_X',
'aux',
'aux_as_ADJ',
'aux_as_AUX',
'aux_as_NOUN',
'aux_as_PART',
'aux_as_VERB',
'case',
'case_as_ADP',
'cc',
'cc_as_CCONJ',
'compound',
'compound_as_NOUN',
'compound_as_PROPN',
'cop',
'dep',
'dep_as_ADJ',
'dep_as_AUX',
'dep_as_CCONJ',
'dep_as_NOUN',
'dep_as_SCONJ',
'dep_as_SYM',
'dep_as_VERB',
'dep_as_X',
'det',
'discourse',
'iobj',
'iobj_as_ADJ',
'iobj_as_ADV',
'iobj_as_NOUN',
'iobj_as_VERB',
'mark',
'mark_as_SCONJ',
'nmod',
'nmod_as_ADJ',
'nmod_as_AUX',
'nmod_as_NOUN',
'nmod_as_PRON',
'nmod_as_VERB',
'nsubj',
'nsubj_as_ADJ',
'nsubj_as_AUX',
'nsubj_as_NOUN',
'nsubj_as_VERB',
'nummod',
'nummod_as_NUM',
'obj',
'obj_as_NOUN',
'obj_as_VERB',
'obl',
'obl_as_ADJ',
'obl_as_ADV',
'obl_as_AUX',
'obl_as_NOUN',
'obl_as_PRON',
'obl_as_VERB',
'punct',
'root_as_AUX',
'root_as_NOUN',
'root_as_SYM',
'root_as_VERB',
'root_as_X',
'subtok']),
('ner',
['DATE',
'LOC',
'MONEY',
'ORG',
'PERCENT',
'PERSON',
'PRODUCT',
'TIME'])])}
Relation to standard English tagsets
@peng2019roads
Penn Treebank
Number | Tag | Description |
---|---|---|
1. | CC | Coordinating conjunction |
2. | CD | Cardinal number |
3. | DT | Determiner |
4. | EX | Existential there |
5. | FW | Foreign word |
6. | IN | Preposition or subordinating conjunction |
7. | JJ | Adjective |
8. | JJR | Adjective, comparative |
9. | JJS | Adjective, superlative |
10. | LS | List item marker |
11. | MD | Modal |
12. | NN | Noun, singular or mass |
13. | NNS | Noun, plural |
14. | NNP | Proper noun, singular |
15. | NNPS | Proper noun, plural |
16. | PDT | Predeterminer |
17. | POS | Possessive ending |
18. | PRP | Personal pronoun |
19. | PRP$ | Possessive pronoun |
20. | RB | Adverb |
21. | RBR | Adverb, comparative |
22. | RBS | Adverb, superlative |
23. | RP | Particle |
24. | SYM | Symbol |
25. | TO | to |
26. | UH | Interjection |
27. | VB | Verb, base form |
28. | VBD | Verb, past tense |
29. | VBG | Verb, gerund or present participle |
30. | VBN | Verb, past participle |
31. | VBP | Verb, non-3rd person singular present |
32. | VBZ | Verb, 3rd person singular present |
33. | WDT | Wh-determiner |
34. | WP | Wh-pronoun |
35. | WP$ | Possessive wh-pronoun |
36. | WRB | Wh-adverb |
UD mapping are available in this table.
Stanford Typed Dependencies and POS
This process is (also) not straightforward:
- Stanford Typed Dependencies documentation
- one possible set of rules for conversion of dependency labels and POS into UD
Relation to UniDic (Japanese)
UD | 大分類 | 中分類 | 小分類 | 細分類 |
---|---|---|---|---|
NOUN | 名詞 | 普通名詞 | 一般 | |
NOUN/VERB | サ変可能 | |||
NOUN | 形状詞可能 | |||
NOUN/VERB/ADJ | サ変形状詞可能 | |||
NOUN | 副詞可能 | |||
PROPN | 固有名詞 | 一般 | ||
PROPN | 人名 | 一般 | ||
PROPN | 姓 | |||
PROPN | 名 | |||
PROPN | 地名 | 一般 | ||
PROPN | 国 | |||
PROPN | 組織名 | |||
NUM | 数詞 | |||
AUX | 助動詞語幹 | |||
PRON | 代名詞 | |||
ADJ | 形状詞 | 一般 | ||
ADJ | タリ | |||
ADJ | 助動詞語幹 | |||
DET | 連体詞 | |||
ADV | 副詞 | |||
SCONJ | 接続詞 | |||
INTJ | 感動詞 | 一般 | ||
INTJ | フィラー | |||
VERB | 動詞 | 一般 | ||
VERB/AUX | 非自立可能 | |||
ADJ | 形容詞 | 一般 | ||
ADJ | 非自立可能 | |||
AUX | 助動詞 | |||
ADP | 助詞 | 格助詞 | ||
ADP | 副助詞 | |||
ADP | 係助詞 | |||
CCONJ | 接続助詞 | |||
PART | 終助詞 | |||
SCONJ | 準体助詞 | |||
NOUN | 接頭辞 | |||
NOUN | 接尾辞 | 名詞的 | 一般 | |
NOUN | サ変可能 | |||
NOUN | 形状詞可能 | |||
NOUN | 副詞可能 | |||
NOUN | 助数詞 | |||
NOUN | 形状詞的 | |||
NOUN | 動詞的 | |||
NOUN | 形容詞的 | |||
SYM | 記号 | 一般 | ||
SYM | 文字 | |||
PUNCT | 補助記号 | 一般 | ||
PUNCT | 句点 | |||
PUNCT | 読点 | |||
PUNCT | 括弧開 | |||
PUNCT | 括弧閉 | |||
SYM | AA | 一般 | ||
SYM | AA | 顔文字 | ||
SPACE* | 空白 |
Critiques of UD from a Japanese perspective
@DBLP:journals/corr/abs-1906-09719
Relation to Chinese tagsets
@leung-etal-2016-developing provides an overview of efforts in creating a UD (v1) Chinese dataset. SpaCy also has a tag mapping to UD here.
General critiques
@osborne2019status