Universal Dependencies and the Universal POS tagset

The Universal Dependencies project is an ongoing effort to provide a practical and unified annotation scheme for natural language processing of many languages. This annotation framework encompasses dependency relations between words, but also lower levels–token segmentation, POS tagging, etc.

UD Label Scheme

In general, all information is encoded within these three label schemes:

Noticeably, this core of UD only provides annotation within one sentence at a time, and does not concern itself with tasks such as textual entailment, question answering, translation, etc.

CoNLL-U

The CoNLL-U format is an extension to the standard formats used in CoNLL tasks (The SIGNLL Conference on Computational Natural Language Learning). It is the format in which all treebanks are distributed.

For each sentence, the format follows the following rules (see documentation):

  • One or more metadata lines prefixed with the hash sign # (currently only sent_id and text)
  • Followed by a tab-delimited line providing various information for each word:
    1. ID: Word index, integer starting at 1 for each new sentence; may be a range for tokens with multiple words.
    2. FORM: Word form or punctuation symbol.
    3. LEMMA: Lemma or stem of word form.
    4. UPOSTAG: Universal part-of-speech tag drawn from our revised version of the Google universal POS tags.
    5. XPOSTAG: Language-specific part-of-speech tag; underscore if not available.
    6. FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
    7. HEAD: Head of the current token, which is either a value of ID or zero (0).
    8. DEPREL: Universal Stanford dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
    9. DEPS: List of secondary dependencies (head-deprel pairs).
    10. MISC: Any other annotation.
  • Example from UD Japanese GSD (Link to position in corpus):
# sent_id = dev-s510
# text = それでいてこの対応である。
1   それ  それ  PRON    NP  _   3   obl _   SpaceAfter=No
2   で   で   ADP PS  _   1   case    _   SpaceAfter=No
3   い   いる  VERB    VV  _   6   advcl   _   SpaceAfter=No
4   て   て   SCONJ   PC  _   3   mark    _   SpaceAfter=No
5   この  この  DET JR  _   6   det _   SpaceAfter=No
6   対応  対応  NOUN    NN  _   0   root    _   SpaceAfter=No
7   である だ   AUX AV  _   6   cop _   SpaceAfter=No
8   。   。   PUNCT   SYM _   6   punct   _   SpaceAfter=No
  • Example using the ginza command:
# text = それでいてこの対応である。x
1       それ    其れ    PRON    代名詞  _       3       nmod    _       BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No|NP_B
2       で      だ      AUX     助動詞  _       1       aux     _       BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
3       い      居る    VERB    動詞-非自立可能 _       6       acl     _       BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No
4       て      て      SCONJ   助詞-接続助詞   _       3       mark    _       BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
5       この    此の    DET     連体詞  _       6       det     _       BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No
6       対応    対応    NOUN    名詞-普通名詞-サ変可能  _       0       root    _       BunsetuBILabel=B|BunsetuPositionType=ROOT|SpaceAfter=No|NP_B
7       で      だ      AUX     助動詞  _       6       aux     _       BunsetuBILabel=I|BunsetuPositionType=FUNC|SpaceAfter=No
8       ある    有る    AUX     動詞-非自立可能 _       6       aux     _       BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
9       。      。      PUNCT   補助記号-句点   _       6       punct   _       BunsetuBILabel=I|BunsetuPositionType=CONT|SpaceAfter=No

If we convert this into a table:

idformlemmauposxposfeatsheaddepreldepsmisc
1それ其れPRON代名詞_3nmod_BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No|NP_B
2AUX助動詞_1aux_BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
3居るVERB動詞-非自立可能_6acl_BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No
4SCONJ助詞-接続助詞_3mark_BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
5この此のDET連体詞_6det_BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No
6対応対応NOUN名詞-普通名詞-サ変可能_0root_BunsetuBILabel=B|BunsetuPositionType=ROOT|SpaceAfter=No|NP_B
7AUX助動詞_6aux_BunsetuBILabel=I|BunsetuPositionType=FUNC|SpaceAfter=No
8ある有るAUX動詞-非自立可能_6aux_BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
9PUNCT補助記号-句点_6punct_BunsetuBILabel=I|BunsetuPositionType=CONT|SpaceAfter=No

spaCy models and UD

The GiNZA model does not have a summary page currently, but is similar to the English models. Below you can expand the metadata to see the details in JSON format. Note that dependency labels suffixed with _as_ are not displayed in the final analysis, because they are only used for disambiguation in a previous alaysis step. This is detailed in this paper on GiNZA [@田中貴秋2019日本語の文法機能タイプ付き単語依存構造解析].

GiNZA model metadata
{'accuracy': {'ents_f': 0.0,
  'ents_p': 0.0,
  'ents_r': 0.0,
  'las': 86.8687191322,
  'tags_acc': 99.9137330714,
  'token_acc': 98.8075260789,
  'uas': 90.4732435529},
 'author': 'Megagon Labs, Tokyo.',
 'description': 'Japanese multi-task CNN trained on UD-Japanese BCCWJ v2.4 + GSK2014-A. Assigns word2vec token vectors, POS tags, dependency parse and named entities.',
 'email': 'ginza@megagon.ai',
 'lang': 'ja',
 'license': 'MIT',
 'name': 'ginza',
 'parent_package': 'spacy',
 'pipeline': ['tagger', 'parser', 'ner', 'JapaneseCorrector'],
 'sources': ['SudachiPy',
  'jawiki-cirrussearch-content',
  'UD-Japanese BCCWJ v2.4',
  'KWDLC'],
 'spacy_version': '>=2.1.0',
 'speed': {'cpu': 21097.6865736948, 'gpu': None, 'nwords': 145999},
 'url': 'https://github.com/megagonlabs/ginza',
 'vectors': {'width': 100,
  'vectors': 117951,
  'keys': 100000,
  'name': 'ja_nopn.vectors'},
 'version': '2.1.0',
 'labels': OrderedDict([('tagger',
               ['_SP',
                'web誤脱',
                '代名詞',
                '副詞',
                '助動詞',
                '助詞-係助詞',
                '助詞-副助詞',
                '助詞-接続助詞',
                '助詞-格助詞',
                '助詞-準体助詞',
                '助詞-終助詞',
                '動詞-一般',
                '動詞-非自立可能',
                '名詞-助動詞語幹',
                '名詞-固有名詞-一般',
                '名詞-固有名詞-人名-一般',
                '名詞-固有名詞-人名-名',
                '名詞-固有名詞-人名-姓',
                '名詞-固有名詞-地名-一般',
                '名詞-固有名詞-地名-国',
                '名詞-数詞',
                '名詞-普通名詞-サ変可能',
                '名詞-普通名詞-サ変形状詞可能',
                '名詞-普通名詞-一般',
                '名詞-普通名詞-副詞可能',
                '名詞-普通名詞-助数詞可能',
                '名詞-普通名詞-形状詞可能',
                '形容詞-一般',
                '形容詞-非自立可能',
                '形状詞-タリ',
                '形状詞-一般',
                '形状詞-助動詞語幹',
                '感動詞-フィラー',
                '感動詞-一般',
                '接尾辞-動詞的',
                '接尾辞-名詞的-サ変可能',
                '接尾辞-名詞的-一般',
                '接尾辞-名詞的-副詞可能',
                '接尾辞-名詞的-助数詞',
                '接尾辞-形容詞的',
                '接尾辞-形状詞的',
                '接続詞',
                '接頭辞',
                '補助記号-一般',
                '補助記号-句点',
                '補助記号-括弧閉',
                '補助記号-括弧開',
                '補助記号-読点',
                '補助記号-AA-一般',
                '補助記号-AA-顔文字',
                '記号-一般',
                '記号-文字',
                '連体詞']),
              ('parser',
               ['ROOT',
                'acl',
                'acl_as_AUX',
                'acl_as_VERB',
                'advcl',
                'advcl_as_AUX',
                'advcl_as_VERB',
                'advmod',
                'advmod_as_ADV',
                'advmod_as_AUX',
                'amod',
                'amod_as_ADJ',
                'amod_as_AUX',
                'appos',
                'appos_as_NOUN',
                'appos_as_X',
                'as_NOUN',
                'as_NUM',
                'as_PROPN',
                'as_X',
                'aux',
                'aux_as_ADJ',
                'aux_as_AUX',
                'aux_as_NOUN',
                'aux_as_PART',
                'aux_as_VERB',
                'case',
                'case_as_ADP',
                'cc',
                'cc_as_CCONJ',
                'compound',
                'compound_as_NOUN',
                'compound_as_PROPN',
                'cop',
                'dep',
                'dep_as_ADJ',
                'dep_as_AUX',
                'dep_as_CCONJ',
                'dep_as_NOUN',
                'dep_as_SCONJ',
                'dep_as_SYM',
                'dep_as_VERB',
                'dep_as_X',
                'det',
                'discourse',
                'iobj',
                'iobj_as_ADJ',
                'iobj_as_ADV',
                'iobj_as_NOUN',
                'iobj_as_VERB',
                'mark',
                'mark_as_SCONJ',
                'nmod',
                'nmod_as_ADJ',
                'nmod_as_AUX',
                'nmod_as_NOUN',
                'nmod_as_PRON',
                'nmod_as_VERB',
                'nsubj',
                'nsubj_as_ADJ',
                'nsubj_as_AUX',
                'nsubj_as_NOUN',
                'nsubj_as_VERB',
                'nummod',
                'nummod_as_NUM',
                'obj',
                'obj_as_NOUN',
                'obj_as_VERB',
                'obl',
                'obl_as_ADJ',
                'obl_as_ADV',
                'obl_as_AUX',
                'obl_as_NOUN',
                'obl_as_PRON',
                'obl_as_VERB',
                'punct',
                'root_as_AUX',
                'root_as_NOUN',
                'root_as_SYM',
                'root_as_VERB',
                'root_as_X',
                'subtok']),
              ('ner',
               ['DATE',
                'LOC',
                'MONEY',
                'ORG',
                'PERCENT',
                'PERSON',
                'PRODUCT',
                'TIME'])])}

Relation to standard English tagsets

@peng2019roads

Penn Treebank

NumberTagDescription
1.CCCoordinating conjunction
2.CDCardinal number
3.DTDeterminer
4.EXExistential there
5.FWForeign word
6.INPreposition or subordinating conjunction
7.JJAdjective
8.JJRAdjective, comparative
9.JJSAdjective, superlative
10.LSList item marker
11.MDModal
12.NNNoun, singular or mass
13.NNSNoun, plural
14.NNPProper noun, singular
15.NNPSProper noun, plural
16.PDTPredeterminer
17.POSPossessive ending
18.PRPPersonal pronoun
19.PRP$Possessive pronoun
20.RBAdverb
21.RBRAdverb, comparative
22.RBSAdverb, superlative
23.RPParticle
24.SYMSymbol
25.TOto
26.UHInterjection
27.VBVerb, base form
28.VBDVerb, past tense
29.VBGVerb, gerund or present participle
30.VBNVerb, past participle
31.VBPVerb, non-3rd person singular present
32.VBZVerb, 3rd person singular present
33.WDTWh-determiner
34.WPWh-pronoun
35.WP$Possessive wh-pronoun
36.WRBWh-adverb

UD mapping are available in this table.

Stanford Typed Dependencies and POS

This process is (also) not straightforward:

Relation to UniDic (Japanese)

UD大分類中分類小分類細分類
NOUN名詞普通名詞一般
NOUN/VERBサ変可能
NOUN形状詞可能
NOUN/VERB/ADJサ変形状詞可能
NOUN副詞可能
PROPN固有名詞一般
PROPN人名一般
PROPN
PROPN
PROPN地名一般
PROPN
PROPN組織名
NUM数詞
AUX助動詞語幹
PRON代名詞
ADJ形状詞一般
ADJタリ
ADJ助動詞語幹
DET連体詞
ADV副詞
SCONJ接続詞
INTJ感動詞一般
INTJフィラー
VERB動詞一般
VERB/AUX非自立可能
ADJ形容詞一般
ADJ非自立可能
AUX助動詞
ADP助詞格助詞
ADP副助詞
ADP係助詞
CCONJ接続助詞
PART終助詞
SCONJ準体助詞
NOUN接頭辞
NOUN接尾辞名詞的一般
NOUNサ変可能
NOUN形状詞可能
NOUN副詞可能
NOUN助数詞
NOUN形状詞的
NOUN動詞的
NOUN形容詞的
SYM記号一般
SYM文字
PUNCT補助記号一般
PUNCT句点
PUNCT読点
PUNCT括弧開
PUNCT括弧閉
SYMAA一般
SYMAA顔文字
SPACE*空白

Critiques of UD from a Japanese perspective

@DBLP:journals/corr/abs-1906-09719

Relation to Chinese tagsets

@leung-etal-2016-developing provides an overview of efforts in creating a UD (v1) Chinese dataset. SpaCy also has a tag mapping to UD here.

General critiques

@osborne2019status