Bor Hodošček: Academic Website

Universal Dependencies and the Universal POS tagset

The Universal Dependencies project is an ongoing effort to provide a practical and unified annotation scheme for natural language processing of many languages. This annotation framework encompasses dependency relations between words, but also lower levels–token segmentation, POS tagging, etc.

UD Label Scheme

In general, all information is encoded within these three label schemes:

Noticeably, this core of UD only provides annotation within one sentence at a time, and does not concern itself with tasks such as textual entailment, question answering, translation, etc.

CoNLL-U

The CoNLL-U format is an extension to the standard formats used in CoNLL tasks (The SIGNLL Conference on Computational Natural Language Learning). It is the format in which all treebanks are distributed.

For each sentence, the format follows the following rules (see documentation):

One or more metadata lines prefixed with the hash sign # (currently only sent_id and text)
Followed by a tab-delimited line providing various information for each word:
1. ID: Word index, integer starting at 1 for each new sentence; may be a range for tokens with multiple words.
2. FORM: Word form or punctuation symbol.
3. LEMMA: Lemma or stem of word form.
4. UPOSTAG: Universal part-of-speech tag drawn from our revised version of the Google universal POS tags.
5. XPOSTAG: Language-specific part-of-speech tag; underscore if not available.
6. FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
7. HEAD: Head of the current token, which is either a value of ID or zero (0).
8. DEPREL: Universal Stanford dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
9. DEPS: List of secondary dependencies (head-deprel pairs).
10. MISC: Any other annotation.
Example from UD Japanese GSD (Link to position in corpus):

# sent_id = dev-s510
# text = それでいてこの対応である。
1   それ  それ  PRON    NP  _   3   obl _   SpaceAfter=No
2   で   で   ADP PS  _   1   case    _   SpaceAfter=No
3   い   いる  VERB    VV  _   6   advcl   _   SpaceAfter=No
4   て   て   SCONJ   PC  _   3   mark    _   SpaceAfter=No
5   この  この  DET JR  _   6   det _   SpaceAfter=No
6   対応  対応  NOUN    NN  _   0   root    _   SpaceAfter=No
7   である だ   AUX AV  _   6   cop _   SpaceAfter=No
8   。   。   PUNCT   SYM _   6   punct   _   SpaceAfter=No

Example using the ginza command:

# text = それでいてこの対応である。x
1       それ    其れ    PRON    代名詞  _       3       nmod    _       BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No|NP_B
2       で      だ      AUX     助動詞  _       1       aux     _       BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
3       い      居る    VERB    動詞-非自立可能 _       6       acl     _       BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No
4       て      て      SCONJ   助詞-接続助詞   _       3       mark    _       BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
5       この    此の    DET     連体詞  _       6       det     _       BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No
6       対応    対応    NOUN    名詞-普通名詞-サ変可能  _       0       root    _       BunsetuBILabel=B|BunsetuPositionType=ROOT|SpaceAfter=No|NP_B
7       で      だ      AUX     助動詞  _       6       aux     _       BunsetuBILabel=I|BunsetuPositionType=FUNC|SpaceAfter=No
8       ある    有る    AUX     動詞-非自立可能 _       6       aux     _       BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
9       。      。      PUNCT   補助記号-句点   _       6       punct   _       BunsetuBILabel=I|BunsetuPositionType=CONT|SpaceAfter=No

If we convert this into a table:

id	form	lemma	upos	xpos	feats	head	deprel	deps	misc
1	それ	其れ	PRON	代名詞	_	3	nmod	_	BunsetuBILabel=B\|BunsetuPositionType=SEM_HEAD\|SpaceAfter=No\|NP_B
2	で	だ	AUX	助動詞	_	1	aux	_	BunsetuBILabel=I\|BunsetuPositionType=SYN_HEAD\|SpaceAfter=No
3	い	居る	VERB	動詞-非自立可能	_	6	acl	_	BunsetuBILabel=B\|BunsetuPositionType=SEM_HEAD\|SpaceAfter=No
4	て	て	SCONJ	助詞-接続助詞	_	3	mark	_	BunsetuBILabel=I\|BunsetuPositionType=SYN_HEAD\|SpaceAfter=No
5	この	此の	DET	連体詞	_	6	det	_	BunsetuBILabel=B\|BunsetuPositionType=SEM_HEAD\|SpaceAfter=No
6	対応	対応	NOUN	名詞-普通名詞-サ変可能	_	0	root	_	BunsetuBILabel=B\|BunsetuPositionType=ROOT\|SpaceAfter=No\|NP_B
7	で	だ	AUX	助動詞	_	6	aux	_	BunsetuBILabel=I\|BunsetuPositionType=FUNC\|SpaceAfter=No
8	ある	有る	AUX	動詞-非自立可能	_	6	aux	_	BunsetuBILabel=I\|BunsetuPositionType=SYN_HEAD\|SpaceAfter=No
9	。	。	PUNCT	補助記号-句点	_	6	punct	_	BunsetuBILabel=I\|BunsetuPositionType=CONT\|SpaceAfter=No

spaCy models and UD

English model label scheme
- note that the tagger scheme corresponds to the Penn one below

The GiNZA model does not have a summary page currently, but is similar to the English models. Below you can expand the metadata to see the details in JSON format. Note that dependency labels suffixed with _as_ are not displayed in the final analysis, because they are only used for disambiguation in a previous alaysis step. This is detailed in this paper on GiNZA [@田中貴秋2019日本語の文法機能タイプ付き単語依存構造解析].

GiNZA model metadata

{'accuracy': {'ents_f': 0.0,
  'ents_p': 0.0,
  'ents_r': 0.0,
  'las': 86.8687191322,
  'tags_acc': 99.9137330714,
  'token_acc': 98.8075260789,
  'uas': 90.4732435529},
 'author': 'Megagon Labs, Tokyo.',
 'description': 'Japanese multi-task CNN trained on UD-Japanese BCCWJ v2.4 + GSK2014-A. Assigns word2vec token vectors, POS tags, dependency parse and named entities.',
 'email': 'ginza@megagon.ai',
 'lang': 'ja',
 'license': 'MIT',
 'name': 'ginza',
 'parent_package': 'spacy',
 'pipeline': ['tagger', 'parser', 'ner', 'JapaneseCorrector'],
 'sources': ['SudachiPy',
  'jawiki-cirrussearch-content',
  'UD-Japanese BCCWJ v2.4',
  'KWDLC'],
 'spacy_version': '>=2.1.0',
 'speed': {'cpu': 21097.6865736948, 'gpu': None, 'nwords': 145999},
 'url': 'https://github.com/megagonlabs/ginza',
 'vectors': {'width': 100,
  'vectors': 117951,
  'keys': 100000,
  'name': 'ja_nopn.vectors'},
 'version': '2.1.0',
 'labels': OrderedDict([('tagger',
               ['_SP',
                'web誤脱',
                '代名詞',
                '副詞',
                '助動詞',
                '助詞-係助詞',
                '助詞-副助詞',
                '助詞-接続助詞',
                '助詞-格助詞',
                '助詞-準体助詞',
                '助詞-終助詞',
                '動詞-一般',
                '動詞-非自立可能',
                '名詞-助動詞語幹',
                '名詞-固有名詞-一般',
                '名詞-固有名詞-人名-一般',
                '名詞-固有名詞-人名-名',
                '名詞-固有名詞-人名-姓',
                '名詞-固有名詞-地名-一般',
                '名詞-固有名詞-地名-国',
                '名詞-数詞',
                '名詞-普通名詞-サ変可能',
                '名詞-普通名詞-サ変形状詞可能',
                '名詞-普通名詞-一般',
                '名詞-普通名詞-副詞可能',
                '名詞-普通名詞-助数詞可能',
                '名詞-普通名詞-形状詞可能',
                '形容詞-一般',
                '形容詞-非自立可能',
                '形状詞-タリ',
                '形状詞-一般',
                '形状詞-助動詞語幹',
                '感動詞-フィラー',
                '感動詞-一般',
                '接尾辞-動詞的',
                '接尾辞-名詞的-サ変可能',
                '接尾辞-名詞的-一般',
                '接尾辞-名詞的-副詞可能',
                '接尾辞-名詞的-助数詞',
                '接尾辞-形容詞的',
                '接尾辞-形状詞的',
                '接続詞',
                '接頭辞',
                '補助記号-一般',
                '補助記号-句点',
                '補助記号-括弧閉',
                '補助記号-括弧開',
                '補助記号-読点',
                '補助記号-ＡＡ-一般',
                '補助記号-ＡＡ-顔文字',
                '記号-一般',
                '記号-文字',
                '連体詞']),
              ('parser',
               ['ROOT',
                'acl',
                'acl_as_AUX',
                'acl_as_VERB',
                'advcl',
                'advcl_as_AUX',
                'advcl_as_VERB',
                'advmod',
                'advmod_as_ADV',
                'advmod_as_AUX',
                'amod',
                'amod_as_ADJ',
                'amod_as_AUX',
                'appos',
                'appos_as_NOUN',
                'appos_as_X',
                'as_NOUN',
                'as_NUM',
                'as_PROPN',
                'as_X',
                'aux',
                'aux_as_ADJ',
                'aux_as_AUX',
                'aux_as_NOUN',
                'aux_as_PART',
                'aux_as_VERB',
                'case',
                'case_as_ADP',
                'cc',
                'cc_as_CCONJ',
                'compound',
                'compound_as_NOUN',
                'compound_as_PROPN',
                'cop',
                'dep',
                'dep_as_ADJ',
                'dep_as_AUX',
                'dep_as_CCONJ',
                'dep_as_NOUN',
                'dep_as_SCONJ',
                'dep_as_SYM',
                'dep_as_VERB',
                'dep_as_X',
                'det',
                'discourse',
                'iobj',
                'iobj_as_ADJ',
                'iobj_as_ADV',
                'iobj_as_NOUN',
                'iobj_as_VERB',
                'mark',
                'mark_as_SCONJ',
                'nmod',
                'nmod_as_ADJ',
                'nmod_as_AUX',
                'nmod_as_NOUN',
                'nmod_as_PRON',
                'nmod_as_VERB',
                'nsubj',
                'nsubj_as_ADJ',
                'nsubj_as_AUX',
                'nsubj_as_NOUN',
                'nsubj_as_VERB',
                'nummod',
                'nummod_as_NUM',
                'obj',
                'obj_as_NOUN',
                'obj_as_VERB',
                'obl',
                'obl_as_ADJ',
                'obl_as_ADV',
                'obl_as_AUX',
                'obl_as_NOUN',
                'obl_as_PRON',
                'obl_as_VERB',
                'punct',
                'root_as_AUX',
                'root_as_NOUN',
                'root_as_SYM',
                'root_as_VERB',
                'root_as_X',
                'subtok']),
              ('ner',
               ['DATE',
                'LOC',
                'MONEY',
                'ORG',
                'PERCENT',
                'PERSON',
                'PRODUCT',
                'TIME'])])}

Relation to standard English tagsets

@peng2019roads

Penn Treebank

Number	Tag	Description
1.	CC	Coordinating conjunction
2.	CD	Cardinal number
3.	DT	Determiner
4.	EX	Existential there
5.	FW	Foreign word
6.	IN	Preposition or subordinating conjunction
7.	JJ	Adjective
8.	JJR	Adjective, comparative
9.	JJS	Adjective, superlative
10.	LS	List item marker
11.	MD	Modal
12.	NN	Noun, singular or mass
13.	NNS	Noun, plural
14.	NNP	Proper noun, singular
15.	NNPS	Proper noun, plural
16.	PDT	Predeterminer
17.	POS	Possessive ending
18.	PRP	Personal pronoun
19.	PRP$	Possessive pronoun
20.	RB	Adverb
21.	RBR	Adverb, comparative
22.	RBS	Adverb, superlative
23.	RP	Particle
24.	SYM	Symbol
25.	TO	to
26.	UH	Interjection
27.	VB	Verb, base form
28.	VBD	Verb, past tense
29.	VBG	Verb, gerund or present participle
30.	VBN	Verb, past participle
31.	VBP	Verb, non-3rd person singular present
32.	VBZ	Verb, 3rd person singular present
33.	WDT	Wh-determiner
34.	WP	Wh-pronoun
35.	WP$	Possessive wh-pronoun
36.	WRB	Wh-adverb

UD mapping are available in this table.

Stanford Typed Dependencies and POS

This process is (also) not straightforward:

Stanford Typed Dependencies documentation
one possible set of rules for conversion of dependency labels and POS into UD

Relation to UniDic (Japanese)

UD	大分類	中分類	小分類	細分類
NOUN	名詞	普通名詞	一般
NOUN/VERB			サ変可能
NOUN			形状詞可能
NOUN/VERB/ADJ			サ変形状詞可能
NOUN			副詞可能
PROPN		固有名詞	一般
PROPN			人名	一般
PROPN				姓
PROPN				名
PROPN			地名	一般
PROPN				国
PROPN			組織名
NUM		数詞
AUX		助動詞語幹
PRON	代名詞
ADJ	形状詞	一般
ADJ		タリ
ADJ		助動詞語幹
DET	連体詞
ADV	副詞
SCONJ	接続詞
INTJ	感動詞	一般
INTJ		フィラー
VERB	動詞	一般
VERB/AUX		非自立可能
ADJ	形容詞	一般
ADJ		非自立可能
AUX	助動詞
ADP	助詞	格助詞
ADP		副助詞
ADP		係助詞
CCONJ		接続助詞
PART		終助詞
SCONJ		準体助詞
NOUN	接頭辞
NOUN	接尾辞	名詞的	一般
NOUN			サ変可能
NOUN			形状詞可能
NOUN			副詞可能
NOUN			助数詞
NOUN		形状詞的
NOUN		動詞的
NOUN		形容詞的
SYM	記号	一般
SYM		文字
PUNCT	補助記号	一般
PUNCT		句点
PUNCT		読点
PUNCT		括弧開
PUNCT		括弧閉
SYM		ＡＡ	一般
SYM		ＡＡ	顔文字
SPACE*	空白

Critiques of UD from a Japanese perspective

@DBLP:journals/corr/abs-1906-09719

Relation to Chinese tagsets

@leung-etal-2016-developing provides an overview of efforts in creating a UD (v1) Chinese dataset. SpaCy also has a tag mapping to UD here.

General critiques

@osborne2019status