Bor Hodošček: Academic Website

Datasets

A good place to start looking for ideas is by getting to know what kind of linguistic data is already out there. Here are some links I recommend:

Universal Dependencies is an attempt at describing the dependency structure of different languages in a unified way. The project provides reference treebanks for many languages. It also provides a unified part-of-speech tagset.
Kaggle datasets is an invaluable resource not only because of the open format in which the data is provided, but in the tools and methodology that researchers all over the world share alongside the data.
NINJAL corpus resources provide access to both historical and contemporary Japanese corpora. Access to the BCCWJ is available upon request.
Archive.org contains a lot of public domain documents and media.
Common Crawl is a large Internet corpus containing over 5 billion web pages.
WikiText long term dependency language modeling dataset.
Google Books Ngrams contains frequency information for all 1–5 grams found in the Google Books corpus. Use the Ngram Viewer to search the data. Bookworm is another interface for this and similar data.

Several of these datasets are installable using NLTK and should already be pre-installed on the server powering the course’s JupyterHub site.

Software

Morphological and Syntactic Parsing

Japanese:

MeCab [@MeCab] and CaboCha [@CaboCha]
- IPAdic, UniDic [@UniDic]
- A good blog on using MeCab from R
JUMAN/jumanpp and KNP
Kytea and EDA

English:

Chinese:

Neural Networks (Deep Learning)

Recommended book on the subject: Deep Learning by @Goodfellow-et-al-2016. Pretty pictures.

Deep Learning frameworks I recommend for beginners:

PyTorch Python
Tensorflow Python

More advanced:

DyNet is specifically geared towards NLP use C++ Python

While it is possible to use deep learning models on the nlp server, training them is not realistic, since there is no GPU. I do have an NVIDIA Titan Xp (12GB) that can be used for this project that is currently in another server. Contact me if you want access.

Visualization

Document Structure

displaCy (spaCy)
- displaCy Dependency Visualizer
- displaCy Named-Entity Visualizer

Word2Vec

word2viz

Statistics

ggplot2 ([Cheat Sheet]) R
d3 (low-level) and Vega-Lite (high-level like ggplot2) Javascript
t-Distributed Stochastic Neighbor Embedding (t-SNE) Python R Javascript
- @wattenberg2016how’s How to Use t-SNE Effectively

Ideas

The following are idea templates that can guide you in your search for an exciting project!

I have provided a difficulty scale for some of the ideas, but be aware that difficulty can change depending on your existing knowledge, skills, and the scope of the project:

← Hard
← Medium
← Easy

Speech Recognition

Kaldi Speech Recognition Toolkit

Spellchecking programming

Morphological Analysis

There are two types of projects envisioned here:

improving the performance of an existing parser
constructing a (toy) parser from scratch

Improving existing parsers

Some parsers have a way of extending the vocabulary or updating the model to better fit new data. As most parsers are trained on relatively homogeneous data (i.e. news), they will perform worse on data from different registers and time periods.

Sub-project

Most parsers are probabilistic and provide a way to show the top “n-best” parses of a sentence. In this sub-project, try to visualize these conflicting parses by converting the n-best parse information from the parser into graphical form. The easiest way to output a graph of nodes is by using graphviz.

Constructing a parser from scratch programming

Matthew Honnibal (spaCy) has a good post on the subject:

A Good Part-of-Speech Tagger in about 200 Lines of Python describes the building of a high-performance English POS tagger.

While this project requires a lot of programming work, the final outcome should be extremely satisfying, as you gain the insight required to better tailor the parser to the data you will be using in your future research.

Syntactic Parsing

Matthew Honnibal (spaCy) has a good post on the subject:

Parsing English in 500 Lines of Python describes how to build a simple–but effective English syntactic parser.

Improving existing parsers programming

Constructing a parser from scratch programming

Chatbot web

Bottery: A conversational agent prototyping platform

Sentiment Analysis classification

There are two types of approaches here:

Use a dataset that has sentiment already annotated and try to extract features that are most useful for predicting the sentiment of new data.
Use a dataset comprised of words that are known to be effective in predicting sentiment. Here the focus would be on the method of using this dataset to best improve prediction accuracy on new data.
NRC Word-Emotion Association Lexicon (@Mohammad13)

Authorship Identification classification

Text Categorization classification

Reuters-21578 Text Categorization Collection Data Set is a standard dataset used to evaluate the performance of different classification algorithms.

Style Transfer

Style transfer is a currently a hot topic in AI (paper; video), but here we are concerned with the transformation of text into a semantically-equivalent, but stylistically-distinct, form. For example, we could make a style transfer for dialects, registers, time-periods, or specific authors.

Knowledge Engineering ontology SPARQL RDF

DBpedia, DBpedia Releases
“DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web.”
ConceptNet 5
“ConceptNet is a semantic network containing lots of things computers should know about the world, especially when understanding text written by people.”
Lemon/OntoLex Lexical Ontology
“The aim of lemon is to provide rich linguistic grounding for ontologies. Rich linguistic grounding includes the representation of morphological and syntactic properties of lexical entries as well as the syntax-semantics interface, i.e. the meaning of these lexical entries with respect to an ontology or vocabulary.”
WordNet (Japanese WordNet)
“WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.”
FrameNet
“The FrameNet project is building a lexical database of English that is both human- and machine-readable, based on annotating examples of how words are used in actual texts.”

Annotation

In this project, you choose a public dataset for which you would like to add novel annotations. Your goal will be to do a pilot annotation, write preliminary annotation guidelines, as well as provide a baseline method for their classification.

Final Project Instructions and Ideas

Datasets

Software

Morphological and Syntactic Parsing

Neural Networks (Deep Learning)

Visualization

Document Structure

Word2Vec

Statistics

Ideas

Speech Recognition

Spellchecking programming

Morphological Analysis

Improving existing parsers

Sub-project

Constructing a parser from scratch programming

Syntactic Parsing

Improving existing parsers programming

Constructing a parser from scratch programming

Chatbot web

Sentiment Analysis classification

Authorship Identification classification

Text Categorization classification

Style Transfer

Knowledge Engineering ontology SPARQL RDF

Annotation

References