Datasets
A good place to start looking for ideas is by getting to know what kind of linguistic data is already out there. Here are some links I recommend:
- Universal Dependencies is an attempt at describing the dependency structure of different languages in a unified way. The project provides reference treebanks for many languages. It also provides a unified part-of-speech tagset.
- Kaggle datasets is an invaluable resource not only because of the open format in which the data is provided, but in the tools and methodology that researchers all over the world share alongside the data.
- NINJAL corpus resources provide access to both historical and contemporary Japanese corpora. Access to the BCCWJ is available upon request.
- Archive.org contains a lot of public domain documents and media.
- Common Crawl is a large Internet corpus containing over 5 billion web pages.
- WikiText long term dependency language modeling dataset.
- Google Books Ngrams contains frequency information for all 1–5 grams found in the Google Books corpus. Use the Ngram Viewer to search the data. Bookworm is another interface for this and similar data.
Several of these datasets are installable using NLTK and should already be pre-installed on the server powering the course’s JupyterHub site.
Software
Morphological and Syntactic Parsing
Japanese:
- MeCab [@MeCab] and CaboCha [@CaboCha]
- IPAdic, UniDic [@UniDic]
- A good blog on using MeCab from R
- JUMAN/jumanpp and KNP
- Kytea and EDA
English:
- spaCy
- Stanford Dependency Parser
- BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser)
Chinese:
- Stanford Dependency Parser
- BIST Parsers
- NiuParser (proprietary)
Neural Networks (Deep Learning)
Recommended book on the subject: Deep Learning by @Goodfellow-et-al-2016. Pretty pictures.
Deep Learning frameworks I recommend for beginners:
- PyTorch Python
- Tensorflow Python
More advanced:
- DyNet is specifically geared towards NLP use C++ Python
While it is possible to use deep learning models on the nlp
server, training them is not realistic, since there is no GPU. I do have an NVIDIA Titan Xp (12GB) that can be used for this project that is currently in another server. Contact me if you want access.
Visualization
Document Structure
- displaCy (spaCy)
Word2Vec
Statistics
- ggplot2 ([Cheat Sheet]) R
- d3 (low-level) and Vega-Lite (high-level like ggplot2) Javascript
- t-Distributed Stochastic Neighbor Embedding (t-SNE) Python R Javascript
- @wattenberg2016how’s How to Use t-SNE Effectively
Ideas
The following are idea templates that can guide you in your search for an exciting project!
I have provided a difficulty scale for some of the ideas, but be aware that difficulty can change depending on your existing knowledge, skills, and the scope of the project:
- ← Hard
- ← Medium
- ← Easy
Speech Recognition
Spellchecking programming
Morphological Analysis
There are two types of projects envisioned here:
- improving the performance of an existing parser
- constructing a (toy) parser from scratch
Improving existing parsers
Some parsers have a way of extending the vocabulary or updating the model to better fit new data. As most parsers are trained on relatively homogeneous data (i.e. news), they will perform worse on data from different registers and time periods.
Sub-project
Most parsers are probabilistic and provide a way to show the top “n-best” parses of a sentence. In this sub-project, try to visualize these conflicting parses by converting the n-best parse information from the parser into graphical form. The easiest way to output a graph of nodes is by using graphviz
.
Constructing a parser from scratch programming
Matthew Honnibal (spaCy) has a good post on the subject:
- A Good Part-of-Speech Tagger in about 200 Lines of Python describes the building of a high-performance English POS tagger.
While this project requires a lot of programming work, the final outcome should be extremely satisfying, as you gain the insight required to better tailor the parser to the data you will be using in your future research.
Syntactic Parsing
Matthew Honnibal (spaCy) has a good post on the subject:
- Parsing English in 500 Lines of Python describes how to build a simple–but effective English syntactic parser.
Improving existing parsers programming
Constructing a parser from scratch programming
Chatbot web
Sentiment Analysis classification
There are two types of approaches here:
Use a dataset that has sentiment already annotated and try to extract features that are most useful for predicting the sentiment of new data.
Use a dataset comprised of words that are known to be effective in predicting sentiment. Here the focus would be on the method of using this dataset to best improve prediction accuracy on new data.
NRC Word-Emotion Association Lexicon (@Mohammad13)
Text Categorization classification
- Reuters-21578 Text Categorization Collection Data Set is a standard dataset used to evaluate the performance of different classification algorithms.
Style Transfer
Style transfer is a currently a hot topic in AI (paper; video), but here we are concerned with the transformation of text into a semantically-equivalent, but stylistically-distinct, form. For example, we could make a style transfer for dialects, registers, time-periods, or specific authors.
Knowledge Engineering ontology SPARQL RDF
“DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web.”
“ConceptNet is a semantic network containing lots of things computers should know about the world, especially when understanding text written by people.”
Lemon/OntoLex Lexical Ontology
“The aim of lemon is to provide rich linguistic grounding for ontologies. Rich linguistic grounding includes the representation of morphological and syntactic properties of lexical entries as well as the syntax-semantics interface, i.e. the meaning of these lexical entries with respect to an ontology or vocabulary.”
“WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.”
“The FrameNet project is building a lexical database of English that is both human- and machine-readable, based on annotating examples of how words are used in actual texts.”
Annotation
In this project, you choose a public dataset for which you would like to add novel annotations. Your goal will be to do a pilot annotation, write preliminary annotation guidelines, as well as provide a baseline method for their classification.