This course includes four programming assignments and a final project. All assignments must be completed individually unless otherwise specified.
In this assignment, you will build a language identification classifier that distinguishes between six languages:
Some languages can be distinguished easily, because they use different scripts. These six languages, however, use the same (Latin) script with minimal diacritics so it is difficult to hand-craft classifiers based on the presence or absence of particular characters. Indeed, unless you have linguistic training or familiarity with the languages, it is difficult to tell them apart.
How can they be distinguished? A naïve approach is to use word counts as unigram features. However, the number of possible words in a large corpus of five languages is vast. It is essential to look at something smaller — characters.
Even though the six languages use roughly the same characters, the relative frequencies of these characters vary greatly. Thus, using characters as features (unigram character models) is appealing (and fairly effective). It is also true that languages very in their phonotactics, the way in which consonants and vowels combine in sequence. Thus, looking at character ngrams (for small values of n) is also appealing (and effective). Note, however, that as the value of n increases, this approach runs into the same problem as the word unigram model (sparcity). In this scenario, the model is likely to overfit.
Various kinds of classifiers can be used for this application. NB classifiers, for example, are quite effective. However, inference is slow and performance, given the same training set, is likely to be worse than other options. Simple logistic regression cannot be used because this is an n-way (multinomial) classification problem. Multinomial Logistic Regression (Softmax Regression) is a good fit.
You will perform the following tasks:
In this homework, you will be building your first language models. You will be expected to build an n-gram language model and a recurrent neural network (RNN) language model. You will also implement Laplace Smoothing for the n-gram model (a lazy version) to account for unknown words.
This assignment will be submitted via Gradescope in two parts. Upload both deliverables to Gradescope:
ngram_lm.py
and rnn_lm.py
, which are included in the handout. Upload
these files without zipping them.In this assignment, you will implement text classification systems to detect clickbait headlines. You will work with word embeddings and analyze their effectiveness for this task.
In this assignment, you will implement a named entity recognition system using sequence modeling techniques. You will explore different architectures and evaluate their performance.
The final exam will cover all topics discussed throughout the course. It will include a mix of conceptual questions and problem-solving tasks.
The exam will consist of 4-5 questions with multiple parts. The questions will be designed to test conceptual understanding rather than recall of facts. The exam is open book and open notes but non-collaborative.
To prepare for the exam, review lecture materials, complete all assignments, and participate in the review session. Practice applying the concepts to novel problems rather than memorizing solutions to specific examples.