This course includes four programming assignments and a final project. All assignments must be completed individually unless otherwise specified.

Assignment 1: Language ID

Released: Feb 4, 2025

Due: Feb 18, 2025 at 11:59 PM

Overview

In this assignment, you will build a language identification classifier that distinguishes between six languages:

Some languages can be distinguished easily, because they use different scripts. These six languages, however, use the same (Latin) script with minimal diacritics so it is difficult to hand-craft classifiers based on the presence or absence of particular characters. Indeed, unless you have linguistic training or familiarity with the languages, it is difficult to tell them apart.

How can they be distinguished? A naïve approach is to use word counts as unigram features. However, the number of possible words in a large corpus of five languages is vast. It is essential to look at something smaller — characters.

Even though the six languages use roughly the same characters, the relative frequencies of these characters vary greatly. Thus, using characters as features (unigram character models) is appealing (and fairly effective). It is also true that languages very in their phonotactics, the way in which consonants and vowels combine in sequence. Thus, looking at character ngrams (for small values of n) is also appealing (and effective). Note, however, that as the value of n increases, this approach runs into the same problem as the word unigram model (sparcity). In this scenario, the model is likely to overfit.

Various kinds of classifiers can be used for this application. NB classifiers, for example, are quite effective. However, inference is slow and performance, given the same training set, is likely to be worse than other options. Simple logistic regression cannot be used because this is an n-way (multinomial) classification problem. Multinomial Logistic Regression (Softmax Regression) is a good fit.

Summary

You will perform the following tasks:

  1. Implement a training loop for Multinomial Logistic Regression.
  2. Implement inference for Multinomal Logistic Regression
  3. Determine the optimal order of n for ngrams for MNLR trained on the training set.
  4. Calculate and display a confusion matrix for a trigram model evaluated on the test set.
  5. Inspect the feature weights, and display the most predictive features for each language.

Assignment 2: Language Modelling

Released: February 18, 2025

Due: March 13, 2025 at 11:59 PM

Overview

In this homework, you will be building your first language models. You will be expected to build an n-gram language model and a recurrent neural network (RNN) language model. You will also implement Laplace Smoothing for the n-gram model (a lazy version) to account for unknown words.

This assignment will be submitted via Gradescope in two parts. Upload both deliverables to Gradescope:

  1. Programming: Paste the functions and classes from the notebook into two files called ngram_lm.py and rnn_lm.py, which are included in the handout. Upload these files without zipping them.
  2. Written: Submit answers to the questions as a PDF file.

Assignment 3: Clickbait Detection

Released: March 13, 2025

Due: March 25, 2025 at 11:59 PM

Overview

In this assignment, you will implement text classification systems to detect clickbait headlines. You will work with word embeddings and analyze their effectiveness for this task.

Learning Objectives

  • Implement neural text classifiers for binary classification
  • Work with pre-trained word embeddings
  • Analyze model performance and feature importance
  • Compare different classification approaches

Tasks

  1. Implement a neural classifier for clickbait detection
  2. Use pre-trained word embeddings
  3. Analyze model performance on different types of headlines
  4. Evaluate and compare classification approaches

Assignment 4: Named Entity Recognition

Released: March 25, 2025

Due: April 15, 2025 at 11:59 PM

Overview

In this assignment, you will implement a named entity recognition system using sequence modeling techniques. You will explore different architectures and evaluate their performance.

Learning Objectives

  • Implement sequence labeling architectures
  • Work with contextualized word representations
  • Train and evaluate NER systems
  • Analyze model behavior and output quality

Tasks

  1. Implement a sequence labeling model
  2. Incorporate contextual information
  3. Train models on annotated data
  4. Evaluate NER performance using precision, recall, and F1 score

Final Exam

Date: TBD

Overview

The final exam will cover all topics discussed throughout the course. It will include a mix of conceptual questions and problem-solving tasks.

Format

The exam will consist of 4-5 questions with multiple parts. The questions will be designed to test conceptual understanding rather than recall of facts. The exam is open book and open notes but non-collaborative.

Preparation

To prepare for the exam, review lecture materials, complete all assignments, and participate in the review session. Practice applying the concepts to novel problems rather than memorizing solutions to specific examples.