# Assignment: Project preparation

## Introduction

This course is concluded by a project. In this assignment, you will pick your project topic and make a small presentation of your topic proposal.

The course project must be done alone or in teams of two persons. The project should match to the following amount of work (per person):

• 6CP course variant: 2 weeks
• 9CP course variant: 5 weeks

### Project topics

The project should result in a documented Rust crate with documentation and unit tests. The crate must implement a library or program that solves some task related to computational linguistics or machine learning. Besides these basic requirements, you are free to explore any topic you are interested in.

I encourage you to come up with your own topics, but I list five example project topics here for inspiration:

• Tokenizer for social media text. Write a tokenizer that aims to work well on social media text (with smileys, stray punctuation, etc.). If you tackle German, you could use SoMaJo as an inspiration.
• Single hidden layer neural network for classification. In this project, you’d implement a small neural network with training and prediction. You could work out the derivatives for you chosen activation function by hand.
• Spelling correction. Implement a program that does spelling correction of a mistyped word. Peter Norvig provides a description and sample code of how a basic spelling corrector works. He also provides suggestions on how the model could be improved.
• Rust binding for liblinear or libsvm. This project would take the existing liblinear library for linear SVMs or the libsvm library for kernel SVMs, and write a (Rustic) Rust binding for one of these libraries. (Warning: requires experience with C, and an interest in low-level programming.)
• Corpus search tools. Make a set of tools, using suffix arrays, that can be used to search character n-grams in large corpora. Given a character n-gram $q$, you could make tools that:

• Show occurences of $q$ in its sentencial context (commonly called a concordance or KWIC).
• Extract statistics (e.g. number of occurences of an n-gram).
• Search for skip-grams of n-gram sequences.
• Etc.

In this project, you could use the the suffix crate to find n-gram positions in $\mathcal{O}(\log N)$ time, where $N$ is the size of the corpus.

Another possibility would be to take an existing widely-used Rust program or library and adding a feature that is missing.

## The assignment

Make a presentation of four slides with the following content:

1. A title slide that also lists the team member(s).
2. A description of the goal of the project (what does it implement).
3. A description of the perceived difficult parts of the project.
4. A discription of third-party crates that you could use in the project (what does it do, how does it fit in the project).

Every team shoul present their slides to get feedback from me and your colleagues. However, it is not really feasible to let ~20 teams present. So, I would like to do a `poster session’ on July 12:

• The first hour, teams 1-10 present their project. Teams 11-20 go to the presentations and give feedback.
• The second hour, teams 11-20 present their project and teams 1-10 go to the presentations and give feedback.