Assignment: Project preparation
This course is concluded by a project. In this assignment, you will pick your project topic and make a small presentation of your topic proposal.
The course project must be done alone or in teams of two persons. The project should match to the following amount of work (per person):
- 6CP course variant: 2 weeks
- 9CP course variant: 5 weeks
The project should result in a documented Rust crate with documentation and unit tests. The crate must implement a library or program that solves some task related to computational linguistics or machine learning. Besides these basic requirements, you are free to explore any topic you are interested in.
I encourage you to come up with your own topics, but I list five example project topics here for inspiration:
- Tokenizer for social media text. Write a tokenizer that aims to work well on social media text (with smileys, stray punctuation, etc.). If you tackle German, you could use SoMaJo as an inspiration.
- Single hidden layer neural network for classification. In this project, you’d implement a small neural network with training and prediction. You could work out the derivatives for you chosen activation function by hand.
- Spelling correction. Implement a program that does spelling correction of a mistyped word. Peter Norvig provides a description and sample code of how a basic spelling corrector works. He also provides suggestions on how the model could be improved.
- Rust binding for liblinear or libsvm. This project would take
liblinearlibrary for linear SVMs or the
libsvmlibrary for kernel SVMs, and write a (Rustic) Rust binding for one of these libraries. (Warning: requires experience with C, and an interest in low-level programming.)
Corpus search tools. Make a set of tools, using suffix arrays, that can be used to search character n-grams in large corpora. Given a character n-gram , you could make tools that:
- Show occurences of in its sentencial context (commonly called a concordance or KWIC).
- Extract statistics (e.g. number of occurences of an n-gram).
- Search for skip-grams of n-gram sequences.
In this project, you could use the the suffix crate to find n-gram positions in time, where is the size of the corpus.
Another possibility would be to take an existing widely-used Rust program or library and adding a feature that is missing.
Make a presentation of four slides with the following content:
- A title slide that also lists the team member(s).
- A description of the goal of the project (what does it implement).
- A description of the perceived difficult parts of the project.
- A discription of third-party crates that you could use in the project (what does it do, how does it fit in the project).
Every team shoul present their slides to get feedback from me and your colleagues. However, it is not really feasible to let ~20 teams present. So, I would like to do a `poster session’ on July 12:
- The first hour, teams 1-10 present their project. Teams 11-20 go to the presentations and give feedback.
- The second hour, teams 11-20 present their project and teams 1-10 go to the presentations and give feedback.
To avoid unnecessary dead trees, you can use your laptop and prepared slides as your poster.
Participation in the poster session as a team is obligatory, absense (without a good reason) results in a grade of 0 points for this assignment.
Submit your slides in PDF format through this page. The deadline is July 10 at 14:00.