Preamble

This mini-project is due on Feb. 22nd at 11:59pm. Late work will be automatically subject to a 20% penalty, and can be submitted up to 5 days after the deadline. No submissions will accepted after this 5 day period.

This mini-project is to completed in groups of three. All members of a group will recieve the same grade. It is not expected that all team members will contribute equally to all components. However every team member should make integral contributions to the project.

You will submit your assignment on MyCourses as a group, and you will also submit to a Kaggle competition. You must register in the Kaggle competition using the email that you are associated with on MyCourses (i.e., @mail.mcgill.ca for McGill students). You can register for the competition at: https://www.kaggle.com/t/b95c2a432a9445d6a01a7a95d51d1dd5. As with MiniProject 1, you must register your group on MyCourses and any group member can submit. You must also form teams on Kaggle and you must use your MyCourses group name as your team name on Kaggle. All Kaggle submissions must be associated with a valid team registered on MyCourses.

Except where explicitly noted in this speciﬁcation, you are free to use any Python library or utility for this project. All your code must be compatible with Python 3.

Background

In this mini-project you will develop models to predict the sentiment of IMBD reviews. IMDB is a popular website and database of movie information and reviews (https://www.imdb.com/). The goal is to classify IMBD reviews as positive or negative based on the language they contain. You will be competing with other groups to achieve the best accuracy in a competition. However, your performance on the competition is only one aspect of your grade. We also ask that you implement a minimum set of models and report on their performance in a write-up.

Tasks

You are welcome to try any model you like on this task, and you are free to use any libraries you like to extract features. However, you must meet the following requirements:

You must implement a Bernoulli Naive Bayes model from scratch (i.e., without using any external libraries such as SciKit learn). You are free to use any text preprocessing that you like with this model.

You must run experiments using at least two diﬀerent classiﬁers from the SciKit learn package (which are not Bernoulli Naive Bayes). Possible options are:

You must try at least two diﬀerent feature extraction pipelines for processing the text data (e.g., using binary occurrences vs. tf-idf weighting). You only need to experiment with these diﬀerent features for one of your models.

You must develop a model validation pipeline (e.g., using k-fold cross validation or a held-out validation set) and report on the performance of the above mentioned model variants.

Deliverables

You must submit two separate ﬁles to MyCourses (using the exact ﬁlenames and ﬁle types outlined below):

code.zip: A collection of .py, .ipynb, and other supporting code ﬁles, which must work with Python version 3. You must include your implementation of Bernoulli Naive Bayes and it must be possible for the TAs to reproduce all the results in your report and your Kaggle leaderboard submissions using your submitted code. Please submit a README detailing the packages you used and providing instructions to replicate your results.

Project write-up

Your team must submit a project write-up that is a maximum of ﬁve pages (single-spaced, 11pt font or larger; an extra page for references/bibliographical content can be used). We highly recommend that students use LaTeX to complete their write-ups and use the bibtex feature for citations. You have some ﬂexibility in how you report your results, but you should adhere to the following general structure:

Abstract (100-250 words)

Summarize the project task and your most important ﬁndings.

Introduction (5+ sentences) Summarize the project task, the dataset, and your most important ﬁndings. This should be similar to the abstract but more detailed.

Related work (4+ sentences) problem.

Summarize previous literature related to the sentiment classiﬁcation

Dataset and setup (3+ sentences) Very brieﬂy describe the dataset and any basic data pre-processing methods that are common to all your approaches (e.g., tokenizing). Note: You do not need to explicitly verify that the data satisﬁes the i.i.d. assumption (or any of the other formal assumptions for linear classiﬁcation).

Proposed approach (7+ sentences ) Brieﬂy describe the diﬀerent models you implemented/compared and the features you designed, providing citations as necessary. If you use or build upon an existing model based on previously published work, it is essential that you properly cite and acknowledge this previous work. Discuss algorithm selection and implementation. Include any decisions about training/validation split, regularization strategies, any optimization tricks, setting hyper-parameters, etc. It is not necessary to provide detailed derivations for the models you use, but you should provide at least few sentences of background (and motivation) for each model.

Results (7+ sentences, possibly with ﬁgures or tables) Provide results on the diﬀerent models you implemented (e.g., accuracy on the validation set, runtimes). You should report your leaderboard test set accuracy in this section, but most of your results should be on your validation set (or from cross validation).

Evaluation

The mini-project is out of 100 points, and the evaluation breakdown is as follows:

Completeness (20 points)

Did you submit all the materials?

Did you perform all the required tasks?

Did you follow the guidelines for the project writeup?

Are your results reproducable?

Performance (50 points)

The performance of your models will be evaluated on the Kaggle competition. Your grade will be computed based on your performance on a held-out test set. The grade computation is a linear interpolation between the performance of a random baseline and the top-performing group in the class; however, Prof. Hamilton’s benchmark performance represents a ceiling on this competition: any groups beating his benchmark are guaranteed full-points and the grading will not interpolate based on scores above his benchmark.

Thus, if we let X denote your accuracy on the held-out test set, B denote the accuracy of the random baseline, T denote the accuracy of the top-performing group, and H denote Prof.

In addition to the above, the top-3 performing groups will receive a bonus of 5 points.

Write-up quality (30 points)

Is your proposed methodology technically sound?

Is your report clear and free of grammatical errors and typos?

Did you go beyond the bare minimum requirements for the write-up (e.g., by including a discussion of related work in the introduction)?

Final remarks

You are expected to display initiative, creativity, scientiﬁc rigour, critical thinking, and good communication skills. You don’t need to restrict yourself to the requirements listed above – feel free to go beyond, and explore further.

You can discuss methods and technical issues with members of other teams, but you cannot share any code or data with other teams. Any team found to cheat (e.g. use external information, use resources without proper references) on either the code, predictions or written report will receive a score of 0 for all components of the project.

Rules speciﬁc to the Kaggle competition:

Don’t cheat! You must submit code that can reproduce the numbers of your leaderboard solution.

The classiﬁcation challenge is based on a public dataset. You must not attempt to cheat by searching for information about the test set. Submissions with suspicious accuracies and/or predictions will be ﬂagged and your group will receive a 0 if you used external information about the test set at any point.

Do not make more than one team for your group (e.g., to make more submissions). You will receive a grade of 0 for intentionally creating new groups with the purpose of making more Kaggle submissions.