Paper summaryshagunsodhani### Introduction
* *Curriculum Learning* - When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks.
* Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy.
* [Link](http://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf) to the paper.
### Contributions of the paper
* Explore cases that show that curriculum learning benefits machine learning.
* Offer hypothesis around when and why does it happen.
* Explore relation of curriculum learning with other machine learning approaches.
### Experiments with convex criteria
* Training perceptron where some input data is irrelevant(not predictive of the target class).
* Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane.
* Curriculum learning model outperforms no-curriculum based approach.
* Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy.
### Experiments on shape recognition with datasets having different variability in shapes
* Standard(target) dataset - Images of rectangles, ellipses, and triangles.
* Easy dataset - Images of squares, circles, and equilateral triangles.
* Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called *switch epoch*).
* For no-curriculum learning, the first epoch is the *switch epoch*.
* As *switch epoch* increases, the classification error comes down with the best performance when *switch epoch* is half the total number of epochs.
* Paper does not report results for higher values of *switch epoch*.
### Experiments on language modeling
* Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words.
* Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary.
* Each word from the vocabulary is embedded into a *d* dimensional feature space using a matrix **W** (to be learnt).
* The model predicts the score of next word, given a window of words.
* Expected value of ranking loss function is minimized to learn **W**.
* Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words.
### Curriculum as a continuation method
* Continuation methods start with a smoothed objective function and gradually move to less smoothed function.
* Useful in the case where the objective function in non-convex.
* Consider a family of cost functions $C_\lambda (\theta)$ such that $C_0(\theta)$ can be easily optimized and $C_1(\theta)$ is the actual objective function.
* Start with $C_0 (\theta)$ and increase $\lambda$, keeping $\theta$ at a local minimum of $C_\lambda (\theta)$.
* Idea is to move $\theta$ towards a dominant (if not global) minima of $C_1(\theta)$.
* Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective.
* The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step).
### Advantages of Curriculum Learning
* Faster training in the online setting as learner does not try to learn difficult examples when it is not ready.
* Guiding training towards better local minima in parameter space, specifically useful for non-convex methods.
### Relation to other machine learning approaches
* **Unsupervised preprocessing** - Both have a regularizing effect and lower the generalization error for the same training error.
* **Active learning** - The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy.
* **Boosting Algorithms** - Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change.
* **Transfer learning** and **Life-long learning** - Initial tasks are used to guide the optimisation problem.
### Criticism
* Curriculum Learning is not well understood, making it difficult to define the curriculum.
* In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse.

### Introduction
* *Curriculum Learning* - When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks.
* Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy.
* [Link](http://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf) to the paper.
### Contributions of the paper
* Explore cases that show that curriculum learning benefits machine learning.
* Offer hypothesis around when and why does it happen.
* Explore relation of curriculum learning with other machine learning approaches.
### Experiments with convex criteria
* Training perceptron where some input data is irrelevant(not predictive of the target class).
* Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane.
* Curriculum learning model outperforms no-curriculum based approach.
* Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy.
### Experiments on shape recognition with datasets having different variability in shapes
* Standard(target) dataset - Images of rectangles, ellipses, and triangles.
* Easy dataset - Images of squares, circles, and equilateral triangles.
* Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called *switch epoch*).
* For no-curriculum learning, the first epoch is the *switch epoch*.
* As *switch epoch* increases, the classification error comes down with the best performance when *switch epoch* is half the total number of epochs.
* Paper does not report results for higher values of *switch epoch*.
### Experiments on language modeling
* Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words.
* Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary.
* Each word from the vocabulary is embedded into a *d* dimensional feature space using a matrix **W** (to be learnt).
* The model predicts the score of next word, given a window of words.
* Expected value of ranking loss function is minimized to learn **W**.
* Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words.
### Curriculum as a continuation method
* Continuation methods start with a smoothed objective function and gradually move to less smoothed function.
* Useful in the case where the objective function in non-convex.
* Consider a family of cost functions $C_\lambda (\theta)$ such that $C_0(\theta)$ can be easily optimized and $C_1(\theta)$ is the actual objective function.
* Start with $C_0 (\theta)$ and increase $\lambda$, keeping $\theta$ at a local minimum of $C_\lambda (\theta)$.
* Idea is to move $\theta$ towards a dominant (if not global) minima of $C_1(\theta)$.
* Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective.
* The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step).
### Advantages of Curriculum Learning
* Faster training in the online setting as learner does not try to learn difficult examples when it is not ready.
* Guiding training towards better local minima in parameter space, specifically useful for non-convex methods.
### Relation to other machine learning approaches
* **Unsupervised preprocessing** - Both have a regularizing effect and lower the generalization error for the same training error.
* **Active learning** - The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy.
* **Boosting Algorithms** - Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change.
* **Transfer learning** and **Life-long learning** - Initial tasks are used to guide the optimisation problem.
### Criticism
* Curriculum Learning is not well understood, making it difficult to define the curriculum.
* In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse.