Summary:
This diff achieves the following objectives:
- We will have a cleaner way of specifying optimizers. Only exact expected arguments can be provided for a particular optimizer. Giving momentum argument for Adam will throw exception.
- It makes it easier to add new optimizers. By replacing the if-else-style create_optimizer(), optimizers can be defined from different files.
- We remove support for multiple optimizers as this feature was not being used anywhere else and it was creating unnecessary complexity.
Pull Request resolved: #8
Reviewed By: hikushalhere
Differential Revision: D13227927
fbshipit-source-id: 2da3a1f8f456e6c4561375f62efa6596715a2880

Summary:
Pull Request resolved: #203
```
Mainly Change
1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node
2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead.
offset = rank * (datasize // world_size) + min(rank, datasize % world_size)
len = datasize // world_size + (1 if rank < datasize % world_size else 0)
```
The intention of this diff is to reduce the memory usage when each node load the sharded dataset.
The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size
This diff enabled that
1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu
2. we take the shard ranged based on formula offset = rank * (datasize // world_size) + min(rank, datasize % world_size) and shard_len =datasize // world_size + (1 if rank < datasize % world_size else 0) , and we might need to pad one more example in some sharded dataset
Example
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size = 3
shard_1 = [1, 2, 3, 4]
shard_2 = [5, 6, 7, 7]
shard_3 = [8, 9, 10, 10]
The benefits of this Sharding + Padding approach is that
1. It doesn't require us to know the total dataset size in advance
2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue
3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large
To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory
Reviewed By: ahhegazy
Differential Revision: D13644994
fbshipit-source-id: d51e55de4f9e2fda6d15980db06989f5712ef885

Summary:
Pull Request resolved: #207
The epoch_size parameter to define epochs is counterinutitive from user perspective. One needs to know dataset size and divide by batch size to judge what the epoch_size should be. This diff removes this parameter in favor of a binary flag, and a reasonable default epoch size.
Two parameters are added:
- upsample (in DisjointMultitaskDatahandler)
- target_task_name (moved to DisjointMultitask.Config)
If upsample = True (default):
We'll cycle over each dataset repeatedly in round-robin fashion, so that shorter datasets will see more iterations (upsampled). Epoch is defined to be the epoch of the target task, if defined, otherwise it's set to the length of the shortest dataset.
If upsample = False:
We do a single pass over each dataset. Datasets which are short will sit idle some of the time. This is used for evaluation.
Reviewed By: seayoung1112
Differential Revision: D13678696
fbshipit-source-id: 96329f241686bc2479e405feda3a230494f12a39

…sk (#204)
Summary:
Pull Request resolved: #204
I fix the two GPU memory related issues that are identified below:
- We save the state_dict of the current best model in memory, which takes up space. The fix is to save to disk and reload.
- Eval takes unnecessary memory because we're not using the torch.no_grad() context for optimization.
Reviewed By: hikushalhere, chenyangyu1988
Differential Revision: D13642570
fbshipit-source-id: 246a82d7ebebbeaaa6fc2747bd9f16e87cfc9537

Summary:
Pull Request resolved: #199
reference implementation from https://arxiv.org/pdf/1503.02531.pdf
I'm actually not sure if we need to re-train the teacher using the same temperature. Quoting "In the simplest form of distillation, knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is used when training the distilled model, but after it has been trained it uses a temperature of 1."
Reviewed By: gardenia22
Differential Revision: D13605113
fbshipit-source-id: 04dfb2c857db4d41f1b039df4ecfcb9399926683

Summary:
Pull Request resolved: #195
Make TensorBoard work with RNNG. Exporter was crashing with use_tensorboard=true because RNNG exporter does not have a dummy_input, this is fixed to add the c2 net directly when calling add_graph.
Also, reworked throttling in Manifold Writer (recent throttling caused data to not show up on the dashboard until the summary_writer is closed)
Reviewed By: hikushalhere
Differential Revision: D13596297
fbshipit-source-id: 88e72186b4c3173996d52ac7d73f2533f238072c

Summary:
This fixes the error that was preventing doc generation in Python 3.7. Note that doc generation will still throw a warning for 3.6, but a lot of time has been spent trying to fix that by Chris and me and moving ahead is the best approach anyway. Therefore, this commit also updates doc generation on CircleCI and ReadTheDocs to Python 3.7. Finally, enabling failures on warning for Sphinx doc gen so we lock this down for good.
Tested by building docs locally without errors
<!--- What types of changes does your code introduce? Put an `x` in all the boxes that apply: -->
- [ ] Docs change / refactoring / dependency upgrade
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
Pull Request resolved: #188
Reviewed By: hikushalhere
Differential Revision: D13582311
Pulled By: snisarg
fbshipit-source-id: e415d0740f9131ac875ab45183dceab90babb9df

Summary:
Pull Request resolved: #136
init metadata before spawn processes
Currently init_metadata will load the whole dataset for building vocab, and if we do this operation in every single process in distributed training, it will increase 8x memory usage which might cause OOMs given some dataset would take 30G (x8 is 240G).
Instead of init metadata inside each process, we prepare the distributed context (metadata) before spawn and pass metadata for create task
Reviewed By: seayoung1112
Differential Revision: D13503441
fbshipit-source-id: 423139fb0a7ad0003e4694afc644f494c2a8e370

Summary:
Pull Request resolved: #176
Fix possible race conditions in Hogwild trainer.
We join all the parallel processes after traning every epoch, and do evaulation after joining, so that we dont encounter races during evaluation and model selection.
This implementation is similar to what sklearn is doing: https://github.com/srome/sklearn-hogwild/blob/master/hogwildsgd.py#L78
Reviewed By: sonalgupta
Differential Revision: D13426423
fbshipit-source-id: 10010c4fbb9832c7429778cb16cec364f357f3c3

Summary:
Moving the enum style classes used in RNNGParser's config into the Config file itself. This is more in lines with how the other config classes are arranged in the package, and also gets rid of doc generate warnings.
Pull Request resolved: #179
Reviewed By: bethebunny
Differential Revision: D13568649
Pulled By: snisarg
fbshipit-source-id: 10b52adbc844f684f34f7bdacac8dd08b6e3d5d3

Summary:
Attacks issues related to indentation and formatting of docstrings. There are 3 families of issues:
1.) bad indentation of individual lines of otherwise good docstrings, these are simple fixes
2.) docstrings not following proper style, simple fixes again
3.) docstrings on config classes, which don't get parsed at all. These are being fixed via unindenting them at read time
Pull Request resolved: #166
Reviewed By: chenyangyu1988
Differential Revision: D13554248
Pulled By: m3rlin45
fbshipit-source-id: 94a184acd30b7454dfc19e4e0a4516c37a2f4532

Summary:
This file is not referenced in our TOC and doesn't appear to actually be used.
If we want to revive it in the future, we can do so, but for now, removing it fixes 4 warnings and doesn't actually change the appearance of the docs.
Pull Request resolved: #174
Reviewed By: hikushalhere
Differential Revision: D13559259
Pulled By: m3rlin45
fbshipit-source-id: d331c01377cbbec48621995d0ff4de69a72fde62

Summary:
our config objects are being added to the index twice, once in the modules docs, once in the config docs. This change removes the config doc verions from the index. Long term we likely want to merge these, but for now, this will do.
based on discussion here:
sphinx-doc/sphinx#3866
Pull Request resolved: #165
Reviewed By: bethebunny
Differential Revision: D13553706
Pulled By: m3rlin45
fbshipit-source-id: 62a23cce983e3e3a10a8169c990b9af9c12278bd

Summary:
We generate a lot of source files as part of our documentation build process. These are generated files and should not be checked in
Pull Request resolved: #162
Reviewed By: bethebunny
Differential Revision: D13553414
Pulled By: m3rlin45
fbshipit-source-id: d685a4402d74af3d8a680d72cb48e574ce68584b

Summary:
We observed a lot of questions regarding the location of the "config.json" file for export and train steps. We expect this file to be created by the user, but for the sake of simplicity and initial trails, it's better to point users to us the existing file. This will also ease users trying the framework for the first time and support questions on both GitHub Issues and FB group.
Pull Request resolved: #156
Reviewed By: hikushalhere
Differential Revision: D13550588
Pulled By: snisarg
fbshipit-source-id: 158c08661d78b9f0e82578790cddc258d298ef8e