I have been trying to set up fine-tuning/domain adaptation for an OpenNMT-tf Transformer model. I have trained a decent general-domain base model using open source data and would now like to customise it to this IT-domain data:

98428 English IT-related sentences in indomain_train_en.txt
98428 French IT-related sentences in indomain_train_fr.txt

And eventually test the tuned model on:

2500 English IT-related sentences in indomain_test_en.txt
2500 French IT-related sentences in indomain_test_fr.txt

However, I’m having issues with updating the vocabulary and kicking off the tuning run. I have already trained a SentencePiece model on the original general-domain training data, and when I try to update the vocabulary, there doesn’t seem to be any output.

Update vocabulary

However, nothing shows up in experiments/transformer/avg/finetuned/, although the docs say that this is “The output directory where the updated checkpoint will be saved.”, there is no finetuned.vocab file created. I would have expected some sort of updated checkpoint file and the new vocab file to be created.

Perhaps I’m misunderstanding something… some questions:

How can I update the vocabulary successfully? What am I doing wrong here?

What exactly is happening internally when the vocabulary is updated?

Am I meant to train a new SentencePiece model for new data that I want to adapt the model to? I would have thought that this model should remain the same as the one trained from the general/base data.

And maybe most importantly:

How can I kick off the training once the vocabulary is updated? Is the correct command something like this:

How can I update the vocabulary successfully? What am I doing wrong here?

You are missing some options. You should provide the current and new vocabularies:

--src_vocab SRC_VOCAB
Path to the current source vocabulary. (default: None)
--tgt_vocab TGT_VOCAB
Path to the current target vocabulary. (default: None)
--new_src_vocab NEW_SRC_VOCAB
Path to the new source vocabulary. (default: None)
--new_tgt_vocab NEW_TGT_VOCAB
Path to the new target vocabulary. (default: None)

nat:

What exactly is happening internally when the vocabulary is updated?

Some model weights depend on the vocabulary: the embedings and the softmax weights. The script resizes these matrices to the new vocabulary size and copy the learned representation of words that are still present in the new vocabulary.

nat:

Am I meant to train a new SentencePiece model for new data that I want to adapt the model to?

You should keep the same SentencePiece model.

nat:

How can I kick off the training once the vocabulary is updated? Is the correct command something like this:

There are 2 ways:

“simple” mode: just change model_dir to the new checkpoint directory and rerun the same command you used for the initial training

“expert” mode: set model_dir to a new directory and set --checkpoint_path to the finetuned checkpoint. This will start a new training (with new optimizer settings and schedules) but load model weights from the checkpoint. This should only be used if you know precisely what you want to do (e.g. change the learning rate, etc.)

It initially looks like this works, but I then get a long error message, starting with:

WARNING:tensorflow:You provided a model configuration but a checkpoint already exists. The model configuration must define the same model as the one used for the initial training. However, you can change non structural values like dropout.

And ending with:

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

I am running OpenNMT-tf==1.15.0 and tensorflow-gpu==1.12.0, building from the Docker image tensorflow/tensorflow:latest-gpu-py3.

Potentially related: when making the new vocabularies, is it necessary to manually process the .vocab files that SentencePiece outputs? I see in one of your scripts that you processed some original (not tuning-specific) .vocab files like this:

# We keep the first field of the vocab file generated by SentencePiece and remove the first line <unk>
cut -f 1 wmt$sl$tl.vocab | tail -n +2 > data/wmt$sl$tl.vocab.tmp
# we add the <blank> word in first position, needed for OpenNMT-TF
sed -i '1i<blank>' data/wmt$sl$tl.vocab.tmp
# Last tweak we replace the empty line supposed to be the "tab" character (removed by the cut above)
perl -pe '$/=""; s/\n\n/\n\t\n/;' data/wmt$sl$tl.vocab.tmp > data/wmt$sl$tl.vocab