The task at hand is to assign a label of 1 to words in a sentence that correspond with a LOCATION, and a label of 0 to everything else.

In this simplified example, we only ever see spans of length 1.

In [25]:

train_sents=[s.lower().split()forsin["we 'll always have Paris","I live in Germany","He comes from Denmark","The capital of Denmark is Copenhagen"]]train_labels=[[0,0,0,0,1],[0,0,0,1],[0,0,0,1],[0,0,0,1,0,1]]assertall([len(train_sents[i])==len(train_labels[i])foriinrange(len(train_sents))])

When we train our model, we rarely update with respect to a single training instance at a time, because a single instance provides a very noisy estimate of the global loss's gradient. We instead construct small batches of data, and update parameters for each batch.

Given some batch size, we want to construct batch tensors out of the word index lists we've just created with our vocab.

For each length B list of inputs, we'll have to:

(1) Add window padding to sentences in the batch like we just saw.
(2) Add additional padding so that each sentence in the batch is the same length.
(3) Make sure our labels are in the desired format.

At the level of the dataest we want:

(4) Easy shuffling, because shuffling from one training epoch to the next gets rid of
pathological batches that are tough to learn from.
(5) Making sure we shuffle inputs and their labels together!

PyTorch provides us with an object torch.utils.data.DataLoader that gets us (4) and (5). All that's required of us is to specify a collate_fn that tells it how to do (1), (2), and (3).

defmy_collate(data,window_size,word_2_id):""" For some chunk of sentences and labels -add winow padding -pad for lengths using pad_sequence -convert our labels to one-hots -return padded inputs, one-hot labels, and lengths """x_s,y_s=zip(*data)# deal with input sentences as we've seenwindow_padded=[convert_tokens_to_inds(pad_sentence_for_window(sentence,window_size),word_2_id)forsentenceinx_s]# append zeros to each list of token ids in batch so that they are all the same lengthpadded=nn.utils.rnn.pad_sequence([torch.LongTensor(t)fortinwindow_padded],batch_first=True)# convert labels to one-hotslabels=[]lengths=[]foryiny_s:lengths.append(len(y))label=torch.zeros((len(y),2))true=torch.LongTensor(y)false=~true.byte()label[:,0]=falselabel[:,1]=truelabels.append(label)padded_labels=nn.utils.rnn.pad_sequence(labels,batch_first=True)returnpadded.long(),padded_labels,torch.LongTensor(lengths)

In [41]:

# Shuffle True is good practice for train loaders.# Use functools.partial to construct a partially populated collate functionexample_loader=DataLoader(list(zip(train_sents,train_labels)),batch_size=2,shuffle=True,collate_fn=partial(my_collate,window_size=2,word_2_id=word_2_id))

Before we go ahead and build our model, let's think about the first thing it needs to do to its inputs.

We're passed batches of sentences. For each sentence i in the batch, for each word j in the sentence, we want to construct a single tensor out of the embeddings surrounding word j in the +/- n window.

Thus, the first thing we're going to need a (B, L, 2N+1) tensor of token indices.

A terrible but nevertheless informative iterative solution looks something like the following, where we iterate through batch elements in our (dummy), iterating non-padded word positions in those, and for each non-padded word position, construct a window:

Technically it works: For each element in the batch, for each word in the original sentence and ignoring window padding, we've got the 5 token indices centered at that word. But in practice will be crazy slow.

Instead, we ideally want to find the right tensor operation in the PyTorch arsenal. Here, that happens to be Tensor.unfold.

defloss_function(outputs,labels,lengths):"""Computes negative LL loss on a batch of model predictions."""B,L,num_classes=outputs.size()num_elems=lengths.sum().float()# get only the values with non-zero labelsloss=outputs*labels# rescale averagereturn-loss.sum()/num_elems

In [56]:

deftrain_epoch(loss_function,optimizer,model,train_data):## For each batch, we must reset the gradients## stored by the model. total_loss=0forbatch,labels,lengthsintrain_data:# clear gradientsoptimizer.zero_grad()# evoke model in training mode on batchoutputs=model.forward(batch)# compute loss w.r.t batchloss=loss_function(outputs,labels,lengths)# pass gradients back, startiing on loss valueloss.backward()# update parametersoptimizer.step()total_loss+=loss.item()# return the total to keep track of how you did this time aroundreturntotal_loss