kraken implements a dialect of the Variable-size Graph Specification Language
(VGSL), enabling the specification of different network architectures for image
processing purposes using a short definition string.

A VGSL specification consists of an input block, one or more layers, and an
output block. For example:

[1,48,0,1 Cr3,3,32 Mp2,2 Cr3,3,64 Mp2,2 S1(1x12)1,3 Lbx100 Do O1c103]

The first block defines the input in order of [batch, heigh, width, channels]
with zero-valued dimensions being variable. Integer valued height or width
input specifications will result in the input images being automatically scaled
in either dimension.

When channels are set to 1 grayscale or B/W inputs are expected, 3 expects RGB
color images. Higher values in combination with a height of 1 result in the
network being fed 1 pixel wide grayscale strips scaled to the size of the
channel dimension.

After the input, a number of layers are defined. Layers operate on the channel
dimension; this is intuitive for convolutional layers but a recurrent layer
doing sequence classification along the width axis on an image of a particular
height requires the height dimension to be moved to the channel dimension,
e.g.:

[1,48,0,1 S1(1x48)1,3 Lbx100 O1c103]

or using the alternative slightly faster formulation:

[1,1,0,48 Lbx100 O1c103]

Finally an output definition is appended. When training sequence classification
networks with the provided tools the appropriate output definition is
automatically appended to the network based on the alphabet of the training
data.

A model with a small convolutional stack before a recurrent LSTM layer. The
extended dropout layer syntax is used to reduce drop probability on the depth
dimension as the default is too high for convolutional layers. The remainder of
the height dimension (12) is reshaped into the depth dimensions before
applying the final recurrent and linear layers.

Adds either an LSTM or GRU recurrent layer to the network using eiter the x
(width) or y (height) dimension as the time axis. Input features are the
channel dimension and the non-time-axis dimension (height/width) is treated as
another batch dimension. For example, a Lfx25 layer on an 1, 16, 906, 32
input will execute 16 independent forward passes on 906x32 tensors resulting
in an output of shape 1, 16, 906, 25. If this isn’t desired either run a
summarizing layer in the other direction, e.g. Lfys20 for an input 1, 1,
906, 20, or prepend a reshape layer S1(1x16)1,3 combining the height and
channel dimension for an 1, 1, 906, 512 input to the recurrent layer.

S[{name}]<d>(<a>x<b>)<e>,<f> Splits one dimension, moves one part to another dimension.

The S layer reshapes a source dimension d to a,b and distributes a into
dimension e, respectively b into f. Either e or f has to be equal to
d. So S1(1, 48)1, 3 on an 1, 48, 1020, 8 input will first reshape into
1, 1, 48, 1020, 8, leave the 1 part in the height dimension and distribute
the 48 sized tensor into the channel dimension resulting in a 1, 1, 1024,
48*8=384 sized output. S layers are mostly used to remove undesirable non-1
height before a recurrent layer.

Note

This S layer is equivalent to the one implemented in the tensorflow
implementation of VGSL, i.e. behaves differently from tesseract.