Deep Learning

1. Overview

Deep Learning is a milestone machine learning technique in computer vision.
It automatically learns from the training images provided and can effectively generate the solutions for a wide range of applications with minimum effort.
Main advantages of this technique are: simple configuration, short development time, versatility of possible applications, robustness to noise and high performance.

localization, segmentation and classification of multiple objects within an image (i.e. bin picking),

quality analysis in variable environments,

localization and classification of key points, characteristic regions and small objects.

Using deep learning functionality includes two stages:

Training - generating a model based on features learned from training samples (using Deep Learning Editor),

Inference - applying the model on new images in order to perform certain machine vision tasks.

The difference from a classic image processing approach is presented in diagrams below:

Classic approach: the algorithm is the missing element that needs to be designed by a human specialist.

Machine learning approach: only the manually labeled training images need to be provided.

Available Deep Learning tools

Anomaly detection - This technique is used to detect anomalous samples. It only needs a set of fault-free samples to learn the model
of normal appearance. Optionally, several faulty ones can be useful to define the level of tolerable variations. This tool is useful especially
in cases where defects are unknown, too difficult to define upfront or highly variable. The output of this tool are a classification result
(normal or faulty), an abnormality score and a heatmap of defects in the image.

An example of missing part detection using DeepLearning_DetectAnomalies2 tool. Left: The original
image with a missing element. Right: The classification result with a heatmap of defects overlay.

Feature detection (segmentation) - This technique is used to precisely segment one or more classes of
features within an image. The pixels belonging to each class must be marked by the user in the training step. The result of this technique
is an array of probability maps for every class.

An example of image segmentation using DeepLearning_DetectFeatures tool. Left: The original
image of the fundus. Right: The segmentation of blood vessels.

Object classification - This technique is used to mark objects/images with one of the predefined classes.
First, it is necessary to provide a training set of labeled images. The result of this technique are a name of
a class and a classification confidence level for a given image.

Instance segmentation - This technique is used to locate, segment and classify single or multiple objects within an image.
The training requires a user to draw regions corresponding to objects on an image and assign them to classes. The result of this technique
are lists with elements describing detected objects - their bounding boxes, masks (segmented regions), class IDs, names and membership probabilities.

An example of instance segmentation using DeepLearning_SegmentInstances tool. Left: The original image.
Right: The resulting list of detected objects.

Point location - This technique is used to precisely locate and classify key points, characteristic regions and small objects within an image.
The training requires a user to mark points of appropriate classes on the training images. The result of this technique is a list of predicted point locations
with corresponding class predictions and confidence scores.

An example of point location using DeepLearning_LocatePoints tool. Left: The original image.
Right: The resulting list of detected points.

Basic terminology

The users do not need to be equipped with the specialistic scientific knowledge to design their own deep learning solutions. However, it may be very
useful to understand the basic terminology and principles behind the process.

Deep neural networks

Adaptive Vision gives access to deep convolutional neural networks architectures created, adjusted and tested to solve industrial-grade
machine vision tasks. Each network is a set of trainable convolutional filters and neural connections which can model complex transformations of the image to
extract relevant features and use them in order to solve particular problem. However, they are useless without proper amount of good quality data provided for
training process (adjusting weights of filters and connections). This documentation gives necessary practical hints on preparing an effective deep learning model.

Due to various levels of tasks complexity and different expected execution times, the users can choose one of five available network depths.
The Network depth parameter is an abstract value defining the memory capacity of the network (i.e., the number of layers and filters of
a network) and ability to solve more complex problems. The list below gives hints about selecting the proper depth for a task characteristics and conditions.

Low depth (value 1-2)

A problem is simple to define.

A problem could be easily solved by a human inspector.

A short time of execution is required.

Background and lightning do not change across images.

Well-positioned objects and good quality of images.

Standard depth (default, value 3)

Suitable for a majority of applications without any special conditions.

A modern CUDA-enabled GPU is available.

High depth (value 4-5)

A big amount of training data is available.

A problem is hard or very complex to define and solve.

Complicated irregular patterns across images.

Long training and execution times are not a problem.

A large amount of GPU RAM (≥4GB) is available.

Varying background, lightning and/or positioning of objects.

Tip: test your solution with a lower depth first, and then increase it if needed.

Note: a higher network depth will lead to a significant increase in memory and computational complexity of training and execution.

Training

Model training is an iterative process of updating neural network weights based on the training data. One iteration involves some number of steps (determined automatically), each step consists of the following operations:

selection of a small subset (batch) of training samples,

calculation of network error for these samples,

updating weights to achieve lower error for these samples.

At the end of each iteration, the current model is evaluated on a separate set of validation samples selected before the training process. Validation set is automatically chosen from the training samples. It is used to simulate how neural network would work with real images not used during training. Only a set of network weights corresponding with the best validation score at the end of training is saved as the final solution. Monitoring the training and validation score (blue and orange lines in the figures below) in consecutive iterations gives fundamental information about the progress:

Both training and validation scores are improving - keep training, model can still improve.

Both training and validation scores has stopped improving - keep training for a few iterations more and stop if still no change.

Training score is improving, but validation score has stopped or is going worse - you can stop training, model has probably started overfitting your training data (remembering exact samples rather than learning rules about features). It may be caused by too small amount of diverse samples or too low complexity of problem for a network selected.

An example of correct training.

A graph characteristic for network overfitting.

Above graphs represent training progress in our Deep Learning Editor, the blue line indicates the performance on training samples, and the orange line represents the performance on validation samples. Please note the blue line is plotted more frequently than the orange line, as validation performance is verified only at the end of each iteration.

Stopping Conditions

The user can stop the training manually by clicking the Stop button. Alternatively, it is also possible to set one or more stopping conditions:

Iteration Count &horbar; training will stop after a fixed number of iterations.

Iterations Without Improvement &horbar; training will stop when the best validation score was not improved for a given number of iterations.

Time &horbar; training will stop after a given number of minutes has passed.

Validation Accuracy or Validation Error &horbar; training will stop when validation score reaches a given value.

Preprocessing

To adjust the performance on your particular task, the user can apply some additional transformations to input images before training starts:

Downsample &horbar; reduction of image size to accelerate training and execution times at the expense of lower level of details possible to detect. Increasing this parameter by 1 will result in downsampling by the factor of 2 over both image dimension.

Convert to grayscale &horbar; while working with problems where color does not matter, you can choose to work with monochrome versions of images.

Augmentation

In case when the number of training images can be too small to represent all possible variations of samples, it is recommended to use data augmentation that adds artificially modified samples during training. This option can also help avoiding overfitting.

Available augmentations are:

Luminance &horbar; if greater than 0% samples with random image brightness changes will be added. The value of the parameter defines the percentage range of maximum brightness changes (both darker and brighter) relatively to the full image range.

Noise &horbar; if greater than 0% samples with random uniform noise will be added. Each channel and pixel is modified separately in the range defined by percentage of the full image range.

Gaussian Blur &horbar; if greater than 0, samples after Gaussian blurring will be added. Size of the kernel is randomly selected between 0 and provided maximum kernel size. Standard deviation is calculated automatically on the basis of obtained kernel size.

Flip Up-Down &horbar; if enabled samples reflected along the X axis will be added.

Flip Left-Right &horbar; if enabled samples reflected along the Y axis will be added.

Relative Translation &horbar; if greater than 0%, shifted samples will be added. Horizontal shift is independent of vertical shift. Both shifts can be negative. Maximum translation is equal to given percent of tile or image (depending on mode) size.

Scale &horbar; if minimum or maximum scale is different than 100%, scaled (by value between minimum scale and maximum scale) samples will be added.

Horizontal Shear &horbar; if greater than 0° horizontally sheared (by angle in ± parameter value range) samples will be added. After this transformation, height of tile (or image) does not change. Angle is measured between vertical line and the same line after transformation.

VerticalShear &horbar; analogous to Horizontal Shear.

Warning: the choice of augmentation options depends only on the task we want to solve, sometimes they might be harmful for quality of a solution. For simple example, enabling the Rotation should not be used if the rotations are not expected in a production environment. Enabling augmentations also increases the network training time (but does not affect execution time!)

2. Anomaly detection

Deep Learning Add-on provides two variants of the DetectAnomalies tool: DeepLearning_DetectAnomalies1 and DeepLearning_DetectAnomalies2. In most cases, the second variant is recommended. The only exception is when very precise defect heatmaps are needed, even at the expense of higher computational time. The difference between them is presented in the example below.

Interactive histogram tool

DetectAnomalies filters measure deviation of samples from normal image appearance learned during training phase. If the deviation exceeds a given threshold, the image is marked as defected. The suggested threshold is automatically calculated after the training phase but can be adjusted by user in the Deep Learning Editor using the interactive histogram tool described below.

After the training phase, scores are calculated for every training sample and are presented in the form of histogram; good samples are marked with green, bad samples with red bars. In the perfect case, the scores for good samples should be all lower than for bad samples and the threshold should be automatically calculated to give the optimal accuracy of the model. However, the groups may sometimes overlap because of:

high variability of the samples appearance or environmental conditions.

In order to achieve more robust threshold, it is recommended to perform training with a large number of samples from both groups. If the number of samples is limited, our software allows to manually set the uncertainty area with additional thresholds (the information about the confidence of the model can be than obtained from the hidden outIsConfident filter port for each sample).

The histogram tool where green bars represent correct samples and red bars represent defected samples. T marks the main threshold and T1, T2 define the area of uncertainty.

Left: a histogram presenting well-separated groups indicating a good accuracy of the model. Right: a poor accuracy of the model.

Feature size

This parameter defines the expected defect size and is the most significant one in terms of both quality and speed of inspection. It it is represented by a gray square in the Image window of the Editor. It needs to be adjusted to a specific application using hints below:

The Feature size should be large enough to contain common defects with some margin.

Too small Feature size may often lead to increased complexity of the model and longer processing time.

It is better to try first with a larger Feature size (small defects should be still detected) or using the Downsample option.

Too large Feature size may lead to less precise results and lower resolution of the inspection.

The Feature size is examined to work best in the range of 48-164.

Sampling density

This parameter controls the spatial resolution of both training and inspection. The larger the density the more precise results but longer computational time. It is recommended to use the Low density only for well positioned and simple objects. The High density is useful when working with complex textures and highly variable objects.

Global and Local network types

The DeepLearning_DetectAnomalies1 filter gives an option to choose between Local and Global types of processing. The Local processing is the default one and is used to analyze images in fragments of size determined by the Feature size parameter. The Global processing is used to analyze images holistically and is recommended for simple applications where objects are well-positioned and have large defects, like in the image below. The Global-like processing can be achieved in DeepLearning_DetectAnomalies2 by setting Feature size close to the image width or height.

Model usage

3. Feature detection (segmentation)

This technique is used to precisely mark pixels belonging to one or more classes called features of the image. They should be repeatable across the dataset and easy to define because they need to be marked by the user first. A few common examples of features to detect using this filter are:

characteristic edges, lines or points,

characteristic patterns and small objects,

repeatable defects.

Preparing the data

Images loaded to the Editor of DeepLearning_DetectFeatures can be of different sizes and can have different ROIs defined. However, it is important to ensure scale and appearance of the features to be consistent with the production environment.

The features can be marked using intuitive interface in the Editor or can be imported as masks from a file.

Each and every feature should be marked on all training images, or ROI should be limited to include only marked defects. Inconsistently marked features are one of the main reasons of poor accuracy.

The marking precision should be adjusted to the application requirements. The more precise marking the better accuracy in the production environment. While marking with low precision it is better to mark features with some excess margin.

An example of knots marked with low precision.

An example of cracks marked with high precision.

Multiple classes of features

It is possible to detect many classes of features separately using the same model. For example, road and building like in the image below. Features may overlap but it is usually not recommended. Also, it is not recommended to define more than a few different classes in a single model.

An example of marking two different classes (red roads and yellow buildings) in the one image.

Feature size

Rules are similar to the ones from the previous section. Especially, it is important to avoid too small values, they are one of the main reasons of poor accuracy and long processing time. The best results should be achieved when Feature size allows to include a defect together with a part of its surroundings (context). A trained network examines the input image only by looking at it through windows defined by Feature size. For this reason, it is recommended to follow the rule that an operator, using the same context, should be able to properly decide whether a given sample contains a defect or not.

Performance tip: a larger Feature size increases the training time and needs more GPU memory and training samples to operate effectively. When feature size exceeds 128 pixels and still looks too small, it is worth considering the Downsample option.

Performance tip: if the execution time is not satisfying you can set the inOverlap filter input to False. It should speed up the inspection by 10-30% at the expense of less precise results.

Examples of feature size: too large (red), optimal (green) and acceptable (orange). Remember that this is just a heuristic and can vary in some cases.

4. Object classification

This technique is used to identify class of object within an image.

Principle of operation

During the training, object classification learns the representation of user defined classes. Model uses generalized knowledge gained from samples provided for training, and aims to obtain good separation between classes.

Result of classification after training.

After the training is completed, user is presented with confusion matrix. It indicates, how well model separated user defined classes. It simplifies identification of model accuracy, when large number of samples is presented.

5. Instance segmentation

This technique is used to locate, segment and classify single or multiple objects within an image. The result of this technique are lists with elements describing detected objects - their bounding boxes, masks (segmented regions), class IDs, names and membership probabilities.

Note that in contrary to feature detection technique, instance segmentation detects individual objects and may be able to separate them even if they touch or overlap. On the other hand, instance segmentation is not an appropriate tool to detect and segment features, e.g. scratches or cracks.

Original image.

Visualized instance segmentation results.

Preparing the data

The training requires a user to draw regions corresponding to objects on an image and assign them to classes.

Editor for marking objects.

Training parameters

Instance segmentation training adapts to the data provided by a user and does not require any additional training parameters besides the default ones.

6. Point location

This technique is used to precisely locate and classify key points, characteristic regions and small objects within an image. The result of this technique is a list of predicted point locations with corresponding class predictions and confidence scores.

When to use point location instead of instance segmentation:

precise location of key points and distinctive regions with no strict boundaries,

location and classification of objects (possibly very small) when their segmentation masks and bounding boxes are not needed (e.g. in object counting).

Preparing the data

The training requires a user to mark points of appropriate classes on the training images.

Editor for marking points.

Training parameters

Feature size

In the case of point location tool, feature size parameter corresponds to the size of object or characteristic region. If images contain objects of different scales, it is recommended to use feature size slightly larger than the average object size, although it may require experimenting with different values to achieve optimal results.

Performance tip: a larger feature size increases the training time and needs more memory and training samples to operate effectively. When feature size exceeds 128 pixels and still looks too small, it is worth considering the Downsample option.

inMinDistanceRatio parameter can be used to set minimum distance between two points to be considered as different. The distance is computed as MinDistanceRatio * FeatureSize. If the value is not enabled, the minimum distance is based on the training data.

To increase detection speed but with potentially slightly worse precision inOverlap can be set to False.

7. Troubleshooting

Below you will find a list of most common problems.

1. Network overfitting

A situation when network loses its ability to generalize over available problems and focuses only on test data.

Symptoms: during training, the validation graph stops at one level and training graph continues to rise. Defects on training images are marked very precisely, but defects on new images are marked poorly.

A graph characteristic for network overfitting.

Causes:

The number of test samples is too small.

Training time is too long.

Solution:

Provide more real samples.

Add more samples with possible object rotations.

2. Susceptibility to changes in lighting conditions

Symptoms: network is not able to process images properly when even minor changes in lighting occur.

Causes:

Samples with variable lighting were not provided.

Solution:

Provide more samples with variable lighting.

Enable "Luminance" option for automatic lighting augmentation.

3. No progress in network training

Symptoms &horbar; even though the training time is optimal, there is no visible training progress.

Training progress with contradictory samples.

Causes:

The number of samples is too small or the samples are not variable enough.

Image contrast is too small.

The chosen network architecture is too small.

There is contradiction in defect masks.

Solution:

Modify lighting to expose defects.

Remove contradictions in defect masks.

Tip: Remember to mark all defects of a given type on the input images or remove images with unmarked defects. Marking only a part of defects of a given type may negatively influence the network learning process.