Faculty / Organisational entity

1 search hit

Computer Vision (CV) problems, such as image classification and segmentation, have traditionally been solved by manual construction of feature hierarchies or incorporation of other prior knowledge. However, noisy images, varying viewpoints and lighting conditions of images, and clutters in real-world images make the problem challenging. Such tasks cannot be efficiently solved without learning from data. Therefore, many Deep Learning (DL) approaches have recently been successful for various CV tasks, for instance, image classification, object recognition and detection, action recognition, video classification, and scene labeling. The main focus of this thesis is to investigate a purely learning-based approach, particularly, Multi-Dimensional LSTM (MD-LSTM) recurrent neural networks to tackle the challenging CV tasks, classification and segmentation on 2D and 3D image data. Due to the structural nature of MD-LSTM, the network learns directly from raw pixel values and takes the complex spatial dependencies of each pixel into account. This thesis provides several key contributions in the field of CV and DL.
Several MD-LSTM network architectural options are suggested based on the type of input and output, as well as the requiring tasks. Including the main layers, which are an input layer, a hidden layer, and an output layer, several additional layers can be added such as a collapse layer and a fully connected layer. First, a single Two Dimensional LSTM (2D-LSTM) is directly applied on texture images for segmentation and show improvement over other texture segmentation methods. Besides, a 2D-LSTM layer with a collapse layer is applied for image classification on texture and scene images and have provided an accurate classification results. In addition, a deeper model with a fully connected layer is introduced to deal with more complex images for scene labeling and outperforms the other state-of-the-art methods including the deep Convolutional Neural Networks (CNN). Here, several input and output representation techniques are introduced to achieve the robust classification. Randomly sampled windows as input are transformed in scaling and rotation, which are integrated to get the final classification. To achieve multi-class image classification on scene images, several pruning techniques are introduced. This framework provides a good results in automatic web-image tagging. The next contribution is an investigation of 3D data with MD-LSTM. The traditional cuboid order of computations in Multi-Dimensional LSTM (MD-LSTM) is re-arranged in pyramidal fashion. The resulting Pyramidal Multi-Dimensional LSTM (PyraMiD-LSTM) is easy to parallelize, especially for 3D data such as stacks of brain slice images. PyraMiD-LSTM was tested on 3D biomedical volumetric images and achieved best known pixel-wise brain image segmentation results and competitive results on Electron Microscopy (EM) data for membrane segmentation.
To validate the framework, several challenging databases for classification and segmentation are proposed to overcome the limitations of current databases. First, scene images are randomly collected from the web and used for scene understanding, i.e., the web-scene image dataset for multi-class image classification. To achieve multi-class image classification, the training and testing images are generated in a different setting. For training, images belong to a single pre-defined category which are trained as a regular single-class image classification. However, for testing, images containing multi-classes are randomly collected by web-image search engine by querying the categories. All scene images include noise, background clutter, unrelated contents, and also diverse in quality and resolution. This setting can make the database possible to evaluate for real-world applications. Secondly, an automated blob-mosaics texture dataset generator is introduced for segmentation. Random 2D Gaussian blobs are generated and filled with random material textures. These textures contain diverse changes in illumination, scale, rotation, and viewpoint. The generated images are very challenging since they are even visually hard to separate the related regions.
Overall, the contributions in this thesis are major advancements in the direction of solving image analysis problems with Long Short-Term Memory (LSTM) without the need of any extra processing or manually designed steps. We aim at improving the presented framework to achieve the ultimate goal of accurate fine-grained image analysis and human-like understanding of images by machines.